(306) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Stats for TDT-2 Eval set.
Date: Mon, 11 Jan 1999 15:55:07 EST

In response to Steve Lowe's more detailed question, I have realized
there is another oddity about the original and adjudicated topic
relevances table that I released recently: they contain entries for
stories that were collected during the first three days of July 1998.

The LDC included the data files and annotations for 1998070[123] in
the eval-data tar file that we shipped to NIST in October, and Jon
Fiscus opted to eliminate the July data from the test set, to make
things easier. If you eliminate all references to July data from
those tables, the numbers ought to make more sense.

Here are the "true" counts of topic hits for the adjudicated TDT-2
eval data, leaving out the three days' worth of July material:


ON-TOPIC Story counts by topic -- ALL/YES/BRIEF:
---------------------------------------------------
Topic Eval-test
topicid=67	   1 /    1 /   0	
topicid=68	   3 /    3 /   0	
topicid=69	   2 /    2 /   0	
topicid=70	 521 /  440 /  81	
topicid=71	 201 /  173 /  28	
topicid=72	  18 /   16 /   2	
topicid=73	   1 /    1 /   0	
topicid=74	  54 /   39 /  15	
topicid=75	   7 /    6 /   1	
topicid=76	 393 /  286 / 107	
topicid=77	  18 /   18 /   0	
topicid=78	  12 /   12 /   0	
topicid=79	   9 /    9 /   0	
topicid=80	   1 /    1 /   0	
topicid=81	   1 /    1 /   0	
topicid=82	   4 /    4 /   0	
topicid=83	  21 /   21 /   0	
topicid=84	  11 /    8 /   3	
topicid=85	  14 /    8 /   6	
topicid=86	 141 /  138 /   3	
topicid=87	 105 /   83 /  22	
topicid=88	  55 /   23 /  32	
topicid=89	  23 /   22 /   1	
topicid=90	   1 /    1 /   0	
topicid=91	  55 /   54 /   1	
topicid=92	   2 /    2 /   0	
topicid=93	  13 /   13 /   0	
topicid=94	   3 /    3 /   0	
topicid=95	   4 /    4 /   0	
topicid=96	 106 /   69 /  37	
topicid=97	   2 /    2 /   0	
topicid=98	   8 /    8 /   0	
topicid=99	   2 /    1 /   1	
topicid=100	  10 /    9 /   1	
Total    	1822 / 1481 / 341	



Here are some additional stats on the eval data, since I happen to
have them on hand (again, excluding July data):

File counts by data type:
-------------------------
    937 sgml
    937 tkn     937 bndtkn
    384 asr     384 bndasr
   2258 data   1321 bnd-tables

ASR File counts by source and month:
------------------------------------
	ABC	CNN	PRI	VOA	Total
199805	  27	 115	  19	  36	 197
199806	  30	 107	  16	  34	 187
Total	  57	 222	  35	  70	 384


SGM/TKN File counts by source and month:
----------------------------------------
	ABC.cc	ABC.fd	APW	CNN	NYT	PRI	VOA	Total
199805	  27	  25	 123	 115	 123	  20	  39	 472
199806	  30	  30	 120	 110	 120	  21	  34	 465
Total	  57	  55	 243	 225	 243	  41	  73	 937


NEWS Story counts by source and month:
--------------------------------------
	ABC.cc	ABC.fd	APW	CNN	NYT	PRI	VOA	Total
199805	  359	  339	 2169	 2802	 2130	  478	 1375	 9652
199806	  419	  419	 2061	 2828	 2039	  493	 1225	 9484
Total	  778	  758	 4230	 5630	 4169	  971	 2600	19136


Average TKN words per story by source:
--------------------------------------
	ABC.cc	ABC.fd	APW	CNN	NYT	PRI	VOA
	 216	 224	 345	 117	 872	 322	 214


Average ASR words per story by source:
--------------------------------------
	ABC	CNN	PRI	VOA
	 228	 125	 335	 224


ASR NON-NEWS segment counts by source and type:
-----------------------------------------------
Segment type 	ABC	CNN	PRI	VOA	Total
MISCELLANEOUS	 373	1912	 453	 826	3564
UNASSIGNED	  56	 217	   2	   1	 276
UNTRANSCRIBED	   1	 276	  16	  21	 314
Total       	 430	2405	 471	 848	4154


TKN NON-NEWS segment counts by source and type:
-----------------------------------------------
Segment type 	ABC.cc	ABC.fd	APW	CNN	NYT	PRI	VOA	Total
MISCELLANEOUS	 373	 362	   0	1943	   0	 530	 862	4070
UNTRANSCRIBED	   1	   0	   0	 282	   0	  16	  21	 320
Total       	 374	 362	   0	2225	   0	 546	 883	4390




You might notice that the "ABC.cc" and "ABC.fd" (closed caption and
FDCH data) differ in terms of the number of NEWS, MISC and UNTRANS
stories in the May collection. Obviously, this is an error; it
affects 14 of the 58 ABC broadcasts in eval for which we had both
FDCH and CC text. We are still working on reconciling this problem.

Dave G.
(306) previous ~ index ~ next

Last updated Wed Feb 3 10:44:21 1999