(306) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Stats for TDT-2 Eval set.
Date: Mon, 11 Jan 1999 15:55:07 EST
In response to Steve Lowe's more detailed question, I have realized
there is another oddity about the original and adjudicated topic
relevances table that I released recently: they contain entries for
stories that were collected during the first three days of July 1998.
The LDC included the data files and annotations for 1998070[123] in
the eval-data tar file that we shipped to NIST in October, and Jon
Fiscus opted to eliminate the July data from the test set, to make
things easier. If you eliminate all references to July data from
those tables, the numbers ought to make more sense.
Here are the "true" counts of topic hits for the adjudicated TDT-2
eval data, leaving out the three days' worth of July material:
ON-TOPIC Story counts by topic -- ALL/YES/BRIEF:
---------------------------------------------------
Topic Eval-test
topicid=67 1 / 1 / 0
topicid=68 3 / 3 / 0
topicid=69 2 / 2 / 0
topicid=70 521 / 440 / 81
topicid=71 201 / 173 / 28
topicid=72 18 / 16 / 2
topicid=73 1 / 1 / 0
topicid=74 54 / 39 / 15
topicid=75 7 / 6 / 1
topicid=76 393 / 286 / 107
topicid=77 18 / 18 / 0
topicid=78 12 / 12 / 0
topicid=79 9 / 9 / 0
topicid=80 1 / 1 / 0
topicid=81 1 / 1 / 0
topicid=82 4 / 4 / 0
topicid=83 21 / 21 / 0
topicid=84 11 / 8 / 3
topicid=85 14 / 8 / 6
topicid=86 141 / 138 / 3
topicid=87 105 / 83 / 22
topicid=88 55 / 23 / 32
topicid=89 23 / 22 / 1
topicid=90 1 / 1 / 0
topicid=91 55 / 54 / 1
topicid=92 2 / 2 / 0
topicid=93 13 / 13 / 0
topicid=94 3 / 3 / 0
topicid=95 4 / 4 / 0
topicid=96 106 / 69 / 37
topicid=97 2 / 2 / 0
topicid=98 8 / 8 / 0
topicid=99 2 / 1 / 1
topicid=100 10 / 9 / 1
Total 1822 / 1481 / 341
Here are some additional stats on the eval data, since I happen to
have them on hand (again, excluding July data):
File counts by data type:
-------------------------
937 sgml
937 tkn 937 bndtkn
384 asr 384 bndasr
2258 data 1321 bnd-tables
ASR File counts by source and month:
------------------------------------
ABC CNN PRI VOA Total
199805 27 115 19 36 197
199806 30 107 16 34 187
Total 57 222 35 70 384
SGM/TKN File counts by source and month:
----------------------------------------
ABC.cc ABC.fd APW CNN NYT PRI VOA Total
199805 27 25 123 115 123 20 39 472
199806 30 30 120 110 120 21 34 465
Total 57 55 243 225 243 41 73 937
NEWS Story counts by source and month:
--------------------------------------
ABC.cc ABC.fd APW CNN NYT PRI VOA Total
199805 359 339 2169 2802 2130 478 1375 9652
199806 419 419 2061 2828 2039 493 1225 9484
Total 778 758 4230 5630 4169 971 2600 19136
Average TKN words per story by source:
--------------------------------------
ABC.cc ABC.fd APW CNN NYT PRI VOA
216 224 345 117 872 322 214
Average ASR words per story by source:
--------------------------------------
ABC CNN PRI VOA
228 125 335 224
ASR NON-NEWS segment counts by source and type:
-----------------------------------------------
Segment type ABC CNN PRI VOA Total
MISCELLANEOUS 373 1912 453 826 3564
UNASSIGNED 56 217 2 1 276
UNTRANSCRIBED 1 276 16 21 314
Total 430 2405 471 848 4154
TKN NON-NEWS segment counts by source and type:
-----------------------------------------------
Segment type ABC.cc ABC.fd APW CNN NYT PRI VOA Total
MISCELLANEOUS 373 362 0 1943 0 530 862 4070
UNTRANSCRIBED 1 0 0 282 0 16 21 320
Total 374 362 0 2225 0 546 883 4390
You might notice that the "ABC.cc" and "ABC.fd" (closed caption and
FDCH data) differ in terms of the number of NEWS, MISC and UNTRANS
stories in the May collection. Obviously, this is an error; it
affects 14 of the 58 ABC broadcasts in eval for which we had both
FDCH and CC text. We are still working on reconciling this problem.
Dave G.
(306) previous ~ index ~ next
Last updated Wed Feb 3 10:44:21 1999