(197) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Summary tables of TDT2 Corpus Contents
Date: Thu, 08 Oct 1998 11:47:56 EDT
Folks,
Below is a set of tables that summarize the relevant quantities of
things that exist in the TDT2 Training and Development Test corpus
release that you should be receiving today. (If you do not receive
your cdrom today, please contact the LDC and we will put a trace on
the shipment.)
The numbers that appear in parentheses next to some of the table
titles are footnote references; the footnotes are further down.
Dave Graff
------------------
File counts by data type:
-------------------------
1750 sgml
1750 tkn 1750 bndtkn
639 asr 639 bndasr
4139 data 2389 bnd-tables
ASR sample counts by source and month:
--------------------------------------
ABC CNN PRI VOA Total
199801 21 91 19 0 131
199802 23 94 20 0 137
199803 28 122 21 17 188
199804 30 108 21 24 183
Total 102 415 81 41 639
SGM/TKN sample counts by source and month:(1)
------------------------------------------
ABC.cc ABC.fd APW CNN NYT PRI VOA Total
199801 23 20 112 91 111 19 33 409
199802 24 24 112 94 104 20 38 416
199803 28 27 124 122 114 21 45 481
199804 30 29 120 109 97 21 38 444
Total 105 100 468 416 426 81 154 1750
NEWS story counts by source and month:(1)
--------------------------------------
ABC.cc ABC.fd APW CNN NYT PRI VOA Total
199801 319 280 2145 2275 2082 511 1186 8798
199802 307 307 2164 2308 1975 472 1411 8944
199803 359 350 2225 2907 1932 488 1669 9930
199804 384 370 1997 2684 1637 498 1353 8923
Total 1369 1307 8531 10174 7626 1969 5619 36595
Average TKN words per news story by source:
-------------------------------------------
ABC.cc ABC.fd APW CNN NYT PRI VOA
226 237 340 120 858 317 220
Average ASR words per news story by source:
-------------------------------------------
ABC CNN PRI VOA
238 128 331 224
ASR NON-NEWS segment counts by source and type:(2)
-----------------------------------------------
Segment type ABC CNN PRI VOA Total
MISCELLANEOUS 636 3265 943 461 5305
UNASSIGNED 96 385 44 14 539
UNTRANSCRIBED 1 295 1 0 297
Total 733 3945 988 475 6141
TKN NON-NEWS segment counts by source and type:(1)(2)
-----------------------------------------------
Segment type ABC.cc ABC.fd APW CNN NYT PRI VOA Total
MISCELLANEOUS 651 624 0 3278 0 944 1686 7183
UNTRANSCRIBED 1 1 0 295 0 1 32 330
Total 652 625 0 3573 0 945 1718 7513
Notes:
------
(1) Items counted under "ABC.fd" are a proper subset of items counted
under "ABC.cc" -- both columns were added into the row totals, so
the total number of distinct items in a row = "row total" minus
"ABC.fd".
(2) Explanation of "NON-NEWS" segment types:
- MISCELLANEOUS: commercial breaks, introductions and "teasers"
(portions that lack two or more independent declarative
clauses on a single news story)
- UNTRANSCRIBED: segments that contain spoken news reporting, but
whose transcription did not contain enough information for
doing topic annotation -- some of these segments may contain
material on a target topic, but they have not been annotated
- UNASSIGNED: a portion of an audio recording that extends beyond
the boundary of the targeted broadcast -- no transcription is
provided for these (out-of-bounds) portions; ASR data might
contain material that could appear relevant to a target topic,
but nothing in these segments has been annotated, and these
segments do not have any DOCNO identifier assigned to them.
By design, "NON-NEWS" segments are excluded entirely from newswire
data, but are retained in broadcast data.
ON-TOPIC Story counts by partition and topic:
---------------------------------------------
Train Devel Total
topicid=01 1078 306 1384
topicid=02 792 281 1073
topicid=03 0 0 0
topicid=04 16 1 17
topicid=05 9 5 14
topicid=06 6 2 8
topicid=07 24 1 25
topicid=08 49 5 54
topicid=09 53 1 54
topicid=10 7 0 7
topicid=11 107 0 107
topicid=12 177 14 191
topicid=13 652 48 700
topicid=14 2 2 4
topicid=15 1473 207 1680
topicid=16 6 0 6
topicid=17 22 0 22
topicid=18 77 11 88
topicid=19 69 19 88
topicid=20 35 4 39
topicid=21 57 6 63
topicid=22 30 0 30
topicid=23 102 17 119
topicid=24 35 5 40
topicid=25 1 0 1
topicid=26 69 1 70
topicid=27 0 1 1
topicid=28 9 3 12
topicid=29 9 2 11
topicid=30 2 0 2
topicid=31 31 7 38
topicid=32 57 69 126
topicid=33 127 3 130
topicid=34 18 0 18
topicid=35 6 0 6
topicid=36 5 0 5
topicid=37 16 15 31
topicid=38 0 1 1
topicid=39 58 65 123
topicid=40 0 3 3
topicid=41 1 25 26
topicid=42 0 27 27
topicid=43 1 14 15
topicid=44 40 171 211
topicid=45 0 0 0
topicid=46 0 5 5
topicid=47 0 29 29
topicid=48 0 146 146
topicid=49 0 0 0
topicid=50 0 11 11
topicid=51 0 0 0
topicid=52 0 5 5
topicid=53 0 7 7
topicid=54 0 1 1
topicid=55 0 1 1
topicid=56 2 54 56
topicid=57 0 17 17
topicid=58 0 1 1
topicid=59 0 1 1
topicid=60 0 8 8
topicid=61 1 4 5
topicid=62 0 2 2
topicid=63 0 17 17
topicid=64 0 12 12
topicid=65 0 47 47
topicid=66 0 6 6
Total 5331 1716 7047
(197) previous ~ index ~ next
Last updated Wed Oct 28 14:44:11 1998