(197) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Summary tables of TDT2 Corpus Contents
Date: Thu, 08 Oct 1998 11:47:56 EDT

Folks,

Below is a set of tables that summarize the relevant quantities of
things that exist in the TDT2 Training and Development Test corpus
release that you should be receiving today. (If you do not receive
your cdrom today, please contact the LDC and we will put a trace on
the shipment.)

The numbers that appear in parentheses next to some of the table
titles are footnote references; the footnotes are further down.

Dave Graff

------------------


File counts by data type:
-------------------------
   1750 sgml
   1750 tkn    1750 bndtkn
    639 asr     639 bndasr
   4139 data   2389 bnd-tables


ASR sample counts by source and month:
--------------------------------------
	ABC	CNN	PRI	VOA	Total
199801	  21	  91	  19	   0	 131
199802	  23	  94	  20	   0	 137
199803	  28	 122	  21	  17	 188
199804	  30	 108	  21	  24	 183
Total	 102	 415	  81	  41	 639


SGM/TKN sample counts by source and month:(1)
------------------------------------------
	ABC.cc	ABC.fd	APW	CNN	NYT	PRI	VOA	Total
199801	  23	  20	 112	  91	 111	  19	  33	 409
199802	  24	  24	 112	  94	 104	  20	  38	 416
199803	  28	  27	 124	 122	 114	  21	  45	 481
199804	  30	  29	 120	 109	  97	  21	  38	 444
Total	 105	 100	 468	 416	 426	  81	 154	1750


NEWS story counts by source and month:(1)
--------------------------------------
	ABC.cc	ABC.fd	APW	CNN	NYT	PRI	VOA	Total
199801	  319	  280	 2145	 2275	 2082	  511	 1186	 8798
199802	  307	  307	 2164	 2308	 1975	  472	 1411	 8944
199803	  359	  350	 2225	 2907	 1932	  488	 1669	 9930
199804	  384	  370	 1997	 2684	 1637	  498	 1353	 8923
Total	 1369	 1307	 8531	10174	 7626	 1969	 5619	36595


Average TKN words per news story by source:
-------------------------------------------
	ABC.cc	ABC.fd	APW	CNN	NYT	PRI	VOA
	 226	 237	 340	 120	 858	 317	 220


Average ASR words per news story by source:
-------------------------------------------
	ABC	CNN	PRI	VOA
	 238	 128	 331	 224


ASR NON-NEWS segment counts by source and type:(2)
-----------------------------------------------
Segment type 	ABC	CNN	PRI	VOA	Total
MISCELLANEOUS	 636	3265	 943	 461	5305
UNASSIGNED	  96	 385	  44	  14	 539
UNTRANSCRIBED	   1	 295	   1	   0	 297
Total       	 733	3945	 988	 475	6141


TKN NON-NEWS segment counts by source and type:(1)(2)
-----------------------------------------------
Segment type 	ABC.cc	ABC.fd	APW	CNN	NYT	PRI	VOA	Total
MISCELLANEOUS	 651	 624	   0	3278	   0	 944	1686	7183
UNTRANSCRIBED	   1	   1	   0	 295	   0	   1	  32	 330
Total       	 652	 625	   0	3573	   0	 945	1718	7513



Notes:
------

(1) Items counted under "ABC.fd" are a proper subset of items counted
under "ABC.cc" -- both columns were added into the row totals, so
the total number of distinct items in a row = "row total" minus
"ABC.fd".

(2) Explanation of "NON-NEWS" segment types:
- MISCELLANEOUS: commercial breaks, introductions and "teasers"
(portions that lack two or more independent declarative
clauses on a single news story)
- UNTRANSCRIBED: segments that contain spoken news reporting, but
whose transcription did not contain enough information for
doing topic annotation -- some of these segments may contain
material on a target topic, but they have not been annotated
- UNASSIGNED: a portion of an audio recording that extends beyond
the boundary of the targeted broadcast -- no transcription is
provided for these (out-of-bounds) portions; ASR data might
contain material that could appear relevant to a target topic,
but nothing in these segments has been annotated, and these
segments do not have any DOCNO identifier assigned to them.
By design, "NON-NEWS" segments are excluded entirely from newswire
data, but are retained in broadcast data.


ON-TOPIC Story counts by partition and topic:
---------------------------------------------
		Train	Devel	Total
topicid=01	 1078	  306	 1384
topicid=02	  792	  281	 1073
topicid=03	    0	    0	    0
topicid=04	   16	    1	   17
topicid=05	    9	    5	   14
topicid=06	    6	    2	    8
topicid=07	   24	    1	   25
topicid=08	   49	    5	   54
topicid=09	   53	    1	   54
topicid=10	    7	    0	    7
topicid=11	  107	    0	  107
topicid=12	  177	   14	  191
topicid=13	  652	   48	  700
topicid=14	    2	    2	    4
topicid=15	 1473	  207	 1680
topicid=16	    6	    0	    6
topicid=17	   22	    0	   22
topicid=18	   77	   11	   88
topicid=19	   69	   19	   88
topicid=20	   35	    4	   39
topicid=21	   57	    6	   63
topicid=22	   30	    0	   30
topicid=23	  102	   17	  119
topicid=24	   35	    5	   40
topicid=25	    1	    0	    1
topicid=26	   69	    1	   70
topicid=27	    0	    1	    1
topicid=28	    9	    3	   12
topicid=29	    9	    2	   11
topicid=30	    2	    0	    2
topicid=31	   31	    7	   38
topicid=32	   57	   69	  126
topicid=33	  127	    3	  130
topicid=34	   18	    0	   18
topicid=35	    6	    0	    6
topicid=36	    5	    0	    5
topicid=37	   16	   15	   31

topicid=38	    0	    1	    1
topicid=39	   58	   65	  123
topicid=40	    0	    3	    3
topicid=41	    1	   25	   26
topicid=42	    0	   27	   27
topicid=43	    1	   14	   15
topicid=44	   40	  171	  211
topicid=45	    0	    0	    0
topicid=46	    0	    5	    5
topicid=47	    0	   29	   29
topicid=48	    0	  146	  146
topicid=49	    0	    0	    0
topicid=50	    0	   11	   11
topicid=51	    0	    0	    0
topicid=52	    0	    5	    5
topicid=53	    0	    7	    7
topicid=54	    0	    1	    1
topicid=55	    0	    1	    1
topicid=56	    2	   54	   56
topicid=57	    0	   17	   17
topicid=58	    0	    1	    1
topicid=59	    0	    1	    1
topicid=60	    0	    8	    8
topicid=61	    1	    4	    5
topicid=62	    0	    2	    2
topicid=63	    0	   17	   17
topicid=64	    0	   12	   12
topicid=65	    0	   47	   47
topicid=66	    0	    6	    6

Total		 5331	 1716	 7047		

(197) previous ~ index ~ next

Last updated Wed Oct 28 14:44:11 1998