The TDT3 Corpus comprises seven components:
The LDC expects that, for the foreseeable future, item (7) will be distributed only to a few recipients, in the form of recordable cdroms (additional copies of the cdroms can be made as needed for other recipients). Items (1-6) will be distributed to all TDT3 participants, and will be described in greater detail below.
For each sampling unit created for TDT3, there will be one file of SGML-tagged text data; each sampling unit consists of the stories captured from one source during one contiguous time span. Among the broadcast (audio) sources, there are typically four sampling units per day for CNN, two per day for VOA (English and Mandarin???), and one per day for ABC, NBC, MSNBC and PRI. The newswire sources (APW, NYT, XIN???) each have up to four sampling units per day, representing a quasi-arbitrary selection of four discrete blocks of stories taken at evenly-spaced intervals from the full day's data stream.
In terms of story content, each CNN sample unit covers 30 minutes of broadcast time and contains, on average, about 20 news stories; ABC sample units also cover 30 minutes, with 12 stories on average; PRI, VOA English, VOA Mandarin, NBC and MSNBC sample units each cover 60 minute broadcasts, and range between two and three dozen stories per unit. The time periods covered by the newswire sample units are widely variable; the selection process that extracts these units from the daily newswire feeds is configured to create units that are fairly equivalent in size, such that there will be 20 stories per unit, on average (the sampling units may contain more or fewer stories, but total number of words in each unit should be comparable).
Each sampling unit is assigned a unique file ID, which identifies the date and time that the sampling took place, and the source. The file ID's are formatted so that, when they are put in a default sorted order, they will appear in chronological sequence. Here are some typical file ID's:
19980106_1830_1900_ABC_WNT
19980121_2001_2105_NYT_NYT
19980122_2300_2400_VOA_TDY
The file ID's are used to name the file that contains the SGML text data for each sample unit, and are likewise used in all other files derived from or relating to that sample unit (i.e. the token-text, ASR-text and boundary table files).
Within each SGML sample unit file, the individual stories are marked off as SGML <DOC> units. Essential information about each story is provided in the following SGML tags that appear in the initial part of each DOC unit:
An additional piece of essential information, which applies to audio sources only, is provided in the following SGML tag:
Additional details about the format and content of the SGML text data are not directly relevant to specifying the other corpus components, and already described on the LDC's web page for the TDT2 Corpus???.
In the case of newswire sources (APW, NYT and XIN???), the SGML archival form is derived from the ANPA transmission format received by LDC via modem; the SGML sample files contain only the stories that have been selected for annotation, and that have not been rejected by annotators due to unsuitable content (i.e. only those DOC's that remain classified as "NEWS STORY" are retained). The selection process for newswire annotation automatically eliminates classes of DOC units in the stream that do not contain news stories, but some non-news units do get passed through for annotation; annotators have the option to flag stories as unacceptable for topic labeling if any of the following conditions is seen:
Of these four causes for rejection from topic labeling, the first three will result in the DOC unit being excluded from the SGML files delivered to researchers.
In the case of audio sources (ABC, NBC, MSNBC, CNN, PRI, VOA English, VOA Mandarin), the SGML archival form is derived from closed-caption text or commercially produced transcripts that have been annotated manually to establish time stamps and DOCTYPE labels for all DOC unit boundaries; all DOC units are retained for the audio sources (i.e. both "NEWS STORY" and "MISCELLANEOUS TEXT").
This form of text data is derived from the SGML-tagged text archive. For each sample unit from each source, there will be one file containing all the text content of the sample, in which each space-separated orthographic token is presented as a separate data record, and all other information from the original sample is excluded. The excluded information consists of:
Regarding the last item, newswire stories typically begin with a "dateline" string that identifies the location from which the news report was originally sent; for example:
BELFAST, Northern Ireland (AP) _ With peace talks already
threatened by a spate of killings, police in Northern Ireland
detonated a car bomb Wednesday in a rural town near Belfast.
Given this paragraph from the SGML-tagged archive, the untagged stream would contain only the space-separated tokens following the underscore character "_".
The tokenization process simply splits the text stream using white-space as the delimiter. All punctuation, hyphenation, quotations and parentheses are retained without modification; in other words, every string of one or more contiguous non-space characters is output as one token. All strings of one or more white-space characters are treated alike as single delimiters; this has the effect of eliminating paragraph boundaries in newswire, which are marked in the SGML files by indentation.
Except for this neutralization of white-space, and the elimination of datelines and ANNOTATION tags, it is possible to reconstruct the full content of the TEXT elements of the SMGL files from the tokenized text files. The following example shows the format of the tokenized stream:
<DOCSET type=NEWSWIRE fileid=19980107_0000_0800_APW_ENG> <W recid=1> With <W recid=2> peace <W recid=3> talks <W recid=4> already <W recid=5> threatened <W recid=6> by <W recid=7> a <W recid=8> spate <W recid=9> of <W recid=10> killings, ... </DOCSET>
Given the "fileid" and "recid" attributes in the tokenized text stream data, the story boundary table will relate the original DOC units to the DOCSET data records, by providing the starting and ending record ID's for each DOCNO; each line of the table will provide the SGML file ID, the DOCSET file ID, the DOCNO string, and the starting and ending "recid" values within the DOCSET file that represent the boundaries of the DOC unit. For audio sources, the table also gives starting and ending time offsets for each DOC, in seconds, from the start of the file. Here is an example:
<BOUNDSET type=CAPTION fileid=19980107_0130_0200_CNN_HDL> <BOUNDARY docno=CNN19980107.0130.0000 doctype=MISCELLANEOUS Bsec=0.00 Esec=16.00 Brecid=1 Erecid=30> <BOUNDARY docno=CNN19980107.0130.0016 doctype=NEWS Bsec=16.00 Esec=53.67 Brecid=31 Erecid=124> <BOUNDARY docno=CNN19980107.0130.0053 doctype=NEWS Bsec=53.67 Esec=121.51 Brecid=125 Erecid=313> ... </BOUNDSET>
It can happen that a DOC unit from a broadcast source may contain an empty TEXT portion; in this case, the table entry for that DOC does not contain the "Brecid" and "Erecid" attributes, but the "Bsec" and "Esec" attributes are always present for broadcast sources. The "Bsec" and "Esec" attributes are not present for newswire sources.
For each audio sample file, there will be a file of ASR output text, which will be similar in format to the tokenized text stream: each orthographic word output by the ASR system will be a separate data record, marked as follows:
<DOCSET type=ASRTEXT fileid=19980109_1830_1900_ABC_WNT> <X Bsec=0.00 Dur=0.01 Conf=NA> <W recid=1 Bsec=0.01 Dur=0.15 Clust=45 Conf=0.32> IS <W recid=2 Bsec=0.16 Dur=0.40 Clust=45 Conf=0.68> WHETHER <W recid=3 Bsec=0.74 Dur=0.32 Clust=45 Conf=0.90> FOR <W recid=4 Bsec=1.06 Dur=0.38 Clust=45 Conf=0.87> OTHERS <W recid=5 Bsec=1.44 Dur=0.19 Clust=45 Conf=0.80> IT <W recid=6 Bsec=1.63 Dur=0.16 Clust=45 Conf=0.75> IS <W recid=7 Bsec=1.79 Dur=0.49 Clust=45 Conf=0.78> SPRING ... </DOCSET>
The additional attributes attached to each word ("Bsec, Dur, Clust, Conf") are provided by the ASR system; the "<X>" tags represent periods of time in the audio signal where no speech was recognized. The "Conf=" attribute is a computed estimate of confidence in the correctness of a given recognized word, varying between 0 (no confidence) and 1 (highest confidence). At present, this measure does not apply to non-speech (X) segments, and it is also possible that the ASR system may be unable to assign a confidence score to some recognized words; in these cases, the attribute is given as "Conf=NA".
The ASR output spans the entire audio recording for each sample file. In some cases, the manual transcription or closed-caption text for the sample may begin or end at a different point in time than the audio recording, so that the ASR output may contain more (or less) material at the beginning or end than the corresponding SGML file. In any case, the boundaries of news-story segments should always be properly aligned; discrepancies in the amount of content at the beginning or end of a sample will be absorbed by <DOC> units whose <DOCTYPE> is "MISCELLANEOUS TEXT".
Also, the "Bsec" and "Dur" attributes do not necessarily account for every second of elapsed time in the broadcast -- there may be time gaps between successive records in the ASR file.
Given the "fileid" and "recid" attributes in the ASR text stream data, the story boundary table will relate DOC units to the ASR data records, by providing the starting and ending record ID's for each DOCNO; each line of the table will provide the DOCNO string, the DOCTYPE value, the starting and ending "recid" values, and the starting and ending time offsets in seconds. Here is an example:
<BOUNDSET type=ASRTEXT fileid=19980109_0100_0130_CNN_HDL> <BOUNDARY docno=CNN19980109.0100.0000 doctype=NEWS Bsec=0.00 Esec=8.00 Brecid=1 Erecid=21> <BOUNDARY docno=CNN19980109.0100.0008 doctype=MISCELLANEOUS Bsec=8.00 Esec=19.00 Brecid=22 Erecid=49> <BOUNDARY docno=CNN19980109.0100.0019 doctype=NEWS Bsec=19.00 Esec=67.00 Brecid=50 Erecid=189> <BOUNDARY docno=CNN19980109.0100.0067 doctype=NEWS Bsec=67.00 Esec=89.09 Brecid=190 Erecid=251> ... </BOUNDSET>
In some cases, a "DOC" of type "MISCELLANEOUS" will span a period of time in which the ASR system will not have found any recognizable speech. In such cases the "Brecid" and "Erecid" attributes will not be present in the BOUNDARY tag; the "Bsec" and "Esec" attributes are always present.
For each of the target topics defined for TDT3, this table will list the DOCNO strings for all DOC units that were judged relevant to that topic. The topics will be identified by an index number (101..160???). Each line of the table will have the topic index number, the file ID in the SGML archive set, the DOCNO, and the level of relevance ("YES" or "BRIEF"). DOC units that were judged irrelevant to all topics will not be listed in this table. If the annotator entered remarks (about something unusual or noteworthy in a given story or its relation to a given topic) the existence of a comment is noted, and all comments are assembled in a separate listing. Here are some examples of entries in the topic relevance table:
<TOPICSET> <ONTOPIC topicid=1 level=YES docno=ABC19980108.1830.0711 fileid=19980108_1830_1900_ABC_WNT comments=NO> <ONTOPIC topicid=1 level=YES docno=ABC19980109.1830.0551 fileid=19980109_1830_1900_ABC_WNT comments=NO> ... <ONTOPIC topicid=1 level=BRIEF docno=APW19980105.0806 fileid=19980105_0000_2400_APW_ENG comments=NO> <ONTOPIC topicid=1 level=YES docno=APW19980105.0808 fileid=19980105_0000_2400_APW_ENG comments=NO> ... </TOPICSET>
Stories can be judged as relevant to more than one topic, in which case the same "docno" will appear more than once in this table, with different "topicid" values.