There are two basic units of organization for the TDT2 Corpus: the sample file, identified by a unique file-ID, and the individual story unit, identified by a unique story-ID. The file-ID identifies the date and time span (in terms of Eastern time zone) of the sample file's content, as well as the news source from which it was captured. The sample file-ID is applied to every form of data associated with a given sample (waveform, closed caption text, transcription or newswire text, automatic speech recognition text), and different data types from the sample are distinguished by a three-letter extension to the file-ID ("sph", "cap", "txt", "asr"). Here are two example file-IDs:
19980120_1830_1900_ABC_WNT
19980120_1924_2237_APW_ENG
The story-ID is made up of a 3-letter indentifier for the news source, the date of collection, and an index number. For broadcast sources, the starting time of the broadcast is also included as part of each story-ID. Here are two examples of story-IDs that could be found within the file-IDs shown above:
ABC19980120.1830.0003
APW19980120.0881
While the story index numbers will increase over the span of the file, they are not strictly sequential. In newswire data, the story-ID's are applied sequentially at the first stage of processing to establish a stable reference for other possible uses of the data, where the filtering and subsampling of stories for TDT might not apply. In broadcast data, the integer number of seconds from the beginning of the sample file to the beginning of the story is used as the index number.
In terms of story content, each CNN sample unit covers 30 minutes of broadcast time and contains, on average, about 20 news stories; ABC sample units also cover 30 minutes, with 12 stories on average; PRI and VOA sample units each cover 60 minute broadcasts, and range between two and three dozen stories per unit. The time periods covered by the newswire sample units are widely variable; the selection process that extracts these units from the daily newswire feeds is configured to create units that are fairly equivalent in size, such that there will be 20 stories per unit, on average (the sampling units may contain more or fewer stories, but total number of words in each unit should be comparable).
Within each SGML sample unit file, the individual stories are marked off as SGML <DOC> units, using the structure illustrated below. The numbers enclosed in square brackets refer to explanations in the list that follows the illustration.
<DOC>
<DOCNO> [1] </DOCNO>
<DOCTYPE> [2] </DOCTYPE>
<DATE_TIME> [3] </DATE_TIME>
<HEADER>
[4]
</HEADER>
<BODY> [5]
<SLUG> [6] </SLUG>
<HEADLINE>
[7]
</HEADLINE>
[8]
<TEXT>
[9]
<ANNOTATION> [10] </ANNOTATION>
<TURN> [11]
</TEXT>
[12]
</BODY>
<END_TIME> [13] </END_TIME>
<TRAILER>
[14]
</TRAILER>
</DOC>
Some of the SGML elements shown above are source dependent in their occurrence and/or content, while other elements are specifically intended to assure comparable use across all sources.
In the case of newswire sources (APW, NYT), the SGML archival form is
derived from the ANPA transmission format received by LDC via modem;
the SGML sample files contain only the stories that have been selected
for annotation, and that have not been rejected by annotators due to
unsuitable content (i.e. only those DOC's that remain classified as
"NEWS STORY" are retained). The selection process for newswire
annotation automatically eliminates classes of DOC units in the stream
that do not contain news stories, but some non-news units do get passed
through for annotation; annotators have the option to flag stories as
unacceptable for topic labeling if any of the following conditions is
seen:
Of these four causes for rejection from topic labeling, the first three will result in the DOC unit being excluded from the SGML files delivered to researchers.
In the case of audio sources (ABC, CNN, PRI, VOA), the SGML archival form is derived from closed-caption text or commercially produced transcripts that have been annotated manually to establish time stamps and DOCTYPE labels for all DOC unit boundaries; all DOC units are retained for the audio sources (i.e. both "NEWS STORY" and "MISCELLANEOUS TEXT").
Regarding the last item, newswire stories typically begin with a
"dateline" string that identifies the location from which the news
report was originally sent; for example:
BELFAST, Northern Ireland (AP) _ With peace talks already
threatened by a spate of killings, police in Northern Ireland
detonated a car bomb Wednesday in a rural town near Belfast.
Given this paragraph from the SGML-tagged archive, the untagged stream would contain only the space-separated tokens following the underscore character "_".
The tokenization process simply splits the text stream using white-space as the delimiter. All punctuation, hyphenation, quotations and parentheses are retained without modification; in other words, every string of one or more contiguous non-space characters is output as one token. All strings of one or more white-space characters are treated alike as single delimiters; this has the effect of eliminating paragraph boundaries in newswire, which are marked in the SGML files by indentation.
Except for this neutralization of white-space, and the elimination of
datelines and ANNOTATION tags, it is possible to reconstruct the full
content of the TEXT elements of the SMGL files from the tokenized text
files. The following example shows the format of the tokenized
stream:
<DOCSET type=NEWSWIRE fileid=19980107_0000_0800_APW_ENG> <W recid=1> With <W recid=2> peace <W recid=3> talks <W recid=4> already <W recid=5> threatened <W recid=6> by <W recid=7> a <W recid=8> spate <W recid=9> of <W recid=10> killings, ... </DOCSET>
Given the "fileid" and "recid" attributes in the tokenized text stream data, the story boundary table will relate the original DOC units to the DOCSET data records, by providing the starting and ending record ID's for each DOCNO; each line of the table will provide the SGML file ID, the DOCSET file ID, the DOCNO string, and the starting and ending "recid" values within the DOCSET file that represent the boundaries of the DOC unit. For audio sources, the table also gives starting and ending time offsets for each DOC, in seconds, from the start of the file. Here is an example:
<BOUNDSET type=CAPTION fileid=19980107_0130_0200_CNN_HDL> <BOUNDARY docno=CNN19980107.0130.0000 doctype=MISCELLANEOUS Bsec=0.00 Esec=16.00 Brecid=1 Erecid=30> <BOUNDARY docno=CNN19980107.0130.0016 doctype=NEWS Bsec=16.00 Esec=53.67 Brecid=31 Erecid=124> <BOUNDARY docno=CNN19980107.0130.0053 doctype=NEWS Bsec=53.67 Esec=121.51 Brecid=125 Erecid=313> ... </BOUNDSET>
It can happen that a DOC unit from a broadcast source may contain an empty TEXT portion; in this case, the table entry for that DOC does not contain the "Brecid" and "Erecid" attributes, but the "Bsec" and "Esec" attributes are always present for broadcast sources. The "Bsec" and "Esec" attributes are not present for newswire sources.
For each audio sample file, there will be a file of ASR output text,
which will be similar in format to the tokenized text stream: each
orthographic word output by the ASR system will be a separate data
record, marked as follows:
<DOCSET type=ASRTEXT fileid=19980109_1830_1900_ABC_WNT> <X Bsec=0.00 Dur=0.01 Conf=NA> <W recid=1 Bsec=0.01 Dur=0.15 Clust=45 Conf=0.32> IS <W recid=2 Bsec=0.16 Dur=0.40 Clust=45 Conf=0.68> WHETHER <W recid=3 Bsec=0.74 Dur=0.32 Clust=45 Conf=0.90> FOR <W recid=4 Bsec=1.06 Dur=0.38 Clust=45 Conf=0.87> OTHERS <W recid=5 Bsec=1.44 Dur=0.19 Clust=45 Conf=0.80> IT <W recid=6 Bsec=1.63 Dur=0.16 Clust=45 Conf=0.75> IS <W recid=7 Bsec=1.79 Dur=0.49 Clust=45 Conf=0.78> SPRING ... </DOCSET>
The additional attributes attached to each word ("Bsec, Dur, Clust, Conf") are provided by the ASR system; the "<X>" tags represent periods of time in the audio signal where no speech was recognized. The "Conf=" attribute is a computed estimate of confidence in the correctness of a given recognized word, varying between 0 (no confidence) and 1 (highest confidence). At present, this measure does not apply to non-speech (X) segments, and it is also possible that the ASR system may be unable to assign a confidence score to some recognized words; in these cases, the attribute is given as "Conf=NA".
The ASR output spans the entire audio recording for each sample file. In some cases, the manual transcription or closed-caption text for the sample may begin or end at a different point in time than the audio recording, so that the ASR output may contain more (or less) material at the beginning or end than the corresponding SGML file. In any case, the boundaries of news-story segments should always be properly aligned; discrepancies in the amount of content at the beginning or end of a sample will be absorbed by <DOC> units whose <DOCTYPE> is "MISCELLANEOUS TEXT".
Also, the "Bsec" and "Dur" attributes do not necessarily account for every second of elapsed time in the broadcast -- there may be time gaps between successive records in the ASR file.
Given the "fileid" and "recid" attributes in the ASR text stream data,
the story boundary table will relate DOC units to the ASR data
records, by providing the starting and ending record ID's for each
DOCNO; each line of the table will provide the DOCNO string, the
DOCTYPE value, the starting and ending "recid" values, and
the starting and ending time offsets in seconds. Here is an example:
<BOUNDSET type=ASRTEXT fileid=19980109_0100_0130_CNN_HDL> <BOUNDARY docno=CNN19980109.0100.0000 doctype=NEWS Bsec=0.00 Esec=8.00 Brecid=1 Erecid=21> <BOUNDARY docno=CNN19980109.0100.0008 doctype=MISCELLANEOUS Bsec=8.00 Esec=19.00 Brecid=22 Erecid=49> <BOUNDARY docno=CNN19980109.0100.0019 doctype=NEWS Bsec=19.00 Esec=67.00 Brecid=50 Erecid=189> <BOUNDARY docno=CNN19980109.0100.0067 doctype=NEWS Bsec=67.00 Esec=89.09 Brecid=190 Erecid=251> ... </BOUNDSET>
In some cases, a "DOC" of type "MISCELLANEOUS" will span a period of time in which the ASR system will not have found any recognizable speech. In such cases the "Brecid" and "Erecid" attributes will not be present in the BOUNDARY tag; the "Bsec" and "Esec" attributes are always present.
For each of the target topics defined for TDT2, this table will list
the DOCNO strings for all DOC units that were judged relevant to that
topic. The topics will be identified by an index number (1..100).
Each line of the table will have the topic index number, the file ID
in the SGML archive set, the DOCNO, and the level of relevance ("YES"
or "BRIEF"). DOC units that were judged irrelevant to all topics will
not be listed in this table. If the annotator entered remarks (about
something unusual or noteworthy in a given story or its relation to a
given topic) the existence of a comment is noted, and all comments are
assembled in a separate listing. Here are some examples of entries in
the topic relevance table:
<TOPICSET> <ONTOPIC topicid=1 level=YES docno=ABC19980108.1830.0711 fileid=19980108_1830_1900_ABC_WNT comments=NO> <ONTOPIC topicid=1 level=YES docno=ABC19980109.1830.0551 fileid=19980109_1830_1900_ABC_WNT comments=NO> ... <ONTOPIC topicid=1 level=BRIEF docno=APW19980105.0806 fileid=19980105_0000_2400_APW_ENG comments=NO> <ONTOPIC topicid=1 level=YES docno=APW19980105.0808 fileid=19980105_0000_2400_APW_ENG comments=NO> ... </TOPICSET>
Stories can be judged as relevant to more than one topic, in which case the same "docno" will appear more than once in this table, with different "topicid" values.
[Back to Top]