Organization of the TDT2 Corpus

Overview

There are two basic units of organization for the TDT2 Corpus: the sample file, identified by a unique file-ID, and the individual story unit, identified by a unique story-ID. The file-ID identifies the date and time span (in terms of Eastern time zone) of the sample file's content, as well as the news source from which it was captured. The sample file-ID is applied to every form of data associated with a given sample (waveform, closed caption text, transcription or newswire text, automatic speech recognition text), and different data types from the sample are distinguished by a three-letter extension to the file-ID ("sph", "cap", "txt", "asr"). Here are two example file-IDs:

19980120_1830_1900_ABC_WNT
19980120_1924_2237_APW_ENG

The story-ID is made up of a 3-letter indentifier for the news source, the date of collection, and an index number. For broadcast sources, the starting time of the broadcast is also included as part of each story-ID. Here are two examples of story-IDs that could be found within the file-IDs shown above:

ABC19980120.1830.0003
APW19980120.0881

While the story index numbers will increase over the span of the file, they are not strictly sequential. In newswire data, the story-ID's are applied sequentially at the first stage of processing to establish a stable reference for other possible uses of the data, where the filtering and subsampling of stories for TDT might not apply. In broadcast data, the integer number of seconds from the beginning of the sample file to the beginning of the story is used as the index number.

Components of the TDT2 Corpus:


  1. The SGML-tagged text archive: contains all annotated text content from all sources, organized by source and sampling period, and structured as a sequence of SGML "<DOC>" units, each uniquely identified by a "<DOCNO>" field. Each news story in the corpus is represented as one "DOC" unit.
  2. The untagged text stream: contains tokenized versions of all annotated text data from all sources, with story boundaries and other structural cues removed, organized by source and sampling period. Each word token in this version of the text constitutes one data record.
  3. The text stream boundary tables: these list, for each SGML "<DOC>" unit (referenced by its "<DOCNO>" field), the data record boundaries for that unit in the tokenized version of the text.
  4. The ASR output: as provided by Dragon Systems, for the four audio sources, organized by source and sampling period. Each word token in this output constitutes one data record.
  5. The ASR boundary tables these list, for each SGML "<DOC>" unit covered by the ASR output (referenced by its "<DOCNO>" field), the data record boundaries for that unit in the ASR output.
  6. A table of topic relevance judgements: created by annotators at the LDC, this lists, for each of the target topics defined for TDT2, all the stories (by DOCNO) that have been judged relevant to that topic, along with the level of relevance: fully relevant (labeled "YES"), or slightly relevant (labeled "BRIEF").
  7. The full set of digital audio files from the four audio sources.

1. The SGML-tagged text archive

For each sampling unit created for TDT2, there will be one file of SGML-tagged text data; each sampling unit consists of the stories captured from one source during one contiguous time span. Among the broadcast (audio) sources, there are typically four sampling units per day for CNN, two per day for VOA, and one per day for ABC and PRI. The newswire sources (APW, NYT) each have four sampling units per day, representing a quasi-arbitrary selection of four discrete blocks of stories taken at evenly-spaced intervals from the full day's data stream.

In terms of story content, each CNN sample unit covers 30 minutes of broadcast time and contains, on average, about 20 news stories; ABC sample units also cover 30 minutes, with 12 stories on average; PRI and VOA sample units each cover 60 minute broadcasts, and range between two and three dozen stories per unit. The time periods covered by the newswire sample units are widely variable; the selection process that extracts these units from the daily newswire feeds is configured to create units that are fairly equivalent in size, such that there will be 20 stories per unit, on average (the sampling units may contain more or fewer stories, but total number of words in each unit should be comparable).

Within each SGML sample unit file, the individual stories are marked off as SGML <DOC> units, using the structure illustrated below. The numbers enclosed in square brackets refer to explanations in the list that follows the illustration.

<DOC>
   <DOCNO> [1] </DOCNO>
   <DOCTYPE> [2] </DOCTYPE>
   <DATE_TIME> [3] </DATE_TIME>
   <HEADER>
      [4]
   </HEADER>
   <BODY> [5]
      <SLUG> [6] </SLUG>
      <HEADLINE>
         [7]
      </HEADLINE>
      [8] 
      <TEXT>
         [9]
         <ANNOTATION> [10] </ANNOTATION>
         <TURN> [11]
      </TEXT>
      [12]
   </BODY>
   <END_TIME> [13] </END_TIME>
   <TRAILER>
      [14]
   </TRAILER>
</DOC>

Some of the SGML elements shown above are source dependent in their occurrence and/or content, while other elements are specifically intended to assure comparable use across all sources.

  1. The DOCNO tag is common to all sources, and contains the story-ID string described earlier.
  2. The DOCTYPE tag is common to all sources, and contains one of two possible text values: "NEWS STORY" appears here if the content of the subsequent TEXT region is a news story; "MISCELLANEOUS TEXT" appears if the TEXT region contains material that is not a news story.
  3. The DATE_TIME tag is common to all sources, and contains a uniformly formatted string giving the date and time at which reception of the story began, in terms of Eastern (U.S.) time (either standard or daylight savings, depending on clock settings at the time of reception).
  4. The HEADER tag is specifically for ANPA newswire material; it stores the ANPA-encoded transmission header that accompanies each story unit. The content and interpretation of this tag is source dependent (different newswires use it in different ways). This tag is not used for broadcast sources.
  5. The BODY tag is common to all sources. For ANPA newswires, this tag contains all material that is transmitted between the ANPA header structure and the closing "trailer" information. For broadcast sources, this includes all material provided by either closed captions or transcribers that is associated with the story unit.
  6. The SLUG tag is specifically for ANPA newswire material; it contains an oddly-formatted "keyword" string that is commonly supplied by wire services and used by receiving news editors to sort through the daily newswire flow.
  7. The HEADLINE tag is specifically for ANPA newswire material; it contains text that has been selected from the initial portion of the article body (prior to the start of the actual text content of the article), based on a heuristic pattern match that is source-dependent. Some newswire story units will use ambiguous or unexpected formatting when rendering the headline, causing the pattern match to fail partially or entirely; other story units lack a headline altogether.
  8. The region between the opening of the story BODY and the opening of the TEXT tag will often include descriptive material about the story, i.e. material provided by a newswire service or a transcriber to classify or explain the story's content. In the case of newswire, this material often includes cryptic typesetting instructions and advisories to editors.
  9. The TEXT tag is common to all sources, and contains the actual "narrative" content of the story -- i.e. what is being reported by the newswire authors and broadcast speakers. The text content is typically mixed case for newswires and monocase for closed caption data; transcription text may be either mixed or monocase. Newswires use conventional paragraph formatting and punctuation, except that there is use of the underscore character "_" as an "em-dash", in addition to the common hyphen character "-"; closed caption and transcript text also uses common punctuation (perhaps less than is common in newswire), but paragraph breaks are essentially undefined or non-existent.
  10. The ANNOTATION tag may appear in all sources, but more often in transcripts and closed captions than in newswire. It is intended to contain material within the TEXT portion that is not part of the narrative content. This includes speaker attributions and indications of background noise or music in the transcripts and captions, as well as brief advisories to editors embedded within newswire stories (e.g. "STORY CAN END HERE -- OPTIONAL MATERIAL FOLLOWS").
  11. The TURN tag is specifically for broadcast sources, and flags a point at which the transcript or closed caption has marked a speaker turn boundary within a story.
  12. As with item [8] above, the region between the close of the TEXT tag and the close of the BODY tag may contain additional material about the story that has been provided by a transcriber or newswire service.
  13. The END_TIME tag is specifically for broadcast sources, and identifies the time at which the story ended in the broadcast, using the same format as the DATE_TIME tag.
  14. The TRAILER tag is specifically for newswire sources, and contains the formulaic date/time stamp that is typically transmitted at the end of every story unit.

    In the case of newswire sources (APW, NYT), the SGML archival form is derived from the ANPA transmission format received by LDC via modem; the SGML sample files contain only the stories that have been selected for annotation, and that have not been rejected by annotators due to unsuitable content (i.e. only those DOC's that remain classified as "NEWS STORY" are retained). The selection process for newswire annotation automatically eliminates classes of DOC units in the stream that do not contain news stories, but some non-news units do get passed through for annotation; annotators have the option to flag stories as unacceptable for topic labeling if any of the following conditions is seen:

    Of these four causes for rejection from topic labeling, the first three will result in the DOC unit being excluded from the SGML files delivered to researchers.

    In the case of audio sources (ABC, CNN, PRI, VOA), the SGML archival form is derived from closed-caption text or commercially produced transcripts that have been annotated manually to establish time stamps and DOCTYPE labels for all DOC unit boundaries; all DOC units are retained for the audio sources (i.e. both "NEWS STORY" and "MISCELLANEOUS TEXT").

    2. The untagged text stream files

    This form of text data is derived from the SGML-tagged text archive. For each sample unit from each source, there will be one file containing all the text content of the sample, in which each space-separated orthographic token is presented as a separate data record, and all other information from the original sample is excluded. The excluded information consists of:

    Regarding the last item, newswire stories typically begin with a "dateline" string that identifies the location from which the news report was originally sent; for example:

               BELFAST, Northern Ireland (AP) _ With peace talks already
        threatened by a spate of killings, police in Northern Ireland
        detonated a car bomb Wednesday in a rural town near Belfast.
    

    Given this paragraph from the SGML-tagged archive, the untagged stream would contain only the space-separated tokens following the underscore character "_".

    The tokenization process simply splits the text stream using white-space as the delimiter. All punctuation, hyphenation, quotations and parentheses are retained without modification; in other words, every string of one or more contiguous non-space characters is output as one token. All strings of one or more white-space characters are treated alike as single delimiters; this has the effect of eliminating paragraph boundaries in newswire, which are marked in the SGML files by indentation.

    Except for this neutralization of white-space, and the elimination of datelines and ANNOTATION tags, it is possible to reconstruct the full content of the TEXT elements of the SMGL files from the tokenized text files. The following example shows the format of the tokenized stream:

    	<DOCSET type=NEWSWIRE fileid=19980107_0000_0800_APW_ENG>
    	<W recid=1> With
    	<W recid=2> peace
    	<W recid=3> talks
    	<W recid=4> already
    	<W recid=5> threatened
    	<W recid=6> by
    	<W recid=7> a
    	<W recid=8> spate
    	<W recid=9> of
    	<W recid=10> killings,
    	...
    	</DOCSET>
    

    3. Story boundary table for untagged text stream data

    Given the "fileid" and "recid" attributes in the tokenized text stream data, the story boundary table will relate the original DOC units to the DOCSET data records, by providing the starting and ending record ID's for each DOCNO; each line of the table will provide the SGML file ID, the DOCSET file ID, the DOCNO string, and the starting and ending "recid" values within the DOCSET file that represent the boundaries of the DOC unit. For audio sources, the table also gives starting and ending time offsets for each DOC, in seconds, from the start of the file. Here is an example:

    <BOUNDSET type=CAPTION fileid=19980107_0130_0200_CNN_HDL>
    <BOUNDARY docno=CNN19980107.0130.0000 doctype=MISCELLANEOUS Bsec=0.00 Esec=16.00 Brecid=1 Erecid=30>
    <BOUNDARY docno=CNN19980107.0130.0016 doctype=NEWS Bsec=16.00 Esec=53.67 Brecid=31 Erecid=124>
    <BOUNDARY docno=CNN19980107.0130.0053 doctype=NEWS Bsec=53.67 Esec=121.51 Brecid=125 Erecid=313>
    ...
    </BOUNDSET>
    

    It can happen that a DOC unit from a broadcast source may contain an empty TEXT portion; in this case, the table entry for that DOC does not contain the "Brecid" and "Erecid" attributes, but the "Bsec" and "Esec" attributes are always present for broadcast sources. The "Bsec" and "Esec" attributes are not present for newswire sources.

    4. ASR output

    For each audio sample file, there will be a file of ASR output text, which will be similar in format to the tokenized text stream: each orthographic word output by the ASR system will be a separate data record, marked as follows:

    	<DOCSET type=ASRTEXT fileid=19980109_1830_1900_ABC_WNT>
    	<X Bsec=0.00 Dur=0.01 Conf=NA>
    	<W recid=1 Bsec=0.01 Dur=0.15 Clust=45 Conf=0.32> IS
    	<W recid=2 Bsec=0.16 Dur=0.40 Clust=45 Conf=0.68> WHETHER
    	<W recid=3 Bsec=0.74 Dur=0.32 Clust=45 Conf=0.90> FOR
    	<W recid=4 Bsec=1.06 Dur=0.38 Clust=45 Conf=0.87> OTHERS
    	<W recid=5 Bsec=1.44 Dur=0.19 Clust=45 Conf=0.80> IT
    	<W recid=6 Bsec=1.63 Dur=0.16 Clust=45 Conf=0.75> IS
    	<W recid=7 Bsec=1.79 Dur=0.49 Clust=45 Conf=0.78> SPRING
    	...
    	</DOCSET>
    

    The additional attributes attached to each word ("Bsec, Dur, Clust, Conf") are provided by the ASR system; the "<X>" tags represent periods of time in the audio signal where no speech was recognized. The "Conf=" attribute is a computed estimate of confidence in the correctness of a given recognized word, varying between 0 (no confidence) and 1 (highest confidence). At present, this measure does not apply to non-speech (X) segments, and it is also possible that the ASR system may be unable to assign a confidence score to some recognized words; in these cases, the attribute is given as "Conf=NA".

    The ASR output spans the entire audio recording for each sample file. In some cases, the manual transcription or closed-caption text for the sample may begin or end at a different point in time than the audio recording, so that the ASR output may contain more (or less) material at the beginning or end than the corresponding SGML file. In any case, the boundaries of news-story segments should always be properly aligned; discrepancies in the amount of content at the beginning or end of a sample will be absorbed by <DOC> units whose <DOCTYPE> is "MISCELLANEOUS TEXT".

    Also, the "Bsec" and "Dur" attributes do not necessarily account for every second of elapsed time in the broadcast -- there may be time gaps between successive records in the ASR file.

    5. Story boundary table for ASR text stream data

    Given the "fileid" and "recid" attributes in the ASR text stream data, the story boundary table will relate DOC units to the ASR data records, by providing the starting and ending record ID's for each DOCNO; each line of the table will provide the DOCNO string, the DOCTYPE value, the starting and ending "recid" values, and the starting and ending time offsets in seconds. Here is an example:

    <BOUNDSET type=ASRTEXT fileid=19980109_0100_0130_CNN_HDL>
    <BOUNDARY docno=CNN19980109.0100.0000 doctype=NEWS Bsec=0.00 Esec=8.00 Brecid=1 Erecid=21>
    <BOUNDARY docno=CNN19980109.0100.0008 doctype=MISCELLANEOUS Bsec=8.00 Esec=19.00 Brecid=22 Erecid=49>
    <BOUNDARY docno=CNN19980109.0100.0019 doctype=NEWS Bsec=19.00 Esec=67.00 Brecid=50 Erecid=189>
    <BOUNDARY docno=CNN19980109.0100.0067 doctype=NEWS Bsec=67.00 Esec=89.09 Brecid=190 Erecid=251>
    ...
    </BOUNDSET>
    

    In some cases, a "DOC" of type "MISCELLANEOUS" will span a period of time in which the ASR system will not have found any recognizable speech. In such cases the "Brecid" and "Erecid" attributes will not be present in the BOUNDARY tag; the "Bsec" and "Esec" attributes are always present.

    6. Table of topic relevance judgements

    For each of the target topics defined for TDT2, this table will list the DOCNO strings for all DOC units that were judged relevant to that topic. The topics will be identified by an index number (1..100). Each line of the table will have the topic index number, the file ID in the SGML archive set, the DOCNO, and the level of relevance ("YES" or "BRIEF"). DOC units that were judged irrelevant to all topics will not be listed in this table. If the annotator entered remarks (about something unusual or noteworthy in a given story or its relation to a given topic) the existence of a comment is noted, and all comments are assembled in a separate listing. Here are some examples of entries in the topic relevance table:

    <TOPICSET>
    <ONTOPIC topicid=1 level=YES docno=ABC19980108.1830.0711 fileid=19980108_1830_1900_ABC_WNT comments=NO>
    <ONTOPIC topicid=1 level=YES docno=ABC19980109.1830.0551 fileid=19980109_1830_1900_ABC_WNT comments=NO>
    ...
    <ONTOPIC topicid=1 level=BRIEF docno=APW19980105.0806 fileid=19980105_0000_2400_APW_ENG comments=NO>
    <ONTOPIC topicid=1 level=YES docno=APW19980105.0808 fileid=19980105_0000_2400_APW_ENG comments=NO>
    ...
    </TOPICSET>
    

    Stories can be judged as relevant to more than one topic, in which case the same "docno" will appear more than once in this table, with different "topicid" values.

    [Back to Top]