The TDT-Phase2 Corpus has been designed to include six months of material drawn from six news sources. The period of time covered is from January 4 to July 3, 1998. The six sources are:
New York Times News Service (NYT)
Associated Press Worldstream News Service (APW)
CNN "Headline News"
ABC "World News Tonight"
Public Radio International's "The World" (PRI)
Voice of America (VOA)
The two main goals of the segmentation task are:
1) Verify and categorize the section boundaries provided
broadcast and radio news program transcript.
2) Correct the time-marks associated with every section
Every section boundary in the transcripts is associated
with a time-mark. Time-marks indicate the particular
time in seconds where such boundary is located in
the audio file. That is, they link the text to the audio.
Time-marks should be located at the natural boundaries of speech such as pauses, breaths etc. They should never be placed in the middle of a word!
Every transcript is segmented into generic section boundaries. The section boundaries mark changes in the content of the material. For TDT purposes, the section boundaries serve to classify the segments into "stories" and "non-stories." Thus, all generic section boundaries <sx> should be categorized accordingly.
sx = generic section boundary.
It should always be changed to either "sr", "su" or "sn".
sr = story report
su = story untranscribed or undertranscribed
sn = non-story
A segment of tanscribed text that has been marked off
by section boundaries in the original transmission will be considered a
"story report" if it includes two or more DECLARATIVE
independent clauses about a single event. If
the segment does not meet this minimum condition, it will be classed as
Non-story sections include but are not restricted to:
*Commercial blocks, including station id's
*Music sections longer than 7 sec. (when not part of a report on music)
*Lists of scores in sports news
*Lists of temperatures in weather reports
*Lists of financial indicators
or "undertranscribed" <su>
Those sections in the speech file that meet the "story report" criterion but were only partially transcribed or not transcribed at all are labeled <su.
NOTE - if the amount that has been transcribed
consists of two independent declarative clauses, the story is labeled as
<sr, even though more text is missing. Make sure that the closing tag
indicates the end of the report proper, not just the transcribed area observed.