
Corpus Description
The TDT-Phase2 Corpus has been designed to include six
months of material drawn from six news sources. The
period of time covered is from January 4 to July 3,
1998. The six sources are:
Newswires:
New York Times News Service (NYT)
Associated Press Worldstream News Service (APW)
Broadcast:
CNN "Headline News"
ABC "World News Tonight"
Radio:
Public Radio International's "The World" (PRI)
Voice of America (VOA)
Tasks
I) Segmentation
a) Segmenting
b) Checking
II) Event-labeling
Segmentation
The two main goals of the segmentation task are:
1) Verify and categorize the section boundaries provided
in every
broadcast and radio news program transcript.
2) Correct the time-marks associated with every section
boundary.
Timemarks
Every section boundary in the transcripts is associated
with a time-mark. Time-marks indicate the particular
time in seconds where such boundary is located in
the audio file. That is, they link the text to the audio.
Time-marks should be located at the natural boundaries
of speech such as pauses, breaths etc. They should
never be placed in the middle of a word!
Section
Boundaries
Every transcript is segmented into generic section boundaries.
The section boundaries mark changes in the content
of the material. For TDT purposes, the section boundaries
serve to classify the segments into "stories" and "non-stories."
Thus, all generic section boundaries <sx> should be categorized
accordingly.
Types:
sx = generic section boundary.
It should always
be changed to either "sr", "su" or "sn".
sr = story report
su = story untranscribed or undertranscribed
sn = non-story
Story <sr>
A segment of tanscribed text that has been marked off
by section boundaries in the original transmission will be considered a
"story report" if it includes two or more DECLARATIVE
independent clauses about a single event. If
the segment does not meet this minimum condition, it will be classed as
"non-story".
Non-story sections include but are not restricted to:
*Commercial blocks, including station id's
*Music sections longer than 7 sec. (when not part of
a report on music)
*Lists of scores in sports news
*Lists of temperatures in weather reports
*Lists of financial indicators
Story "untranscribed"
or "undertranscribed" <su>
Those sections in the speech file that meet the "story
report" criterion but were only partially transcribed
or not transcribed at all are labeled <su.
NOTE - if the amount that has been transcribed
consists of two independent declarative clauses, the story is labeled as
<sr, even though more text is missing. Make sure that the closing tag
indicates the end of the report proper, not just the transcribed area observed.