
Topic Detection and Tracking (TDT) refers to automatic
techniques for finding topically related material
in streams of data such as newswire, broadcast and
radio news. Research objectives of this study include finding topically
homogeneous sections (segmentation), detecting the occurence of
news events (detection), and tracking the reocurrence of
old or new events (tracking).
Corpus Description
The TDT-Phase3 Corpus has been designed to include three months of material drawn from 8 English news sources, and 3 Mandarin news sources. The period of time covered is from October 1 to December 31, 1998. The sources are as follows:
ENGLISH
Newswires:
New York Times News Service (NYT)
Associated Press Worldstream News Service (APW)
Broadcast:
CNN "Headline News"
ABC "World News Tonight"
NBC "Nightly News"
MSNBC "News with Brian Williams"
Radio:
Public Radio International's "The World" (PRI)
Voice of America (VOA)
MANDARIN
Newswires:
Xinhua News Service
Zaobao WWW News Service
Radio:
Voice of America Mandarin News
Tasks
I) Segmentation
a) Segmenting
b) Second-passing
II) Event-labeling
Segmentation
The two main goals of the segmentation task are:
1) Verify and categorize the section boundaries provided
in every
broadcast and radio news program transcript.
2) Correct the time-marks associated with every section
boundary.
Timemarks
Every section boundary in the transcripts is associated
with a timemark. Time-marks indicate the particular
time in seconds where such boundary is located in
the audio file. That is, they link the text to the audio. Timemarks
should be located at the natural boundaries of
speech such as pauses, breaths etc. They should never be placed in the
middle of a word!
Every transcript is segmented into generic section boundaries.
The section boundaries mark changes in the content
of the material. For TDT purposes, the section boundaries
serve to classify the segments into "stories" and" non-stories."
Thus, all generic section boundaries <sx> should be categorized
accordingly.
Types:
sx = generic section boundary.
It should always
be changed to either "sr", "su" or "sn".
sr = story report
su = story untranscribed or undertranscribed
sn = non-story
Story <sr>
A segment of tanscribed text that has been marked off
by section boundaries in the original transmission will be considered a
"story report" if it includes two or more DECLARATIVE
independent clauses about a single event. If
the segment does not meet this minimum condition, it will be classed as
"non-story".
Non-story sections include but are not restricted to:
*Commercial blocks, including station id's
*Music sections longer than 7 sec. (when not part of
a report on music)
*Lists of scores in sports news
*Lists of temperatures in weather reports
*Lists of financial indicators
Story "untranscribed" or "undertranscribed" <su>
Those sections in the speech file that meet the "story report" criterion (not filler, contain substantive information) but were only partially transcribed or not transcribed at all are labeled <su.
NOTE - if the amount that has been transcribed
is substantive enough to get some information from the report, the story is labeled as
<sr, even though more text is missing. Make sure that the closing tag indicates the end of the report proper, not just the transcribed area that is evident