Introduction
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire, broadcast and radio news. Research objectives of this study include finding topically homogeneous sections (segmentation), detecting the occurence of news events (detection), and tracking the reocurrence of old or new events (tracking).
 

Corpus Description
The TDT-Phase2 Corpus has been designed to include six months of material drawn from six news sources. The period of time covered is from January 4 to July 3, 1998. The six sources are:

        Newswires:
                New York Times News Service (NYT)
                Associated Press Worldstream News Service (APW)

        Broadcast:
                CNN "Headline News"
                ABC "World News Tonight"

        Radio:
                Public Radio International's "The World" (PRI)
                Voice of America (VOA)

Tasks

I) Segmentation
   a) Segmenting
   b) Checking

II) Event-labeling
 

Segmentation

The two main goals of the segmentation task are:

1) Verify and categorize the section boundaries provided in every
broadcast and radio news program transcript.

2) Correct the time-marks associated with every section boundary.
 

Timemarks

Every section boundary in the transcripts is associated with a time-mark. Time-marks indicate the particular time in seconds where such boundary is located in the audio file. That is, they link the text to the audio.
Time-marks should be located at the natural boundaries of speech such as pauses, breaths etc. They should never be placed in the middle of a word!
 

Section Boundaries 
Every transcript is segmented into generic section boundaries. The section boundaries mark changes in the content of the material. For TDT purposes, the section boundaries serve to classify the segments into "stories" and "non-stories." Thus, all generic section boundaries <sx>  should be categorized accordingly.
 
Types:

sx = generic section boundary.
       It should always be changed to either "sr", "su" or "sn".
sr = story report
su = story untranscribed or undertranscribed
sn = non-story
 

 Story <sr>

A segment of tanscribed text that has been marked off by section boundaries in the original transmission will be considered a "story report" if it includes two or more DECLARATIVE independent clauses about a single event.  If the segment does not meet this minimum condition, it will be classed as "non-story".
 

Non-story <sn>

Non-story sections include but are not restricted to:

*Commercial blocks, including station id's
*Music sections longer than 7 sec. (when not part of a report on music)
*Lists of scores in sports news
*Lists of temperatures in weather reports
*Lists of financial indicators
 

Story "untranscribed"  or "undertranscribed" <su>
Those sections in the speech file that meet the "story report" criterion but were only partially transcribed or not transcribed at all are labeled <su.

NOTE - if the amount that has been transcribed consists of two independent declarative clauses, the story is labeled as <sr, even though more text is missing. Make sure that the closing tag indicates the end of the report proper, not just the transcribed area observed.