TDT Audio Segmentation Guidelines - Television Sources (CNN, ABC)

The text files associated with TV audio data come from the closed captions that are provided with the video signal; the closed captioning services use established conventions for marking topic boundaries and speaker turn changes in news broadcasts, and these marks come with approximate time stamps.

At present we do not have access to the specific conventions used by closed captioners for marking topic boundaries. The following is apparent from our observations of the data:

Our present method for creating these text files also inserts time stamps at speaker turn changes and at apparent sentence boundaries. The only time stamps that matter in the TDT segmentation task are the times for the topic (story) boundaries. The timing of speaker changes within a story, and accuracy of time stamps at sentence boundaries, is of no importance, and annotators should pay no attention to these other time stamps.

Objectives

The actual close captioned text provided should not to be changed, even if the transcription is inaccurate.