TDT Audio Segmentation Guidelines - Television Sources (CNN, ABC)
The text files associated with TV audio data come from the closed captions
that are provided with the video signal; the closed captioning services use
established conventions for marking topic boundaries and speaker turn changes
in news broadcasts, and these marks come with approximate time stamps.
At present we do not have access to the specific conventions used by
closed captioners for marking topic boundaries. The following is
apparent from our observations of the data:
- A topic boundary is marked whenever a new topic begins -- but
when a topic is followed by something other than news (i.e. a
commercial or music break), no boundary is given to mark the end of the topic.
- When an announcer is giving a list of upcoming stories, each
story s/he mentions is marked with a topic boundary, producing a
sequence of very short segments.
- It may happen on occasion that the closed-captioner will
mistakenly insert a topic boundary in the middle of a story.
Our present method for creating these text files also inserts time
stamps at speaker turn changes and at apparent sentence boundaries.
The only time stamps that matter in the TDT segmentation task are the
times for the topic (story) boundaries. The timing of speaker changes
within a story, and accuracy of time stamps at sentence boundaries, is
of no importance, and annotators should pay no attention to these
other time stamps.
Objectives
- The time stamps at segment boundaries must be correct: when you listen to
the portion of the recording that starts at the time stamp for
a boundary, you should hear all of the text that follows that
boundary without any clipping of the first word, and you should
NOT hear any of the text that precedes the boundary. If the
first word following the boundary is not fully audible, or if
you hear text that comes before the boundary, the time stamp
needs to be shifted.
- Every segment must be categorized as either a news report, using "<sr", or as
non-news, using "<sn". In the original form of the text (as
produced from the closed-caption signal), all segment
boundaries are marked with "<sx" -- all the "x"'s must be
changed. In the event that a news report is encountered that
has not been fully transcribed, the segment boundary is marked
with an "<su". (for "untranscribed or "under-transcribed segment")
- When you come to a topic boundary in the middle of a broadcast
and find that text from a commercial has been included as the
last portion of a previous "<sr" (news story) segment, trace
back in the text and insert "<sn", with a correct time
stamp, to mark where that previous story ends and the commercial begins.
- When you come to a topic boundary in the middle of a broadcast
and find that it clearly should NOT be a boundary (that is, the
material following the boundary is clearly a continuation of the story
that precedes the boundary), you may delete this boundary.
The actual close
captioned text provided should not to be changed,
even if the transcription is inaccurate.