English Video Sources

Overview

CNN Headline News is a 24-hour/day cable-TV broadcast, which presents top news stories continuously throughout the day. Some portion of the daily broadcast schedule includes closed-caption transcription. (No other form of transcription exists for this program in the normal course of events.) Typically, 20 or more distinct news stories are covered in each 30-minute portion of programming. When closed captions are provided, they include markers that flag changes of topic and changes of speaker. The closed caption text stream is padded with null bytes so that the text content is reasonably well aligned in time with the audio content. The accuracy of the caption text is reasonably good, but within any given 30 minute broadcast, one should expect to find numerous cases (dozens?) where whole words or phrases in the audio are missing in the captions, as well as a few cases (less than a dozen?) where the caption text is clearly wrong (i.e. the captioner has misspelled, or misunderstood what was said).

The insertion points for topic boundaries are determined by the people who create the closed caption transcription, and it is not clear to us exactly how this is done. We have observed that when an announcer quickly lists a series of stories that are coming up in the broadcast, the captioner typically marks a boundary at each mention of a story. LDC annotators have also seen cases where a topic boundary was apparently inserted by mistake in the middle of story. Another characteristic of topic boundary marking by captioners is that they do not mark the end of a topic, but only the beginning point of the next topic. If a news story is followed directly by a commercial break (without the announcer listing the stories to follow), there is no boundary marked at the beginning of the commercial. (However, it is possible to detect other cues in the closed-caption stream that flag the onset of commercial breaks, and we make use of these cues when capturing the stream to a text file.)

ABC World News Tonight is a daily 30-minute news broadcast that typically covers about a dozen different news items. Closed captioning is provided for every broadcast, and commercially produced transcripts are also available from Federal Documents Clearing House (FDCH). The closed captioning is presumably of better quality than the CNN Headline news captions, because the program content is more carefully prepared (there tend to be fewer mistakes in the caption content); still, owing to differences between rate of speech and rate of text display, it is not uncommon to find words and phrases that were spoken but omitted from the captions. Story and speaker turn boundaries are marked in roughly the same way as we find in the CNN captions.

The commercial transcripts of ABC WNT are created from recordings of the broadcasts, and are of very high quality. The marking of topic boundaries is more conservative -- the initial portion of the broadcast, including the headline summary, the voice-over introduction of Peter Jennings, and the lead story, are all grouped together in a single topic unit in the FDCH transcript, whereas several topic boundaries are marked in the corresponding portion of caption text. Also, the FDCH presentation of topic units is more akin to newswire data -- each topical section is formatted as a distinct story unit, with information about the story in a header and footer that surround the transcription text.

Sampling and capture

Each daily ABC WNT broadcast and up to four 30-minute sections of CNN HDL are recorded each day. The CNN segments are drawn from that portion of the daily schedule that happens to include closed captioning; CNN provides captioning over a 16-hour period each weekday, and an 8-hour period each weekend day, so we typically collect four half-hour samples per day on weekdays, and three per day on weekends.

The broadcasts are captured directly from a cable TV connection. The signal first goes to a VCR, which is programmed to put the broadcast on VHS tape. The audio line output from the VCR is connected to a DAT recording deck, and this in turn has its digital audio output connected to a Townshend DATLink unit, which in turn has a SCSI connection to a Sun sparc workstation. In addition, the VCR's video output is passed through a closed-caption decoder, which converts the closed caption signal into ASCII text and sends this data over a serial connection to the same sparc workstation.

So, at broadcast time, the VCR comes on under its own control to tape the broadcast; the DAT recorder is activated by a remote-control emitter that is connected to the DATLink unit and controlled by the workstation. A control process is scheduled on the workstation to execute at broadcast time to (a) start the DAT deck in "Record" mode, (b) start sampling on the DATLink device and (c) start receiving closed caption text data on the serial port. The DAT deck samples the signal at 32 KHz for storage on the DAT cartridge, and the DATLink downsamples the digital audio to 16KHz for storage into a disk file.

When the DATLink recording process is done, the VCR shuts itself off, a remote-control signal is sent to the DAT deck to stop its recording, the serial port connection is closed, and the control process on the workstation runs a quality check program on the resulting data files. The results of that quality check (waveform and text file sizes, min and max waveform sample values, number of occurrences of apparent peak clipping in the waveform) are placed into an Oracle database table together with the file-id for the broadcast.

If the quality check indicates any problem with either the closed caption text file or the waveform file, the video tape is checked, and if it was recorded correctly, it is used as the signal source (the particular broadcast is played back in its entirety) to redo the digital audio capture (to DAT and disk) and the closed caption capture.

Issues and problems

The distribution of topic boundary marks in the closed caption text shows some clear shortcomings (not marking the ends of stories in some conditions), and some practices that may pose problems for annotators (marking a boundary at every short phrase during a listing of upcoming stories). Annotators must be careful to make sure that story boundaries are marked where they need to be, adding marks where necessary. They must also decide when the amount of text between two story boundaries is too short to be called a story; in this regard, a boundary would be marked as the start of a story if the following content consists of two or more independent clauses of informative text about a news item. This will mean that a listing of upcoming news like the following will not be classed as three independent stories, but rather as a single region of non-news material (where ">>" indicates topic boundaries present in the original closed caption text):

  >>
  In this hour, the latest report from the Persian Gulf,
  >>
  more leaks from Kenneth Starr's special investigation,
  >>
  and another tornado hits central Florida.

Another property of closed caption text that we observe on occasion is that a portion of the broadcast may contain a spoken report (comprising an indenpendent news story) for which little or no closed caption text is provided. For example, an announcer may utter two or three independent, informative clauses reporting a brief news item, but the closed caption text will contain none or only one of these cluases.

In this case, the person doing the segmentation will classify the segment boundary as an "untranscribed news segment", meaning that there is not enough text provided to pass the "two-clause" rule, even though the audio signal does contain enough information to constitute a story. Segments marked in this way will not be treated during the topic labeling stage of annotation.

Audio segmentation

When the closed caption and waveform data have been confirmed to be okay, the broadcast episode is made available for manual verification of the time stamps associated with the topic (story) boundaries. The task of the annotator is three-fold:

This annotation is done using a modified version of the closed caption text file in a hybrid emacs/xwaves user interface developed by the LDC. When the audio segmentation has been completed for a given episode, the modified text file is filtered to produce the common TDT2 SGML format. Every file is gone over twice, with the second pass being done by a different annotator.

In the case of the ABC programs collected during the TDT2 period, where we also have the higher-quality FDCH transcripts, there was an extra step of reconciling the audio segmentation in both the FDCH transcripts and the closed caption text. After the first pass of audio segmentation is completed on the caption data, the FDCH transcript data is conditioned into an equivalent annotation format, and the person doing the second pass over the segmentation has the additional task of replicating the time-stamped story boundaries into the FDCH file. Both the caption and FDCH texts are then filtered to produce the common TDT2 SGML format, so that there are two forms of text data (in two separate text files) for this broadcast source.