Public Radio International's "The World" is a one-hour program that airs five days a week. It is a combined effort of WGBH in Boston and the BBC in London. As a result, it is the one source of audio news data where non-U.S. accents of English are certain to occur in each broadcast.
During most of the TDT2 collection period, the Voice of America produced radio broadcasts that included a one-hour news program in English seven days a week ("VOA Today"), and an additional one-hour show on week-days "World Report"). Both shows have the typical format of one or two anchor announcers presenting a variety of stories from read scripts, plus a collection of correspondent reports from both studio and field recordings; the correspondent reports tend to include interview segments or portions of speeches, and may have varied bandwidth conditions. As of May 28, 1998, VOA changed it's broadcast format and schedule; there are still 60-minute blocks of news, but there are more of them in the course of the day, and they are not assigned distinctive titles or formats.
Commercial or broadcast-source transcripts are not created for either of these sources in the normal course of events, so the LDC established subcontracts with professional transcription services, whose normal business operations included closed-captioning, to produce transcripts of every recorded episode. Our intention was to produce a level of quality that is comparable (or close) to that of FDCH transcripts, and a level of markup (topic and speaker turn boundaries) that are comparable to the video sources.
For VOA, the original plan was to have a custom satellite downlink system installed at the LDC in time to capture the entire TDT2 six-month period of VOA directly from their satellite transmission. Due to delays in the delivery of key hardware components, the downlink system did not begin operation until Feb. 20. For broadcasts between Jan. 4 and Feb. 19, the LDC simply collected the full one-hour audio files posted on the VOA web site. These files are sampled at the VOA studios using 16-bit PCM and a sample rate of 11025 Hz. With the satellite downlink in operation, the recording process now consists of direct digital capture from the satellite transmission. The signal is transmitted using MPEG encoding, and the downlink system converts this to standard line audio output, which is fed directly to a standard DAT recorder.
Each DAT recorder is controlled by a simple digital clock timer, which is set to turn the DAT recorder on during the one-hour broadcast. The recorder is set to go directly into recording mode (at 32KHz sample rate) on power-up. The digital audio output from the DAT recorder is passed to a DATLink device for downsampling to 16KHz, and this in turn is connected to a Sun sparc workstation. A control process is scheduled to run on the workstation at broadcast time to sample from the DATLink for that hour. At the end of the hour, the DAT recorder shuts off, the waveform file is closed, and a quality check process is run on the waveform data to report the min and max sample values and check for peak clipping.
The results of the check are entered into an Oracle table with the file ID of the broadcast. If there is a problem with the waveform file, the DAT cartridge is checked, and if it was recorded correctly, it is used as the signal source to redo the waveform file capture.
For every episode that is successfully recorded, a brief inspection of the waveform is made to divide it cleanly in half; four time stamp labels are set, marking the beginning and ending points for each program half (leaving out local station promotional messages at these boundaries). These time stamps are used to play back the two halves over the DATLink for recording onto the two sides of a 60-minute analog audio cassette tape. The cassette is then sent to the transcription service, and the service returns the completed transcript via email.
When the transcript arrives, it is filtered to produce the format used by the segmentation interface. The annotators' task is basically the same as for the video sources, except that the original story boundaries in the transcript do not have time stamps provided by the original transcriber. This means that the LDC annotator needs to spend somewhat more time to locate the story boundary in the waveform file, and place the correct time stamp for each boundary into the transcript.