The VOA Mandarin news service provides at least two distinct 60-minute broadcasts per day, each having the typical format of one or two anchor announcers presenting a variety of stories from read scripts, plus a collection of correspondent reports from both studio and field recordings; the correspondent reports tend to include interview segments or portions of speeches, and may have varied bandwidth conditions.
The announcers and reporters are all native speakers of mainland Mandarin Chinese, and use only their native language when reporting the news. However, many broadcasts will include one or more brief audio clips from taped speeches or interviews that happened to be in English. In these cases, the person speaking English is presented at full volume for the duration of the audio clip, and after that, the Chinese reporter will summarize or discuss what the English speaker said. (This contrasts with the usual treatment of foreign language speakers in English news broadcasts, in which the volume of the foriegn speaker is reduced to a low background level and an interpreter provides a translation to English as a full-volume "voice-over".)
Full transcripts of VOA Mandarin news broadcasts are not created in the normal coarse of events, so the LDC established a subcontract with a qualified Mandarin-language transcription service, to produce full transcripts for every recorded episode. The transcripts are created using GB character encoding, and follow conventional practices for punctuation; the transcribers do not impose word segmentation on the text, since there are no widely accepted conventions for doing so.
On Feb. 20, 1998, the LDC began operation of a custom satellite downlink system for receiving VOA broadcasts directly from VOA's global satellite transmission network. The recording process consists of direct digital capture from the satellite transmission. The signal is transmitted using MPEG encoding, and the downlink system converts this to standard line audio output, which is fed directly to a standard DAT recorder.
There have been occasions when the the VOA satellite carrier signal was affected by poor reception quality, causing corruption of the MPEG encoded data streams for all VOA channels; when this happened, the audio output from the MPEG decoders was subject to intermittent distortions and drop-outs, lasting anywhere from a fraction of a second to several seconds. These in turn became part of our digital recording for the affected broadcasts. The problem varied in severity from day to day, but it is not uncommon for these distortions or drop-outs to appear two or three times in the course of a 60-minute broadcast.
The DAT recorder is controlled by a simple digital clock timer, which is set to turn the DAT recorder on during the one-hour broadcast. The recorder is set to go directly into recording mode (at 32KHz sample rate) on power-up.
The digital audio output from the DAT recorder is passed to a DATLink device for downsampling to 16KHz, and this in turn is connected to a Sun sparc workstation. A control process is scheduled to run on the workstation at broadcast time to sample from the DATLink for that hour. At the end of the hour, the DAT recorder shuts off, the waveform file is closed, and a quality check process is run on the waveform data to report the min and max sample values and check for peak clipping.
The results of the check are entered into an Oracle table with the file ID of the broadcast. If there is a problem with the waveform file, the DAT cartridge is checked, and if it was recorded correctly, it is used as the signal source to redo the waveform file capture. If the DAT recording showed a severe amount of distortions or drop-outs due to poor satellite reception, the episode was removed from the collection.
For every episode that is successfully recorded, a brief inspection of the waveform is made to divide it cleanly in half; four time stamp labels are set, marking the beginning and ending points for each program half (leaving out local station promotional messages at these boundaries). These time stamps are used to play back the two halves over the DATLink for recording onto the two sides of a 60-minute analog audio cassette tape. The cassette is then sent to the transcription service, and the service returns the completed transcript via ftp.
When the transcript arrives, it is filtered to produce the format used by the segmentation interface. The annotators' task is basically the same as for the video sources, except that the original story boundaries in the transcript do not have time stamps provided by the original transcriber. This means that the LDC annotator needs to spend somewhat more time to locate each story boundary in the waveform file, and place the correct time stamps for all boundaries into the transcript.