GALE: Transcription
LDC will create, purchase or otherwise acquire
verbatim
transcripts for large volumes of speech for GALE. Final transcripts
will conform to the quick transcription (QTR) specification whose
elements include accurate transcription of content words, segmentation
and time-alignment at least to the level of sections and speaker turns,
speaker identification, and standardized punctuation and orthography
but no additional markup. A portion of the data will be
further
subject to the quick rich
transcription
(QRTR) specification, which adds time-aligned sentences plus sentence
type identification (SUs). A small amount of data will be
dually
transcribed and discrepancies adjudicated to provide data for analysis
of human transcription variation. Upon sponsor request, small amounts
of data may also be transcribed using Careful Transcription (CTR)
specifications in order to provide material for evaluating system
performance.
Initial transcripts are obtained in a number of
ways including
- manual creation by LDC
- manual creation by professional transcription agencies
- harvesting from the web
- acquisition of transcript archives from commercial sources
Transcripts will be distributed to GALE sites as
part of
quarterly data releases. Initial distributions may include
transcripts that do not fully conform to the QTR
specification.
Such transcripts will be reviewed and revised for future releases.
Guidelines
-
Quick Transcription Specification (QTR)
The goal of quick transcription (QTR) for broadcast news and broadcast conversation is to produce a verbatim, time-aligned transcript with minimal but useful markup. The elements a quick transcript include verbatim transcription plus time-aligned section boundaries and speaker turns (segmentation), section type identification and speaker identification.
-
Quick Rich Transcription Specification (QRTR)
The goal of quick rich transcription (QRTR) for broadcast news and broadcast conversation is to produce a verbatim, time-aligned transcript with minimal but useful markup. QRTR also identifies some salient structural features of the broadcast recording. The elements a quick rich transcript include verbatim transcription plus time-aligned section boundaries, speaker turns and sentences (segmentation), section and sentence type identification and speaker identification.
- Arabic Broadcast QRTR Guidelines (updated 05/22/2008)
- Chinese Broadcast QRTR Guidelines (updated 03/05/2008)
- English
Broadcast QRTR Guidelines
-
Careful Transcription Specification (CTR)
The goal of careful transcription (CTR) for broadcast news and broadcast conversation is to produce a verbatim, time-aligned transcript with rich markup and extensive quality control. Audio is first manually segmented and time-aligned with the transcript at the level of the section, speaker turn and breath group. The verbatim transcript contains section type identification and speaker identification as well as standardized markup for disfluencies, partial words, acronyms, mispronounced words and several other features
- RT-04 Careful Transcription (CTR) Guidelines, Version 3.1 (updated 03/31/2004)
-
Metadata Extraction (MDE)
During the Metadata Extraction task, annotators identify fillers, strings of deletable words (deletable regions, or delregs) within edit disfluencies, and SUs ("semantic", "syntactic" or "sentence" units). Transcripts annotated for metadata can be "cleaned up" to enhance readability; for instance, delregs and fillers might be removed and each SU presented as a separate line within the transcript.
- Description
of Web-harvested Transcripts
In addition to creating verbatim transcripts for GALE, LDC also harvests transcripts from online news sources, where possible. The transcripts are in a plain text format. LDC extracts the transcripts from the downloaded HTML files, converted to UTF-8, and divides them into "sentences" based on punctuation characters. In most cases, speakers are identified. However, the web-harvested transcript format differs from standard LDC transcription format and content in several ways. The transcripts have not been time-aligned with the audio and they do not follow LDC-style transcription markup. In addition, sections and sentence units (SUs) have not been explicitly identified.














