GALE: Transcription

LDC will create, purchase or otherwise acquire verbatim transcripts for large volumes of speech for GALE. Final transcripts will conform to the quick transcription (QTR) specification whose elements include accurate transcription of content words, segmentation and time-alignment at least to the level of sections and speaker turns, speaker identification, and standardized punctuation and orthography but no additional markup.  A portion of the data will be further subject to the quick rich transcription (QRTR) specification, which adds time-aligned sentences plus sentence type identification (SUs).  A small amount of data will be dually transcribed and discrepancies adjudicated to provide data for analysis of human transcription variation. Upon sponsor request, small amounts of data may also be transcribed using Careful Transcription (CTR) specifications in order to provide material for evaluating system performance.

Initial transcripts are obtained in a number of ways including

  • manual creation by LDC
  • manual creation by professional transcription agencies
  • harvesting from the web
  • acquisition of transcript archives from commercial sources

Transcripts will be distributed to GALE sites as part of quarterly data releases.  Initial distributions may include transcripts that do not fully conform to the QTR specification.  Such transcripts will be reviewed and revised for future releases.

Guidelines

  • Quick Transcription Specification (QTR)

    The goal of quick transcription (QTR) for broadcast news and broadcast conversation is to produce a verbatim, time-aligned transcript with minimal but useful markup.  The elements a quick transcript include verbatim transcription plus time-aligned section boundaries and speaker turns (segmentation), section type identification and speaker identification.

  • Quick Rich Transcription Specification (QRTR)

    The goal of quick rich transcription (QRTR) for broadcast news and broadcast conversation is to produce a verbatim, time-aligned transcript with minimal but useful markup.  QRTR also identifies some salient structural features of the broadcast recording.  The elements a quick rich transcript include verbatim transcription plus time-aligned section boundaries, speaker turns and sentences (segmentation), section and sentence type identification and speaker identification

  • Careful Transcription Specification (CTR)

    The goal of careful transcription (CTR) for broadcast news and broadcast conversation is to produce a verbatim, time-aligned transcript with rich markup and extensive quality control.  Audio is first manually segmented  and time-aligned with the transcript at the level of the section, speaker turn and breath group.  The verbatim transcript contains section type identification and speaker identification as well as standardized markup for disfluencies, partial words, acronyms, mispronounced words and several other features

  • Metadata Extraction (MDE) 

    During the Metadata Extraction task, annotators identify fillers, strings of deletable words (deletable regions, or delregs) within edit disfluencies, and SUs ("semantic", "syntactic" or "sentence" units). Transcripts annotated for metadata can be "cleaned up" to enhance readability; for instance, delregs and fillers might be removed and each SU presented as a separate line within the transcript.


  • Description of Web-harvested Transcripts

    In addition to creating verbatim transcripts for GALE, LDC also harvests transcripts from online news sources, where possible. The transcripts are in a plain text format.  LDC extracts the transcripts from the downloaded HTML files, converted to UTF-8, and divides them into "sentences" based on punctuation characters. In most cases, speakers are identified. However, the web-harvested transcript format differs from standard LDC transcription format and content in several ways. The transcripts have not been time-aligned with the audio and they do not follow LDC-style transcription markup. In addition, sections and sentence units (SUs) have not been explicitly identified.