This document describes a convention for formatting and structuring transcripts of spoken language. The convention is intended to apply equally well to all forms of spoken language, and to support varying types and amounts of detail in transcription. It is also designed to support an efficient user interface for the creation of transcript data, as well as clear, stable and unambiguous interpretation by users of transcripts.
In general, it is expected that both creators and users of transcripts will make use of digital recordings of the speech in conjunction with the transcripts, so an important feature of the convention is to provide consistent and reliable references to time offsets within the recordings. But for applications in which accurate reference to the acoustic signal is not required, the time-offset information can readily be ignored or excluded, without any impact on the presentation of relevant data (i.e. the text of utterances and any structural or featural information associated with the utterances).
The convention is based on a simple skeletal structure of SGML markup, allowing extensibility to support varying levels of detail, as well as parsibility to assure both formal adherence to an established specification and overall coherence within each transcript document. The formal specification is simply the SGML Document Type Definition (DTD), which dictates the character encoding to be used, the inventory of units and structures that make up the transcript document, and the organization of these units and structures within the document.
This design specification is organized into two sections: General Properties, and Properties of transcribed speech. The former describes the elements of the design that apply equally to all transcripts, while the latter provides details that depend on such issues as character encoding, orthographic practice, definitions of "word", etc, which differ from one language to another.
All transcripts are derived from recordings of speech, so the
boundaries of a transcript are, at maximum, the boundaries of a
particular recording. For applications that involve use of the
acoustic signal with the transcript, identification of the acoustic
data (i.e. a file name or other reference) is provided. The
transcript may exclude one or more portions of the recording, and the
excluded portion(s) may or may not have an explicit representation in
the transcript, depending on the needs of particular applications
where the transcript is to be used.
All transcripts are based on a fundamental unit of recorded speech behavior, which will be refered to as a "speaker turn", or simply "Turn". The Turn may range in length from a brief conversational interjection to an extended speech or lecture. Whatever its length, its content represents a single act of communication by one individual, bounded either by the limits of the transcript or by other Turns within the transcript.
In transcripts where two or more speakers are present, Turns are labeled to identify each speaker uniquely. The nature of the labeling can range from arbitrary and generic ("A", "B", etc) to featural ("M1", "F1", etc) to specific ("BClinton", "PJennings", etc). In general, the speaker labels applied to Turns should reflect relevant information about the speaker, such as gender; in the case of multichannel recordings with speakers divided by channel, the channel identification is reflected in the speaker labels as well.
For some applications, sequences of Turns may be grouped into larger structural units, such as transactions or stories. In addition, some applications may require that Turns be subdivided according to various events or conditions that occur within Turns. These varying needs can be met by adding levels of structure to the SGML markup above or below the level of the Turn. In all cases, the Turn remains the fundamental unit whose basic representation and function are consistent across all transcripts.
All time offsets in a transcript are given a single, uniform representation, called a "Breakpoint" tag. In order for the correlation of transcripts and recordings to be consistent and reliable across all uses of this specification, each Turn must begin and end with a Breakpoint tag to define the temporal extent of the Turn in the recording. The Turn can contain additional Breakpoint tags, to associate time offsets with other notations within the Turn, or simply to split a long Turn into chunks that are more convenient for auditing, transcribing or processing.
If portions that are excluded from the transcription process are to be listed explicitly as such in the transcript, these portions must be identified by some alternative SGML tagged unit, whose contents are simply the two Breakpoint tags that define its temporal extent.
In cases where two or more speakers are recorded in conversation, it is expected that their respective Turns will occasionally overlap in time. Since each Turn contains an initial and final Breakpoint time offset, this overlap already has a direct (but minimal) representation without further notation. In applications where it is important to identify overlapping speech in more careful detail (e.g. where two voices overlap on a single audio channel), the portions of each Turn affected by the overlap can be bounded by Overlap tags, and these can be bound to additional Breakpoint tags within each Turn where necessary.
In order to clarify this set conventions, a few different transcription scenarios will be considered in detail below. In each scenario, there is a brief summary of requirements imposed by the nature of the recording and the needs of an application to be served by the transcript. Following the summary, there is a table that summarizes the transcript specifications in terms of SGML units; these would be implemented by rendering the contents of the table into a corresponding DTD. Following the table, there is an explanation of the table notations, and an example of the resulting transcript format, together with comments about the format.
Scenario 1: Two-channel telephone conversation
|Turn||SpkrID=[AB][1-9]*||Time,((OV | #TRANSDATA?),Time)+|
The definition of what constitutes the text content of a transciption (i.e. #TRANSDATA) is language dependent. It will be discussed in greater detail under Properties of transcribed speech.
Words preceded by an initial "/" are to be classified as proper nouns; this is a language-dependent convention for token classification, to be discussed below. No punctuation was used by the transcriber in this case, except for a question mark where appropriate. (This sample was adapted from an excerpt produced in accordance with previous callhome conventions.) It would be possible to define other forms of punctuation in the language-specific portion of the DTD.
Scenario 2: Single-channel broadcast news
|Episode||Filename, Program, Language||Section+|
|Section||Type, startTime, endTime||(Turn*|Comment?)|
|Turn||Speaker, Sex, startTime, endTime||(Time?,(#PCDATA|Foreign|Unclear|Overlap))+|
These definitions establish the possible character patterns that make up the transcription text proper (i.e. the #PCDATA portions of the DTD specification): which characters make up lexical tokens (as well as non-lexical verbalizations, or "non-lexemes", e.g. "um"), and which characters divide tokens. Among the tokens, it will generally be useful to define particular subsets having peculiar qualities, such as proper nouns, non-lexemes, alphabetic strings ("FBI"), etc. A special set of characters, not otherwise used in rendering the tokens, is applied to identify members of each class.
The list (and explanation) of special these characters is found in the document: