Santa Barbara Corpus of Spoken American English


Guide to Conventions



 
 

Jump to:

The Basics Timestamping Orthography Markup
  Speaker ID Capitalization Symbols Used
  Proper names Numerals Non-Lexemes
    Abbreviations Interjections Overlapping Speech
    Punctuation  
    Contractions  
    Hyphens & Compounds  

How-to Guide: Timestamping
How-to Guide: Second-Passing


 
For each file, we will begin with an initial transcript (created by an external transcription agency) to work with. The annotator's job is to produce a perfect verbatim (word-for-word) transcription of each file, complete with accurate timestamps and the appropriate markup according to these conventions. 

 

Timestamping
Each file may contain a number of different speakers.

There may be a number of different turns inserted, but is necessary to listen to the entire audio file. Breakpoints (or time segments that are present that do not indicate a new speaker ) are inserted mainly for ease of transcription. These will typically coincide with breath groups or pauses in the speech, and may coincide with the ends of sentences or phrases. Breakpoints will typically occur every 3-8 seconds.

Some things to consider when inserting breakpoints:



 

Speaker Identification

At the beginning of each transcript, the speaker should be given a unique identifier if the name is not present. The speaker's gender will also be indicated during the second pass. Personal information about the speakers (what their names are, where they live,etc) that are present in the audio data will be distorted in the audio and in the transcripts during post-processing in order to protect participant privacy - in aid of this, sections in which personal information occurs in the transcripts must be identified. More on that later.



Transcription Conventions

Orthography
We follow the general orthographic conventions (spelling) for English. Some special situations:

Capitalization
Capitalization in our transcripts is used as an aid for human comprehension of the text. You should follow the accepted standard way to capitalize words, including words at the beginning of a sentence, proper names, and so on.
Numerals
Write out all numerals. Hyphenate numbers between twenty-one and ninety-nine only.
twenty-two
nineteen ninety-five
seven thousand two hundred seventy-five
nineteen oh nine
Abbreviations
When abbreviations are used as part of a personal title, they can remain as abbreviations:
Mr. Brown
Mrs. Jones
Dr. Spock
However, when they are used in any other context, write them out in full.
I went to the junior league game.
I'm going home to see the missus
I went to the doctor, and all he said was, don't worry, it's natural.
Hey mister, do you know how to get to the stadium?
Punctuation
Use standard English punctuation, limited to the following:
.
?
,
Do not use any additional punctuation, such as quotation marks, exclamation marks, colons, semicolons, dashes or ellipses.
Contractions and apostrophe -s
Limit your use of contractions to those that exist in standard written English, and of course only when a contraction is actually produced by the speaker. The table below, while not comprehensive, illustrates what is considered standard written English with respect to contractions. (Note: Avoid the common mistakes of transposing possessive its for contraction it's (it is) and possessive your for the contraction you're (you are).
Complete words Contraction allowed Contraction
not allowed
I have I've  
Cannot can't  
will not won't  
you have you've  
could not couldn't  
we will we'll  
should have should've  
it is it's  
Marvin - possessive Marvin's  
going to -- Gonna
want to -- Wanna
she is -- she's
Marvin is -- Marvin's
Marvin has -- Marvin's
 
Hyphenated words and compounds
This is a tricky issue. In general, be conservative about use of hyphens. When in doubt, leave it out. For instance:
an overly complicated analysis not an overly-complicated analysis
However, in some cases, a hyphen is required:
anti-nuclear protests not anti nuclear protests
Consult your supervisors with questions.


 

Special Conventions
Some special conventions are used to indicate particular kinds of words.

Condition Symbol Example Description of symbol's use
Acronyms I @ @NATO acronyms commonly pronounced as a single word
Acronyms II ~ ~FBI acronyms commonly pronounced as series of letters
Individual letters ~ ~I before ~E except after ~C Individual letters spelled out
Proper Nouns ^ ^Mitsubishi Proper names, place names, organization names should be marked, and segmented for SBCSAE
Partial words - absolu- Speaker-produced partial words are indicated with a dash. Transcribe as much of the word as you hear.
Mispronounced words + +probably Mispronounced word (a speech error). 
NOTE: Do not use this symbol to indicate non-standard but common regional or social dialect pronunciations, such as "gonna" for "going to". Transcribe these variants using normal standard orthography.
Idiosyncratic words * *poodleish Speaker uses a "made-up" word.
Speaker noise { } {breath}
{cough}

{laugh}
{sneeze}
{lipsmack}

Sounds made by the talker. Limited to these five sounds
Background noise (instantaneous) [background] [background] Short sound not made by the talker (in the background).
Background noise (extended) [background/] [/background] [background/]

[/background]

Indicates start/end of a longer non-speaker background noise.
Distortion (instantaneous) [distortion] [distortion] Brief period of distortion in the audio
Distortion (extended)
[distortion/]

[/distortion]

[distortion/]

[/distortion]

Indicates start/end of a longer period of distortion in the audio
Semi-intelligible speech ((text)) ((the software design)) This is the transcriber's best attempt at transcribing a difficult passage
Unintelligible speech (single token) (( )) (( )) Used if a single word or short phrase is completely unintelligible. 
Unintelligible speech (extended) [[skip]] [[skip]] Used to indicate a long period of unintelligible speech. This mark should receive a separate timestamp.
Foreign language <language text> <French merci> This is used to indicate foreign speech. If the foreign word is unknown, merely write the language. If the language is unknown, consult your supervisors.
NOTE: Do not use this convention for common foreign language borrowings into English, such as "..."
Non-lexemes % %ah Non-word hesitation sounds. Limited to the list below.
 
Non-lexemes
Indicate filled pauses using a standardized spelling, shown in the table below.
(*If you believe a speaker uses a word that does not appear on this list, let us know.)
%ach %eh %hee %huh %oh  %aw
%ah %ew %huh %um %er  
%eee %ha %hm %uh %ooh  
 
Interjections
Unlike hesitation sounds, interjections are considered words and require no special markup. Use the standardized spellings listed in the table below.
please pay particular attention to okay and uh-oh (Not OK and    %UH %OH)
 
 
mhm okay yeah
uh-huh whoa jeeze
uh-oh whew  

Overlapping Speech
While there may be overlapping speech segments between two speakers - it is impossible for there to be an overlapping speech section for the same speaker - please be careful to ensure that this does not occur.