Guide to Conventions
Jump to:
How-to
Guide: Timestamping
How-to Guide: Second-Passing
For each file, we will begin with an initial transcript (created by an external transcription agency) to work with. The annotator's job is to produce a perfect verbatim (word-for-word) transcription of each file, complete with accurate timestamps and the appropriate markup according to these conventions.
Timestamping
Each file may contain a number of different speakers.
There may be a number of different turns inserted, but is necessary to listen to the entire audio file. Breakpoints (or time segments that are present that do not indicate a new speaker ) are inserted mainly for ease of transcription. These will typically coincide with breath groups or pauses in the speech, and may coincide with the ends of sentences or phrases. Breakpoints will typically occur every 3-8 seconds.
Some things to consider when inserting breakpoints:
Speaker Identification
At the beginning of each transcript, the speaker should be given a unique identifier if the name is not present. The speaker's gender will also be indicated during the second pass. Personal information about the speakers (what their names are, where they live,etc) that are present in the audio data will be distorted in the audio and in the transcripts during post-processing in order to protect participant privacy - in aid of this, sections in which personal information occurs in the transcripts must be identified. More on that later.
Orthography
We follow the general orthographic conventions (spelling) for English.
Some special situations:
Capitalization
Capitalization in our transcripts is used as an aid for human comprehension of the text. You should follow the accepted standard way to capitalize words, including words at the beginning of a sentence, proper names, and so on.
Numerals
Write out all numerals. Hyphenate numbers between twenty-one and ninety-nine only.
twenty-two
nineteen ninety-five
seven thousand two hundred seventy-five
nineteen oh nine
Abbreviations
When abbreviations are used as part of a personal title, they can remain as abbreviations:
Mr. Brown
Mrs. Jones
Dr. Spock
However, when they are used in any other context, write them out in full.
I went to the junior league game.
I'm going home to see the missus
I went to the doctor, and all he said was, don't worry, it's natural.
Hey mister, do you know how to get to the stadium?
Punctuation
Use standard English punctuation, limited to the following:
.
?
,
Do not use any additional punctuation, such as quotation marks, exclamation marks, colons, semicolons, dashes or ellipses.
Contractions and apostrophe -s
Limit your use of contractions to those that exist in standard written English, and of course only when a contraction is actually produced by the speaker. The table below, while not comprehensive, illustrates what is considered standard written English with respect to contractions. (Note: Avoid the common mistakes of transposing possessive its for contraction it's (it is) and possessive your for the contraction you're (you are).
| Complete words | Contraction allowed | Contraction
not allowed |
| I have | I've | |
| Cannot | can't | |
| will not | won't | |
| you have | you've | |
| could not | couldn't | |
| we will | we'll | |
| should have | should've | |
| it is | it's | |
| Marvin - possessive | Marvin's | |
| going to | -- | Gonna |
| want to | -- | Wanna |
| she is | -- | she's |
| Marvin is | -- | Marvin's |
| Marvin has | -- | Marvin's |
Hyphenated words and compounds
This is a tricky issue. In general, be conservative about use of hyphens. When in doubt, leave it out. For instance:
an overly complicated analysis not an overly-complicated analysis
However, in some cases, a hyphen is required:
anti-nuclear protests not anti nuclear protests
Consult your supervisors with questions.
Special Conventions
Some special conventions are used to indicate particular
kinds of words.
| Condition | Symbol | Example | Description of symbol's use |
| Acronyms I | @ | @NATO | acronyms commonly pronounced as a single word |
| Acronyms II | ~ | ~FBI | acronyms commonly pronounced as series of letters |
| Individual letters | ~ | ~I before ~E except after ~C | Individual letters spelled out |
| Proper Nouns | ^ | ^Mitsubishi | Proper names, place names, organization names should be marked, and segmented for SBCSAE |
| Partial words | - | absolu- | Speaker-produced partial words are indicated with a dash. Transcribe as much of the word as you hear. |
| Mispronounced words | + | +probably | Mispronounced
word (a speech error).
NOTE: Do not use this symbol to indicate non-standard but common regional or social dialect pronunciations, such as "gonna" for "going to". Transcribe these variants using normal standard orthography. |
| Idiosyncratic words | * | *poodleish | Speaker uses a "made-up" word. |
| Speaker noise | { } | {breath}
{cough} {laugh}
|
Sounds made by the talker. Limited to these five sounds |
| Background noise (instantaneous) | [background] | [background] | Short sound not made by the talker (in the background). |
| Background noise (extended) | [background/] [/background] | [background/]
[/background] |
Indicates start/end of a longer non-speaker background noise. |
| Distortion (instantaneous) | [distortion] | [distortion] | Brief period of distortion in the audio |
Distortion (extended) |
[distortion/]
[/distortion] |
[distortion/]
[/distortion] |
Indicates start/end of a longer period of distortion in the audio |
| Semi-intelligible speech | ((text)) | ((the software design)) | This is the transcriber's best attempt at transcribing a difficult passage |
| Unintelligible speech (single token) | (( )) | (( )) | Used if a single word or short phrase is completely unintelligible. |
| Unintelligible speech (extended) | [[skip]] | [[skip]] | Used to indicate a long period of unintelligible speech. This mark should receive a separate timestamp. |
| Foreign language | <language text> | <French merci> | This
is used to indicate foreign speech. If the foreign word is unknown, merely
write the language. If the language is unknown, consult your supervisors.
NOTE: Do not use this convention for common foreign language borrowings into English, such as "..." |
| Non-lexemes | % | %ah | Non-word hesitation sounds. Limited to the list below. |
Non-lexemes
Indicate filled pauses using a standardized spelling, shown in the table below.
(*If you believe a speaker uses a word that does not appear on this list, let us know.)
| %ach | %eh | %hee | %huh | %oh | %aw |
| %ah | %ew | %huh | %um | %er | |
| %eee | %ha | %hm | %uh | %ooh |
please pay particular attention to okay and uh-oh (Not OK and %UH %OH)
Interjections
Unlike hesitation sounds, interjections are considered words and require no special markup. Use the standardized spellings listed in the table below.
| mhm | okay | yeah |
| uh-huh | whoa | jeeze |
| uh-oh | whew |