Guidelines for RT-03 Transcription

Offical RT-03 Transcription Guidelines in .pdf format



RT-03 Web Guidelines for Annotators


Transcription Main Sample BN Transcript Keyboard Shortcuts Tools Help LDC Home



Your goal is to provide an accurate, verbatim (word-for-word) transcript of the entire broadcast.  The transcript will be time-aligned with the audio file.

The Transcription Process

1) Segmentation
The segmentation process creates initial timestamps for the audio file.  Timestamps indicate when different things are happening in the audio, and so allow us to align the transcript with the corresponding audio file. Timestamps also make transcription of the audio easier, by allowing the transcriber to listen to small chunks of segmented speech at a time.

Segment boundaries, or timestamps, must occur at regular intervals within each audio file.   Segment boundaries, at a minimum, must identify

    • story or section boundaries within a news broadcast
    • speaker turns (change of speaker)
In addition, you may insert several additional breakpoints within each speaker's turn, especially if the turn is lengthy.  This will help break up long turns into more manageable units, and make transcription easier.

Some things to keep in mind while doing segmentation:

Section and turn boundaries are easy to detect, and timestamps must be inserted at these points.
Breakpoints, or timestamps within a speaker's turn, will typically coincide with breath groups or pauses in the speech, and may coincide with the ends of sentences or phrases.  Breakpoints should typically be inserted every 3-8 seconds.

Some things to consider when inserting breakpoints:

  • Breakpoints should never occur in the middle of a word
  • Be careful not to clip off the end/beginning of a word when inserting a breakpoint
    • This is trickiest with certain sounds, like "s", "sh", "f", "th".  Take special care when inserting breakpoints around words that begin or end with these sounds.
  • Good places to insert breakpoints are
    • at pauses
    • at breaths
    • at ends of sentences
  • If you encounter a period of silence which lasts more than .5 seconds, insert a breakpoint.
Beginning of a section
Because section boundaries are presumed to begin with a speaker turn, it is not necessary to insert a turn boundary directly after a section boundary.  For instance:
<sf 21.232> <<male, Lou_Waters>>
The last great explorer ^Jacques ^Cousteau has died in ^Paris at age eighty-seven.
<t 25.907> <<female, Natalie_Allen>>
{breath} Part of Early Prime is being preempted so that for the next half hour we can remember one of the giants
<b 31.105>
of the twentieth century. Hello, I'm ^Natalie ^Allen.
End of a turn
If the end of one speaker's turn is directly followed by the start of another speaker's turn, there is no need to specifically label or timestamp the end of the first speaker's turn.  If a speaker's turn is followed by a period of non-speech (music, sound effects or silence), then you must explicitly timestamp and label the end point of the speaker's turn with <e>.

End of a section
If the end of a section is directly followed by the start of another section, there is no need to specifically label or timestamp the end of the first section.  If the section is followed by a period of non-speech (music, sound effects or silence), then you must explicitly timestamp and label the end point of the section with <e>.

End of the file
Each file must end with a final timestamp, indicating where the audio recording for that program concludes.  This timestamp should be labeled with <e> to indicate end.

Overlapping speech
A special subclass of breakpoints marks the beginning and end points of overlapped speech; that is, periods of the recording where there are multiple speakers talking at once.  Use the notation <o> to mark the beginning of the overlapping speech section.  The <e1> or <e2> label indicates the end of the period of overlap.  <e1> is used when speaker one stops talking while speaker two continues; <e2> is used when speaker two stops talking while speaker one continues.  If both speakers stop talking at the same time and a third person begins talking, a new <t> turn label or <sx> section label should be used.  Instructions for transcribing overlapping speech appear below.

Multiple speakers
In situations when you have several people speaking at once and it is very difficult to make them out, insert an <e> tag at the start of the difficult section. Then start the new turn <t> at the next region of clear, discernable speech.

   <t 223.456> <<male, speaker_12>>
        <b 225.678>
        <e 230.302>
     ((region of multiple speakers, impossible to transcribe))
        <t 232.563> <<female, speaker_13>>>

Speakers start simultaneously
When speakers start talking simultaneously, create start times for the speakers that are about one tenth of a second (or less) apart, and use the <o> overlap tag for the second speaker's turn. For instance,

   <t 1123.176> <male, Jacques_Cousteau>
        You know,
        <o 1123.276> <female, speaker_2>
        SPEAKER1:I understand {laugh}
        SPEAKER2:well,
        <e1 1124.256>
        oceanography is new exploration and we're not...

Extended periods of non-speech
For an extended (more than 5 seconds) period of silence, music or other non-speech, insert an <e> tag at the start of the non-speech section. Then start the new turn <t> at the next region of speech.  For example,

   <t 223.456> <<male, speaker_1>>
        <b 225.678>
        <e 230.302>
         ((region of silence, sound effects or music))
        <t 236.563> <<female, speaker_3>>>

2) Section Labels

In addition to providing timestamps, you must also label each section, turn or breakpoint with the appropriate label.

There are three types of section boundaries:

  • <sr> refers to news reports
  • <sf> refers to fillers
  • <sn> refers to non-news sections, including commercials
These three sections are defined in detail here.

The table below summarizes the segment labels you will use.

       
      sr
      start of <section type=report> news story section
      sf
      start of <section type=filler> filler section
      sn
      start of <section type=non-news> non-news section: commercials, etc. 
      t
      start of (non-initial) speaker turn within section 
      b
      turn interval breakpoints
      e
      end of turn within section, followed by a non-speech region 
      o
      start of overlap region (speaker one is interrupted by speaker two) 
      e1
      end of overlap where speaker one stops and speaker two continues 
      e2
      end of overlap where speaker two stops and speaker one continues 
3) Speaker Identification
In addition to identifying segment boundaries and timestamping them, you must also identify all of the speakers within a broadcast.  If you are unable to determine the name of a speaker, you must assign that speaker a unique identification, and use the same speaker ID throughout the transcript file.  You must also identify speaker type as
Female
Male
Child
Other - used for speakers in unison, altered voices, etc.
Names and Identifiers
Whenever possible, include the proper name of the speaker. Examples of proper names include Jacques_Cousteau, William_Cohen, and Madeleine_Albright.    You must use the same spelling of proper names within and across all broadcast files.

If a speaker is not identified by name within a recording, a unique numerical index is used.  Unnamed speakers are divided into Reporter and SpeakerReporter is used for news anchors, interviewers, or reporters on the scene of a story. Speaker refers to anyone else who is not identified by name.  The numerical IDs for Reporter and Speaker IDs cannot overlap; each successive anonymous speaker  has a unique number, regardless of the category the speaker is assigned to. For example, the following sequence is entirely possible:

        reporter_1
        reporter_2
        speaker_3
        speaker_4
        reporter_2 (assuming it is the same voice as the previous reporter_2)
        reporter_6 (a new reporter distinct from the two above)

Native and non-native speakers
In addition to indicating speaker type and name/ID, you must also indicate when a speaker is a non-native speaker.  In English broadcast news, native is defined as a speaker of any North American English dialect. As native is the default, you do not need to explicitly mark this. Non-native is used for speakers of other dialects of English, including British English or Indian English; non-nativeis also used to indicate people who are not native English speakers and have a discernable foreign accent.

Examples of speaker identifications

         <sr 1.402> <<male, Leon_Harris>>
         <sr 158.244> <<female, Joie_Chen>>
         <t 196.813> <<male, speaker_1>>
         <t 498.314> <<female, non-native, speaker_3>>
         <t 567.215> <<male, altered, speaker_4>>
 



In Progress

4) Transcription
Once a file has been fully segmented and the speakers identified, it must be transcribed.  Annotators must produce a verbatim (word-for-word) transcript of everything that is said within the file.  The words transcribed within each segment boundary must correspond exactly to the timestamps that have been created, so that the audio file is aligned with the transcript.

Transcription Conventions

Capitalization
Capitalization in our transcripts is used to aid human comprehension of the text. You should follow accepted standard written capitalization patterns, and capitalize words at the beginning of a sentence, proper names, and so on.

Orthography and spelling
Transcribers should use standard orthography, word segmentation and word spelling.  All files must be spell-checked after transcription is complete.  When in doubt about the spelling of a word or name, transcribers should consult a standard reference, like an online or paper dictionary, world atlas or news website.

Punctuation
Transcribers should use standard punctuation for ease of transcription and reading.   Acceptable punctuation is limited to periods and question marks at the end of a sentence, and commas within a sentence.  Write the punctuation as you normally would in standard written English (with no additional spaces around the punctuation marks).

DO NOT use quotation marks, exclamation marks, colons, semicolons, dashes or ellipses in transcribing.  If you encounter these symbols in an existing transcript, you must remove them.

Abbreviations
Avoid word abbreviations whenever possible; instead, spell out the word in full.  When abbreviations are used as part of a personal title, they can remain as abbreviations:

       Mr. Brown
       Mrs. Jones
       Dr. Spock

However, when they are used in any other context, write them out in full.

          I went to the junior league game.
          I went to the doctor, and all he said was, don't worry, it's natural.
          Hey mister, do you know how to get to the stadium?

Hyphenated words and compounds
In general, be conservative about use of hyphens.  For instance:

          an overly complicated analysis not an overly-comlicated analysis

However, in some cases, a  hyphen is required:

       anti-nuclearprotests not anti nuclear protests

Coumpounds can be tricky.  When in doubt, consult a dictionary and talk to your team leader.

Numbers
Write out all numerals as words. Hyphenate numbers between twenty-one and ninety-nine only.

             twenty-two
             nineteen ninety-five
             seven thousand two hundred seventy-five
             nineteen oh nine

Contractions and apostrophe -s / Contractions, abbreviations and compound words  **NEEDS WORK**
Limit your use of contractions to those that exist in standard written English, and of course only when a contraction is actually produced by the
speaker.  Take care to transcribe exactly what the speaker says.  The table below, while not comprehensive, shows some examples of how to transcribe common contractions.
 
Complete Form Spoken As Transcribed As Incorrect
I have I've I've
cannot can't can't
will not won't won't
you have you've you've
could not couldn't couldn't
should have should've should've should of, shoulda
it is it's it's
Marvin (possessive) Marvin's Marvin's
Marvin is Marvin's Marvin's
Marvin has Marvin's Marvin's
going to gonna going to gonna
want to wanna want to wanna
got to gotta got to gotta

Note: Avoid the common mistakes of transposing possessive its for contraction it's (it is), possessive your for the contraction you're (you are), and their (possessive), they're (they are) and there.

Transcribe exactly what you here using standard orthography.  If a speaker uses a contraction, transcribe the word as contracted: they're, won't, isn't, don't and so on.  If the speaker uses a complete form, transcribe what you hear: they are, is not and so on.

For non-standard contractions like "gonna" and "wanna" spell out the entire word: going to, want to.  If you are unsure about whether a contraction is standard or non-standard, talk to your team leader.

Disfluent speech **NEEDS WORK**
Regions of disfluent speech are particularly difficult to transcribe.  Speakers may stumble over their words, repeat themselves, utter partial words, restart phrases or sentences, and use lots of hesitation sounds.  Take particular care in sections of disfluency to transcribe exactly what you hear.

Filled Pauses/Hesitation Sounds
Filled pauses are non-lexemes (non-words) that speakers employ to indicate hesitation or to maintain control of a conversation while thinking of what to say next.  Each language has a limited set of filled pauses that speakers can employ.    Use the standardized spellings shown in the table below for filled pauses.  Don't alter the spelling to reflect how the speaker pronounces the word (e.g., typing AH for a loud "ah" or hmmmmmmm for a long "hmm".)  For English, this set includes ah, eh, er, uh, um.

If you believe a speaker uses a word or sound as a filled pause or hesitation marker, and the word does not appear on this list, let your team leader know.  All filled pauses are indicated with a % sign preceding the word.
 
English Filled Pauses
%ah
%eh
%er
%uh
%um

Partial Words
Use - to indicate point at which word was broken off.

Mispronounced Words
Use + symbol for obviously mispronounced words (not regional or non-standard dialect pronunciation).

Hard-to-understand Words
Sometimes you will encounter a section of speech that is difficult or impossible to understand.  In these cases, you should use the (( )) symbol to mark the region of difficulty.

If you have some idea of the speaker's words but aren't entirely sure, type what you think you hear and surround the stretch of uncertain transcription with double parentheses:

((here you type what you think you hear but aren't sure))
If you're truly mistified and can't at all make out what the speaker is saying, don't type anything, and use empty double parentheses to surround the untranscribed region.  If possible, this untranscribed region should get its own timestamp.

Idiosyncratic words
Occasionally a speaker will make up a new word on the spot.  These are not the same as slang words; they're words that are unique to the speaker in that conversation.  If you encounter an idiosyncratic word, transcribe it to the best of your ability and mark it with a * symbol.

Proper nouns
We mark all proper nouns, including personal names, place names and the like, with a  ^ symbol.  If the name contains more than one word, mark all words in the name with the symbol.

Interjections
Please use these standardized spellings for interjections.  Interjections *do not* require any special symbol.  If you encounter an interjectection that does not appear on this list and are unsure how to spell it, notify your team leader.
 
 ach
 eee
 ew
 ha
 hee
 huh
 hm
 huh
 jeez
mhm
oh
okay
ooh
uh-huh
uh-oh
whoa 
whew
yeah

Summary of special symbols
Condition Symbol Example Description of symbol's use
Individual letters ~  ~I before ~E except after ~C Individual letters spelled out, each with ~
Partial words - absolu- Speaker-produced partial words are indicated with a dash.  Transcribe as much of the word as you hear.
Mispronounced words + +probably Mispronounced word (a speech error).  NOTE: Do not use this symbol to indicate non-standard but common regional or social dialect pronunciations, such as "gonna" for "going to". Transcribe non-standard pronunciation variants using normal standard orthography.
Idiosyncratic words *  *poodleish Speaker uses a "made-up" word.  NOTE: Some speakers may use non-standard dialect words which don't constitute idiosyncratic words.  If you're unsure, consult your team leader.
Speaker restart -- I thought he -- I thought he was there. Used when the speaker stops short and then repeats themselves or abandons the utterance completely, restarting with a new sentence.
Speaker noise { } {breath} 
{cough}
{laugh} 
{sneeze} 
{lipsmack}
Sounds made by the talker.  Limited to these five sounds
Semi-intelligible speech ((text)) ((they lived next door to us)) This is the transcriber's best attempt at transcribing a difficult passage
Unintelligible speech (single token) (( )) (( )) Used if a single word or short phrase is completely unintelligible.
Foreign language <language text> <French merci> This is used to indicate foreign speech.  If the  foreign word is unknown, merely write the language.  If the language is unknown, consult your team leader.
NOTE: Do not use this convention for common foreign language borrowings into English, such as XX
Punctuation ,.? Limited to end-of-sentence and commas
Numbers twenty-five, one hundred and six Written in full
Non-standard contractions going to, want to spell out in full  - gonna, wanna
Proper names
Interjections and non-lexemes no special markup uh-huh, mhm, yeah, uh-oh, whoa 
filled pauses % ah, er, hm, um, uh limited to this list
Pronounced Acronyms @ @NAFTA, @AIDS

Some general considerations
Do not try to correct grammatical errors, e.g. "I seen him" for "I saw him" should be transcribed as spoken.  The same goes for mis-used words: transcribe what is spoken, not what you expect to hear.

Don't try to imitate a speaker's non-standard pronunciation.  Use standard spelling for non-standard pronunciations.  Obviously mispronounced words (as opposed to non-standard pronunciations) should be marked with the special + symbol.  When in doubt, consult your team leader.

     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

    -----------------------------------

    Of all of the sections, you should only transcribe those that are  reports, "sr" (section =report) , (including weather) ,or filler material, "sf" (section=filler). There are several things which should not be transcribed:
    Commercials
    Material repeated between broadcasts.
    Anything too "difficult" to understand: if you have to listen to a passage more than 4 times in order to understand anything, it is probably too difficult to transcribe
    Anything obscured by heavy distortion or overwhelming background noise

    If you skip any portion of the broadcast, you should provide a time-stamp of the
    skipped speech portion (even if it is a minute long). Use the notation "sn" (section
    non-transcribed) to designate sections that fall into the categories above. Generally,
    "sn"

       <sn 323.08>

    Furthermore, if the material is marked as "sn" because it is a repeat of material found
    elsewhere in the transcripts, add the notation [[repeat]] after the "sn". If you
    happen to know the other source for the repeated material, include that information
    (file id, timestamp(s) if you know it) after the [[repeat]]:

       <sn 323.08>
       [[repeat]]
       <sn 156.997>
       [[repeat sv970613d at time 708.388 to
    840.328]]

    For the sections marked as <sn> you should not provide any transcription.

    If you have any questions about this, please consult your language leader.
     

      Overlapping speech

    • during interruptions,  <o> & timestamp will be used to indicate the start of the overlapping speech region corresponding to the interrupting speaker.
    • - an overlapping speech region is determined by overlapping word boundaries, rather than the exact point in the waveform which may require splitting words.
              • <t 1122.443> <male, Jacques_Cousteau>
                about the same thing I
                <o 1123.276> <female, spkr_2>
                SPEAKER1:understand {laugh}
                SPEAKER2:well,
                <e1 1124.256>
                oceanography is new exploration and we're not
      • <e1 indicates that the first speaker has now stopped, while the second speaker has continued to speak. If BOTH speakers stop  at the same point in time, the next speaker turn indicated by a if overlap ends with non-speech section (silence, music, etc.), mark beginning of non-speech section with a terminating  <e> & timestamp


      The [[NS]] tag can be used when there is an area within a turn that has no speech within it , i.e. a musical interruption, or extended background noise.

      <b 123.456 >
      The crowd was furious.
      <b 124.567>
      [[NS]]
      <b 128.987>
      Calm was soon restored by the arrival of the riot police.
       
    • indicate disfluencies by using hyphen to mark partial words; transcribe pause fillers, e.g.

    •  We're jus--  just waiting for that uh tha-- that report to to come in.
       
    • transcribe standard English contractions as they're spoken: they're, won't, isn't, don't, etc.

    •  
    • for non-standard contractions like "gonna" and "wanna" spell out the entire word: going to, want to.

    •  
    • identify extended non-speech sections (music, dead air, sound effects) with <e> and timestamp at beginning of section, followed by <t> and timestamp when the speech resumes, e.g.

    •  
      <t 148.57>
      Gunfire filled the air.
      <e 154.50>
      <t 170.89>
      That sound greeted early morning visitors on tuesday.
     
    Checking and separation of unintelligible (( )) speech
        <t 859.405> <<male>>
        Not only do we methodically destroy the coastal fringe
        <b 863.598>
        but we also throw back our toxic *1(( ))*2 directly in the sea
        <b 868.453>
        or under the sea when we feel ashamed.

        i) Use Ctrl-c s to "send" the segment to the waveform window. Then find the section which cannot be understood and isolate it as you would if you were placing breakpoints
        ii) Return to the text area, and place the cursor in front of the brackets (*1)
        iii) Hold Alt and the middle mouse button (M2), drag the cursor over the (( )) region, and release M2(*2). There is no highlighting over the region to show you that it is enabled, but you will receive a prompt for implementation of the change if it happens correctly.

    Syntax Checking
    Common Messages include:
     
  • time-stamp without text data?

  •  -The timestamp does not contain corresponding transcript data
     
  • time-stamp follows non-empty line

  •  - an empty line should follow each transcribed timestamp.
     
  • turn should be on single line

  •  -only one turn permitted for each line.
     
  • <English ...> should not be inside (( ))

  •  -foreign speech should not be contained within "guess" brackets.
     
  • <English (()) > has no text content)

  •  -rather the "guess" should be contained within the foreign language bracket
     
  • closing angle (`>') should be followed by space

  • self evident
     
  • bracket error with '[]'

  • may be a number of possibilities
     
  • bracket error with '()'

  • may be a number of possibilities
     

    punctuation should be inside `((...))'
    If completely necessary, punctuation should go inside these brackets - in most case, no punctuation will be necessary.
     

  • punctuation should be inside `<... >'

  • If there is punctuation immediately outside an < , please place on the inside of the bracket.
     
  • bad spacing around punctuation `.'

  • There should exist a space after punctuation
     
  • bad spacing around punctuation `?'

  • There should exist a space after punctuation
     
  • closing paren (`))') should be followed by space

  • self explanatory
     
  • turn contains ILLEGAL CHARACTER `!'

  • Some characters are not allowed within the text - for instance, exclamation points - please let your language leader know when you come across this error warning.
     
  • digits found in text

  • There should not be any numerals in the text outside of the timestamps