CF_Farsi goal is to transcribe a "good" fifteen minutes (900 seconds) from each call.

Quick Reminders:

  • A: corresponding to the local channel, (the lower waveform window)
  • B: corresponding to the remote channel. (the top waveform window)
  • A "turn" has a speaker channel identification, and has a beginning and end timestamp.

  • The insertion of "breakpoints" has the same appearance as a new speaker turn. Breakpoints can be inserted wherever they seem convenient to the transcriber. They should occur at the natural boundaries of speech, such as pauses, breaths, etc. Do not insert a breakpoint (timestamp) in the middle of a word! The time stamp has both a start and end point, and neither point can overlap a previous timestamp of the same speaker. Transcribers can, and should, skip over those parts of a conversation that are "difficult". What does that mean? As a rule of thumb, "difficult" means:

  • More than one or two portions of overlapping speech in a row.
  • Heavy distortion or overwhelming background noise over a portion of the conversation.
  • If you have to listen to a passage more than 4 times in order to understand the content, it is probably too difficult to transcribe.

    If any portion of the conversation is skipped, there should be a timestamp of the skipped speech portion (even if it is a minute long), and add the notation "[[skip]]" on the line following the timestamp with a single space:

    Punctuation The following punctuation marks should be used in the transcripts. The punctuation marks are primarily for ease of (human) reading. Use only those punctuation marks indicated below. Do not use marks such as single quotation (' '), exclamation ('!') or apostrophe (') other than those given below.

  • periods "." should be added at the end of declarative sentences
  • question marks "?" should be added at the end of interrogative sentences
  • commas "," should be added between clauses as is accepted in the standard orthography of the language

    Symbols

  • Acronyms I: are pronounced as a single word and should be written in caps (no spaces) and preceded by a "@" symbol:@AIDS

  • Acronyms II: are normally written as a single word but pronounced as a sequence of individual letters and should be written in all caps (no spaces) and preceded by a "~" symbol: ~FBI

  • Individual letters: are pronounced as such and should be written in caps and proceded by a "~" symbol:

  • Proper names: Both proper names and place names should be marked with a "^"symbol. If you encounter a "proper name phrase", mark only those words as proper names that are true proper names on their own.

  • Partial words: are indicated with a dash week-(without any spacing between the dash and the word):

  • If a word is mispronounced (such as a slip of the tongue), provide the correct spelling of the word, and place a "+" symbol in front of the word.

  • Interjections

    Use one from a set of standardized spellings for interjections. When it is hard to determine how to represent the interjection, ask your language leader.

  • English interjections as transcribed in English.


    mhm
    uh-huh
    uh-oh
    whoa
    whew
    yeah
    jeeze

  • Non-lexemes

    In addition to the interjections (which are considered to be words), we also have a set of standardized spellings for hesitation sounds that speakers make while talking. Every such "non word" in the transcripts is marked with the "%" symbol.

  • English non-lexemes (to give you an idea of the criterion for lexemes and non-lexemes.)


    %ach
    %ah
    %eee
    %eh
    %ew
    %ha
    %hee
    %huh
    %hm
    %huh
    %um
    %uh
    %oh

  • Idiosyncratic words

    If a speaker uses a "made-up" word which is not used by other speakers (although it may be understandable), place a "*" symbol before the word. Consult your language leader in cases where you are uncertain whether a word fits in this category. Onomatopoeia also fits into this category.

  • Noises

    In order to account for sound phenomena such as distortion, coughs, breaths, unintelligible speech, foreign words and phrases, etc, we utilize a set of unique brackets.

    {text} Sound made by the talker. Use only those sounds described below: {laugh} {cough} {sneeze} {breath} {lipsmack}

    Sound not made by the talker (usually background or channel). This notation should be used only in those rare cases where the background condition is overwhelming.

    Use only those descriptions provided below: [distortion] [static] -- used for channel noise such as "buzzes", "pops", etc. [background] -- used for other noises such as children crying, pots being struck, etc. There may be many instances of a brief channel noise, such as intermittent [static] or [background] noises. You can ignore these occurences. The focus of these transcriptions are areas of speech, so there is no need to be overly concerned with small distortions. Similarly, if a speaker is stuttering, or starts to speak with a series of partial, hesititant words which have been individually timestamped, include the partial speech into a larger speech section.

    [text/] [/text] Marks when sound not made by the talker is non-instantaneous. Place this at the beginning and end of the noisy region. These tags are channel specific, and therefore the tag can cross turn changes if the sound is extended.

  • Other Conventions ((text)) Unintelligible speech. This is the transcriber's best guess.

    (( )) Unintelligible speech (one or more words) that you cannot even make a guess at (with a single space between the parentheses).

    This is used to indicate speech (one or more words) in another language. In place of "language", write the name of the language, if known. This can overlap with the (( )) notation above. If the language is recognized and can be transcribed, use the notation. If the language is recognized but cannot be transcribed, use . If the language is not even recognized, use just the (( )) notation as above.

    Our rule of thumb for noting a "foreign word" is that you wouldn't come across these words in a written form of Persian. For example, the pronunciation of the word "okay" has been nativized and we are writing it as a Korean word. However, words such as "conversation" which is usually pronounced as "combersation" among Koreans, should still be treated as a foreign words despite the nativized pronunciation since we don't expect to come across its written form in Korean.

    text This is used to mark an aside made by the primary talker where the talker is addressing someone in the background.

    text Overlapping speech is when a speaker is interrupted by another speaker, at a roughly equal volume. In situations where overlapping speech occurs, insert the breakpoint at the beginning of the word in which the interruption started, in other words, at the end of the last complete word.

    [[skip]] Used to indicate a portion of speech that has proven to be too difficult to transcribe. Use this to indicate substantial areas, (more than four words, or three seconds) of unintelligible speech.


    damiller@ldc.upenn.edu
    Last modified: Wed Jul 12 11:50:07 2000