
CF_Farsi goal is to transcribe a "good" fifteen minutes (900 seconds)
from each call. Quick Reminders:
If any portion of the conversation is skipped, there should be a timestamp of the skipped speech portion (even if it is a minute long), and add the notation "[[skip]]" on the line following the timestamp with a single space:
Punctuation The following punctuation marks should be used in the transcripts. The punctuation marks are primarily for ease of (human) reading. Use only those punctuation marks indicated below. Do not use marks such as single quotation (' '), exclamation ('!') or apostrophe (') other than those given below.
Symbols
Use one from a set of standardized spellings for interjections. When it is hard to determine how to represent the interjection, ask your language leader.
mhm
uh-huh
uh-oh
whoa
whew
yeah
jeeze
In addition to the interjections (which are considered to be words), we also have a set of standardized spellings for hesitation sounds that speakers make while talking. Every such "non word" in the transcripts is marked with the "%" symbol.
%ach
%ah
%eee
%eh
%ew
%ha
%hee
%huh
%hm
%huh
%um
%uh
%oh
If a speaker uses a "made-up" word which is not used by other speakers (although it may be understandable), place a "*" symbol before the word. Consult your language leader in cases where you are uncertain whether a word fits in this category. Onomatopoeia also fits into this category.
In order to account for sound phenomena such as distortion, coughs, breaths, unintelligible speech, foreign words and phrases, etc, we utilize a set of unique brackets.
{text} Sound made by the talker. Use only those sounds described below: {laugh} {cough} {sneeze} {breath} {lipsmack}
Sound not made by the talker (usually background or channel). This notation should be used only in those rare cases where the background condition is overwhelming.
Use only those descriptions provided below: [distortion] [static] -- used for channel noise such as "buzzes", "pops", etc. [background] -- used for other noises such as children crying, pots being struck, etc. There may be many instances of a brief channel noise, such as intermittent [static] or [background] noises. You can ignore these occurences. The focus of these transcriptions are areas of speech, so there is no need to be overly concerned with small distortions. Similarly, if a speaker is stuttering, or starts to speak with a series of partial, hesititant words which have been individually timestamped, include the partial speech into a larger speech section.
[text/] [/text] Marks when sound not made by the talker is non-instantaneous. Place this at the beginning and end of the noisy region. These tags are channel specific, and therefore the tag can cross turn changes if the sound is extended.
(( )) Unintelligible speech (one or more words) that you cannot even make a guess at (with a single space between the parentheses).
Our rule of thumb for noting a "foreign word" is that you wouldn't come
across these words in a written form of Persian. For example, the
pronunciation of the word "okay" has been nativized and we are
writing it as a Korean word. However, words such as "conversation"
which is usually pronounced as "combersation" among Koreans, should
still be treated as a foreign words despite the nativized
pronunciation since we don't expect to come across its written form
in Korean.
[[skip]] Used to indicate a portion of speech that has proven to be too
difficult to transcribe. Use this to indicate substantial areas,
(more than four words, or three seconds) of unintelligible speech.
damiller@ldc.upenn.edu
Last modified: Wed Jul 12 11:50:07 2000