RT-04 GUIDELINES
LEVANTINE ARABIC TELEPHONE SPEECH
TRANSCRIPTION CONVENTIONS
[Mohamed Maamouri 1.12 dated April 1, 2004]
This document is a short version of the more complete and comprehensive set of
specifications that the LDC produced for RT-03 transcription. For more information, please go to: http://ldc.upenn.edu/Projects/Transcription/rt03/RT_Transcription_V2.2.pdf and/or to: http://www.ldc.upenn.edu/Projects/EARS/.
The present transcription conventions apply to telephone speech transcription and files. When used with the LDC Arabic Multi-dialectal Transcription Tool, AMADAT, the annotation conventions below apply to the Modern Standard Arabic-based Transcription (MSAT) level and are either tool-based tags (with push-button labels) or keyboard-based and manually added annotation. The annotation represented by these tags will transfer directly to the Arabic Orthographic System-based Transliteration (AOST). All Arabic
examples below are unvocalized because they reflect the MSAT level of annotation.
1.0 TRANSCRIPTION CONVENTIONS: ORTHOGRAPHY AND SPELLING
1.1 Telephone speech is recorded on two separate channels. In most cases, each channel corresponds to a single speaker for the duration of the call. The two channels are labeled
Speaker A (the local channel) and Speaker B (the remote channel). Occasionally, multiple speakers appear on a single channel. When/if this happens, the additional speakers are to be identified by the channel they appear on with a number to distinguish
them from the main speaker, e.g.
Speaker A1: second speaker on Channel A
The tags (تداخل ) 'tdAxl' 'overlap' for beginning and (\تداخل) 'tdAxl\’ for end of speaker overlap are used.
1.2 Once a file has been fully segmented and speakers identified, annotators should transcribe the file in its entirety, working with both channels at once. Annotators
must produce a word-by-word transcript of everything that is said within the file indicating all pauses and other audio/sound features.
1.3 Spelling: Transcribers/Annotators should use standard Modern Standard Arabic (=MSA) orthography conventions both in word segmentation, word spacing, and word spelling, e.g.
والله not و الله for ‘wa Al-l~ah’
When in doubt about the spelling of a word or name, annotators should contact paper
or online web references (dictionaries, atlases, etc.)
1.4 Capitalization and Proper Names: No capitalization is used in Arabic orthography. Proper names are labeled with a caret ^ symbol. This symbol is used in front of each word part of the proper name. Do not use the caret for common nouns that are part of a title or Sname. Examples:
هندي ^ سامية ^ ßà ^Samia ^Hindi
مبارك ^ الرئيس ßà President ^Mubarak
The caret ^ is only used with name entities and name locations -- mostly to signal that they could have a variable transcription/transliteration. The names of days, months, and holidays are considered normal nouns and not marked in Arabic. Do not be confused by the English grammar practice, which considers all the above cases as Proper Nouns.
1.5 Contractions: Contractions are extremely rare in Arabic. Annotators should limit their use to cases where they are actually produced by the speaker. In those rare cases
annotators must take care to transcribe exactly what the speaker says and what they hear using standard orthography, e.g.:
? شقتله for ?له قلت اش ‘What did you say to him’
or, perhaps, نصّ for نصف ‘half’ and بتّ for بنت ‘daughter’
Non-standard contractions should be spelled out in full. There are no hyphenated words
in Arabic and annotators should consult a dictionary before using a hyphen. Compounds
are written as single words as in: برمائي or قروسطي
Annotators must take care to transcribe exactly what they hear.
1.6 Numbers: All numerals should be written out as complete words. Numerals from 11 to 19 show a form of word contraction in most Arabic dialects.
حداشإِ <iHdA$ ‘eleven’
أطناش >TnA$ ‘twelve’
خمصطاشر xamaSTA$ar ‘fifteen’
1.7 Acronyms: Acronyms that are pronounced as a single word should be written in single letters and preceded by a @ symbol as in:
صلعم @ pronounced ‘SalEama’ for صلّى الله عليه وسلّم
إلَخ @ for إلى آخره
أبجد @ for ‘abc’ as in ‘to know the abc of ‘
1.8 Arabic spoken letters letters are pronounced and transcribed as separate individual words. The Arabic letters for English letters ‘j’ and ‘n’ should not be written as ج and
ن but as full words جيم and نون. Similarly, individual letters in borrowed non-Arabic acronyms are pronounced and transcribed as separate words, e.g.:
أي بي سي for ‘ABC’
1.9 Punctuation: Annotators should use standard MSA punctuation for ease of transcription and reading. Acceptable Arabic punctuation should be limited to
periods and question marks at the end of a sentence. Transcripts should not contain
quotation marks, exclamation marks, colons, semicolons, dashes, or ellipses. No additional spaces should be used around punctuation marks.
1.10 Word compounds and Multiwords: Annotators should use the standard
conventions of segmented and non-segmented words in MSA when it comes
to the use of affixes and pronouns with nouns or verbs (including participles)
and separately, prepositions with whichever words follow them.
.
Examples: صار لِي (2 separate words) not صارلِي (one word)
الّلَه^ عبد^ not عبدلله ^
2.0 TRANSCRIPTION CONVENTIONS: DISFLUENT SPEECH
2.1 Areas of disfluent speech are particularly important and difficult to transcribe. Speakers may stumble over their words, repeat themselves, utter partial words, restart
phrases or sentences, and use a lot of hesitation sounds. Annotators should transcribe
exactly what is spoken, including all the partial words, repetition and filled pauses used
by the speaker.
2.2 Filled pauses and hesitation sounds: Filled pauses are non-words. Speakers use them to indicate hesitation or to maintain control of a conversation. Each language
has a limited set of filled pauses that speakers can employ. Annotators should use
the standardized spelling of filled pauses without trying to alter them to reflect how the
speaker pronounces the word (as in ‘hummmmm’ for a long ‘hum’). The Arabic set
includes the following five interjections :
'%ah' '%>h' أه
'%eh' '%<yh' إيه
'%um' '%>m' أم
'%ooh' '%>ww' أوو
'%hm' '%hAy' هاي
2.3 Partial words and restarts: when a speaker breaks off in the middle of the word, annotators should transcribe as much of the word as can be made out. A single dash – is used to indicate point at which word was broken off.
Example: -تـ or -تحـ before uttering the full word تحبهاش
wyn$- yEny yn$rwA يعني ينشروا -وينش
wnEm AltE- -- wnEm Alnsb ونعم النسب -- -ونعم التع
yEn- brAmj برامج -يعن
Speaker restarts are indicated with a double dash --. This annotation should be used when a speaker stops short, cutting him/herself off, before continuing with the utterance.
Please study the following examples:
إسمي محمّد-- -إسـ
brAmj -- brnAmj tEArfy yEny يعني تعارفي برنامج -- برامج
yEny mA -- mA fyh kvyr ما فيه كثير -- يعني ما
>nA mA Em bqdm -- mA Em bqdm twjyhy ما عم بقدم توجيهي -- أنا ما عم بقدم
<y$ fyh Hlwl -- <y$ fyh Hlwl <nty btqtrHyhA?
?إيش فيه حلول إنتي بتقترحيها -- إيش فيه حلول
w- -- wbEdyn mrAt وبعدين مرات -- -و
hm yEny h- -- h*A hAy hy هذا هاي هي -- -هم يعني ه
lkAn >H- --Hsn <n kAn btdrd$ أحسن إن كان بتدردش -- -لكان أح
mn E$- -- mn >kvrmn E$ryn snp من أكثرمن عشرين سنة -- -من عش
myp w>rbEyn $hr- -- $hry? ?شهري -- -أربعين شهرو مية
bySyr fyh >$yA' mA k- -- mA knt$ tqdr tkwn
ما كنتش تقدر تكون -- -بيصير فيه أشياء ما ك
lmA Alwld bt- -- bySyr Emrh wAHd wE$ryn snp
بيصير عمره واحد وعشرين سنة -- -لما الولد بت
2.4 Contracted words: There are very few cases of contracted forms in Arabic
(as against English for instance which frequently uses contracted forms such
as ‘I’ve’ or ‘they’re’). A good example of that exists in the Iraqi Arabic form
which is pronounced /has~a/ for /ha+Al+sAEah/. Suggested written form for
that word will be هسّا instead of the commonly used form هسّه. هسّا can be
connected to a current but less frequent variant هسّع with the deletion of the final consonant. There is evidence for this form in Tunisian Arabic where the
word فيسع ‘quickly’ is extremely frequent (from في ساعة ).
Since contractions are extremely rare in Arabic, annotators should use extreme
caution and try to keep as close as possible to what they hear. An important
annotation decision needs to be made when it comes to chosing whether the word
under scrutiny is a ‘made-up’ word, a misused word, or a non-standard dialect term.
2.5 Mispronounced words and hard-to-understand sections: A ‘+’ symbol is used for obviously mispronounced words, not regional or non-standard dialect pronunciation. Annotators should transcribe using the standard spelling and should not try to represent the pronunciation.
Example: ياسمين ^ و+لخرة فاطمة ^ instead of ياسمين ^ و+لخة فاطمة ^
Sometimes an audio file will contain a section of speech that is impossible
to understand. In these cases, annotators should use empty double parenthesis (( )) to mark totally unintelligible speech. If it is possible to guess the speaker’s words, annotators should transcribe what they think they hear and surround the uncertain transcription/text with double parenthesis. If possible, the empty untranscribed region
should get its own timestamp as in e.g.: [1125.145] (( ))
2.6 Noise: Sometimes, audio files contain various types of noises. All noises should be
accurately transcribed and marked.
2.6.1 Speaker-produced noises أصوات % 'peopletalk' are identified with the five following tags:
صمت % 'silence'
تنفّس % 'breath'
ضحك % 'laugh'
سعال % 'cough'
عطس% 'sneeze'
When there is noticeable background noise (not speaker noise) present during a span
of speech, annotators employ the ضجّة 'noise'. When the noise is instantaneous, like
door slamming or gunshot, the ضجّة% symbol is inserted next to the word during which
the noise occurs. If the sound is prolonged and spans several words in the transcript,
the ضجّة%symbol is inserted before the word where the sound begins, and an end
\ضجّة% symbol is inserted after the word where the sound ends.
2.6.2 Three additional noise-related tags have been added:
إنقطاع % 'pause'
موسيقى % 'music'
3. 0 ADDITIONAL LINGUISTIC AND SOCIOLINGUISTIC MARKUP
3.1 Linguistic tags: Annotation of linguistic and dialectal variation is extremely important in multidialectal speech transcription. This annotation takes place in the Arabic Orthographic System-based Transliteration (AOST). Eight tags are included in AMADAT to cover important phonological and morphophonemic changes. These are:
'(Cons Change)'
'(Velarized Cons)'
'(Voc Variant)'
'(Hamzah Drop)'
'(Diphthong)'
'(-h Deletion)'
'(Cons Deletion)'
‘(-ap Silent) and (-ap Pronounced) for recording of the Ta marbuta
It is to be noted that the assimilation of the definite article Al- has not been included
because it could be covered by automatic change rules.
3.2 Sociolinguistic tags: Annotators should annotate all words that they identify as not belonging to the targeted speech. Annotators should mark and annotate all language variation and diglossia phenomena. These phenomena occur frequently in intradialectal and interdialectal communication. The following situations are important annotation
areas:
3.2.1 Modern Standard Arabic ‘MSA’: This happens when annotators feel that some words are clearly borrowed from the Standard language MSA. This usually happens in Koranic citations or when there is a lexical need which is not covered in the targeted dialect and which forces a borrowing from MSA. This is a difficult issue and annotators should be very conservative and cautious when making this decision.
3.2.2 Interdialectal variation: Sometimes annotators borrow from other Arabic dialects. In this case, the source dialect should be tagged: For the moment, the following Arabic Dialects have been include in the tool: 'NA', 'ALG', 'EGP', 'GLF', 'IRQ', 'LEB', 'JOR', 'MOR', 'PAL', 'SAU', 'SYR', 'TUN', 'YEM'. This is not a complete listing and other tags
will be added when needed.
3.2.3 Foreign languages and foreign borrowings: Portions of speech in another language
are annotated using the (language text) convention to indicate the language and to transcribe the words that are spoken in that language. The transcription of the foreign word(s) in Latin script is only done in the YELLOW careful transcription. Only the tag
Foreign أجنبي% is used in the GREEN Quick Transcription.
Examples:
In GREEN: كثير أجنبي%
In YELLOW: كثير (FRENCH merci)
If the annotator does not know the name of the language or what is being said, he/she should use the tag FOR in parentheses: (FOR).
It should be noted that the above convention should not be used for foreign borrowings that are common in the target language. These words should be transcribed in the modified Arabic orthographic system using the extra four Persian letters.
3.2.4 In transcribing foreign names, vowels are often conventionally represented in the script using the long glide (vocalic/consonantal) letters ي , و and the ‘alif’ ا to
represent the non-Arabic vocalic range [ i , e , a , o , u ]:
Examples: ولسون^ for ‘^Wilson’
ماري ^ for ‘^Mary’
شيراك^ for ‘^Chirak’
When annotators encounter a name whose spelling they are not sure of, they should make their best guess and transcribe it with a double caret ^ ^ .
The above MSA-based orthographic conventional transcription reflects a pronunciation of long vowels which does not occur in actual pronunciation. The question remains as to whether we use leave the above conventions in the Modern Standard Arabic-based Transcription (MSAT) level only or whether we should correct/readjust in the Arabic Orthographic System-based Transliteration (AOST) to reflect real pronunciation.