RT-04 GUIDELINES

 

LEVANTINE ARABIC TELEPHONE SPEECH

TRANSCRIPTION CONVENTIONS

 

 [Mohamed Maamouri 1.12 dated April 1, 2004]

 

This document is a short version of the more complete and comprehensive set of

specifications that the LDC produced for RT-03 transcription.  For more information, please go to: http://ldc.upenn.edu/Projects/Transcription/rt03/RT_Transcription_V2.2.pdf   and/or to: http://www.ldc.upenn.edu/Projects/EARS/.

 

The present transcription conventions apply to telephone speech transcription and files.  When used with the LDC Arabic Multi-dialectal Transcription Tool, AMADAT, the annotation conventions below apply to the Modern Standard Arabic-based Transcription (MSAT) level and are either tool-based tags (with push-button labels) or keyboard-based and manually added annotation.  The annotation represented by these tags will transfer directly to the Arabic Orthographic System-based Transliteration (AOST).  All Arabic

examples below are unvocalized because they reflect the MSAT level of annotation.

 

1.0 TRANSCRIPTION CONVENTIONS:  ORTHOGRAPHY AND SPELLING

 

1.1  Telephone speech is recorded on two separate channels.  In most cases, each channel corresponds to a single speaker for the duration of the call.  The two channels are labeled

Speaker A (the local channel) and Speaker B (the remote channel).  Occasionally, multiple speakers appear on a single channel.  When/if this happens, the additional speakers are to be identified by the channel they appear on with a number to distinguish

them from the main speaker, e.g.

           

Speaker A1: second speaker on Channel A

 

The tags  (تداخل ) 'tdAxl' 'overlap' for beginning and (\تداخل) 'tdAxl\’ for end of speaker overlap are used.

 

1.2 Once a file has been fully segmented and speakers identified, annotators should transcribe the file in its entirety, working with both channels at once.  Annotators

must produce a word-by-word transcript of everything that is said within the file indicating all pauses and other audio/sound features.

 

1.3  Spelling:  Transcribers/Annotators should use standard Modern Standard Arabic (=MSA) orthography conventions both in word segmentation, word spacing, and word spelling, e.g.

                                  والله   not     و الله   for ‘wa Al-l~ah’

                       

When in doubt about the spelling of a word or name, annotators should contact paper

or online web references (dictionaries, atlases, etc.)

 

1.4  Capitalization and Proper Names:  No capitalization is used in Arabic orthography.  Proper names are labeled with a caret ^ symbol.  This symbol is used in front of each word part of the proper name.  Do not use the caret for common nouns that are part of a title or Sname.   Examples:

           

                        هندي ^ سامية ^   ßà  ^Samia ^Hindi

                        مبارك ^ الرئيس    ßà  President ^Mubarak

 

The caret  ^  is only used with name entities and name locations -- mostly to signal that they could have a variable transcription/transliteration.  The names of days, months, and holidays are considered normal nouns and not marked in Arabic.  Do not be confused by the English grammar practice, which considers all the above cases as Proper Nouns.

 

1.5  Contractions:  Contractions are extremely rare in Arabic.  Annotators should limit their use to cases where they are actually produced by the speaker.  In those rare cases

annotators must take care to transcribe exactly what the speaker says and what they hear using standard orthography, e.g.:

 

                        ?        شقتله   for  ?له  قلت اش  ‘What did you say to him’

or, perhaps, نصّ  for  نصف  ‘half’ and  بتّ for بنت ‘daughter’

 

Non-standard contractions should be spelled out in full.  There are no hyphenated words

in Arabic and annotators should consult a dictionary before using a hyphen.  Compounds

are written as single words as in:  برمائي  or  قروسطي

Annotators must take care to transcribe exactly what they hear.

 

1.6 Numbers:  All numerals should be written out as complete words.  Numerals from 11 to 19 show a form of word contraction in most Arabic dialects. 

            حداشإِ                          <iHdA$                       ‘eleven’

أطناش             >TnA$                         ‘twelve’

            خمصطاشر     xamaSTA$ar               ‘fifteen’

 

1.7  Acronyms:  Acronyms that are pronounced as a single word should be written in single letters and preceded by a @ symbol as in:

 

                        صلعم @     pronounced  ‘SalEama’ for  صلّى الله عليه وسلّم

      إلَخ @                                     for     إلى آخره

                           أبجد @       for ‘abc’  as in ‘to know the abc of ‘   

 

1.8  Arabic spoken letters letters are pronounced and transcribed as separate individual words.  The Arabic letters for English letters ‘j’ and ‘n’ should not be written as ج and

ن but as full words جيم  and   نون.  Similarly, individual letters in borrowed non-Arabic acronyms are pronounced and transcribed as separate words, e.g.:

 

                                            أي بي سي    for ‘ABC’

 

1.9  Punctuation:  Annotators should use standard MSA punctuation for ease of transcription and reading.  Acceptable Arabic punctuation should be limited to

periods and question marks at the end of a sentence.  Transcripts should not contain

quotation marks, exclamation marks, colons, semicolons, dashes, or ellipses.  No additional spaces should be used around punctuation marks.

 

1.10 Word compounds and Multiwords:  Annotators should use the standard

conventions of segmented and non-segmented words in MSA when it comes

to the use of affixes and pronouns with nouns or verbs (including participles)

 and separately, prepositions with whichever words follow them.

Examples:       صار لِي   (2 separate words)  not صارلِي (one word)

                                    الّلَه^  عبد^                            not   عبدلله ^

 

2.0    TRANSCRIPTION CONVENTIONS:  DISFLUENT SPEECH

 

2.1 Areas of disfluent speech are particularly important and difficult to transcribe.  Speakers may stumble over their words, repeat themselves, utter partial words, restart

phrases or sentences, and use a lot of hesitation sounds.  Annotators should transcribe

exactly what is spoken, including all the partial words, repetition and filled pauses used

by the speaker.

 

2.2  Filled pauses and hesitation sounds:  Filled pauses are non-words.  Speakers use them to indicate hesitation or to maintain control of a conversation.  Each language

has a limited set of filled pauses that speakers can employ.  Annotators should use

the standardized spelling of filled pauses without trying to alter them to reflect how the

speaker pronounces the word (as in ‘hummmmm’ for a long ‘hum’).  The Arabic set

includes the following five interjections :

 

'%ah'               '%>h'               أه

'%eh'               '%<yh'             إيه

'%um'              '%>m'              أم

'%ooh'           '%>ww'           أوو

'%hm'             '%hAy'             هاي

 

2.3 Partial words and restarts:  when a speaker breaks off in the middle of the word, annotators should transcribe as much of the word as can be made out.  A single dash – is used to indicate point at which word was broken off.

 

            Example:      -تـ or -تحـ before uttering the full word تحبهاش

 

wyn$-  yEny yn$rwA      يعني  ينشروا     -وينش      

 

wnEm AltE- -- wnEm Alnsb     ونعم النسب  --  -ونعم التع

 

yEn-  brAmj       برامج   -يعن      

 

Speaker restarts are indicated with a double dash --.  This annotation should be used when a speaker stops short, cutting him/herself off, before continuing with the utterance.

Please study the following examples:

 

إسمي محمّد--  -إسـ

 

brAmj -- brnAmj tEArfy yEny              يعني     تعارفي        برنامج     --  برامج  

 

yEny mA -- mA fyh kvyr           ما   فيه كثير  --  يعني   ما

 

>nA mA Em bqdm -- mA Em bqdm twjyhy    ما عم بقدم توجيهي --  أنا ما عم بقدم

 

<y$ fyh Hlwl -- <y$ fyh Hlwl <nty btqtrHyhA? 

 

?إيش فيه حلول إنتي بتقترحيها  --  إيش فيه حلول

 

w- -- wbEdyn mrAt       وبعدين مرات  -- -و

 

hm yEny h-  -- h*A hAy hy        هذا هاي هي -- -هم يعني ه

 

lkAn >H- --Hsn <n kAn btdrd$   أحسن إن كان بتدردش --  -لكان أح

 

mn E$-  -- mn >kvrmn E$ryn snp          من أكثرمن عشرين سنة -- -من عش

 

myp w>rbEyn $hr- -- $hry?      ?شهري -- -أربعين شهرو مية   

 

bySyr fyh >$yA' mA k- -- mA knt$ tqdr tkwn

ما كنتش تقدر تكون --  -بيصير فيه أشياء ما ك

 

lmA Alwld bt- -- bySyr Emrh wAHd wE$ryn snp

 بيصير عمره واحد وعشرين سنة -- -لما الولد بت

 

 

 

 

 

 

 

 

 

2.4 Contracted words:  There are very few cases of contracted forms in Arabic

(as against English for instance which frequently uses contracted forms such

as ‘I’ve’ or ‘they’re’).  A good example of that exists in the Iraqi Arabic form

which is pronounced /has~a/ for /ha+Al+sAEah/.  Suggested written form for

that word will be هسّا instead of the commonly used form هسّه.  هسّا can be

connected to a current but less frequent variant هسّع with the deletion of the final consonant.  There is evidence for this form in Tunisian Arabic where the

word فيسع ‘quickly’ is extremely frequent (from في ساعة ). 

Since contractions are extremely rare in Arabic, annotators should use extreme

caution and try to keep as close as possible to what they hear.  An important

annotation decision needs to be made when it comes to chosing whether the word

under scrutiny is a ‘made-up’ word, a misused word, or a non-standard dialect term. 

 

2.5 Mispronounced words and hard-to-understand sections:  A ‘+’ symbol is used for obviously mispronounced words, not regional or non-standard dialect pronunciation.  Annotators should transcribe using the standard spelling and should not try to represent the pronunciation.

 

            Example:            ياسمين ^  و+لخرة  فاطمة ^  instead of  ياسمين ^ و+لخة  فاطمة ^ 

 

Sometimes an audio file will contain a section of speech that is impossible

to understand.  In these cases, annotators should use empty double parenthesis (( )) to mark totally unintelligible speech.  If it is possible to guess the speaker’s words, annotators should transcribe what they think they hear and surround the uncertain transcription/text with double parenthesis.  If possible, the empty untranscribed region

should get its own timestamp as in e.g.:    [1125.145] (( ))

 

2.6 Noise:  Sometimes, audio files contain various types of noises.  All noises should be

accurately transcribed and marked.

 

2.6.1 Speaker-produced noises  أصوات %   'peopletalk' are identified with the five following tags:

صمت %            'silence'

تنفّس %                 'breath'

ضحك %               'laugh' 

سعال %               'cough'

عطس%                'sneeze'

 

When there is noticeable background noise (not speaker noise) present during a span

of speech, annotators employ the ضجّة 'noise'.  When the noise is instantaneous, like

door slamming or gunshot, the ضجّة% symbol is inserted next to the word during which

the noise occurs.  If the sound is prolonged and spans several words in the transcript,

the  ضجّة%symbol is inserted before the word where the sound begins, and an end

 \ضجّة% symbol is inserted after the word where the sound ends.

 

2.6.2 Three additional noise-related tags have been added:

 

                                                            إنقطاع %            'pause'

موسيقى %         'music'

 

3. 0 ADDITIONAL LINGUISTIC AND SOCIOLINGUISTIC MARKUP

 

3.1 Linguistic tags:  Annotation of linguistic and dialectal variation is extremely important in multidialectal speech transcription.  This annotation takes place in the Arabic Orthographic System-based Transliteration (AOST).   Eight tags are included in AMADAT to cover important phonological and morphophonemic changes.  These are:

 

        '(Cons Change)'

        '(Velarized Cons)'

        '(Voc Variant)'

        '(Hamzah Drop)'

        '(Diphthong)'

        '(-h Deletion)'

        '(Cons Deletion)'

        ‘(-ap Silent) and (-ap Pronounced) for recording of the Ta marbuta

 

It is to be noted that the assimilation of the definite article Al- has not been included

because it could be covered by automatic change rules.

 

3.2 Sociolinguistic tags:  Annotators should annotate all words that they identify as not belonging to the targeted speech.  Annotators should mark and annotate all language variation and diglossia phenomena.  These phenomena occur frequently in intradialectal and interdialectal communication.  The following situations are important annotation

areas: 

 

3.2.1 Modern Standard Arabic ‘MSA’:  This happens when annotators feel that some words are clearly borrowed from the Standard language MSA.  This usually happens in Koranic citations or when there is a lexical need which is not covered in the targeted dialect and which forces a borrowing from MSA. This is a difficult issue and annotators should be very conservative and cautious when making this decision.

           

3.2.2 Interdialectal variation:  Sometimes annotators borrow from other Arabic dialects.  In this case, the source dialect  should be tagged:  For the moment, the following Arabic Dialects have been include in the tool: 'NA', 'ALG', 'EGP', 'GLF', 'IRQ', 'LEB', 'JOR', 'MOR', 'PAL', 'SAU', 'SYR', 'TUN', 'YEM'.  This is not a complete listing and other tags

will be added when needed.  

 

3.2.3 Foreign languages and foreign borrowings:  Portions of speech in another language

are annotated using the (language text) convention to indicate the language and to transcribe the words that are spoken in that language.  The transcription of the foreign word(s) in Latin script is only done in the YELLOW careful transcription.  Only the tag

Foreign  أجنبي% is used in the GREEN Quick Transcription.

Examples:

In GREEN:  كثير  أجنبي%

 In YELLOW:  كثير  (FRENCH merci)

 

If the annotator does not know the name of the language or what is being said, he/she should use the tag FOR in parentheses: (FOR).

It should be noted that the above convention should not be used for foreign borrowings that are common in the target language.  These words should be transcribed in the modified Arabic orthographic system using the extra four Persian letters. 

 

3.2.4 In transcribing foreign names, vowels are often conventionally represented in the script using the long glide (vocalic/consonantal) letters ي , و  and the ‘alif’  ا to

represent the non-Arabic vocalic range [ i , e , a , o , u ]:

 

Examples:       ولسون^           for ‘^Wilson’  

                        ماري ^               for ‘^Mary’

                        شيراك^           for ‘^Chirak’

 

When annotators encounter a name whose spelling they are not sure of, they should make their best guess and transcribe it with a double caret ^ ^ .

 

The above MSA-based orthographic conventional transcription reflects a pronunciation of long vowels which does not occur in actual pronunciation.  The question remains as to whether we  use leave the above conventions in the Modern Standard Arabic-based Transcription (MSAT) level only or whether we should correct/readjust in the Arabic Orthographic System-based Transliteration (AOST) to reflect real pronunciation.