Guidelines for the Transcription of Arabic Dialects (EARS)
Tim Buckwalter, Mohamed Maamouri
Arabic Treebank Project
LDC, University of Pennsylvania
January 9, 2004

GUIDELINES FOR TRANSCRIBING LEVANTINE ARABIC:
AOST TRANSCRIPTION

IMPORTANT NOTICE!! These are the Guidelines for the Arabic Orthographic System-based Transcription (=AOST: transcription in the "Yellow" area). If you are engaged in MSA-based transcription (=MSAT: transcription in the "Green" area) please use the MSAT Guidelines.


GENERAL OBSERVATIONS

The AOST transcription starts out as an exact copy of the MSA-based Arabic transcription (MSAT), and it is in transliteration (AOST-T) as well as in Arabic script (AOST-A). The mapping between AOST-T and AOST-A is one-to-one and reversible. Example:

   MSAT: بالضبط (ضحك) اما (إنقطاع) (إنقطاع) اما صحيح اللي بحكيه والا لأ
   AOST-T (before changes): bAlDbT (DHk) AmA (<nqTAE) (<nqTAE) AmA SHyH Ally bHkyh wAlA l>
   AOST-A (before changes): بالضبط (ضحك) اما (إنقطاع) (إنقطاع) اما صحيح اللي بحكيه والا لأ

The annotator's task is to modify the transliteration so that it more closely reflects actual pronunciation. These modifications will consist of: adding missing short vowels and diacritics, removing certain letters that are not pronounced, and modifying certain letters when they are pronounced differently than how they are written. In addition to making changes to the transliteration text, the annotator will occasionally attach one or more Annotation Remarks (e.g., "-h Deletion") to the relevant words in the text. For example:

   MSAT: بالضبط (ضحك) اما (إنقطاع) (إنقطاع) اما صحيح اللي بحكيه والا لأ
   AOST-T (after changes): biAlz~abT >am~A SaHiyH <ill~iy baHkiy (-h Deletion) wil~A la>
   AOST-A (after changes): وِلّا لَأ (-h Deletion) بِالزَّبط أَمّا صَحِيح إِللِّي بَحكِي


DETAILED INSTRUCTIONS

(*) Transcribing the Definite Article "Al-"

Do not modify the definite article "Al" – leave it unvocalized, but add the shadda to the word after "Al" when it begins with a geminated "shamsiyya" consonant (e.g. Als~alAm) if that is the way it was pronounced. Note that some dialects extend gemination to consonants that are not geminated in MSA (e.g. Alj~amal). Note the vocalization of the definite article when it is preceded by the preposition "li": lils~alAm, lilmaktab. Examples:

   MSAT: القمر والشمس
   AOST-T: Al>amar wiAl$~ams
   AOST-A: الأَمَر وِالشَّمس

(*) Transcribing Ta Marbuta

The ta marbuta is always be transcribed as "p" regardless of how it is pronounced. However, if it is pronounced /t/ (such as when the word is in an idafa construction) then you must add the Annotation Remark "(-ap Pronounced)." For example:

   MSAT: مدرسة البنات مدرسة ثانوية
   AOST-T: madrasp(-ap Pronounced) AlbinAt madrasap tAnawiy~ap
   AOST-A: البِنات مَدرَسَة تانَوِيَّة (-ap Pronounced)مَدرَسة

Note that some words have ta marbuta that is often pronounced as /t/ outside of idafa constructions:

   MSAT: AlHyAp
   AOST-T: AlHayAp(-ap Pronounced)
   AOST-A: (-ap Pronounced)الحَياة

(*) Transcribing With non-Standard (Persian) Letters

AMADAT allows for the use of three non-standard (Persian) characters for AOST transcription:

   /g/  گ  گـ  ـگـ  ـگ
   /č/  چ  چـ  ـچـ  ـچ
   /p/  پ  پـ  ـپـ  ـپ

Examples:

   MSAT: انت بتحكي انكليزي؟
   AOST-T: <inta btiHkiy <inGliyziy?
   AOST-A: إِنتَ بتِحكِي إِنگلِيزِي؟

   MSAT: كيف حالك؟
   AOST-T: Jiyf HAliJ?
   AOST-A: چِيف حالِچ؟

   MSAT: عندك كمبيوتر في البيت؟
   AOST-T: Eindak kamPyuwtar fiy Albayt?
   AOST-A: عِندَك كَمپيُوتَر فِي البَيت؟

(*) Transcribing "q" Pronounced as Glottal Stop

When the "q" is pronounced as a glottal stop, it should be transcribed as hamza. Follow the standard rules of orthography, which are based on the surrounding short vowels. (If there is known variation, such as شئون/شؤون and رءوس/رؤوس, follow the predominant Levantine spelling of the hamza). Examples:

   MSAT: مين اللي بيقول هيك
   AOST-T: miyn All~iy biy&uwl hayk?
   AOST-A: مِين اللِّي بِيؤُول هَيك؟

   MSAT: ما قريتش المقالة
   AOST-T: mA >arayti$ Alma|lap
   AOST-A: ما أَرَيتِش المَآلَة

(*) Transcribing "q" Pronounced as /g/

When the "q" is pronounced as /g/ (as in some non-urban Levantine dialects), it should be transcribed with the Persian letter گ. Example:

   MSAT: AnA bqwl lk
   AOST-T: >anA baGuwl lak
   AOST-A: أَنا بَگُول لَك

(*) Transcribing Hamza

The hamza at the beginning of words will be written above or below alif depending on the accompanying short vowel: above if the vowel is "a" or "u"; below if it's "i". For example:

   MSAT: وين امك
   AOST-T: wayn <im~ak
   AOST-A: وَين إِمَّك

(*) Transcribing the Short Vowel "a" before Alif

The short vowel "a" is redundant before the long vowel "A" so there is no need to write it. (Technical note: if the sequence "aA" is desired in the final transcription, it can be generated automatically via a substitution script. The reverse process – replacing all instances of "aA" with "A" – is even easier to implement).

(*) Transcribing "Dagger Alif" as Long Vowel "A"

There are several words that in MSA are written with a "dagger alif" diacritic. On PC's this diacritic appears only in the word "Allah" (الله) because there is a special glyph for the sequence of 3 letters "llh" (لله). The dagger alif should also appear in words like هذا, هذي, and طه (if it were available). Note that the dagger alif is transcribed in AOST as the long vowel "A":

   MSAT: الله ـ هذا ـ هذي ـ طه
   AOST-T: All~Ah - hA*A - hA*iy - TAha
   AOST-A: اللّاه ـ هاذا ـ هاذِي ـ طاهَ

(*) Transcribing Unstressed Long Vowels

In words with more than one long vowel, unstressed long vowels are often shortened. For example: مباريات is often pronounced /mubaraya:t/. This long-vowel shortening is related to word stress and will therefore not be recorded in AOST transcription.

   MSAT: صواريخ ـ مباريات ـ كانون
   AOST-T: SawAriyx - mubArayAt - kAnuwn
   AOST-A: صَوارِيخ ـ مُبارَيات ـ كانُون


DISPLAY ISSUES

(*) Vocalization of Lam-Alif Ligature

On all platforms and in most applications the lam-alif ligature (لا) displays strangely when a diacritic is inserted between the lam and the alif (e.g., "li>an~a" لِأَنَّ). This is a display issue only, and the data should not be considered corrupted!


ISSUES TO BE CONSIDERED

(*) Transcribing the Long Vowels /e:/ and /o:/

The long vowels /e:/ or /o:/ should be transcribed as the diphthongs /ay/ and /aw/ if the word's MSA counterpart also has a diphthong (e.g., بيت, صوت). However, if the MSA counterpart has a long vowel sound (e.g., موتور, موديل), then transcribe the /e:/ and /o:/ sounds as the long vowels /iy/ and /uw/.) Examples:

   MSAT: صوت الموتور ـ الموديل ـ البيت
   AOST-T: Sawt Almuwtuwr - Almuwdiyl - Albayt
   AOST-A: صَوت المُوتُور ـ المُودِيل ـ البَيت

(*) Annotation of Epenthetic Vowels

For example, which of the following AOST transcriptions do we prefer, and why?

(1.a)
   MSAT: انا قلت لك
   AOST-T: >nA >ult l~ak
   AOST-A: أنا أُلت لَّك

(1.b)
   MSAT: انا قلت لك
   AOST-T: >nA >ulti l~ak
   AOST-A: أنا أُلتِ لَّك

(1.c)
   MSAT: انا قلت لك
   AOST-T: >nA >ult il~ak
   AOST-A: أنا أُلت ِلَّك

(1.d)
   MSAT: انا قلت لك
   AOST-T: >nA >ult <il~ak
   AOST-A: أنا أُلت إِلَّك

(2.a)
   MSAT: من البيت
   AOST-T: min Albayt
   AOST-A: مِن البَيت

(2.b)
   MSAT: من البيت
   AOST-T: mini Albayt
   AOST-A: مِنِ البَيت

(2.c)
   MSAT: من البيت
   AOST-T: mina Albayt
   AOST-A: مِنَ البَيت

(2.d)
   MSAT: من البيت
   AOST-T: min iAlbayt
   AOST-A: مِن ِالبَيت