Guidelines for the Transcription of Arabic Dialects (EARS)
Tim Buckwalter, Mohamed Maamouri
Arabic Treebank Project
LDC, University of Pennsylvania
Oct. 8, 2004

GUIDELINES FOR TRANSCRIBING LEVANTINE ARABIC:
MSA-BASED TRANSCRIPTION

(1) General spelling

The Guidelines can be summarized in the following general statement: Adhere as closely as possible to unvocalized MSA spelling and word segmentation in all cases. The following example illustrates the application of these principles: the Levantine utterance /?ultil:ak/ ("I told you") will be transcribed as MSA قلت لك unvocalized, spelled as two words, using accepted MSA orthography. There are three notable exceptions to these rules:

(2) MSA and LA phonological differences

The following regular phonological differences between MSA and LA (and many other dialects as well) do not justify departing from MSA orthography when transcribing the colloquial form:

(3) Ta marbuta

The ta marbuta should be written always with the two dots (ـة) whether it is pronounced /a/ or /t/.

(4) Short vowels, diacritics, and Nunation

Do not transcribe the short vowels /a/, /u/, /i/, or the diacritics indicating zero-vowel ("sukun") and gemination ("shadda"). When Nunation ("tanwin") is recorded in speech it should be transcribed. The most common examples all involve use of "fatHatan" (e.g., /?ahlan wa-sahlan/ أهلاً وسهلاً), although some uses of "kasratan" are recorded in educated or elevated speech (e.g., /?ila Had:in ma:/ إلى حدٍ ما). Note that the "fatHatan" may occur without an alif chair: (e.g., /xa:S:atan/ خاصةً),

(5) Glottal stop (hamza)

All glottal stops should be written when and where they occur. For example:

           /ğara:?id/ جرائد  (contrast with: /ğara:yid/ جرايد)
           /fa:?iz/ فائز  (contrast with: /fa:yiz/ فايز)
           /ra?s/ رأس  (contrast with: /ra:s/ راس)
           /bi:r/ بئر  (contrast with: /bi:r/ بير)

(6) Verbal prefixes and suffixes

Colloquial verbs follow MSA paradigms, but with modifications to the underlying tri-consonantal radicals and with the addition of purely colloquial clitics.

The colloquial Perfect Verb follows MSA suffixation orthography except for the following:

The colloquial Perfect Verb follows MSA stem orthography except for the following:

PERFECT VERB PARADIGMS

جاش قراش شافش كتبش هو ما   إجى قرى شاف كتب هو
جوش قروش شافوش كتبوش هم ما   إجوا قروا شافوا كتبوا هم
جتش قرتش شافتش كتبتش هي ما   إجت قرت شافت كتبت هي
جيتش قريتش شفتش كتبتش إنت ما   جيت قريت شفت كتبت إنت
جيتيش قريتيش شفتيش كتبتيش إنتي ما   جيتي قريتي شفتي كتبتي إنتي
جيتوش قريتوش شفتوش كتبتوش إنتوا ما   جيتوا قريتوا شفتوا كتبتوا إنتوا
جيتش قريتش شفتش كتبتش أنا ما   جيت قريت شفت كتبت أنا
جيناش قريناش شفناش كتبناش إحنا ما   جينا قرينا شفنا كتبنا إحنا


The colloquial Imperfect Verb follows MSA orthography except for the following:

IMPERFECT VERB PARADIGMS

بيجيش بيقراش بيشوفش بيكتبش هو ما   بيجي بيقرى بيشوف بيكتب هو
بيجوش بيقروش بيشوفوش بيكتبوش هم ما   بيجوا بيقروا بيشوفوا بيكتبوا هم
بتيجيش بتقراش بتشوفش بتكتبش هي ما   بتيجي بتقرى بتشوف بتكتب هي
بتيجيش بتقراش بتشوفش بتكتبش إنت ما   بتيجي بتقرى بتشوف بتكتب إنت
بتيجيش بتقريش بتشوفيش بتكتبيش إنتي ما   بتيجي بتقري بتشوفي بتكتبي إنتي
بتيجوش بتقروش بتشوفوش بتكتبوش إنتوا ما   بتيجوا بتقروا بتشوفوا بتكتبوا إنتوا
بجيش بقراش بشوفش بكتبش أنا ما   بجي بقرى بشوف بكتب أنا
منيجيش منقراش منشوفش منكتبش إحنا ما   منيجي منقرى منشوف منكتب إحنا


(7) Pronominal suffixes and verbal objects (prepositional phrases and direct objects)

MSA orthography should be followed in all cases. Note, especially, the following examples:

updated There are three notable exceptions to the preceding MSA-based rule:

(8) Pronominal suffixes and active participles

Unlike MSA, LA active particles may take pronominal suffixes. Note the following examples:

(9) Numerals

Except for the numerals 11-19 (see below), follow MSA orthography (or parallel MSA forms) in all cases. Note, especially, the following examples:

The Levantine Arabic numerals 11-19 show considerable deviation from MSA pronunciation and word segmentation. The orthography of these colloquial words varies considerably, as observed in informal writing on the Web. The following recommended orthography is based on a survey of Google frequency counts of all known orthographic variants. Note that the word-final /r/ in all these forms is to be transcribed even in cases where it is not pronounced.

      11   إحدعشر
      12   إثنعشر
      13   ثلاثطعشر
      14   أربعطعشر
      15   خمسطعشر
      16   سطعشر
      17   سبعطعشر
      18   ثمانطعشر
      19   تسعطعشر

(10) Days of the week

      Sunday   يوم الأحد
      Monday   يوم الإثنين
      Tuesday   يوم الثلاثة
      Wednesday   يوم الأربعة
      Thursday   يوم الخميس
      Friday   يوم الجمعة
      Saturday   يوم السبت

(11) Months of the year

      Eastern:

      January   شهر كانون الثاني / شهر واحد / شهر يناير
      February   شهر شباط / شهر إثنين / شهر فبراير
      March   شهر آذار / شهر ثلاثة / شهر مارس
      April   شهر نيسان / شهر أربعة / شهر إبريل
      May   شهر أيار / شهر خمسة / شهر مايو
      June   شهر حزيران / شهر ستة / شهر يونيو
      July   شهر تموز / شهر سبعة / شهر يوليو
      August   شهر آب / شهر ثمانية / شهر أغسطس
      September   شهر أيلول / شهر تسعة / شهر سبتمبر
      October   شهر تشرين الأول / شهر عشرة / شهر أكتوبر
      November   شهر تشرين الثاني / شهر إحدعشر / شهر نوفمبر
      December   شهر كانون الأول / شهر إثنعشر / شهر ديسمبر

      Islamic:

      Muharram   محرم
      Safar   صفر
      Rabie I   ربيع الأول
      Rabie II   ربيع الثاني
      Jumada I   جمادى الأول
      Jumada II   جمادى الثاني
      Rajab   رجب
      Shaaban   شعبان
      Ramadan   رمضان
      Shawwal   شوال
      Dhul Qada   ذو القعدة
      Dhul Hijja   ذو الحجة

(12) Foreign words and placenames

Most foreign words and placenames already have established MSA spellings (e.g., Washington واشنطن, Los Angeles لوس انجلوس). In cases where the MSA spelling has regional variants, follow the Levantine spelling. Note the following examples:

Words that are not attested in MSA should be transcribed as expected in MSA, but according to Levantine orthography. Note that although many computers are able to display "extended" Arabic characters, such as the Persian letters /p/ پ, /č/ چ, /ž/ ژ, and /g/ گ, few sytems provide the user with an easy way to actually type these characters on the keyboard. So, although these letters are potentially available for representing foreign sounds, the convention in MSA orthography is to substitute the corresponding and easily-available Arabic letters instead. Therefore, according to Levantine practice,


APPENDIX A: HIGH-FREQUENCY DIALECTAL WORDS

In order to maintain lexical consistency the transcriber should consult regulary the following list of high-frequency dialectal words and their prescribed orthographic forms.

أتاري /?ata:ri/ "it turned out to me" (also pronounced /?aθa:ri/--perhaps as a "classicisation"--among non-urban speakers)

إحنا /?iHna/ "we"; cf. نحنا /niHna/

إشي /?iši:/ "something"; بدك إشي؟ "you want something?"; الإشي الجديد هذا "this new thing" (cf. شي /ši:/)

إلـ /?il-/ used primarily in possessive constructions, it attaches directly to a following clitic (e.g. أنا إلي /?ana ?ili:/ "I have", كم أخ إلك /kam ?ax ?ilak/ "How many brothers do you have?". Note also uses with preposition: لإله /la?ilo/ "for him", لإلها /la?ilha/ "for her"

اللي /?il:i:/ "who; which" (with prepositions: باللي /bil:i:/, للي /lil:i:/)

أكمن /?akam:an, ?akam:in/ "how many"; هالأكمن /ha:l?akam:an, ha:l?akam:in/ "this many"

ألو /?alu:, ?alo:/ "hello" (in telephone conversations only)

إمبارح /?imba:riH/ "yesterday"

إمبيرح /?imbe:riH/ "yesterday"

إمراة /?imra:/ "woman; wife" (إمراتك /?imra:tak/ "your wife") see also مرة /mara/

إنتي /?inti:/ "you" (fem.sg.)

إنتوا /?intu:/ "you" (masc.pl.)

أنو /?anu:/ "which" من أنو جامعة؟ "from which university?"

أني /?ani:/ (1) "I" أني معاك "I'm with you" (2) "which" (fem. variant of أنو, see above) حضرتك من أني عائلة؟ "which family do you come from?"

أوضة /?o:Da/ "room"

أوكي /?o:ke:/ "O.K."

إيد /?i:d/ "hand"

إيش /?e:š/ "what"; cf. ليش /le:š/ "why"

updated إيمتى /?e:mta/ "when"; also إيمتىً /?e:mtan/

إيه /?e:h, ?e:/ "what?"; cf. ليه /le:h, le:/ "why?"

أيواً /?ay:uwan/ "yes"

أيوه /?ay:uwah/, also /?ay:uwa/ "yes"

بالله /bal:a/ "I swear; really?" (cf. والله)

بد /bidd-/ "want" (بدي /biddi/ "I want", بده /biddo/ "he wants", بدها /bidda:, biddha/ "she wants"),

برا /bar:a/ "outside" (cf. جوا /juw:a/ "inside")

برضو /barDo/ "also"

بركي /barki/ "maybe" (variant of بلكي)

بس /bas:/ "only; just"

بعديـ /ba`di:-/ "after" بعديكوا (cf. قبليـ)

بعدين /ba`de:n/ "afterwards"

بكرة /bukra/ "tomorrow"

بلكي /balki/ "maybe" (variant of بركي)

بيـ /bi:-/ "with" (with following clitic: بيك /bi:k/ "to/with you (masc.sg.)", بيكي /bi:ki:/ "to/with you (fem.sg.)"): أهلاً بيكوا /?ahlan bi:ku/ "welcome!"

بيناتـ /be:na:t-/ "among; between" (with following clitic: بيناتهم /be:na:thum/ "among them"

new تبع /taba`/ possessive function word, usually with following clitic: الملك تبعكم "your king"; تبع مين هاي السيارة؟ "whose car is this?"

تع /ta`/ "come!" fem. تعي, pl. تعوا (short forms of تعال, تعالي, and تعالوا)

updated تم /tum:/ "mouth": /sak:ir tum:ak/ "shut up!"

جاج /ža:ž/ "chicken"

new جوا /juw:a/ "inside" (cf. برا /bar:a/ "outside")

حد /Hadd/ (in the phrase ما حدش /ma: Had:iš/ "nobody")

حدا /Hada/ (esp. in neg phrases such as ما حدا بيعرف /ma: Hada biya`ref/ "nobody knows")

دغري /dughri/ "straight; forward"

زاكي /za:ki:/ "delicious"

updated زلمة /zalamah, zalameh, zalami/ "guy; buddy; dude": زلمتنا /zilmitna/ "our friend; our colleague"

زي /zey:/ "like;as"

ساعيات (used with هالـ) /has:a:`iya:t/ هالساعيات "nowadays"

ش /-š/ neg. part. used with verbs and function words: ما عنديش, ما فيش

شو /šu:/ "what"

شوي /šway:/ "a little bit" (fem. شوية /šway:a, šway:it/)

شي /ši:/ "something"; cf. إشي /?iši:/

طب /Tab/ "okay"

عـ /`a/ a shortened form of على it attaches directly to a following particle (e.g. عكل حال /`a-kull Ha:l/ "in any case"), or to a noun stem (e.g. علبنان /`a-lubna:n/ "to Lebanon") or to the definite article and following noun (e.g. عالسفارة /`as-safa:ra/ "to the embassy")

عبين /`abe:n/ "until" (typically with ما):
استنى شوية عبين ما حدا يحكي معاك "wait a second till someone (comes to the phone) to speak with you";
استنيته عبين ما رجع "I waited for him till he returned"

عشان /`aš:a:n/ "because; 'cause"

updated علشان /`alša:n/ "because; 'cause" (note: /`alaša:n/ should be transcribed as على شان)

عم /`am/ (imperfect verb particle: شو عم تحكي؟ "what are you saying?")

عمالـ /`am:a:l-/: (عماله /`am:a:lo/, عمالنا /`am:a:lna/) modal to express habitual action: عمالنا نستنى فيه "we've been waiting a while for him"

new فا /fa:/ "so, therefore" (written frequently as a separate word: فا اللي يقدر يروح "so, whoever is able to go")

updated فيه /fi:, fi:h/ "there is/are" (note: transcribe /fiy:o/ as فيو; cf. هيو)

فيش /fi:š/ "there isn't/aren't/ain't"; also فيشي /fi:ši:/

فين /fe:n/ "where" (cf. وين)

قبليـ /?abli:-/ "before" قبليكوا (cf. بعديـ)

قديش /?ad:e:š/ "how much"

قديه /?ad:e:h, ?ad:e:/ "how much"

كلياتـ /kul:iy:a:t-/ "all of" (with following clitic: كلياتهم /kul:iy:a:thum/ "all of them"

كمان /kama:n/ "also"

كويس /kway:is/ "good"

لأ /la?/ "no; nope" (follow the hamza transcription rule here: if you hear it, transcribe it)

updated لسة /lis:a/ "still, yet" (with following clitic لساتـ /lis:a:t/: لساتك هون؟ /lis:a:tak ho:n/ "are you still here?" لساتني طفل صغير /lis:a:tni Tifl ?izgi:r/ "I'm still a small child")

له /lah/ "no; no way"

ليش /le:š/ "why"; cf. إيش /?e:š/ "what"

ماي /may/ "water"

ماية /may:a/ "water"

مبلى /mbala/ "yes, certainly"

مرة /mara/ "woman; wife" (مرتك /martak/ "your wife") see also إمراة /?imra:/

مش /miš, muš/ "not"

new مشان /miša:n/ "in order to; for the sake of"

new مظلش /maZal:iš/ (contraction of ما ظلش) "not to remain": مظلش ولا واحد "nobody remained; not a single person remained"

معا /ma`a:/ "with" (with following clitic: معاك /ma`a:k/ "with you (sg.)", معاكوا /ma`a:ku:/ "with you (pl.)", معاي /ma`a:y(a)/ "with me")

معليش /ma`le:š/ "never mind, it's okay"

معليه /ma`le:h, ma`le:/ "never mind, it's okay"

معناة /ma`na:t-/ "meaning" (with following clitic: معناته /ma`na:to/ "its meaning")

مليح /mli:H/ "good"

منو /manu:, minu:/ "who" (in both direct and indirect questions):
      منو هي؟ /minu: hiy:e/ "who is she?"
      منو معي؟ /minu: ma`i/ lit. "who is with me?" (on the phone line), i.e., "who am I speaking with?"

مني /mini:/ "who" (rare fem. variant of منو /manu:, minu:/ (see above)

منيح /mni:H/ "good"

مية /miy:e/ "one hundred" (/mi:t/ in const. case)

مين /mi:n/ "who"

نحنا /niHna/ "we"; cf. إحنا /?iHna/

نص /nuSS/ "half"

نيالـ /niya:l-/ "how good for" نيالك, /niya:lak, niya:lik/, نياله /niya:lo/, نيالكوا /niya:lku/)

هـ /ha/ (with following definite article: هالـ /ha:l/) "this" (هالكتاب /hal-kta:b/ "this book"); with preposition: /li-ha-d-daraže/ لهالدرجة

هاذ /ha:ð/ "this (masc.sg.)"

هالو /ha:lu:, ha:lo:/ "hello" (in telephone conversations only)

هانا /ha:na/ (also هان /ha:n/) "here" (Bedouin speech)

هاي /ha:y/ "this (fem.sg.)"

new هذاك /haða:k/ "that"

هذول /haðo:l, hado:l/ "those;these"

هذيك /haði:k/ "that" (rare fem.: هذيكة /haði:kt/ "that" )

هسع /his:a`/ "still, yet; now, right now"

هلق /halla?/ "now"

هنيك /hune:k, hne:k/ "there, over there"

هون /ho:n/ "here"

هيدي /haydi/ "this" (fem.sg.)

updated هيـ /hay:-/ "there!" (with following pronoun suffix: هيو /hay:o/ "there he/it is!", هيها /hay:ha/ "there she/it is!", هيهم /hay:hum/ "there they are!")

هيك /he:k/ "like this/that" (also هيكة /he:ka/ and هيكي /he:ke/)

وإلا /wa?il:a/ "or" (also والا /wil:a/). Note the difference between ولا /wala/ ("neither...nor", as in ولا بيهش ولا بينش) and والا /wil:a/ ("otherwise", as in اسكت والا بضربك). Follow the hamza transcription rule here: if you hear /wa?il:a/ write وإلا, and if you hear /wil:a/ write والا.

والله /wal:a/ "I swear; really?" (cf. بالله)

وين /we:n/ "where" (cf. فين)

يالله /yal:a/ "c'mon! let's...!" (not the same as the two-word phrase يا الله /ya: ?al:a:h/ "oh my goodness!")

يللي /yal:i:/ "who; which" cf. اللي /?il:i:/


PRACTICAL TIPS ON HOW TO APPLY THE GUIDELINES

In practical terms, for each item the transcriber will ask the following question: is the item on the list of frequent colloquial items? If Yes, follow the orthography used in the list; if No, follow the orthography of the MSA parallel form, with any necessary modifications (e.g., to reflect the morphology of colloquial verbs). The transcriber should be able to justify each transcription decision by citing that the transcribed item either parallels the MSA form or that it is among the high-frequency dialectal items listed in Appendix A. Any new dialectal item that is identified will be added to this list. The following examples illustrate a rigorous application of the Guidelines.

Utterance: /muš Hakeitlak ?inno fi: na:s ?ikti:r rayHi:n Eas:inama/
Transcription: مش حكيت لك إنه فيه ناس كثير رايحين عالسينما?

/muš/: dialectal item listed in Appendix A (مش)

/Hakeitlak/: the parallel MSA form is /Hakayt laka/ حكيت لك

/?inno/: the parallel MSA form is /?innahu/ إنه

/fi:/: dialectal item listed in Appendix A (فيه)

/na:s/: the parallel MSA form is ناس

/?ikti:r/: the parallel MSA form is كثير

/rayHi:n/: the parallel MSA form is رايحين

/Eas:inama/: the stem /sinama/ is transcribed as سينما because this is the parallel MSA form; the clitic /`a-/ عـ is a dialectal item listed in Appendix A