Word-level information for Arabic text analysis

In past Treebanking efforts, two different kinds of information have been produced, typically by two different (human and computer) processes.

The first kind of information is often called "part of speech tagging". It divides the text into lexical tokens, and gives relevant information about each token. For each word, we want at a minimum to identify its lexical category (noun, verb etc.) and inflectional features (plural, past tense etc.). We might also identify some quasi-semantic features (proper noun) or even specify a word sense relative to some lexicon.

The second kind of information is the treebanking proper -- it characterizes the constituent structure of the word sequences, and provides a category for each non-terminal node. It may identify null elements such as traces, null pronouns and so on.

This page is about how to provide word-level information for Arabic treebanking, in context of the DARPA TIDES project. It was written by Mark Liberman on Nov. 3, 2001. I've presented it as a web page so that various supporting documents can more easily be referenced by readers.

At an October 1 workshop at UPenn, we agreed to consider two alternatives. One is to build the word-by-word tagging of texts on the analyses provided by Tim Buckwalter's Arabic dictionary. Another is to use the lexical information implicit in Apptek's translation tool. In either case, the method would be to present the alternative analysis for each word, along with the sentential context, to native-speaker annotators for disambiguation. Past experience suggests that when the set of alternatives are simple and clearly presented by a well-designed tool, annotators can work very fast, on the order of 1,000 words per hour.

At the October 1 meeting, we agreed to compare the outputs of these systems on about 1,000 words of sample text, as a basis for making a decision. I sent out text samples on 10/8, and analyses came in over the following three weeks or so. After examination of the results, a provisional decision has been reached to use Buckwalter's analysis as a basis, for reasons discussed below.

Both Apptek and Buckwalter sent in different sorts of analysis at different times. I've identified analyses by the date I received them for clarity of reference.

Arabic Text Examples

These are some Arabic-language files taken from the TREC Arabic dataset. They come from the AFP newswire during the period of 1994 and 1995.

File ID               Words   location

________________________________________________

19941020_AFP_ARB.0025   160   arabic_text1.html
19941021_AFP_ARB.0111   203   arabic_text2.html
19950115_AFP_ARB.0114   165   arabic_text3.html 
19950220_AFP_ARB.0125   279   arabic_text4.html
19950303_AFP_ARB.0069   292   arabic_text5.html

The TREC files (whose Arabic content is UTF-8) have been placed in a simple html wrapper:


<html dir="RTL" lang="AR">
<head>
<title>Arabic Text Example</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
ARABIC UTF-8 FILE GOES HERE
</BODY>
</html>

Buckwalter morphology/POS Annotation

Buckwalter produced two analyses, one on 10/15, and the second on 10/25. The second version is improved by having part-of-speech information, which was originally only implicit in the lexicon, made explicit.

Tim's first-cut annotations, coming from his dictionary as it was on 10/15, are to be found in this directory in the files


19941020.win1256.txt.POS  buckwalter_19941020.html
19941021.win1256.txt.POS  buckwalter_19941021.html
19950115.win1256.txt.POS  buckwalter_19950115.html
19950220.win1256.txt.POS  buckwalter_19950220.html
19950303.win1256.txt.POS  buckwalter_19950303.html

The files in the left column are plain text files with Arabic inputs represented in (I believe) Windows Code Page 1256 character encoding. The corresponding files in the right column are presented with a simple html wrapper, which should make them correctly viewable in IE 5.5. or 6, are. In the analysis, the Arabic is presented in Tim's transliteration, which is explained here. Tim's explanation of the alternative analyses and their presentation is here.

A little later, Tim modified his lexicon to move part-of-speech tags from a comment field or an implicit grouping of stems to an explicit field in each entry. He also presented his alternative analyses in xml and provided a simple .css file so that the xml is conveniently viewable in any compliant browser/OS combination in this sample:

19941020.win1256.xml 

(This works for me in W2000/IE5.5, but fails in W98/IE6.YMMV. You can always download the .xml file and view it locally.)

Output analysis of this general form can be provided automatically, without further human intervention, for any amount of Arabic input text.

Apptek POS/Constituency Annotation

Here is a document (in .doc form) explaining Apptek's approach to word tagging and name finding.

Here are some immediate constituent analyses, in which pre-terminal categories (parts of speech) can be found, but not detailed morphological features. A single analysis (believed to be correct) is produced, rather than a set of possibilities. These analyses were provided by Paul Roochnik on 11/17; as I understand it, they represent the output of a process of human interaction with (aspects of) Apptek's translation tool. I gather that the amount of interaction required is significant. Paul did the analysis in the time he could spare from his more-than-full-time job at Princeton -- his labors are much appreciated -- and then wrote: "It's not everything. It's the best my colleague and I could do. We can continue, but not tonight and not tomorrow."

The files are supplied in two forms: one is an excel spreadsheet (the form that Paul sent), and the other is a corresponding html file produced by Excel (on my computer).

apptek_19941020.xls     apptek_19941020.html
apptek_19941021.xls     apptek_19941021.html

Here are some further sentence-by-sentence analyses provided by Paul on 10/23:

sentence1Xdag.txt     sentenceX3Xdag.txt 
sentenceX5Xdag.txt    sentenceX2Xdag.txt 
sentenceX4Xdag.txt

Here is a guide to the notation used.

Mudar Yaghi has sent me some xml files representing the output of the word tagging and name finding components of their tool. Again, I believe that some human interaction goes into producing these. Mudar sent two versions, one on 10/23 and one on 10/31; the second release includes some additions and corrections.

The word tagging outputs are:

10/23:
19941020_AFP_ARB.0025.txt   19950115_AFP_ARB.0114.txt   19950303_AFP_ARB.0069.txt
19941021_AFP_ARB.0111.txt   19950220_AFP_ARB.0125.txt

10/31:
19941020_AFP_ARB.0025.xml   19950115_AFP_ARB.0114.xml   19950303_AFP_ARB.0069.xml
19941021_AFP_ARB.0111.xml   19950220_AFP_ARB.0125.xml

The name-tagged files are:

10/23:
19941020_AFP_ARB.0025.html 19950220_AFP_ARB.0125.html 19941021_AFP_ARB.0111.html 
19950303_AFP_ARB.0069.html 19950115_AFP_ARB.0114.html


10/31:
19941020_AFP_ARB.0025.html   19950115_AFP_ARB.0114.html   19950303_AFP_ARB.0069.html
19941021_AFP_ARB.0111.html   19950220_AFP_ARB.0125.html

Comparison of outputs

Tim Buckwalter originally expressed the opinion that the Apptek system provided a richer characterization of lexical features than his lexicon did. However, on inspecting the outputs, he concluded that his lexicon did a better or more complete job on the test passages in some cases, and provided these examples.

Discussion

(This is an expression of opinion by the author, based on current understanding).

The level of analysis provided by Tim Buckwalter's lexicon is just about what we want, from the point of view of supporting further processing. In addition to part-of-speech, it also identifies a dictionary lemma and the morphosyntactic features associated with its inflections and enclitics, if any; and it also provides one or more English glosses. [For ease of reference I'll call this kind of annotation MPG (morphology/part-of-speech/gloss) tagging.] The lemma ID, inflectional characterization, and glosses will be useful both for human annotators and for parsers, and provide about the level and type of lexical information that parsers trained on Penn-Treebank-style annotation will want.

Although the level of detail is substantially greater than that used in the Penn Treebank, Tim's presentation of simple bundled alternative analyses, combining full vowelization, morphological tags, and glosses, has the right properties to permit rapid and efficient human disambiguation. The additional details (resulting in a much larger effective tag set) are motivated in the case of Arabic, and the technology for dealing with tag sets of this kind is now well understood. The current OOV ("out of vocabulary") rate seems to be low enough that human provision of missing words will not be a large problem.

Tim's system is already set up so that a large block of Arabic text can be processed, without futher human intervention, to produce a large block of output, as rapidly as the computer analysis can operate (a few hundred to a thousand words a second, I believe).

The Apptek "word tagger" also provides MPG information, and indeed aims to provide a variety of more ambitious features. The organization of this information is considerably more complex, and would have to be simplified before human annotators could efficiently disambiguate it. Apptek's processing is aimed towards interactive development of a translation, and it does not seem to be currently set up to provide a simple but automatic listing of alternative MPG analyses on a batch basis. In addition, the sample analysis suggests that the coverage and accuracy of Apptek's lexicon may be lower than Buckwalter's.

Before using Apptek's word tagger as the first stage in an MPG tagging project for Arabic, we would need to determine what is required to get it to produce fully-automatic lists of alternative lexical analyses quickly and efficiently; and we would also have to figure out how to re-organize the presentation of alternative analyses for efficient human disambiguation. I don't have enough information to estimate the size of these tasks -- but the fact that they haven't been done yet, in order to present the material for the consideration of this group, suggests that the amount of time and effort involved is not trivial.

Since Buckwalter's lexical analyses are suitable for our needs, since they are available right away, since their quality appears to be at least as high as if not higher than Apptek's, and since we would like to move forward without unnecessary delay, I propose (subject to a brief further discussion) that we do MPG tagging on the basis of his system's analyses.