(019) previous ~ index ~ next

To: Jon_Yamron@dragonsys.com
From: Mark Liberman <myl@unagi.cis.upenn.edu>
Subject: Re: Mandarin Resources and Large, Non-Punctual Topics
Date: Wed, 10 Feb 1999 12:32:02 EST


>I agree with Rich that that big problem is getting hold of some kind of
>bilingual concordance. This is made at least a little trickier in Mandarin
>because it interacts with the tokenization problem---basically, if we agree on a
>bilingual dictionary, we also must agree on a word list. In short, the word
>list the LDC uses in the tokenization of the Mandarin text data, the word list
>used in the Mandarin recognizer used to process the Mandarin broadcast sources,
>and the word list used in the English-Mandarin concordance, had better all be
>pretty similar.
>
>- Jon

During the CallHome data collections, we began the process of adding English
glosses to the Mandarin entries. Because it wasn't on the critical path
to that task, and because it would have taken a significant effort, we
stopped. However, it would be simple enough to do it (especially if a
bilingual dictionary could be found to start the process, and if it was
restricted to (say) the commonest 10K words).

We've found several bilingual dictionaries, one of which Xiaoyi Ma has used
with success for aligning bilingual Mandarin/English text from the web.

If there is interest, we could come up with an estimate for the amount of
labor involved, and the likely elapsed time.

Another resource that might be of interest would be bilingual text
harvested from the web. Xiaoyi's spider has been working on German
every night for a while, and has found and aligned a large quantity of
bitext. We could switch it to work on Mandarin instead (though the
pickings are slimmer there -- which is why we've been doing German).

-Mark Liberman
(019) previous ~ index ~ next

Last updated Thu May 13 09:28:14 1999