(076) previous ~ index ~ next

To: Ralf Brown <Ralf_Brown@v.gp.cs.cmu.edu>
From: Mark Liberman <myl@unagi.cis.upenn.edu>
Subject: Re: Two thoughts about rapid development of cross-language capabilities in resource-poor new languages
Date: Wed, 31 Mar 1999 14:22:16 EST

>>As for data point #4, we've worked hard to identify some relevant
>>Mandarin/English bitexts. We've found a reasonable amount (Hong Kong
>>Hansards, Hong Kong Information Services Department News, and Hong
>>Kong legal code). The total is less than what IBM used in their French
>>system (about half the size?), but it should be enough for some
>
>If you really have ~300 megs of bitext (IBM Hansards are 700 megs),
>that's an order of magnitude more than the DIPLOMAT project has been
>using to build EBMT systems.... If I could get access to a
>sentence-aligned version (even a 95%-correct automatically-aligned
>version), preferably with the Mandarin side word-segmented, I'd
>volunteer to build a translator -- it would only take me a few days.
>We'd also get a substantial probabilistic dictionary as a by-product;
>the IBM Hansards yielded a 53k-word English-French and a 62k-word
>French-English dictionary.
>
> Ralf

I was a little too optimistic, because I remembered the size of the
IBM Hansards as 2 million sentence pairs. Instead they are about 2.9
million sentence pairs; so if the sentences are 20 words long on
average, this would be something like 2.9M * 20 = 58M words on each
side.

What we have (thanks to the enterprise of Xiaoyi Ma over the past year or two)
is (all counts on the English side):

Hong Kong Law Code: 8M words
Hong Kong Gov't News: 6M words (growing at about 600K words per month)
Hong Kong Hansard: 6M words (maybe more - MS Word files yet to be converted)
_______________________________
20M words

So it's more like a third than a half, when my faulty memory is corrected.

Still, this may be enough to do something useful.

Some factual issues, which may take you a few extra days to resolve :-) ...

(1) The texts are not aligned yet. However, they appear to be strictly
sentence-to-sentence translations, always or almost always, so
sentence alignment seems to be easy.

(2) The Mandarin is Big5 encoded. Big5->GB mapping has some problems (not
as many as the other way), specifically quite a few characters with no
correspondence, mainly in names.

(3) The Mandarin is not segmented into words (as is normal for
Chinese-character text, of course). I presume that your algorithms assume
that word divisions are known? So you would have to add segmentation to your
model, or else borrow a good segmenter.

Of course the vocabulary and the style will not be a particularly good
fit to the TDT-3 task, but such is life.

The good news is that we now have permission to distribute the data
(well, I'm not sure that the Hansards permission has arrived yet, but
it is expected), so you can give it a try. We'll put the data up on
our http://www.ldc.upenn.edu/Projects/Chinese web site as soon as
the permissions issues are fully resolved.

-Mark Liberman

(076) previous ~ index ~ next

Last updated Thu May 13 09:28:23 1999