(068) previous ~ index ~ next

To: Mark Liberman <myl@unagi.cis.upenn.edu>
From: Rich Schwartz <schwartz@bbn.com>
Subject: Re: Two thoughts about rapid development of cross-language capabilities in resource-poor new languages
Date: Tue, 30 Mar 1999 12:59:17 -0500 (EST)

Mark,

I'm glad to hear you talk about the cross-lingual problem for many
languages as a problem that we should consider solving with "just work".
(I've had similar thoughts for a long time on the issue of creating
capabilities within each language for hundreds of languages.)

Basically, if we focus some effort - as you suggest - on seeing
how quickly we can create basic resources for a language (in this case
English to/from Other language), then it's very likely that we can come up
with a process that gets us to a modest but quite usable level with a very
small amount of time and money. This is in marked contrast to the very
large amounts of time and money that COULD be (and often have been)
spent on a language.

While some might consider it technically less interesting to solve
these problems the old fashioned way, I think we owe it to the community
to consider these approaches but with the work focussed on how to do it
very efficiently (as you've suggested).

As always, I believe strongly in controlled experiments. So in
particular in this question, I would say that we should compare the
performance of several obvious systems on a language (like Mandarin) for
which we will have all of the options:

1. Systran translator

2. Extensive bilingual word and phrase dictionary with probabilities

3. Rapidly created bilingual dictionary (as you've described)

4. Automatically created bilingual dictionary from large parallel
or comparable corpora.

5. Resources created from minimal data of some sort as yet unspecified
for low-density languages where we can't find any people to make #3,
and we only have small corpora for creating #4.


I think we all believe that #2 should be the best in the long run.
But is #1 almost as good?
How close can #3 get to #2 with n weeks of effort?
etc., etc.
Is it more expensive to do #3 (manual small dict) vs #5 (limited parallel
resources)?


In any case, I think these are the right questions to ask.

--Rich




(068) previous ~ index ~ next

Last updated Thu May 13 09:28:22 1999