(073) previous ~ index ~ next
To: Rich Schwartz <schwartz@bbn.com>
From: Mark Liberman <myl@unagi.cis.upenn.edu>
Subject: Re: Two thoughts about rapid development of cross-language capabilities in resource-poor new languages
Date: Wed, 31 Mar 1999 11:16:51 EST
A couple of comments on Rich's comments (taking them somewhat out of order):
> As always, I believe strongly in controlled experiments. So in
>particular in this question, I would say that we should compare the
>performance of several obvious systems on a language (like Mandarin) for
>which we will have all of the options:
>
>1. Systran translator
>
>2. Extensive bilingual word and phrase dictionary with probabilities
>
>3. Rapidly created bilingual dictionary (as you've described)
>
>4. Automatically created bilingual dictionary from large parallel
> or comparable corpora.
>
>5. Resources created from minimal data of some sort as yet unspecified
> for low-density languages where we can't find any people to make #3,
> and we only have small corpora for creating #4.
(At least part of) this was exactly what I had in mind for Mandarin TDT.
We will obviously have data point #1 anyhow. It has a big random
component, since it just samples one case: Systran
Mandarin-->English. But eventually we'll have data points comparing
some other translation systems for other languages.
No one now has the basis for data point #2, as far as I know. It would
be great to try this, but we can probably only approximate it, as
discussed below.
I hope to try #3 -- the first time around, it will be a bigger chore
(since we need to create the "template"), but it will still be a
pretty cheap test.
As for data point #4, we've worked hard to identify some relevant
Mandarin/English bitexts. We've found a reasonable amount (Hong Kong
Hansards, Hong Kong Information Services Department News, and Hong
Kong legal code). The total is less than what IBM used in their French
system (about half the size?), but it should be enough for some
experiments. It may also be possible to use independent Mandarin and
English news service materials from comparable periods, e.g. to
identify proper name transliterations automatically. This connects to
the plans for the next JHU summer workshop, which might produce a
result that could serve for a data point in this category. We could
also use software recently produced here at Penn (by Dan Melamed and
others) to produce candidate lists of this type. From what I've seen
of such results, I suspect that (unless VERY much larger bitexts are
available) it may be best seen as a component of approach #3, i.e. as
a source of candidate translation-equivalents, and a way to estimate
probabilities. But as always, the experimental result is what counts.
Finally, your data point #5 -- "Resources created from minimal data of
some sort as yet unspecified for low-density languages where we can't
find any people to make #3, and we only have small corpora for
creating #4" -- I wonder how often this case really exists in a
practical sense. If a language has any reasonable number of speakers
-- and otherwise where would the operational data come from -- then I
assume that the resources of the U.S. Government can find and recruit
a few speakers to participate in an effort like #3.
Another way to construe this case, though, would be focus on what can
be done quickly with a few scraps of existing documentation -- a small
print dictionary, a grammar book with some examples and a few
transcribed texts, etc. I'd prefer to see this as a supplement to #3,
especially for the case of languages with significant morphology,
where some sort of quick-and-dirty stem/affixes analyzer would
probably be needed to do much of anything, and again, one would like
to know how to build one very fast. I'd also cite again the idea
of helping the existing army of fieldworkers to produce data that
our community can assimilate.
>
> I think we all believe that #2 should be the best in the long run.
>But is #1 almost as good?
>How close can #3 get to #2 with n weeks of effort?
Very good questions, as always. I'd add the observation that in most
practical cases, where there will be extensive monolingual texts
available (if only in paper form to be OCR'ed, or in audio form to be
transcribed), there are some obvious things to try that move through
the space from #3 toward #2, or towards a full statistical MT system.
>Is it more expensive to do #3 (manual small dict) vs #5 (limited parallel
>resources)?
I think you mean #3 vs. #4, right? Anyhow, the zeroth-order economics
should be to estimate, if we knew what #3 really cost, since we know
what text-translation costs are per word, and how much bitext is
necessary to create a decent statistical translation lexicon.
It might also be interesting to consider mixed strategies, where
phrases to be translated are specifically selected to help
with cases chosen on the basis of a bilingual glossary and
monolingual texts.
>
> In any case, I think these are the right questions to ask.
>
Just as important, a systematic program of trying to get answers,
along the lines that Rich laid out, will naturally lead to a new
series of questions, some of which may be more interesting than
any of the original ones.
Rich began with the comments:
> I'm glad to hear you talk about the cross-lingual problem for many
>languages as a problem that we should consider solving with "just work".
>(I've had similar thoughts for a long time on the issue of creating
>capabilities within each language for hundreds of languages.)
>
> Basically, if we focus some effort - as you suggest - on seeing
>how quickly we can create basic resources for a language (in this case
>English to/from Other language), then it's very likely that we can come up
>with a process that gets us to a modest but quite usable level with a very
>small amount of time and money. This is in marked contrast to the very
>large amounts of time and money that COULD be (and often have been)
>spent on a language.
>
> While some might consider it technically less interesting to solve
>these problems the old fashioned way, I think we owe it to the community
>to consider these approaches but with the work focussed on how to do it
>very efficiently (as you've suggested).
I agree with this, but would also like to emphasize the "just work" methods
are not an alternative to "new fashioned" methods, but rather part of an
integrated approach that looks at every stage for the best combination
of methods given a particular set of goals and constraints.
--
-Mark Liberman
(073) previous ~ index ~ next
Last updated Thu May 13 09:28:22 1999