(064) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: Mark Liberman <myl@unagi.cis.upenn.edu>
Subject: Two thoughts about rapid development of cross-language capabilities in
Date: Tue, 30 Mar 1999 10:45:55 -0500

This is not a salvo in the current battle over TDT dry-run task
definitions, nor a general manifesto about right and wrong ways to do
multilingual TDT. Instead, I've been doing some thinking recently about
approaches to very rapid development of cross-language capabilities, for
new languages with meagre existing resources, and I'd like to get the
benefit of your collective experience and opinion on some of the
questions involved.

Specifically, I have two thoughts. First, we ought to develop concrete
specs, tools and procedures for very rapid construction of useful
bilingual glossaries. Second, we ought to find better ways to get access
to resources routinely created by linguists, anthropologists and so
forth.

FAST BILINGUAL GLOSSARIES. I have in mind the sorts of lists that are
used in one approach to CLIR -- a list of words in English, each
associated with a set of translation-equivalents in Language X; and the
same thing in the other direction. Add probabilities, and you're on the
way to a lexicon for statistical MT, but that isn't the immediate goal
here. Rather, the goal is rough-and-ready cross-language IR, and even
rougher word-by-word glossing as a (very) poor man's form of MT. How big
does such a list need to be to be useful? and how quickly could one be
built from a standing start, with no existing resources at all? My guess
is that "10K lemmas" is about the right answer to the first question
(i.e. for English, something like 15K word-forms); and that (if 10K is
right) "two weeks" is about the right answer to the second one.

The first question -- how big does such a glossary need to be? -- is
suspectible to various sorts of empirical testing. For instance, suppose
that we sort all the words in a test corpus according to TF/IDF scores,
and then run e.g. TDT detection and tracking with vocabulary lists
clipped at various lengths (i.e. ignoring all words below the cut-off
point). How do scores degrade? There is no question that scores will
degrade -- rare proper names are certainly an important part of the
process -- though some languages will be nice enough to preserve foreign
proper names in a recognizable form -- but I presume that performance
will still be decent at plausible lexicon sizes.

The second question -- how quickly could such a glossary be constructed?
-- depends crucially on how it is done. If maximum speed is the goal,
the logical approach is to prepare the English side carefully in
advance, along with a recipe for searching out key missing items on the
other side. For instance, the SIL folks (I think originally it was SIL
Cameroon) prepared a term list of about 2,500 items to be used in
building a quick dictionary of a new language. Suppose we had a similar
list of (say) 10K items, chosen for maximum IR relevance by e.g. sorting
on TF/IDF in a relevant training corpus. For ease and clarity of use, we
would probably want to include quite a few terms with internal white
space. Anyhow, suppose we then allocate two minutes to producing the
translation of each term. This is not enough time to do it perfectly --
but we'll just take what we can get at that rate. Then the labor time to
gloss 10K terms is 333 hours, or 10 person-weeks. Assuming a crew of
five translators (with suitable supervision and technical support), this
would take two weeks.

The English->X glossary would then be inverted to produce an X->English
glossary (with significant degradation, but anyhow). If other resources
are available (bilingual texts, dictionaries, monolingual texts) then
obvious things can be done to improve the results.

Has this ever been tried? If not (and I suspect that it has not, at
least on the scale suggested), then an experiment or two seems to be in
order. TDT-3 would provide a useful test bed -- though Mandarin/English
is a particularly easy language pair to do quickly, because of the
relative lack of morphology on both sides. Assuming that this approach
is plausible at all, then there are dozens of associated research
topics, especially in the area of how to use this minimal cross-language
resource in various bootstrapping
activities.

BETTER ACCESS TO EXISTING RESOURCES. Considering only the situation
within the U.S., there are more than a thousand
linguists/anthropologists/sociologists/historians/folklorists/missionaries
etc. who (mostly as a by-product of other activities) produce primary
materials on various of the world's languages. By "primary materials" I
mean transcribed speech recordings, lexicons and so on. Worldwide, the
number of people doing this sort of thing is very much larger. A very
small fraction of this material is published in paper form; some
additional modest fraction is deposited in library or museum
collections; most remains in dusty piles of notebooks, floppy disks and
tapes in offices and homes. The quality is variable but overall it is
pretty good, in my experience.

In most cases, the people doing this work would love to see it
published. The limited amount of current publication is not because the
authors are uninterested, but rather because it is not commercially
viable to publish most of the stuff, since the market is too small. In
principle, the internet changes this; the cost of distribution goes to
zero. And in fact we're starting to see quite a bit of net-publication
of such stuff.

However, serious barriers remain: adequate computer tools for
multilingual transcription/annotation coupled with lexicography don't
exist (though there are some noble attempts); and translation to web
publication form remains very difficult -- and paradoxically, it is
hardest for the output of the best existing basic tools. The problems
will eventually be solved, by the usual brownian motion, but the current
rate of progress is distressingly slow in some key respects (details
available on request).

If our community is serious about learning to function in hundreds of
languages, then it seems to me to make a good deal of sense to find ways
to build bridges to the people who are already producing materials in
these languages. The best way to do this, in my opinion, is to invest in
standards, tools and examples that these folks can use to do a better
job of what they are trying to do anyhow -- with the side effect that
their work becomes accessible to us. One way to move in that direction
might be to include some linguistic researchers of this stripe as
partners in e.g. a TIDES project.

-Mark Liberman
(064) previous ~ index ~ next

Last updated Thu May 13 09:28:22 1999