(051) previous ~ index ~ next
To: Rich Schwartz <schwartz@bbn.com>
From: Doug Oard <oard@glue.umd.edu>
Subject: Re: TDT3 dry run
Date: Fri, 26 Mar 1999 20:41:04 -0500 (EST)
Rich/Jon/Larry,
You make a good set of points - I've added a few comments below
that might help clarify a few things, but in general I agree that having
the required run be on MT data would make sense.
Doug
On Fri, 26 Mar 1999, Rich Schwartz wrote:
> Your (and Larry's) suggestion sounds very good at least for the
> following reason. Even if there were a goal to do the problem from raw
> data, surely a baseline measurement for that would be to see how you do
> with the sequential approach of translating first.
The sequential approach has been compared to more closely integrated
approaches in the context of cross-language IR fairly extensively already,
although only among western languages to date.
> There can certainly be arguments about why you might do better
> going from the raw data - like being able to translate each word in
> multiple ways - but those are empirical questions and still need a
> baseline to measure whether they worked, rather than someone just saying
> that they do it this way and get this performance.
The best answer to this question that we know so far from reported CLIR
research is that techniques which are able to train on data that is very
similar to the evaluation data seem to do better than with the closely
coupled approaches, but that modular MT is a good choice from an
effectiveness standpoint when there is no comparable training data
available. The TDT setup disallows lookahead to the entire document
collection, suggesting that MT will probably perform reasonably well -
again viewed from an effectiveness standpoint. From an efficiency
standpoint, the closely coupled techniques are typically MUCH faster. And
that's a big deal in real applications - you want to train your topic
tracker to process in the language of the documents if you need to do high
volume, and then only translate the documents that are detected.
> But here's a question that may indicate I haven't been paying
> attention. Were there going to be English translations of the Mandarin
> materials produced? Or is there an automatic translation system for
> Mandarin -> English? It doesn't have to be a great one - because we're
> just taking this as a baseline to start.
I believe that Systran is unidirectional, Chinese->English. That means
that in some cases training documents will be translated, and in others it
will be evaluation documents that were translated. This creates some
interesting problems for some experiment designs since (at least in IR)
same-language retrieval typically outperforms cross-language retrieval by
enough to make a difference when merging results.
> > Date: Fri, 26 Mar 1999 13:45:27 -0500
> > From: Jon_Yamron@dragonsys.com
> > To: tdt-distrib@unagi.cis.upenn.edu
> > Subject: TDT3 dry run
> >
> > Larry Gillick, in an email not widely distributed, made the proposal that
> > the English translations of the Mandarin material be prepared for use as
> > evaluation material (i.e., tokenized files and boundary tables supplied).
> > I excerpt from Larry's message on the advantages of this approach, with
> > some comments of mine added in brackets:
> >
> > ---------------
> >
> > 1) It simplifies the engineering. No special steps need be taken to deal
> > with Mandarin or indeed, any other language from which source material is
> > translated into English. [Note also that Mandarin-English translation is
> > not required; the entire evaluation can be done in English.]
Agree, although from the points above I would observe that it limits
somewhat the questions that you can answer. That's fine for the required
run, though, as long as (as Larry states below) sites that wan to are able
experiemnt with the Mandarin as well.
> > 2) It is better than a concordance or dictionary. This is because MT
> > systems are much better than random at using context to figure out which
> > word sense is implied when there is ambiguity. A concordance would be
> > noncontextual and therefore much more prone to error. [Of course, the
> > concordances located so far would be made available for those sites that
> > chose to use the Mandarin directly.]
The situation is - at least in theory - a bit more complex than this. MT
systems use a "lexicon" and some algorithm that encodes appropriate
linguistic knowledge. CLIR systems use a "dictionary" (and/or a training
corpus) and a set of algorithms that encode linguistic knowledge. In the
final analysis, the two approaches are both MT systems, but one is
optimized for human readability and the other is optimized fo reffective
retrieval. Focusing on the limitations of the existing dictionaries is
appropriate, of course, but thinking of a dictionary as something
inherently different from a lexicon would only serve to obscure the
commanality between the two problems. I say "at least in theiry" because
the state of the art in CLIR is not yet as refined linguistically (and may
not ever need to be) as it is in MT. So we really don't yet know where
the natural dividing line between the two problems lies.
> > 3) It factors the problem. Just as the transcription problem has been
> > decoupled from the TDT problem, in an analogous manner, this suggestion
> > decouples the translation problem from the TDT problem.
This strikes me as a potential disadvantage (except to the extent that it
amplifies point 1) since it may not be the right factorization. As an
example, it might be harder to recover from names that are missegmented
(and hence mistranslated) by Systran than it would be to implement good
name recognition in Chinese.
> > 4) It doesn't rule out access to the Chinese transcription. Any site that
> > wants to pursue algorithms that make use of the Mandarin directly could be
> > free to do so, just as sites are free to directly use the original audio
> > should they find that valuable, instead of relying solely on the ASR
> > transcripts.
A citical point. If this were not the case, working with Mandarin
wouldn't tell us much that we don't already know.
> > 5) It fits in well with the TIDES program that Ron Larsen spoke of. As I
> > understood DARPA's plans for this new program in cross-lingual information
> > extraction, the assumption there was also that everything would be
> > translated into a common language using MT.
My perception of TIDES is different - translating everything into a common
language might be workable in some applications covered by TIDES (e.g., if
sources were preselected to create a high density of relevant documents),
but it would likely not be suitable for applications with high volumes of
mostly useless stuff.
> > ---------------
> >
> > I am particularly enamored of (1), I think (3) is in the proper spirit of
> > the TDT3 project (this is, after all, not a transcription or translation
> > project, so using state-of-the-art systems to do those steps and allowing
> > sites to concentrate on the TDT aspects makes sense), and I believe the
> > flexibility of (4) accomodates approaches that wish to go to the "raw"
> > data.
> >
> > Notice that this also makes irrelevant the issue of how to tokenize the
> > Mandarin corpora.
Not for the sites that want to use Mandarin!
> > In any event, it is time to decide how the evaluation is going to work! I
> > submit this as a basis for discussion, and welcome comments.
While I have been typing this two more responses have come in ... looks
like you got your wish!
Doug
(051) previous ~ index ~ next
Last updated Thu May 13 09:28:21 1999