(048) previous ~ index ~ next
To: Jon_Yamron@dragonsys.com
From: Rich Schwartz <schwartz@bbn.com>
Subject: Re: TDT3 dry run
Date: Fri, 26 Mar 1999 19:56:05 -0500 (EST)
Jon,
Your (and Larry's) suggestion sounds very good at least for the
following reason. Even if there were a goal to do the problem from raw
data, surely a baseline measurement for that would be to see how you do
with the sequential approach of translating first.
Even if the eventual goal were to do this on "low density
languages" for which there is no translation system, we would still want
to measure our approaches in comparison to how we would do if we DID build
a translation system. (I claim that any work on dealing with low density
languages or limited amount of training should always be done on high
density languages or with lots of training so you can SIMULATE the bad
environment and then measure how close you can get in performance to the
ideal case.)
There can certainly be arguments about why you might do better
going from the raw data - like being able to translate each word in
multiple ways - but those are empirical questions and still need a
baseline to measure whether they worked, rather than someone just saying
that they do it this way and get this performance.
-------
But here's a question that may indicate I haven't been paying
attention. Were there going to be English translations of the Mandarin
materials produced? Or is there an automatic translation system for
Mandarin -> English? It doesn't have to be a great one - because we're
just taking this as a baseline to start.
--Rich
============================================================
On Fri, 26 Mar 1999 Jon_Yamron@dragonsys.com wrote:
> Date: Fri, 26 Mar 1999 13:45:27 -0500
> From: Jon_Yamron@dragonsys.com
> To: tdt-distrib@unagi.cis.upenn.edu
> Subject: TDT3 dry run
>
> There has been a fair amount of correspondance recently on the subject of
> Mandarin resources, but none of it addresses what is becoming a central
> concern of mine now that we are nearly a month past the Workshop: What is
> the dry-run evaluation, now less than 3 months away, going to look like?
>
> Larry Gillick, in an email not widely distributed, made the proposal that
> the English translations of the Mandarin material be prepared for use as
> evaluation material (i.e., tokenized files and boundary tables supplied).
> I excerpt from Larry's message on the advantages of this approach, with
> some comments of mine added in brackets:
>
> ---------------
>
> 1) It simplifies the engineering. No special steps need be taken to deal
> with Mandarin or indeed, any other language from which source material is
> translated into English. [Note also that Mandarin-English translation is
> not required; the entire evaluation can be done in English.]
>
> 2) It is better than a concordance or dictionary. This is because MT
> systems are much better than random at using context to figure out which
> word sense is implied when there is ambiguity. A concordance would be
> noncontextual and therefore much more prone to error. [Of course, the
> concordances located so far would be made available for those sites that
> chose to use the Mandarin directly.]
>
> 3) It factors the problem. Just as the transcription problem has been
> decoupled from the TDT problem, in an analogous manner, this suggestion
> decouples the translation problem from the TDT problem.
>
> 4) It doesn't rule out access to the Chinese transcription. Any site that
> wants to pursue algorithms that make use of the Mandarin directly could be
> free to do so, just as sites are free to directly use the original audio
> should they find that valuable, instead of relying solely on the ASR
> transcripts.
>
> 5) It fits in well with the TIDES program that Ron Larsen spoke of. As I
> understood DARPA's plans for this new program in cross-lingual information
> extraction, the assumption there was also that everything would be
> translated into a common language using MT.
>
> ---------------
>
> I am particularly enamored of (1), I think (3) is in the proper spirit of
> the TDT3 project (this is, after all, not a transcription or translation
> project, so using state-of-the-art systems to do those steps and allowing
> sites to concentrate on the TDT aspects makes sense), and I believe the
> flexibility of (4) accomodates approaches that wish to go to the "raw"
> data.
>
> Notice that this also makes irrelevant the issue of how to tokenize the
> Mandarin corpora.
>
> In any event, it is time to decide how the evaluation is going to work! I
> submit this as a basis for discussion, and welcome comments.
>
> - Jon
>
>
(048) previous ~ index ~ next
Last updated Thu May 13 09:28:20 1999