(050) previous ~ index ~ next

To: Mark Liberman <myl@unagi.cis.upenn.edu>
From: Rich Schwartz <schwartz@bbn.com>
Subject: Re: TDT3 dry run
Date: Fri, 26 Mar 1999 20:38:21 -0500 (EST)

Mark,

On Fri, 26 Mar 1999, Mark Liberman wrote:

> > But here's a question that may indicate I haven't been paying
> >attention.
>
> Indeed, you haven't. While we have your attention, Rich, are you willing
> to provide a frequency-sorted list of English named-entities from TDT-2,
> to be used in extending the bilingual glossary?

Sure. We are planning on running TDT-2 to get named entities (I think for
the summer workshop) and then giving out the output for people to work on.
Once we have that, any statistics desired can be gathered.


> > Were there going to be English translations of the Mandarin
> >materials produced?
>
> Yes. As discussed earlier, the LDC is buying the Systran Mandarin->English
> system, and will be supplying translations of all the TDT-3 Mandarin
> material. I posted a sample translation of a VOA transcript earlier.
>
> However, I don't see what this really has to do with the task
> definition. As I understood George's description at the workshop,
> the tasks will be just as before, except that some of the data is
> Mandarin and some is English. If each site had its own copy of the
> Mandarin->English translation system, they could run it on either the
> unsegmented or segmented input, and use it as they please in training
> or testing their algorithms in each of the various task conditions.
>
> We'll help you save (a little) money and (more significant) time by
> running the system for everyone. It is worth discussion how we will
> present the data so as to signal the cross-language correspondence --
> One possibility would be external pointers that enable you to recover the
> corresponding story boundaries in the two text streams, for the case
> in which you are using veridical story boundaries.

That's great that a 'standard' translation would be provided.
Oh, and I don't think people should be spending a lot of effort trying to
find a better translator.
I would assume that the systems would know that this English document
originally came from Mandarin in cases they wanted to treat it differently
in some way.

But.......


> Since you cannot be prevented from buying your own copy of Systran
> Mandarin-->English and incorporating it into your system, we would aim
> to enable you to do anything you could do with that capability. But
> none of this would change the task definition in any way.

I agree that the task definition wouldn't change. It's just that there
would be this (virtual) resource of the systran translator that I suspect
most people might try using.



> I think that the Dragon folks were making a more radical suggestion, namely
> that the Mandarin materials should be translated into English and then
> (from the perspective of the task definition) thrown away. Our local
> research effort would find that strategy an uninteresting one, but of
> course it is up to George and others to decide if it the task definitions
> (especially of segmentation) should be changed to accomodate it.


I think you have this wrong. Dragon is NOT proposing that the Chinese be
discarded, any more than the speech audio was discarded for TDT2. The
Chinese will be there so that people can explore techniques for using
various kinds of bilingual dictionaries or other techniques to work
directly from the raw data.

Jon was just making the case that it's at least possible that you couldn't
do much better with the raw data than you could with the standard
translation, and that you might do better with the systram translator
because it might do a better job of disambiguating senses. But even if
this weren't true, we owe it to ourselves to use the translation at least
as a baseline for comparison.

Now it's true that it might turn out that once you have the translation,
it might take a good bit of energy to overcome the inertia of just using
that for a while. I'm not sure if this is so bad at the start, as there
are lots of measurements to make. But I'm assuming that some of us will
have the ambition to try more.

--Rich

P.S. The reason I was suggesting that people not try to just do a 'better'
translation, is that this is likely to be a small difference compared to
doing something more fundamental. This is just like for TDT2, no one
bothered trying to do a better speech recognition than Dragon, because we
all knew that a 1 or 2 point difference in word error rate wouldln't
matter than much. Now if it turns out after a few years of trying, that
the best thing we can do for this is just to translate, and if it also
turns out that the loss due to imperfect translation is large, then
someone might be attempted to just devote effort to better translation.

And then it might also be true at that time that the sponsors would tell
us not to do that because they wanted this mainly for languages without
translation systems (though that might be shooting themselves in the
foot). But we can see where that goes.

--Rich


(050) previous ~ index ~ next

Last updated Thu May 13 09:28:21 1999