(047) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: Jon_Yamron@dragonsys.com
Subject: TDT3 dry run
Date: Fri, 26 Mar 1999 13:45:27 -0500

There has been a fair amount of correspondance recently on the subject of
Mandarin resources, but none of it addresses what is becoming a central
concern of mine now that we are nearly a month past the Workshop: What is
the dry-run evaluation, now less than 3 months away, going to look like?

Larry Gillick, in an email not widely distributed, made the proposal that
the English translations of the Mandarin material be prepared for use as
evaluation material (i.e., tokenized files and boundary tables supplied).
I excerpt from Larry's message on the advantages of this approach, with
some comments of mine added in brackets:

---------------

1) It simplifies the engineering. No special steps need be taken to deal
with Mandarin or indeed, any other language from which source material is
translated into English. [Note also that Mandarin-English translation is
not required; the entire evaluation can be done in English.]

2) It is better than a concordance or dictionary. This is because MT
systems are much better than random at using context to figure out which
word sense is implied when there is ambiguity. A concordance would be
noncontextual and therefore much more prone to error. [Of course, the
concordances located so far would be made available for those sites that
chose to use the Mandarin directly.]

3) It factors the problem. Just as the transcription problem has been
decoupled from the TDT problem, in an analogous manner, this suggestion
decouples the translation problem from the TDT problem.

4) It doesn't rule out access to the Chinese transcription. Any site that
wants to pursue algorithms that make use of the Mandarin directly could be
free to do so, just as sites are free to directly use the original audio
should they find that valuable, instead of relying solely on the ASR
transcripts.

5) It fits in well with the TIDES program that Ron Larsen spoke of. As I
understood DARPA's plans for this new program in cross-lingual information
extraction, the assumption there was also that everything would be
translated into a common language using MT.

---------------

I am particularly enamored of (1), I think (3) is in the proper spirit of
the TDT3 project (this is, after all, not a transcription or translation
project, so using state-of-the-art systems to do those steps and allowing
sites to concentrate on the TDT aspects makes sense), and I believe the
flexibility of (4) accomodates approaches that wish to go to the "raw"
data.

Notice that this also makes irrelevant the issue of how to tokenize the
Mandarin corpora.

In any event, it is time to decide how the evaluation is going to work! I
submit this as a basis for discussion, and welcome comments.

- Jon


(047) previous ~ index ~ next

Last updated Thu May 13 09:28:20 1999