(057) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: Jon_Yamron@dragonsys.com
Subject: Re: TDT3 dry run
Date: Mon, 29 Mar 1999 17:25:47 -0500
There is so much to respond to, let me take it one at a time and try to be
even more explicit about what I am proposing. Let me summarize a few key
points:
1) Dragon's proposal does not restrict the types of approaches to this
problem in any way that I can see. If you think it does, you are not
reading carefully.
2) In the same way that supplying a default trascription of the audio
allowed sites without fast broadcast recognizers to participate in TDT2,
supplying a default Mandarin translation will allow sites with limited
Mandarin/MT resources to participate in TDT3.
3) This proposal provides for one basic evaluation condition (of the many
that might ultimately be supported) in which the format of the corpus is
the same as TDT2. This will dramatically simplify the preparation for, and
the execution of, the dry run (which is very close and still not defined).
On to the responses:
> From Mark Liberman:
> I think that the Dragon folks were making a more radical suggestion,
> namely that the Mandarin materials should be translated into English and
> then (from the perspective of the task definition) thrown away. Our local
> research effort would find that strategy an uninteresting one, but of
> course it is up to George and others to decide if it the task definitions
> (especially of segmentation) should be changed to accomodate it.
I am ABSOLUTELY NOT proposing that the Mandarin materials should be "thrown
away" in any sense (as Rich has already observed). What I am proposing is
that there exists a data set within the corpus, consisting entirely of
English, on which the evaluation could be run.
Let me make an explicit comparison with the way we currently handle the
transcription problem. Every site is currently supplied with the raw audio
data (many gigabytes of .WAV files). There is also a transcription of the
audio, courtesy of Dragon's Hub4 recognizer. A site may choose to use
Dragon's transcription, may substitute a transcription of their own, may
use the .WAV data directly, or any combination of the above. The
evaluation supports all of these approaches. The point is, the data is
supplied in a form in which the transcription problem has been "factored
out". Those sites that think this a bad way to factor the problem are free
to pursue their own approach (as SRI did), because the raw data has also
been supplied (as well as a method of evaluation on the raw data).
Dragon's proposal is that there exist an evaluation condition in which the
translation problem has been factored out as well. So in addition to the
raw Mandarin data (which might be tokenized into characters, and includes,
for example, the usual .bnd and .tkn files), there also exists in the
corpus a default English translation of the Mandarin data (including the
usual .bnd and .tkn files). The evaluation supports scoring system output
on either set of data. Those sites that think this a bad way to factor the
problem are free to pursue their own approach, because the raw data has
been supplied.
> From Doug Oard:
> This strikes me as a potential disadvantage [factoring out the translation
> problem] since it may not be the right factorization. As an example, it
> might be harder to recover from names that are missegmented (and hence
> mistranslated) by Systran than it would be to implement good name
> recognition in Chinese.
The reference to the problem of names is interesting, because the
recognizer used by Dragon has exactly the same problem---unknown names are
turned into recognizer gibberish. But I don't recall anyone arguing that
the corpus should not include a standard transcription of the acoustic data
because information might be lost in the recognition. (And it is:
inflection, channel, speaker...all this is lost or reduced to barely
reliable features.)
As I said above, any site that is uncomfortable with the translation, or
even a site that is inclined to trust the translation but would like more
information, can build systems that consult the raw data that comes with
the corpus.
> From Jaime:
> There are translingual tasks that do not require MT, a case in point is
> translingual IR. We refer you to Yang, Carbonell et al (IJCAI97 and AI
> Journal 98) where non-MT-based IR achieves well over 90% of the accuracy of
> monolingual IR -- based on exploiting parallel corpora. Others have also
> achieved very good performance -- if not quite a high -- without MT or a
> parallel corpus. So why reduce one task (translingual IR) to a much
> harder task (Machine Translation) without need of doing so? Of course we
> don't know whether the same translation-free simpler methods will succeed
> for tracking and detection. But we (CMU at least) very much want to find
> out -- this is a key scientific question, not to be brushed aside for
> reasibs of temporary expediency.
> So, my recommendation is that TDT3 focus on true translingual TDT, even
> if resources of each site permit us to do only some tasks and not others.
> And evaluations must be conducted by judgements on the original language
> texts -- judgments based on degraded and potentially-perverted MTed source
> documents risk diverging TDT from reality. Of course, sites can choose to
> use the Systraned outputs instead of Mandarin, but all evaluations should
> be with respect to the original texts in the original languages, as that is
> the ground truth. And, such sites will be less likely to have scalable
> technology (both in size and language diversity) by requiring up-front full
> MT.
There is no need for you to brush aside any questions, as you can use raw
Mandarin data supplied in the corpus. But another site might feel that
there are language-independent approaches that could be tried, possibly
robust to the inaccuracies of transcription and translation, which should
be compared to your methods. Requiring that site to work from the Mandarin
does nothing but make it pointlessly harder to get those results.
As for requiring "judgements on the original language texts", I don't know
what this means. Except for the detail of reporting RECID numbers, easily
handled by NIST, judgments are on stories, which are (by definition) in
one-to-one correspondance in the raw and translated corpus. (One might
argue that, through some twist, a story might get lost somehow in
translation, but we already see this in the transcriptions where an entire
story fails to get recognized, and the evaluation handles it.)
> From George:
> I'm sorry for not responding sooner, because there clearly exist confusion
> regarding the required conditions for the TDT3 tasks. The required source
> conditions are newswire text and the TDT3-supplied automatic transcription
> of the audio. There is no requirement to use any MTed version of either a
> manual or an ASR transcription.
But the question is: will the evaluation support judgments that refer to
RECID assignments in the English translations of the Mandarin? I can think
of no good reason why this should not be so, as it is easy to implement
(much easier than supporting segmentation judgments on the raw audio, which
we currently do).
Cranky from lack of sleep,
- Jon
(057) previous ~ index ~ next
Last updated Thu May 13 09:28:21 1999