(053) previous ~ index ~ next
From: Jaime Carbonell <jgc@NL.CS.CMU.EDU>
Subject: Response to Anglocentrism in TDT
Date: Sat, 27 Mar 99 11:58:27 EST
Let me put on my suit of armor to help deflect the coming sticks and stones,
rotten eggs, verbal bards, and what have you... but I humbly disagree in a
fundamental way with the emerging consensus, to wit:
A major objective of TDT3 is translingual detection and tracking -- TRANSLINGUAL.
For instance, detection of novel events can be run in Mandarin and in English
separately, and then the event/topic clusters merged or correlated translanguage.
Or, events detected in one language (e.g. English) result in training examples
used for tracking in other langauges (e.g. Mandarin).
If we first translate everything into English, we can then just do good 'ole
monolingual TDT -- not that there isn't more reserach and improvement possible;
on the contrary, there certainly is. But there are other serious problems with this
English-centric mind set:
- Translation systems are not accurate. Worse yet, translation of ASR
text is less so. So, are we using Systran translations of ASR Mandarin?
If so, I would expect significant degradation. If not, are we kissing
goodbye to ASR-based TDT?
- Using Systran Translation causes us to train to the particular quirks
of that system. Maybe this won't affect the underlying technology
development, but that's like saying that evaluation results play little
or no role in the selection and refinement of the technology, and I
am quite sure no one in this mailing list seriously believes that.
- It is unrealistic to translate the entire world into English as a
preprocessing step before doing anything else. Doug already touched
on this point. This was also clearly realized by Ron Larsen in TIDES.
Can you imagine translating the entire web into English, or all possible
news sources and all raw intelligence data first? Too much of it.
No translation system exists for well over 90% of languages. Of course,
if DoD wishes to massively fund translation technology, we at CMU
MT systems w/o commercial motivation :-)
- Other well-recognized evaluations (e.g. TREC-CLIR) do not do their
evaluations in English only. This is very reasonable because far
more IR (especially with the web) is being deplyed in many more langauges
that the more difficult MT challenge.
- There are translingual tasks that do not require MT, a case in point is
translingual IR. We refer you to Yang, Carbonell et al (IJCAI97 and
AI Journal 98) where non-MT-based IR achieves well over 90% of the
accuracy of monolingual IR -- based on exploiting parallel corpora.
Others have also achieved very good performance -- if not quite a high --
without MT or a parallel corpus. So why reduce one task (translingual IR)
to a much harder task (Machine Translation) without need of doing so?
Of course we don't know whether the same translation-free simpler
methods will succeed for tracking and detection. But we (CMU at least)
very much want to find out -- this is a key scientific question, not
to be brushed aside for reasibs of temporary expediency.
So, my recommendation is that TDT3 focus on true translingual TDT, even if
resources of each site permit us to do only some tasks and not others. And
evaluations must be conducted by judgements on the original language texts --
judgments based on degraded and potentially-perverted MTed source documents risk
diverging TDT from reality. Of course, sites can choose to use the Systraned
outputs instead of Mandarin, but all evaluations should be with respect to
the original texts in the original languages, as that is the ground truth.
And, such sites will be less likely to have scalable technology (both in size
and language diversity) by requiring up-front full MT.
Best, -- wait let me put on my protective hood and iron mask -- ah, there we go.
-- Jaime (including Yiming's points as well)
(053) previous ~ index ~ next
Last updated Thu May 13 09:28:21 1999