(065) previous ~ index ~ next
To: doddington@nist.gov
From: Steve Lowe <steve@dragonsys.com>
Subject: Re: TDT3 source conditions
Date: Tue, 30 Mar 1999 10:57:58 -0500
George Doddington Mon, 29 Mar 1999 19:26:43 -0500 wrote:
An automatic MT translation of the Mandarin language data into English
will also be supplied, as an OPTIONAL form of the Mandarin source data.
This will be derived from the source data for the required condition
and therefore satisfies the required source data condition. Story
boundary information for the MT translation will also be supplied.
This may be implied by your statement, but I want to clarify. Is the
following true?
1) There will be a complete set of token (.tkn and .asr) files and
accompanying boundary (.bndtkn and .bndasr) files for the MT
translation form of the Mandarin training and test data. These files
will not differ in format from the corresponding TDT2 and TDT3 English
files.
2) A site that WISHES to make use of this translation MAY submit
results using the RECIDs in the MT translated form of the .tkn and
.asr files, WITHOUT reference to the original Mandarin data.
I won't put on a suit of armor like Jaime (I prefer deflector shields
myself, they repel higher-wattage blasts), but I would like to
reiterate that Dragon's requests on the Mandarin resources issue has
NOT been intended to force the TDT research in any particular
direction. It has been entirely in response to the realities of the
sponsor's requirements:
* we ALL have to do cross-lingual IR
* we ALL have to submit dry run results on cross-lingual IR in June
We at Dragon have been focusing on some non-traditional approaches to
IR that we think are promising, based on our experience in language
modeling for ASR. I have to admit that we don't have the amount of
experience in the information retrieval field that some other sites
have. Maybe we're just barking up the wrong tree and our ideas will
not pan out. But, we're trying to catch up and understand how our
algorithms are related to those already established in the field, so
we can figure out where (and whether) we may have something useful to
offer.
The cross-lingual effort doesn't help this line of research, since,
speaking personally now, I don't know what I'm doing in ENGLISH yet.
Given time, we can do Mandarin---it would be INTERESTING to do
Mandarin!---but doing it by June in any but the most simple way will
bring our work on algorithms to a halt.
Some writers seem to be concerned that research will be channeled into
the factored MT/IR path. Personally, I'd rather see it go the other
way---it would be kind of boring if Systran+IR just worked. Where's
the fun? But maybe the factored MT/IR is adequate, so it needs to be
tried. And, it seems there are plenty of sites that are very
interested in doing the Mandarin cross-lingual task carefully and
thoroughly, so I doubt that closely-coupled, non-explicit-MT
approaches will be neglected. Maybe, given time, even us...
Steve
(065) previous ~ index ~ next
Last updated Thu May 13 09:28:22 1999