(066) previous ~ index ~ next

To: Steve Lowe <steve@dragonsys.com>
From: George Doddington <doddington@nist.gov>
Subject: Re: TDT3 source conditions
Date: Tue, 30 Mar 1999 12:05:56 -0500

Steve Lowe wrote:
>
> George Doddington <doddington@nist.gov> Mon, 29 Mar 1999 19:26:43 -0500 wrote:
> An automatic MT translation of the Mandarin language data into English
> will also be supplied, as an OPTIONAL form of the Mandarin source data.
> This will be derived from the source data for the required condition
> and therefore satisfies the required source data condition. Story
> boundary information for the MT translation will also be supplied.
>
> This may be implied by your statement, but I want to clarify. Is the
> following true?
>
> 1) There will be a complete set of token (.tkn and .asr) files and
> accompanying boundary (.bndtkn and .bndasr) files for the MT
> translation form of the Mandarin training and test data. These files
> will not differ in format from the corresponding TDT2 and TDT3 English
> files.
>
> 2) A site that WISHES to make use of this translation MAY submit
> results using the RECIDs in the MT translated form of the .tkn and
> .asr files, WITHOUT reference to the original Mandarin data.

Yes.
--------

On a related issue, it isn't clear to me how a "non-Mandarin"
participant in the segmentation task will be able to handle
Mandarin data, since we aren't planning to include a mapping
of MT'd English words back into Mandarin characters. (Note
that segmentation of Mandarin requires outputting story
boundaries in terms of record ID's in the Mandarin source
stream.)

This brings up another issue. (Actually NOT an issue, I trust.)
That is that the Mandarin source stream will be one CHARACTER
per record for the manual transcription and one WORD per record
for the ASR transcription. (There is no good reason to discard
word boundary information, and it may well be found to be useful
if preserved.) For the segmentation task, the system output will
still be required to be in terms of record ID's, however, and NOT
in terms of either characters or words.
--
George Doddington at NIST: doddington@nist.gov or 301/975-3261
(066) previous ~ index ~ next

Last updated Thu May 13 09:28:22 1999