(069) previous ~ index ~ next

To: doddington@nist.gov
From: Mark Liberman <myl@unagi.cis.upenn.edu>
Subject: Re: TDT3 source conditions
Date: Tue, 30 Mar 1999 13:42:09 EST

>This brings up another issue. (Actually NOT an issue, I trust.)
>That is that the Mandarin source stream will be one CHARACTER
>per record for the manual transcription and one WORD per record
>for the ASR transcription. (There is no good reason to discard
>word boundary information, and it may well be found to be useful
>if preserved.) For the segmentation task, the system output will
>still be required to be in terms of record ID's, however, and NOT
>in terms of either characters or words.

One more related issue. The Systran Mandarin-->English MT output, as
some of you will have noticed from the example posted earlier, is not
entirely English. When the system encounters a word (or words) it
doesn't know, it simply passes a block of Chinese characters through.
In the examples that I have seen, these are most commonly two or three
character sequences (four or six bytes), and I presume that they are
usually a single Mandarin word, though I don't imagine that this can be
guaranteed.

If the MT output is to be treated like the other kinds of data (as
Steve's questions and George's answer suggest), then we need to decide
how to tokenize this stuff. Is there any objection to using the normal
rules of English tokenization, which I presume will make each such stretch
of passed-through Chinese characters into a record?

--

-Mark Liberman
(069) previous ~ index ~ next

Last updated Thu May 13 09:28:22 1999