(020) previous ~ index ~ next

To: "'Jon_Yamron@dragonsys.com'" <Jon_Yamron@dragonsys.com>
From: "Strzalkowski, Tomek (CRD)" <strzalkowski@crd.ge.com>
Subject: RE: Mandarin Resources and Large, Non-Punctual Topics
Date: Wed, 10 Feb 1999 13:54:24 -0500

I'm sure some Mandarin resources would be useful, but do we really
need to worry about tokenization, i.e. word lists? I guess this may be
needed for speech recognition, but IR is done well without. At TREC
chinese track CUNY system (Kwok) used character bigrams, and this
(if I recall well) worked just as well, or even better than any word
segmentation.

Can we use some commercial MT system? Systran
has Eng-Mandarin, at least one way. Building custom dict will be
too expensive and probably a moving target with TDT.

---- Tomek


> ----------
> From: Jon_Yamron@dragonsys.com[SMTP:Jon_Yamron@dragonsys.com]
> Sent: Wednesday, February 10, 1999 10:53 AM
> To: Rich Schwartz
> Cc: Christopher Cieri; tdt-distrib@ldc.upenn.edu
> Subject: Re: Mandarin Resources and Large, Non-Punctual Topics
>
> I agree with Rich that that big problem is getting hold of some kind of
> bilingual concordance. This is made at least a little trickier in Mandarin
> because it interacts with the tokenization problem---basically, if we agree on a
> bilingual dictionary, we also must agree on a word list. In short, the word
> list the LDC uses in the tokenization of the Mandarin text data, the word list
> used in the Mandarin recognizer used to process the Mandarin broadcast sources,
> and the word list used in the English-Mandarin concordance, had better all be
> pretty similar.
>
> - Jon
>
>
>
>
>
> Rich Schwartz <schwartz@bbn.com> on 02/10/99 09:20:01 AM
>
> To: Christopher Cieri <ccieri@ldc.upenn.edu>
> cc: tdt-distrib@ldc.upenn.edu (bcc: Jon Yamron/Dragon Systems USA)
> Subject: Re: Mandarin Resources and Large, Non-Punctual Topics
>
>
>
>
>
>
> Chris,
>
> On Tue, 9 Feb 1999, Christopher Cieri wrote:
>
> > James asked us to list any Mandarin resources we might contribute to
> > TDT-3.
>
> The resource we all need most is a bilingual dictionary, because we
> are supposed to find Mandarin and English documents that are about the
> same topic. We could each start to work on techniques for estimating
> a probabilistic bilingual dictionary from general news. But I'm
> assuming that is beyond the scope of this effort. Am I wrong there?
> It's certainly an interesting and current topic. I just didn't think
> we were going to do that here.
>
> Doug Oard's message is encouraging.
>
> me.
>
>
>
(020) previous ~ index ~ next

Last updated Thu May 13 09:28:14 1999