(089) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: TDT2 Mandarin Collection -- training release
Date: Fri, 30 Apr 1999 18:25:23 EDT
Folks,
The first release of TDT2 Mandarin text data is ready for ftp
retrieval in the usual way (see below). Documentation is still in
progress, and will be available on the LDC web site soon.
This release covers material spanning January-March 1998. Our
annotators assessed all 20 target topics over this set, though only
11 topics yielded a usable number of hits. (We're hoping that the
annotation of the April-June data will yield hits on additional
topics.)
In terms of data organization, the main difference relative to TDT2
English is the addition of a new data type: machine-translated text
(stored in an "mtrtext" directory, file names "*.mtr", boundary
tables "*.bndmtr"). This is the output of a Systran Chinese-English
system, presented in tokenized form only.
The data set will occupy about 165 MB when unpacked. The compressed
tar file is 42181133 bytes (42 MB).
Below are the ftp instructions.
Dave Graff
------------------
[ftp instructions available upon request from graff@ldc.upenn.edu]
(089) previous ~ index ~ next
Last updated Thu May 13 09:28:23 1999