(103) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: TDT2 Mandarin Text Corpus
Date: Thu, 27 May 1999 00:37:47 EDT
Folks,
The full six-month collection of TDT2 Mandarin data is now available
for ftp access via the usual "members_only" method. We have annotated
all six months of text from three sources (Xinhua, Zaobao and VOA)
against the 20 topics chosen from the TDT2 English set.
There is a scattering of known bugs in this release, including
sections of corrupted text in some Xinhua files (due to modem line
noise), and some stories that came up missing in the machine
translated version of a few Xinhua and Zaobao files. We intend to
work these issues out over the next few weeks, but the problems are
quite limited in scope and quantity, and we do not foresee a need for
patches between now and the PI meeting at the end of June (unless of
course someone finds a serious bug that we are not yet aware of).
Below is a portion of the readme file that comes with the
distribution. The file name is:
tdt-deliv-990526.tar.gz
The exact size is of this file is 101,973,987 bytes. It will be
nearly 400 MB when uncompressed. I expect that an ftp transfer of
this size may not be convenient for some people, so we will put the
uncompressed data on recordable cdroms today (May 27) for shipment to
the sites who received the English TDT2 cdrom today.
(For those of you who can make the ftp transfer without trouble, I
would be grateful if you could send me a message when you have the
data, so as to put less wear and tear on our shipping crew. Thank
you.)
Dave Graff
------------------------------------------------------------------
Top-Level README file for the TDT2 Mandarin Text Corpus
Release date: May 26, 1999
This is the first full release of the TDT2 Mandarin Text Corpus. It
consists of the following three sources:
Xinhua News, January 1 - June 30, 1998, 1-3 files per day
Zaobao News, February 26 - June 30, 1998, 2 files per day
VOA News, February 20 - June 30, 1998, 1-3 files per day (with gaps)
As with the TDT2 English Corpus, the following data formats are
provided:
sgml -- reference text data, in GB-encoded Chinese (unsegmented)
tkntext -- tokenized reference text (1 GB character = 1 token)
asrtext -- output of Dragon's Mandarin ASR, word-segmented, in GB
In addition, the following translated data formats are also included;
these were created using the "Systran" machine translation system to
produce English output from the Chinese source text:
mtrtext -- machine translation of reference text, in tokenized form
mtatext -- machine translation of ASR text, in tokenized form
In both of these translated data sets, the tokenization into English
words is done in the same manner as TDT2 English tokenization: simply
by splitting up the text at white-space characters (punctuation,
quotes and brackets are no separated from adjacent word tokens). But
unlike the TDT2 English corpus, these text streams contain occasional
strings of GB characters; these are portions of the Chinese input text
that Systran was unable to translate, for whatever reason (either
inability to segment the strings into putative words, or inability to
locate words in its internal lexicon). Each untranslated string of GB
characters is rendered as-is, and treated as a single "word" token.
For each of the data files in the "*text" directories, there is a
corresponding boundary-table file in the "tables" directory, to allow
for mapping of story boundaries to the word streams by means of the
sequential token numbers (recids) assigned to each token in the data
file.
Finally, the file "tdt2_man_topic.rel" provides the results of topic
relevance annotation conducted by native speakers of Mandarin at the
LDC. The referene text for each story in the collection was judged
against a set of 20 topics, which were selected from 100 target topics
of the TDT2 English corpus. These 20 were chosen to assure a minimum
of 4 hits per topic in the Chinese data. The current tally of hits
per topic is shown below, with "topicid" values that map to the labels
assigned to the original English target topics:
Topic # of hits
=================
topicid=1 1775
topicid=2 39
topicid=5 53
topicid=13 141
topicid=15 425
topicid=20 10
topicid=23 9
topicid=39 56
topicid=44 7
topicid=48 20
topicid=57 6
topicid=70 243
topicid=71 60
topicid=76 239
topicid=85 5
topicid=88 47
topicid=89 15
topicid=91 14
topicid=96 19
(These quantities may be subject to change as further quality control
checks are carried out, and as we get feedback from research sites
about misses or false alarms on the part of LDC annotators.)
With regard to the machine-translated material (derived from both
reference and ASR text data), we have tried very hard to make it as
complete as possible, but there are some known problems with the
Systran output. In addition to leaving a scattering of untranslated
GB strings in every file (where each string is typically just a few
characters long), it appears that there was loss or corruption of some
story boundaries in some of the files. In particular, the "docnos"
for the following stories appear in the boundary tables for the
reference text data, but not in the tables for the translated data:
ZBN19980428.0117 XIN19980106.0008
ZBN19980501.0042 XIN19980115.0113
ZBN19980510.0089 XIN19980116.0126
ZBN19980518.0101 XIN19980117.0084
ZBN19980527.0028 XIN19980118.0007
ZBN19980531.0113 XIN19980128.0059
ZBN19980617.0040 XIN19980214.0103
ZBN19980626.0061 XIN19980607.0061
We have not established yet whether the translated content of these
stories has been absorbed into adjacent stories within the same files,
or was simply discarded in some way (lending new meaning to the phrase
"lost in translation").
Among the VOA data, there is a greater apparent discrepancy in docno
inventories between the original and translated files, but these
actually pose less of a problem. Across ASR and reference text data,
there are a total of 98 distinct docnos that appear in the boundary
tables for untranslated files (*.bndtkn and *.bndasr) but do not
appear in the corresponding translated tables (*.bndmtr and *.bndmta,
respectively). It turns out that all of these VOA docnos correspond
to non-news or unannotated segments of the broadcasts (i.e.
"miscellaneous" or "untranscribed" portions). We believe these are
all be cases where the reference or ASR file contained no text over
the entire duration of the "story unit", and in the various stages of
filtering these files into and out of systran processing, the tags
identifying these empty "story units" were lost.
(103) previous ~ index ~ next
Last updated Mon Jun 21 11:18:47 1999