(095) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Addendum to TDT2 Mandarin release
Date: Wed, 12 May 1999 18:59:26 EDT
Folks,
Last week (990503), I released a corrected version of the TDT2
Mandarin data and tables, covering the January-March '98 collection
period.
There was one component of the corpus missing from that release:
the "systran" machine translation of ASR output for 34 VOA files.
That last component, consisting of machine-translated token streams
and associated boundary tables, is now available in our members_only
ftp, via the usual "members_only" anon.ftp commands.
The file name of this distribution is:
tdt-addendum_to-990503.tar.gz
The exact file size is 1539040 bytes.
The tar file contents are as follows:
mtatext/ # directory containing 34 files, "1998*VOA_MAN.mta"
tables/ # directory containing 34 files, "1998*VOA_MAN.bndmta"
Note that you should "cd" to your TDT base directory to unpack this
file; this will add the 34 files to your current "tables" directory,
and create the new "mtatext" directory with 34 files in it.
As with the machine-translated versions of the reference human
transcribed text data ("mtrtext/*.mtr" and "tables/*.bndmtr"), some
tokens in these token stream files are actually untranslated GB
strings: when SYSTRAN is unable to associate a string of one or more
GB characters with one or more Chinese words in its internal lexicon,
it passes that string through to its output without modification.
The resulting token stream files treat each string of one or more GB
characters as a single "word token" (in which all the byte values
happen to be in the range of 0xA1 - 0xFE); these tokens receive the
same sgml mark-up, and are assigned sequential ids, just as if they
were English words.
The file-level attributes are consistent with those used in the May 3
release: each <DOCSET> and <BOUNDSET> has a "type=" attribute and a
"fileid=" attribute. The formatting of sgml attributes will be
changed when we make our next data release at the end of this month,
in order to better accommodate the larger range of different data
types in the corpus.
Dave Graff
(095) previous ~ index ~ next
Last updated Thu May 13 09:28:24 1999