(092) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: re-release of Jan-March TDT2 Mandarin data
Date: Tue, 04 May 1999 18:48:19 EDT
Folks,
I apologize for the delay in fixing the problems with last Friday's
release of the Mandarin TDT2 data. In addition to errors made in
generating the topic table, we also found a number of the data files
(28 of the machine-translated "English" set) containing a variety of
formatting errors.
Here is a correct summary of topic hits in this data set (sorted by
number of hits), where the "topicid" numbers match those that were
used in the TDT2 English table. This list shows hits on 15 topics
(three of which have less than 4 hits); when the next three months
worth of data are released, there will be 20 topics in the table, all
of which will have at least 4 hits.
694 topicid=1
367 topicid=15
142 topicid=13
45 topicid=39
29 topicid=76
18 topicid=2
11 topicid=88
10 topicid=48
10 topicid=23
10 topicid=20
8 topicid=5
5 topicid=7
2 topicid=44
2 topicid=71
1 topicid=96
The new release is 167 MB when uncompressed, and the compressed tar
file is 42588112 bytes (42.5 MB).
The new file name, available now via the usual "members_only" method,
is:
tdt-deliv-990503.tar.gz
-----------
David Graff Linguistic Data Consortium
graff@ldc.upenn.edu 3615 Market St., Suite 200
voice: (215) 898-0887 University of Pennsylvania
fax: (215) 573-2175 Philadelphia, PA 19104
http://www.ldc.upenn.edu
(092) previous ~ index ~ next
Last updated Thu May 13 09:28:24 1999