(160) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Shipment of TDT2 Text Corpus, REVISED VERSION
Date: Thu, 19 Aug 1999 15:49:34 EDT

Folks,

We have finally completed a wide-ranging revision of the TDT2 Text
Corpus (comprising all TDT2 sources, six English and three Mandarin),
and we are shipping cdroms today to the participants listed below.
You can expect to receive your copies tomorrow:

BBN -- Rich Schwartz
CMU -- Tom Pierce
Columbia Univ. -- Dragomir Radev
Dragon -- Jon Yamron
GE -- Tomek Strzalkowski
IBM -- Ramesh Gopinath
Mitre -- Patricia Robinson
NIST -- Jon Fiscus
NSA -- Grace Crowder
SRI -- Andreas Stolcke
UMass -- James Allan
Univ. of Iowa -- Padmini Srinivasan
Univ. of Maryland -- Gina-Anne Levow
Univ. of Pennsylvania -- Mike Schultz

If any other site on the "tdt-distrib" email group needs a copy and
is not listed above, please let me know.

In a message following this one, I will provide a copy of one of the
documentation files from this new cdrom, explaining how this release
differs from the June 6 release prepared by NIST. The differences
are rather substantive, and are likely to require some amount of
retooling by users, due to changes in directory organization and
file name patterns. I apologize for the inconvenience this will
cause, but I believe that the new format will be much easier to work
with, and you will find the quality of the data noticeably improved
(especially with respect to the Mandarin newswire sources).

I realize that, in one major regard, this new version of the corpus
will be unusable -- until the NIST scoring software is adjusted to
support it. Jon Fiscus has been providing advice during our
preparation of the new corpus design, and the necessary changes will
be fairly simple. Our intention is to apply this same corpus format
in preparing the TDT3 Eval Test data set.

Dave Graff
(160) previous ~ index ~ next

Last updated Thu Aug 19 16:14:48 1999