(001) previous ~ index ~ next
To: tdt-distrib@ldc.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: First installment of TDT text data
Date: Mon, 16 Feb 1998 03:24:54 EST
The first delivery of text data for the TDT2 project is now available
using the ftp protocol shown below. If anyone (besides George D. and
Charles W.) need physical media because ftp is impractical, let me
know as soon as possible, and include a usable postal address and
phone number for a fedex label.
[ftp instructions available on request from graff@ldc.upenn.edu]
ftp directory after seven days (or when the next installment is ready,
whichever happens later).
Size of compressed tar file: 29012931 bytes
Size of uncompressed tar file: 97698816 bytes
This delivery covers Jan. 4-31. Its bulk is due to the fact that I
have chosen, for this first installment, to include ALL newswire data
from this time period, rather than just the 80 stories per day that
were requested for this delivery. However, in keeping with the
established request, I have provided randomized lists for all data
sources for each site, and in these lists, there are no more than 80
stories per day from a given newswire. (It turns out that there are
more than 80 stories from the four combined episodes of CNN on some
days; but these tallies may be skewed by the notion of "story
boundaries" in closed-caption data.)
There is documentation and even some source code included with the
text data and story lists. I should apologize for the length of the
top-level readme file, but I sincerely hope that people will read it.
Please note (on the "To:" line above), that the LDC has established an
email alias for TDT2 participants:
<tdt-distrib@ldc.upenn.edu>
which currently contains the addresses listed below. The list will be
maintained by Chris Cieri and Dave Graff at the LDC, so please send
any requests for changes to one (or both) of us. Naturally, it makes
sense for all of you to use this alias in continuing email discussions
and announcements.
We look forward to feedback on the quality of these materials, and to
timely selection of the first installment of events/topics for our
annotators.
Dave Graff
-------------------------------
Listing of current tdt-distrib:
Andreas Stolcke <stolcke@speech.sri.com>
James Allan <allan@freya.cs.umass.edu>
Rich Schwartz <schwartz@bbn.com>
Jon Yamron <jon@dragonsys.com>
John Lafferty <lafferty@cs.cmu.edu>
Yiming Yang <yiming@cs.cmu.edu>
Jaime Carbonell <jgc@nl.cs.cmu.edu>
David Graff <graff@unagi.cis.upenn.edu>
Chris Cieri <ccieri@unagi.cis.upenn.edu>
Mark Liberman <myl@unagi.cis.upenn.edu>
Charles Wayne <clwayne@snap.org>
Grace Crowder <crowder@afterlife.ncsc.mil>
George Doddington <doddington@sri.com>
(001) previous ~ index ~ next
Last updated Wed Sep 9 09:40:45 1998