(022) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Complete January distribution is ready
Date: Tue, 07 Apr 1998 19:43:40 EDT
Folks,
I'm sorry it took longer than intended, but here it is; you can
retrieve the delivery of January TDT data as instructed below. For
completeness, I have included again the SGML data and the
topic-relevance.table, which were initially made available last week.
In addition to these, the present release includes ASR text streams,
tokenized text streams derived from the SGML data, and complete
tables for relating story boundaries to locations in the stream files.
I have also included the script that I used to do the tokenization of
the SGML data. This is meant be an initial point of departure for
any discussions about the finer points of tokenization -- I would
encourage anyone who is concerned about this matter to suggest
improvements or alternatives, in order to establish a reasonable
consensus on how tokenization should be done for this project.
Good luck with the materials. I will be away from the office for the
rest of this week, but I hope you will not hesitate to send any
comments, questions, etc.
The compressed file size is 58004389 bytes, and it will uncompress to
a tad over 205 MB. Anyone for whom this poses a hardship in terms of
ftp transfer or online storage can send an request for a cdrom copy to
<ldc@ldc.upenn.edu> -- be sure to point out that you are on the TDT2
distribution email list, and be prepared to face some paperwork...
Dave Graff
----------------- instructions for ftp retrieval ------------------
[ftp instructions available on request from graff@ldc.upenn.edu]
(022) previous ~ index ~ next
Last updated Wed Sep 9 09:40:47 1998