(196) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re-release of TDT2 Training+Devtest data
Date: Wed, 07 Oct 1998 14:59:21 EDT
Folks,
To go along with new versions of evaluation software and index files
(just announced by Jon Fiscus), I'd like to announce also that the
re-release of TDT2 training and devtest data will be shipping today,
via overnight delivery service.
As described in previous email messages, the corpus occupies a single
cdrom (a tad over 600 MB of data, uncompressed), includes all four
months of data (january-april) in the same directory structure common
to earlier releases, and provides a single topic relevance table that
contains a listing of all on-topic stories (january-april) with
respect to all training and devtest topics (1-66) found by LDC
annotators. The topic table also incorporates revisions based on
adjudication of all dry-run test results and additional feedback from
research sites about missed stories.
I will attach the main readme file from the cdrom below, and will
circulate summary tables of quantities (files/stories by month/source,
etc) tomorrow.
Those of you who lack the benefit of Rock-Ridge-capable cdrom readers
can use file-name mapping tables that have been included on the cdrom
to relate the iso-9660-truncated (8.3 character) file names to the
original (28-or-so character) file names. Please contact if you have
questions or problems with this feature of the corpus.
Dave Graff
Top level readme file for the October 6 Release of the TDT2 Corpus
==================================================================
This release of TDT2 data consists of all text and table files
spanning the collection period of January 4 through April 30, 1998,
comprising the training and development test sets designated for the
1998 TDT2 Benchmark Test. Most of the material here has been released
previously to TDT2 participants, and the present release incorporates
a number of fixes to the earlier versions. Also, some data from the
training and dev-test periods is being presented now for the first
time.
PARTITIONING OF THE DATA
------------------------
All files are grouped into main data directories by type of data:
sgml/ -- contains the reference "true" text of news content,
together with descriptive markup surrounding each story
tkntext/ -- contains tokenized news content from sgml files, with
individual tokens labeled by "recid" numbers
asrtext/ -- contains ASR output from audio recordings of broadcasts,
with individual words labeled by "recid" numbers
tables/ -- contains boundary tables for each tkntext and each asrtext
file, indicating the extent of each story unit in terms of
word/token "recid" numbers, and in terms of starting and
ending times for broadcast sources; also contains the topic
relevance table
In the sgml and tkntext directories, there are up to 16 files for each
calendar date in the collection period (up to 8 files per day in the
asrtext directory).
The TDT2 project defines a partitioning of the data into training and
development test sets: the training data comprise the material
collected in January and February, and the development test data
comprise the March and April collection period. While this division
of the data is not explicit in the organization of the data
directories, it is directly represented in the individual file names,
which reflect the date and time of collection for each sample.
In addition to partitioning the four-month data collection, the TDT2
project partitions the target topics that have been defined and
annotated over this material. A total of 66 topics were drawn from
this collection period; of these, topics 1 through 37 are drawn from
the training partition, and topics 38 through 66 are drawn from the
development test partition.
The single topic relevance table contains all relations of topics to
stories; that is, all stories in the four month collection have been
judged against all 66 topics, and this one table stores all the cases
found by LDC annotators of on-topic stories. (This release of the
table also incorporates a number of inputs from the TDT2 research
sites, including dry-run test results from the Tracking task.)
DIFFERENCES FROM EARLIER RELEASES
---------------------------------
As indicated above, this release includes a number of sample files
(34) that had not been released previously for various reasons. Also,
most of the newswire (APW,NYT) sample files have been modified to fix
some minor errors in the SGML markup format; these repairs had the
effect of altering the token content of a minority of stories in most
files, and this in turn affects the content of the tokenized streams
for the files -- in all cases, the tokenized version of the files in
the earlier releases had contained some non-news content that had not
been properly set apart by the SGML markup. Most of these cases have
now been corrected.
In addition, a total of 54 segmentation errors in 48 CNN files were
discovered in the earlier releases, and these have been repaired; they
involved a missed end-of-story (start-of-commercial) boundary, and had
the result of causing too much material from the asrtext files to be
included as story content.
The topic relevance table incorporates a number of repairs stemming
from reviews of dry-run test results and other findings from TDT2
researchers. It also includes, for the first time, the annotations of
training topics against dev-test data, and of dev-test topics against
training data. These latter sets of judgements have not undergone as
complete a review of quality compared to the within-partition topic
labels, and the LDC is looking forward to feedback from researchers to
further refine the table.
FILE NAMES
----------
All data file names in this corpus are quite long. This cdrom is
being created using Rock Ridge extensions to the ISO 9660 format for
cdroms, to allow full name recovery on systems that support Rock
Ridge. For systems that lack this support, each directory contains a
file called "CDRNAMES.TBL", which maps the original long name to the
truncated ISO 9660 name. Since use of the data (especially with
reference to index files supplied by NIST) requires access to the long
names, it may be necessary for some people to use the CDRNAMES.TBL
file in each directory to create a copy of (or link to) the cdrom
files in order to assign the correct full-length file name to each
one.
(196) previous ~ index ~ next
Last updated Wed Oct 28 14:44:11 1998