(187) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Plans for full train+devtest TDT release
Date: Fri, 02 Oct 1998 13:38:09 EDT

Folks,

At the last TDT-PI meeting at IBM, I mentioned that there would be
some details to resolve about how to organize the complete release of
training and devtest data, which is scheduled to go out next Tuesday
(Oct. 6). I'd like to review those issues now, and propose a
structure for the release. Please give some careful attention to this
proposal, and let me know very soon if you believe strongly that
anything should be done differently.

The main objective will be to keep things as similar as possible to
the structure of earlier releases, so that the indexing and scoring
software will continue to work.

Another objective is to make sure that there be no confusion about how
to partition the data into training and testing sets -- both in terms
of the overall division of the corpus, and in terms of the "per-topic"
cut-off points (based on "Nt-max") for the tracking task.

Regarding that last point about the tracking task: we will be
releasing topic judgements for all four months of data (January-April)
on all train+devtest topics (1-66). Mike Schultz pointed out that a
problem may arise when devtest topics (38-66) happen to show up in the
two months of training texts (Jan-Feb) -- the last time I checked, we
had a total of 110 hits in the training data involving 8 of the
devtest topics (these numbers may change as a result of recent QC).

I'm not entirely certain which of the following two options will be
adopted by the sites when running development tests for the tracking
task:

(1) use all four months of text data, and locate the train/test
cut-off point for each topic at the "Nt-max"th on-topic story
starting from 19980104

(2) use only March-April text data, and locate the cut-off points at
the "Nt-max"th on-topic story starting from 19980301

I will assume that option (2) is the more likely one, which means that
there will need to be separate topic-relevance tables for the
within-set and across-set annotations. If option (1) is adopted, the
two devtest topic tables can simply be combined to get the overall
coverage.

Having explained that point, here is an overview of the planned
structure for the release:

tdt_deliv_981006/

asrtext/ -- contains ALL asr files, 19980104* - 19980430*
tkntext/ -- contains ALL tkn files, 19980104* - 19980430*
tables/ -- contains ALL boundary table files for asr and tkn data
sgml/ -- contains ALL sgml files, 19980104* - 19980430*
dtd/ -- contains DTD files for tables, asr and tkn data

trntop-trntxt.rel -- relevance table, train topics * train data
trntop-devtxt.rel -- relevance table, train topics * dev data
devtop-trntxt.rel -- relevance table, dev topics * train data
devtop-trntxt.rel -- relevance table, dev topics * dev data

Please note the following:

- The various data directories will have LOTS of files in them
(tkntext/ will have over 1700, tables/ will have over 2300).

- The overall training/devtest partition of data files will be on the
basis of file name patterns ONLY -- 19980[12]* for training files,
19980[34]* for devtest files.

- The topic relevance tables will no longer be kept in the tables/
directory -- they will be in the top-level directory

- We will distribute the package on cdrom -- it will be on the order
of about 600 MB of data, uncompressed.

- We will coordinate with NIST to make sure that new index files are
available by the time you receive the cdroms.

- The cdroms will use ISO 9660 format with Rock Ridge extensions to
provide full POSIX file names on systems that support Rock Ridge;
in case your system does not have Rock Ridge support, the cdroms
will include file-name translation tables (a file named
"CDRNAMES.TBL" in each directory) to map the "8.3-character" file
names of ISO 9660 to the longer TDT file names.

- Later today we will send email to tdt-distrib listing the intended
recipients of the cdroms, with postal addresses -- please respond
to ldc@ldc.upenn.edu if your address needs to be added, removed or
corrected.

Thanks very much.

Dave Graff
(187) previous ~ index ~ next

Last updated Fri Oct 2 19:04:21 1998