(192) previous ~ index ~ next
To: TDT distribution <tdt-distrib@ldc.upenn.edu>
From: George Doddington <doddington@nist.gov>
Subject: Re: Plans for full train+devtest TDT release
Date: Mon, 05 Oct 1998 16:47:54 -0400
Regarding the division of the upcoming LDC TDT2 corpus release
into training and devset portions, I've talked with Dave Graff and
Jon Fiscus about this, with the following result:
* In general, the training set comprises data collected in January and
February, and the devset comprises data collected in March and April.
* In general, the training set topics are topics defined from stories in
the training data, and the devset topics are topics defined from stories
in the devset.
* The tracking task for training set topics will use training stories
taken
from the training data (January and February), and the tracking task for
devset topics will use training stories taken from the devset data only
(March and April). In particular, the division of the corpus into
training
and test will be formally defined by the index files used for evaluation,
and sites are urged to used these index files to control their systems.
* The detection task will include all of the source files in the index
file
and will not make a distinction between training data and devset data.
However, performance scores may be computed separately for training
topics and devset topics.
* All of the relevance data will be combined within a single table, rather
than the four separate tables suggested by Dave Graff in his email note
of earlier today. (The formal division of training and devset data for
the
tracking task will be determined implicitly by the evaluation index files
for the tracking task.)
--
George Doddington at NIST: doddington@nist.gov or 301/975-3261
(192) previous ~ index ~ next
Last updated Wed Oct 28 14:44:10 1998