(319) previous ~ index ~ next

To: TDT Distrib <tdt-distrib@ldc.upenn.edu>
From: Jonathan Fiscus <jonathan.fiscus@nist.gov>
Subject: 2001 TDT Evaluation Data Use Outline
Date: Wed, 09 May 2001 09:42:36 -0400

TDT Evaluation Participants:

Since we'll be using the TDT3 corpus for this Fall's evaluation and half
of the TDT3 topics will be released for training, we should modify the
tasks or corpora slightly to make this year's TDT evaluation
sufficiently different and to protect systems from training to the TDT3
data.

Through previous discussions, there are four potential ways to
accomplish this. I believe that options 1 and 2 are going to happen, 3
and 4 are certainly more controversial. Also, the training and
development data should be specified for each evaluation task.

The rest of this email outlines the corpus changes and potential rules
for corpus usage by evaluation task. I'd like to get consensus on these
issues so that the evaluation plan can be as specific as possible.

Regards,
Jon



Corpus modifications:
---------------------

1: Release the newswire texts for the intervening three months between
TDT2 and TDT3 for the evaluation. (I'll call this TDT2.5 for lack of a
better name.) IBM agreed that this would be a good way to "confuse" :)
the detection systems. If we did this, all TDT2.5 stories would need to
be excluded from scoring during the evaluation since there are no topic
annotations for that data.

2: Add additional newswire data from the TDT3 epoch. The published TDT3
corpus contains a sub sample of the available newswire data. This
option would publish all the newswire data.

3: Declare BRIEF documents to be scorable, as either on-topic or
off-topic. (I'm in favor of on-topic). This will give a more
"realistic" quality to the evaluation, since right now we're closing our
eyes to a serious issue.

4: Declare Non-News stories to be scorable, off-topic stories. Again,
this gives a more realistic data set.


Evaluation Task Data Usage Guidelines:
--------------------------------------

For each evaluation task, I propose the following test scenarios:

Segmentation:
Same status as last year.

Training and Development Test Corpora: TDT2
Testing Corpora: TDT3

Topic Tracking:
Release 1/2 of the TDT3 topics for system development. The TDT3 topics
will be divided into two sets balanced by 1999 and 2000 topics and by
topic size.

The exposed TDT3 topics and the TDT3 corpus can only be used as a
development test set, i.e. participants can tune systems to the
development test topics, but not use the corpus for background
statistics, etc.

During the evaluation, test on all TDT3 topics and permit
systems access to
the TDT2.5 corpus for the training epoch, but report results using the
unexposed evaluation topics and use the performance difference between
the exposed and unexposed topics to gauge the "training" effect.

Training Corpora: TDT2, all topics;
Development Test Corpora: TDT3, development topic subset
Testing Corpora: TDT2.5 and the Augmented TDT3 corpus, all TDT3 topics

Topic Detection:
There are three possibilities here:

1) Follow the regime described above for tracking. There is a concern
that training on the exposed TDT3 topics will distort the performance on
the unexposed topics. Therefore, I think option 2 is better.

2) Restrict the training a development to TDT2, like last year, and
evaluate using TDT2.5 and the Augmented TDT3.

First Story Detection:

Same as Topic Detection:

Link Detection:

Same as Topic Tracking, except generate a new set of index files
for the unexposed topics.
To unsubscribe from tdt-distrib, email majordomo@ldc.upenn.edu with "unsubscribe tdt-distrib" in the body of the message.
(319) previous ~ index ~ next

Last updated Wed Aug 22 16:07:30 2001