(297) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Adjudicated topic_relevance.table
Date: Fri, 08 Jan 1999 15:43:00 EST
Folks,
We have finished our review of the TDT-2 site results submitted to
NIST last month, and have prepared an adjudicated version of the
topic_relevance.table for the TDT-2 evaluation data. This table
differs from the one delivered to NIST last October, as a result of
the following process:
- we reviewed every story which the LDC had originally labeled as
"on-topic" but which had been missed by in at least one set of site
results; if the annotator decided at this point that the story was in
fact NOT on-topic, it was eliminated from the topic relevance table.
- we reviewed every story which had been identified by any site as
being "on-topic" but which was not originally labeled as such; if the
annotator decided at this point that the story was in fact on-topic,
it was added to the topic relevance table.
Here is a summary of how many stories were eliminated from the
original table (these were "false alarms" on the part of LDC
annotators during the initial topic labeling last fall):
21 topicid=70
6 topicid=71
4 topicid=74
2 topicid=76
1 topicid=77
4 topicid=86
1 topicid=87
1 topicid=91
12 topicid=96
52 total
Here is a summary of how many stories were added to the table (these
were "misses" on the part of LDC annotators):
80 topicid=70
12 topicid=71
4 topicid=72
20 topicid=74
39 topicid=76
1 topicid=77
1 topicid=79
3 topicid=83
3 topicid=84
4 topicid=85
4 topicid=86
6 topicid=87
18 topicid=88
1 topicid=89
4 topicid=91
1 topicid=93
43 topicid=96
2 topicid=100
246 total
You'll notice that the quantity of errors is correlated by topic.
This suggests that further investigation may be warranted to
determine why certain topics were so error-prone, and to make sure
that the topic definitions were both sensible and stable.
You can download the adjudicated topic relevance table for the
eval-set data from the LDC's "members_only" ftp directory:
[ftp instructions available on request from graff@ldc.upenn.edu]
Note that this is simply the single, uncompressed, plain-text file
containing the SGML-formatted table of topic-story relations for the
eval data. The file size is 194306 bytes.
Dave Graff
(297) previous ~ index ~ next
Last updated Wed Feb 3 10:44:20 1999