(299) previous ~ index ~ next

To: David Graff <graff@unagi.cis.upenn.edu>
From: Rich Schwartz <schwartz@bbn.com>
Subject: Re: Adjudicated topic_relevance.table
Date: Sat, 9 Jan 1999 23:05:54 -0500 (EST)

Dave,

It's interesting that the human miss rate of 246 / 1575 is
comparable to or higher than the automatic trackers. Of course, the false
accept rate is much lower. I guess this should be expected.

It also is striking how radically different the eval test is from
the dev-test. We had said that the dev-test was different from the "train
set" (Jan-Feb) because of a few factors:

1. Since the train set was the first, there was a start up with a big
backlog of stories.

2. The rules changed a little from train to dev.

But the eval set is different still. (The number of on-topic
stories is more than 3 times as much to start.) The obvious difference is
that the scores are several times worse than on the dev set.

I think this says what we have known (and seen) for years. If
test sets are made at different times, they will be different. While this
may be natural, it makes research difficult because you can't reliably
develop algorithms that are stable. The only way to get two sets of data
that are remotely comparable is to create them at the same time. That
way, the random differences will wash out (as Ellen Vorhees pointed out)
by virtue of having 25 randomly chosen topics in each set.

--Rich
====================================================================

On Fri, 8 Jan 1999, David Graff wrote:

> Date: Fri, 08 Jan 1999 15:43:00 EST
> From: David Graff <graff@unagi.cis.upenn.edu>
> To: tdt-distrib@unagi.cis.upenn.edu
> Subject: Adjudicated topic_relevance.table
>
>
> Folks,
>
> We have finished our review of the TDT-2 site results submitted to
> NIST last month, and have prepared an adjudicated version of the
> topic_relevance.table for the TDT-2 evaluation data. This table
> differs from the one delivered to NIST last October, as a result of
> the following process:
>
> - we reviewed every story which the LDC had originally labeled as
> "on-topic" but which had been missed by in at least one set of site
> results; if the annotator decided at this point that the story was in
> fact NOT on-topic, it was eliminated from the topic relevance table.
>
> - we reviewed every story which had been identified by any site as
> being "on-topic" but which was not originally labeled as such; if the
> annotator decided at this point that the story was in fact on-topic,
> it was added to the topic relevance table.
>
> Here is a summary of how many stories were eliminated from the
> original table (these were "false alarms" on the part of LDC
> annotators during the initial topic labeling last fall):
>
> 21 topicid=70
> 6 topicid=71
> 4 topicid=74
> 2 topicid=76
> 1 topicid=77
> 4 topicid=86
> 1 topicid=87
> 1 topicid=91
> 12 topicid=96
> 52 total
>
> Here is a summary of how many stories were added to the table (these
> were "misses" on the part of LDC annotators):
>
> 80 topicid=70
> 12 topicid=71
> 4 topicid=72
> 20 topicid=74
> 39 topicid=76
> 1 topicid=77
> 1 topicid=79
> 3 topicid=83
> 3 topicid=84
> 4 topicid=85
> 4 topicid=86
> 6 topicid=87
> 18 topicid=88
> 1 topicid=89
> 4 topicid=91
> 1 topicid=93
> 43 topicid=96
> 2 topicid=100
> 246 total
>
> You'll notice that the quantity of errors is correlated by topic.
> This suggests that further investigation may be warranted to
> determine why certain topics were so error-prone, and to make sure
> that the topic definitions were both sensible and stable.
>

(299) previous ~ index ~ next

Last updated Wed Feb 3 10:44:21 1999