(302) previous ~ index ~ next

To: Rich Schwartz <schwartz@bbn.com>
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: Adjudicated topic_relevance.table
Date: Mon, 11 Jan 1999 13:13:27 EST

Rich Schwartz wrote:

> It's interesting that the human miss rate of 246 / 1575 is
> comparable to or higher than the automatic trackers. Of course,
> the false accept rate is much lower. I guess this should be
> expected.

Yes, this is to be expected, in particular because the eval-set
annotations went through only two of the three QC stages applied to
the earlier data sets. Since we knew we would be reviewing results
from the sites, we held off from doing the "recall" stage of QC, which
looks for on-topic stories that were missed in the first annotation
pass. The LDC false alarm rate was higher than I expected,
considering that the eval-set did undergo a "precision" check back in
October, in which all on-topic judgements were reviewed for accuracy.

> It also is striking how radically different the eval test is from
> the dev-test. ... (The number of on-topic
> stories is more than 3 times as much to start.) The obvious
> difference is that the scores are several times worse than on the
> dev set.

Rich may be exaggerating the on-topic quantity difference a bit (I
think the factor is closer to 2.7, not "more than 3"), but it's clear
that topic selection during the three partition periods did not yield
comparable quantities; the training set was inordinately rich, and
the devtest set inordinately poor. The eval-set, thank goodness, was
middling between those extremes.

I haven't looked at any of the scores -- did the adjudication improve
them?

Apart from the differences in the number of on-topic stories, the
number of trackable topics, and the scoring of system performance, I
don't know of anything that makes the devtest data "radically
different" from the evaltest data. Are these differences "radical"?

> I think this says what we have known (and seen) for years.
> If test sets are made at different times, they will be different.
> ... The only way to
> get two sets of data that are remotely comparable is to create them
> at the same time. That way, the random differences will wash out
> (as Ellen Vorhees pointed out) by virtue of having 25 randomly
> chosen topics in each set.

Does this mean that you would propose defining/annotating 50 topics
on a given time-period/data-set, using 25 of those topics and that
data-set for devtest, and then using the other 25 topics and the same
data-set for eval? Would there really be an adequate stress test on
story segmentation and topic detection if you're just using the same
data-set over again?

I suppose if there are enough pairs of comparable data sources, the
overall data-set could be split by source into comparable halves
along with the topic set (after all the annotation is done), and
there's a reasonable chance that relative quantities could be kept
comparable after the split, as well (maybe). Halving the collection
this way would yield less material to test on, unless the length of
collection period is extended. This might be a topic for discussion
in planning the next eval.

Dave Graff
(302) previous ~ index ~ next

Last updated Wed Feb 3 10:44:21 1999