(133) previous ~ index ~ next
To: Jonathan Fiscus <firstname.lastname@example.org>
From: Rich Schwartz <email@example.com>
Subject: Re: Example Evaluation Mismatch
Date: Thu, 3 Sep 1998 10:09:05 -0400 (EDT)
I'm really concerned about what we've got here. At the May
meeting (5/7,8) we said that all wanted to start using the new release of
data (dev test Mar,Apr) and the new scoring software. There were several
false starts. Problems were found with the corpus and the software was
not released in a final form until a few days before the 7/30 meeting.
At this meeting we made a few changes to some constants, but
we all agreed that this was OK, because now it was really ready
and we could use it for the remaining month up to the 9/9 dry run
evaluation. It was clear that the sole purpose for the dry run was
to debug the corpus and the evaluation procedure.
Well there have been a large number of messages about problems
found and solved, so the procedure and the corpus are getting worked out.
That's great. But each problem (with the corpus and lists and
postprocessing) that is found raises questions about how many more
problems there are. Is anyone worried that a large percentage of the
topics have a serious problem? Are these checks and tests being gathered
by NIST as part of a suite of tests to be run on the new corpus?
Moreover, for the "real" evaluation, the corpus will not have the
benefit of all of us groveling over it for 3 months in order to find
serious problems. My assumption would have to be that the evaluation
corpus will be in the same condition as the previous corpora were when
they arrived. Some of the mechanical problems will have been solved,
but there will surely be many new problems.
I'm not trying to lay any blame. I have been impressed at the
great lengths that LDC and NIST have taken to try to make things right.
But the fact is that they are not right. I'm not sure what to do about
this, but it seems to me that we all need to put our heads together to
find a way to make it more correct. I envision a collection of quality
For example, we know that a number of the topic judgements are
wrong. It's a small percentage, well within what you could expect people
to do. But at the same time, it's almost comparable to the errors our
systems make. (This is possible because the number of YES stories is
very small compared to the total for any topic. So a 99.9% accuracy on
the corpus may mean a 25% error on the topics.) One suggestion I made
at the last meeting was that LDC should use the TRACKING systems to
help verify the data. Most of the annotation errors are misses.
I would suggest training the tracking systems on ALL of the YESses for
a topic (including the test ones) and then scoring all of the corpus.
Then, examine any NOs with a high score and any YESses with a low score.
I claim that this will find a large number of errors. Because the
systems will have been trained on many more positive examples, and
because they will span the whole time interval, the accuracy should be
very high. Of course, to be fair, one must use all of the systems.
This will also increase the accuracy of the test further.
I claim this kind of test needs to be done ahead of time
in batch, rather than after the data is released on a 1-by-1 story
basis, or in a very painful adjudication process. It doesn't make
sense for each site to paw over all of the errors they made and read the
stories to decide whether to complain. This can be done all at once,
on the union of high scoring false alarms or low scoring misses, by
LDC who is much more experienced at it.
This is only one suggestion. We need many more. I feel that if
the evaluation is to mean anything at all, we need to do these things.
Perhaps this should be the primary subject of discussion at the next
(133) previous ~ index ~ next
Last updated Wed Sep 9 09:40:55 1998