(164) previous ~ index ~ next
To: Ralf Brown <Ralf_Brown@v.gp.cs.cmu.edu>
From: Mark Liberman <myl@unagi.cis.upenn.edu>
Subject: Re: more missed labels in devset
Date: Wed, 09 Sep 1998 09:13:22 EDT
In evaluating "miss rates," everyone should recall the earlier estimates
of the "kappa" statistic, measuring inter-annotator agreement, from
experiments with double tagging.
You'll recall that kappa is the ratio of observed frequency of
agreement to one minus the frequency of agreement expected by chance,
and that our observed kappas are roughly .9. According to the literature
this is regarded as a rather good level of intersubjective agreement.
The implication of a kappa of about this level is that about five
percent of the stories that a given person tags as relevant to a given
topic will not be tagged the same way by someone else. Some of these
would be clear mistakes, while others would represent differences of
judgment. In any case, the present process is not going to produce
average miss rates much lower than that. Depending on the proportion
of cases that would regarded on adjudication as clear mistakes, a
screening of candidates from many search engines might produce lower
rates.
The apparent miss rate cited by Ralf is obviously higher than this would
lead us to expect. Logically, there are several possible reasons for this:
* Ralf understands (some of) the topics differently than the annotators did
* his thresholds are somewhat lower than the annotators' thresholds
* the annotators made more mistakes on this material than our spot checks
on consistency have led us to believe that they should
* the method of checking inter-annotator consistency is underestimating
the miss rate
We'll take a careful look at the additional suggested stories, to try
to figure out which of these is the case.
>From the particulars cited earlier, it looks like there are a few
examples that result from simple mis-tagging (e.g. confusions between
1 and 2 or 1 and 12). We should be able to do things to make such
errors less frequent. It would not surprise me to find some differences
of interpretation of topic definitions, as well.
However, if very low miss rates are needed, then there is no choice
but to revise the process, to expand the stage of checking candidates
from search engines, which we are now doing only for one search engine.
Independently, we've put into place a formal bug reporting/tracking
mechanism (to be announced soon), which should make it easier to keep
careful track of how many of what kind of bugs are found, and what
their ultimate disposition is.
-Mark
(164) previous ~ index ~ next
Last updated Fri Sep 11 13:52:53 1998