(169) previous ~ index ~ next
To: Jaime Carbonell <jgc@NL.CS.CMU.EDU>
From: Mark Liberman <myl@unagi.cis.upenn.edu>
Subject: Re: more missed labels in devset
Date: Wed, 09 Sep 1998 14:09:33 EDT
>Mark,
> Thanks for your analytical message. The Kappa ratio, of course,
>measures inter-rater consistency, and not systematic misses (by all
>raters). One would hope the two would be correlated, but it is not
>necessarily so.
To clarify: inter-rater agreement is at best a very crude estimator of
miss rates, but it should be related in some way, unless all
annotators are doing the same wrong thing in lock step. Under various
assumptions, it would bound miss rates in various ways. Not to belabor
the point, but if each annotator were (for whatever reason) missing
(say) a random 20% of the truly topic-related stories, then it would
not be possible to have a kappa as high as .9. Thus the apparent
inconsistency between Ralf's observations and our tests.
> Therefore, it would be useful to develop another
>measure to check for systematic misses.
I absolutely agree.
[Aside: This was the side
>discussion that Rich Schwartz and I were having last meeting, and
>the reason for Rich's suggestion that the trackers be used to catch
>misses, which is exactly what Ralf did.]
This was also suggested (perhaps also by Rich?) during the first
planning meetings, late last year or early this year. As I recall,
George vetoed the idea on the grounds that it would tend to warp the
notion of "topic" in the direction of whatever the current search
technologies tend to do. However, I have continued to suspect that
there is no other economical way to get miss rates down.
> I think that it would be useful to search for misses with another
>tracker or two, and then investigate the causes (your hypotheses are
>plausible), leading to an improved labeling on the TDT2 corpus.
>This is not a high cost operation, as only a small number of stories
>(high-scoring alleged false alarms) need be re-read and perhaps
>relabeled manually, if they indeed prove to be systematic misses.
We are already doing a QC pass down the N most related stories according
to Mike Schultz's tracker. There would no extra expense at all if we just
changed the ranking to be some sort of amalgam of several trackers. We could
also increase N with modest impacts on overall budgets.
-Mark
(169) previous ~ index ~ next
Last updated Fri Sep 11 13:52:53 1998