(149) previous ~ index ~ next
To: George Doddington <doddington@nist.gov>
From: Rich Schwartz <schwartz@bbn.com>
Subject: Re: Comments on the TDT3 performance metric
Date: Fri, 23 Jul 1999 14:36:26 -0400 (EDT)
George,
I agree with most of what you say.
1. The cost model tries to model an imagined application rather than
a technology. This is somewhat in contradiction to your general
stated goal of trying to measure performance a little bit more
generally oriented to the technology. But it's OK.
2. Given realistic costs for miss and FA, it is entirely possible that a
system can do much better than random cannot improve on the cost of
'just say no'.
NOTE that this does NOT mean that we should change the cost function to
make such systems appear to have negative (i.e., good) cost.
3. I still think, ultimately, that the cost function should be chosen
based on some guess about what INCREMENTAL precision is tolerable. I
guess this would be a new concept. The incremental precision is the
partial derivative of spurious documents found for each missed document
recovered. (Actually, this isn't 'precision', but incremental
'extra'.) This can be related directly to how you optimize the cost
function.
4. The residual entropy is still a useful concept if you would like to
know (for scientific curiousity) how much information an algorithm is
producing. However, having a second measure is fundamentally bad,
since it would require a different optimization -- and perhaps even
a different algorithm. So never mind :-).
--Rich
====================================================================
On Thu, 22 Jul 1999, George Doddington wrote:
> Date: Thu, 22 Jul 1999 18:52:54 -0400
> From: George Doddington <doddington@nist.gov>
> To: TDT distribution <tdt-distrib@ldc.upenn.edu>
> Subject: Comments on the TDT3 performance metric
>
> In his email note of 9 July, Rich Schwartz criticized the cost model
> as a means of measuring the performance of TDT systems and offerred
> some alternative measures. This is a response to his note.
>
> Rich offerred five possible alternatives:
> 1) Recall/Precision
> 2) Missing/Extra
> 3) F-measure
> 4) Geometric mean of P(miss) and P(fa)
> 5) Residual entropy
> Rich expressed a preference for the second of these measures.
>
> The performance metric has been such a significant topic of discussion
> and concern, so I would like to respond and try to explain why we are
> using a cost model as the basis for measuring TDT performance.
>
> The cost model: Assume a real operational system. We don't know much
> about this system. It probably has a human in the loop somewhere, but
> perhaps not. In any case, assume an incoming stream of objects. The
> system is supposed to classify each of these objects (e.g., "stories")
> as to whether it belongs to a target class (e.g., is "on topic"). If
> the system performs perfectly, then we get full value from the system.
> If, however, it misses a target object, then that is a loss of value
> and we assign a cost, Cmiss, to each such missed target object. And
> if the system falsely declares a non-target object to be a target,
> then the system looses efficiency by wasting time dealing with this
> object that has no value. This represents a loss of value and every
> time this happens we accumulate a cost, Cfa, to each such falsely
> detected target object.
>
> This simple cost model, while rather crude, serves reasonable well
> in representing the general needs of most TDT applications. As in
> most document retrieval applications, the a priori probability of
> a target object is usually very small (much closer to zero than to
> one). And typically the cost of a false alarm is much less than
> the cost of a miss (because false alarms are usually disposed of
> without much effort).
>
> At the last workshop, concern was voiced about systems that clearly
> were "working" yet that scored worse than never detecting a target
> ("just say no"). This is attributable to the cost model parameters.
> Namely, for the putative application, Ptarget = 0.02 and Cmiss = Cfa.
> This means that in order for a system to provide value it must be
> capable of a correct rejection rate more than 49 times its miss rate.
> This comes directly from the formula Cdet = 0.98*Pmiss + 0.02*Pfa and
> the "just say no" cost, Cjsn = 0.02: 0.02 = 0.98*Pmiss + 0.02*Pfa.
> Thus, for Cdet = Cjsn: (1 - Pfa) = 49*Pmiss.
>
> If a system can't do better than this, then the lowest cost strategy
> will be to ignore it and "just say no". This does put a floor on
> the minimum acceptable performance of a system, and not all systems
> that are capable of "better-than-random" performance will be able to
> do better than the "just say no" strategy.
>
> Now let me say a few words about the performance measures that Rich
> offers. The first two, namely Recall/Precision and Missing/Extra,
> are not really alternatives to the cost measure. Rather, they are
> alternative ways of expressing miss and false alarm probabilities.
> And both Precision and Extra suffer from being influenced by the
> richness of the corpus and thus being less indicative of system
> performance alone. This is why Miss and False Alarm probabilities
> are preferrable measures for TDT system R&D. (I sent out a rather
> compelling 2-page brief on this subject in a TDT email on February
> 9 of this year. If you don't have it, you may retrieve it from
> LDC: http://www.ldc.upenn.edu/Projects/TDT3/email/index.html)
>
> The last three measures, namely F-measure, the geometric mean of
> P(miss) and P(fa), and residual entropy are reasonable performance
> measures that might be considered. In general, however, they all
> share the inability to control the relative importance of misses
> and false alarms, according to the needs of the applications. In
> addition, here are some other reasons that they haven't been
> chosen:
> 3) F-measure: This measure also is influenced by corpus richness,
> since it is a function of precision. Thus it is indicative of
> performance only when the corpus richness of the application
> matches that of the test.
> 4) Geometric mean of P(miss) and P(fa): This is a reasonable
> measure for gauging system performance. Because of the lack
> of control over miss/fa balance, however, it may happen that
> performance is "best" in a region of no application interest.
> 5) Residual entropy: While this is good from the point of view
> of information theory, it is even further removed from the
> ability of most to appreciate its meaning with respect to
> applications.
> --
> George Doddington at NIST: doddington@nist.gov or 301/975-3261
>
(149) previous ~ index ~ next
Last updated Thu Aug 19 16:14:47 1999