(147) previous ~ index ~ next
To: TDT distribution <tdt-distrib@ldc.upenn.edu>
From: George Doddington <doddington@nist.gov>
Subject: Comments on the TDT3 performance metric
Date: Thu, 22 Jul 1999 18:52:54 -0400
In his email note of 9 July, Rich Schwartz criticized the cost model
as a means of measuring the performance of TDT systems and offerred
some alternative measures. This is a response to his note.
Rich offerred five possible alternatives:
1) Recall/Precision
2) Missing/Extra
3) F-measure
4) Geometric mean of P(miss) and P(fa)
5) Residual entropy
Rich expressed a preference for the second of these measures.
The performance metric has been such a significant topic of discussion
and concern, so I would like to respond and try to explain why we are
using a cost model as the basis for measuring TDT performance.
The cost model: Assume a real operational system. We don't know much
about this system. It probably has a human in the loop somewhere, but
perhaps not. In any case, assume an incoming stream of objects. The
system is supposed to classify each of these objects (e.g., "stories")
as to whether it belongs to a target class (e.g., is "on topic"). If
the system performs perfectly, then we get full value from the system.
If, however, it misses a target object, then that is a loss of value
and we assign a cost, Cmiss, to each such missed target object. And
if the system falsely declares a non-target object to be a target,
then the system looses efficiency by wasting time dealing with this
object that has no value. This represents a loss of value and every
time this happens we accumulate a cost, Cfa, to each such falsely
detected target object.
This simple cost model, while rather crude, serves reasonable well
in representing the general needs of most TDT applications. As in
most document retrieval applications, the a priori probability of
a target object is usually very small (much closer to zero than to
one). And typically the cost of a false alarm is much less than
the cost of a miss (because false alarms are usually disposed of
without much effort).
At the last workshop, concern was voiced about systems that clearly
were "working" yet that scored worse than never detecting a target
("just say no"). This is attributable to the cost model parameters.
Namely, for the putative application, Ptarget = 0.02 and Cmiss = Cfa.
This means that in order for a system to provide value it must be
capable of a correct rejection rate more than 49 times its miss rate.
This comes directly from the formula Cdet = 0.98*Pmiss + 0.02*Pfa and
the "just say no" cost, Cjsn = 0.02: 0.02 = 0.98*Pmiss + 0.02*Pfa.
Thus, for Cdet = Cjsn: (1 - Pfa) = 49*Pmiss.
If a system can't do better than this, then the lowest cost strategy
will be to ignore it and "just say no". This does put a floor on
the minimum acceptable performance of a system, and not all systems
that are capable of "better-than-random" performance will be able to
do better than the "just say no" strategy.
Now let me say a few words about the performance measures that Rich
offers. The first two, namely Recall/Precision and Missing/Extra,
are not really alternatives to the cost measure. Rather, they are
alternative ways of expressing miss and false alarm probabilities.
And both Precision and Extra suffer from being influenced by the
richness of the corpus and thus being less indicative of system
performance alone. This is why Miss and False Alarm probabilities
are preferrable measures for TDT system R&D. (I sent out a rather
compelling 2-page brief on this subject in a TDT email on February
9 of this year. If you don't have it, you may retrieve it from
LDC: http://www.ldc.upenn.edu/Projects/TDT3/email/index.html)
The last three measures, namely F-measure, the geometric mean of
P(miss) and P(fa), and residual entropy are reasonable performance
measures that might be considered. In general, however, they all
share the inability to control the relative importance of misses
and false alarms, according to the needs of the applications. In
addition, here are some other reasons that they haven't been
chosen:
3) F-measure: This measure also is influenced by corpus richness,
since it is a function of precision. Thus it is indicative of
performance only when the corpus richness of the application
matches that of the test.
4) Geometric mean of P(miss) and P(fa): This is a reasonable
measure for gauging system performance. Because of the lack
of control over miss/fa balance, however, it may happen that
performance is "best" in a region of no application interest.
5) Residual entropy: While this is good from the point of view
of information theory, it is even further removed from the
ability of most to appreciate its meaning with respect to
applications.
--
George Doddington at NIST: doddington@nist.gov or 301/975-3261
(147) previous ~ index ~ next
Last updated Fri Jul 23 10:52:01 1999