(143) previous ~ index ~ next
To: George Doddington <firstname.lastname@example.org>
From: Rich Schwartz <email@example.com>
Subject: Re: TDT3 cost function changes
Date: Fri, 9 Jul 1999 19:12:21 -0400 (EDT)
On Tue, 6 Jul 1999, George Doddington wrote:
> Date: Tue, 06 Jul 1999 13:27:00 -0700
> From: George Doddington <firstname.lastname@example.org>
> To: TDT distribution <email@example.com>
> Subject: TDT3 cost function changes
> The TDT evaluation cost functions provide a linear combination of miss
> and false alarm probabilities for the various TDT tasks. These cost
> functions serve to represent the overall system performance in terms
> of a single performance cost, or "score". This score is a function of
> the application and cost parameters, namely the prior probability of
> the target and the costs of miss and false alarm errors. It became
> clear during the recent TDT workshop that it would be desirable to
> normalize the cost so as to represent system performance in a way that
> minimizes the effect of cost parameters. This would provide a stable
> value that would be intuitively meaningful. To this end, I have
> modified the TDT3 evaluation plan to provide such a normalized value.
> The normalization yields a normalized cost of 1.0 for "most favorable
> guessing", i.e., for always guessing either yes or no, whichever gives
> minimum cost.
This is fine. It will at least make each site completely aware when they
have mistuned their system. But it will be a little harder to explain to
outsiders since there is an extra step.
> There was also substantial discussion at the last TDT workshop
> regarding the appropriateness of the cost model for the topic tracking
> and detection tasks. The gist of this discussion suggested an
> imbalance in the relative costs of misses and false alarms. In the
> past, these two costs have been the same. To correct this imbalance,
> I've changed the evaluation plan to reduce the cost of a false alarm
> by a factor of 10.
I think this is MISSING THE POINT entirely.
The fact that some sites are getting scores worse than you can get by just
saying "no" doesn't mean that the COSTS should change. It also doesn't
mean that their systems are worse than doing nothing. What it DOES mean
is that they have simply not gone to the trouble of tuning their systems
to minimize CTRACK or CDET.
I have to say that some of the sites are newer to the rigors (game
playing, if you like) of evaluation. So for example, at BBN, whenever we
turn in any result, we take it for granted that we MUST optimize our
result according to the prescribed evaluation measure. I don't want to
get into a phylosophical discussion about whether this is good or not. But
we DO treat it as a 'competition'. (We happen to believe, also, that this
is a good way to do research, since it forces everyone to deal with
certain realities. And if we don't like the measure, then we must come up
with a new one.)
But back to the metric. Yes, it happens that the cost functions we have
been using make it particularly important to keep the false accept rate
low. We can all deal with that, I think. We certainly should NOT change
the cost function just to get better numbers. If someone gets more than
100% error then they should retune their system. We should only change
the cost function if the imagined application (and yes, we MUST imagine
one) demands it.
So what is a reasonable cost function?
When we get 10% p(miss) and 0.3% p(FA), what does this really mean
operationally? For a corpus with 20,000 stories and a topic with 20 true
stories, it means that you found 18/20 of the true stories and you found
60 spurious stories. The IR community would say that
the recall was 18/20 = 90% and
the precision was 18/(18+60) ~ 23%.
This is a good operational measure of the lost information and the human
labor needed to slog through the errors. Now, we (John Makhoul at BBN)
have written some papers arguing that precision should be replaced with a
slightly different measure that doesn't add the returned answers to the
denominator. So we talk about
missing = #missed / #true = 2/20. = 10%
extra = #extra / # true = 60/20 = 300%,
which means you lost 10% of the true documents, and for every true
document you have to look at 3 garbage ones.
We believe these are more useful numbers. But our opinion about that is
beside the point. The question is whether this kind of performance is the
balance you want. I personally don't think we would want to have more
false alarms. If we decrease the cost of false alarms by a factor of 10,
then we are likely to see AT LEAST 3 times more false alarms. So we might
missing = 1/20 = 5%
extra = 200/20 = 1000%. or precision ~ 9.5%.
(This would be reported as p(miss) = 5% and p(FA) = 1%, which on the
surface sounds good but really is abysmal in terms of helping someone.)
So I strongly disagree that we should change the cost function, unless you
really believe that this operating point is a better one for some imagined
Note, though that however you report it, the issue here is not about the
individual probability of error measures, but about the combination of the
two measures into a single number that reflects how well your system is
So what do I propose?
For speech recognition it has always been pretty easy because we can just
add up the errors under the assumption that the cost of errors is the cost
of correcting all of the insertions, substitutions and deletions. It's
impossible to 'cheat' the metric. But here the costs are different and we
have to think harder about it.
Well, there are a few possibilities. But first, each site should be aware
of the boundary conditions and do the work to turn their system to avoid
doing worse than chance according to the metric.
But we should also have a metric that makes it clear how much the system
is doing better than chance. The F1-measure commonly used in information
extraction (harmonic mean of recall and precision) attempts to do this by
making it impossible to cheat. If you always answer "no" or "yes", your
recall or precision go to zero and the harmonic mean is zero. (I don't
happen to like this, but only because of the detail that precision
benefits you in a funny nonlinear way for giving more answers.) So I
wouldn't mind it that much. I do prefer the simpler "missing" and
"extra", each divided by truth, and then combined in much the same
>From the view of detection, I have always found that the simple PRODUCT of
the false rejection (p-miss) and false accept (p-fa) is a very good single
number measure that is roughly independent of system tunings, since they
tend to trade off on each other that way. That's why our DET curve
results are approximately straight lines on log-log plots (and almost
straight on normal plots). But then I think you (George) have been trying
to get people to give more than just an ROC curve and also provide a
decision point. The harmonic mean above has the nice property that you
get the most benefit when both measures are about the same.
If you like information theory, then you might want to measure the added
information that the systems provide over doing nothing. So doing nothing
in this case would be always saying 'NO'. There is an entropy associated
with that. We could measure the cross entropy between the system output
and the truth and see how much better it is than always saying "NO".
The only problem with this measure is that no one outside of the program
will understand what we're talking about.
So I, personally, would vote for measuring
p(miss) = #missed / truth and
p(extra) = #extra / truth.
and then if we want a single-number measure we can either take a weighted
sum or an exponentially weighted geometric mean or harmonic mean.
P.S. The case where this came up most distinctly was on first story
detection. For tracking and detection, I think everyone COULD tune their
systems to avoid getting a score > 0.02. But for first-story detection,
it is very hard to find the first story and we are likely to never find
more than half of them without dozens of false alarms. Again, this
shouldn't mean that it's not worth looking for them, since James' system
found 30% of them, which is MUCH BETTER than chance which would not find
even one first story with the number of false alarms he had.
(There are the separate questions of whether the scoring is too
restrictive in that it gives no credit for finding the second story, and
also statistically degenerate because the number of test samples is only
equal to the number of stories. But these are separate and should be
Again, we need a measure of how much better this is than doing it by
chance. And any weighted sum of miss and false accept CANNOT reflect how
much more information you have provided. To do this you must have an
INFORMATION measure. So a simple measure would be
NUMBER of 1st stories found, divided by
the number of 1st stories you would find by accident if you
created the number of answers you created.
So, for example, if you have 20,000 stories with 20 topics and you mark a
random 1000 of them as 1st stories, you would expect to find about 1 of
them by accident. (This isn't exactly correct, but it's close.)
But if you find 7 of them with only 1000 answers, then you have done 7
times better than chance.
Say we gave (1/2)^nth story for finding the nth story of a topic, then you
would get a few more points for finding some second and third stories.
(By the way, there is no benefit for marking all of the stories, because
even though this would give you a credit of 2 for the topic, you would
have a ton of false alarms.)
OK, this message is plenty long enough!
Speech and Language Dept., BBN Technologies
GTE Corp. Voice: (617) 873-3360
70 Fawcett St. Fax: (617) 873-2534
Cambridge, MA, 02138 E-Mail: Schwartz@bbn.com
> The attached postscript file summarizes the changes and illustrates
> the result of changing the cost parameters with constant-cost curves
> on a DET plot. -- George Doddington in Orinda, CA:
> firstname.lastname@example.org or 925/631-6628
(143) previous ~ index ~ next
Last updated Mon Jul 12 17:16:51 1999