(243) previous ~ index ~ next

To: tdt-distrib@ldc.upenn.edu
From: Jon_Yamron@Dragonsys.com
Subject: Re: Evaluation plan
Date: Mon, 12 Jun 2000 12:06:00 -0400

I agree with George, that as far as the 60 topics from the 99 Evaluation are
concerned, the cat is substantially out of the bag, and the sensible thing is to
allow their use for research. As for the effect of this on the evaluation, I
believe it means that our systems will produce values of C_det, C_track, etc.
that are optimistically well-tuned. Under these conditions, we will still be
able to compare relative performance across systems (unless sites differ
radically on how much information about the corpus is gleaned from
"tuning"---I've already made the suggestion that we NOT allow sites to train
models from the Oct-Dec data. In Dragon's case, this leaves only a small number
of parameters with which to characterize the corpus.)

If one believes that the "optimistic" performance figures we get by using the 60
Oct-Dec topics are substantially better than what could be achieved otherwise, I
would regard that as proof that our development set is too small to capture a
reasonable range of behavior. In that case, the evaluation results are already
questionable, and we should surely use the 60 Oct-Dec topics in order to do
useful work.

One point that Paul van Mulbregt has raised is that for those sites doing
segmentation, the Jan-June data is probably sufficient for training. Dragon has
no objection if the Oct-Dec corpus was completely off limits for those doing
segmentation.

In any case, we need a decision of some kind soon.

- Jon





doddingt@nist.gov on 06/07/2000 05:54:33 AM

Please respond to doddingt@nist.gov

To: allan@cs.umass.edu
cc: tdt-distrib@ldc.upenn.edu (bcc: Jon Yamron/Dragon Systems USA)
Subject: Re: Evaluation plan



On Tue, 06 Jun 2000 18:35:35 -0400
James Allan wrote:

> Regarding Wessel's comments....
>
> Note that if you use the existing TDT-3 topics for your training, you
> are learning how to put them into clusters. Anything you do to
> improve their clustering might also improve your clustering of stories
> in the as-yet-untagged 60 new topics in the same corpus. Of course,
> people will try not to fall into a trap like that, but it's possible.

Huh? People will try not to fall into a trap like what? I don't know
how you would avoid it. And if there were some possible way, I seriously
doubt that people would try to avoid it. It seems very clear to me that
having 60 (old) topics identified in the test corpus will definitely bias
results in a favorable way, more so for topic detection and first story
detection than for topic tracking or link detection. Therefore the formal
evaluation can only be considered as being suggestive -- it will not give
reliable estimates of absolute performance. This is not to say that I
disagree with using TDT3 for experimentation and development. I do agree
that it would be more valuable to use it for R&D than to try to exclude it
(which, by the way, is not really possible in any case, since we have
already tested on it).







(243) previous ~ index ~ next

Last updated Mon Jun 12 13:26:39 2000