(244) previous ~ index ~ next

To: Jon_Yamron@dragonsys.com
From: Rich Schwartz <schwartz@bbn.com>
Subject: Re: Evaluation plan
Date: Mon, 12 Jun 2000 12:26:05 -0400 (EDT)

Jon,
It's hard to know how good different sites are at memorizing the
answers (i.e., tuning). But as I understand it, there are additional
topics defined now, isn't that right? In this case, I don't see what
would be wrong with declaring the old 60 topics as dev test, and the new
ones as eval test. It's pretty hard to see how training on the old topics
is somehow cheating on the new ones, even though the corpus is the same.
It would, at best be a second-order effect.

--Rich
========================================

On Mon, 12 Jun 2000 Jon_Yamron@Dragonsys.com wrote:

> Date: Mon, 12 Jun 2000 12:06:00 -0400
> From: Jon_Yamron@Dragonsys.com
> To: tdt-distrib@ldc.upenn.edu
> Subject: Re: Evaluation plan
>
> I agree with George, that as far as the 60 topics from the 99 Evaluation are
> concerned, the cat is substantially out of the bag, and the sensible thing is to
> allow their use for research. As for the effect of this on the evaluation, I
> believe it means that our systems will produce values of C_det, C_track, etc.
> that are optimistically well-tuned. Under these conditions, we will still be
> able to compare relative performance across systems (unless sites differ
> radically on how much information about the corpus is gleaned from
> "tuning"---I've already made the suggestion that we NOT allow sites to train
> models from the Oct-Dec data. In Dragon's case, this leaves only a small number
> of parameters with which to characterize the corpus.)
>
> If one believes that the "optimistic" performance figures we get by using the 60
> Oct-Dec topics are substantially better than what could be achieved otherwise, I
> would regard that as proof that our development set is too small to capture a
> reasonable range of behavior. In that case, the evaluation results are already
> questionable, and we should surely use the 60 Oct-Dec topics in order to do
> useful work.
>
> One point that Paul van Mulbregt has raised is that for those sites doing
> segmentation, the Jan-June data is probably sufficient for training. Dragon has
> no objection if the Oct-Dec corpus was completely off limits for those doing
> segmentation.
>
> In any case, we need a decision of some kind soon.
>
> - Jon
>
>
>
>
>
> doddingt@nist.gov on 06/07/2000 05:54:33 AM
>
> Please respond to doddingt@nist.gov
>
> To: allan@cs.umass.edu
> cc: tdt-distrib@ldc.upenn.edu (bcc: Jon Yamron/Dragon Systems USA)
> Subject: Re: Evaluation plan
>
>
>
> On Tue, 06 Jun 2000 18:35:35 -0400
> James Allan wrote:
>
> > Regarding Wessel's comments....
> >
> > Note that if you use the existing TDT-3 topics for your training, you
> > are learning how to put them into clusters. Anything you do to
> > improve their clustering might also improve your clustering of stories
> > in the as-yet-untagged 60 new topics in the same corpus. Of course,
> > people will try not to fall into a trap like that, but it's possible.
>
> Huh? People will try not to fall into a trap like what? I don't know
> how you would avoid it. And if there were some possible way, I seriously
> doubt that people would try to avoid it. It seems very clear to me that
> having 60 (old) topics identified in the test corpus will definitely bias
> results in a favorable way, more so for topic detection and first story
> detection than for topic tracking or link detection. Therefore the formal
> evaluation can only be considered as being suggestive -- it will not give
> reliable estimates of absolute performance. This is not to say that I
> disagree with using TDT3 for experimentation and development. I do agree
> that it would be more valuable to use it for R&D than to try to exclude it
> (which, by the way, is not really possible in any case, since we have
> already tested on it).
>
>
>
>
>
>
>

(244) previous ~ index ~ next

Last updated Mon Jun 12 13:26:40 2000