(248) previous ~ index ~ next
To: James Allan <firstname.lastname@example.org>
From: "Charles L. Wayne" <email@example.com>
Subject: Re: Evaluation plan
Date: Thu, 15 Jun 2000 16:08:43 -0400 (EDT)
I have the following comments:
To clarify terminology: "TDT2" refers to Jan-Jun data, "TDT3" to
I believe the right approach is the one NIST announced, viz.,
reserving all of the TDT3 data for the next round of tests. This is also
what James said on 1 March at the end of the TDT Workshop.
I believe this is the right approach both because it will give us a
reasonable way to measure progress from TDT1999 to TDT2000 and because it
will let us compare performance on somewhat different topic sets: The
existing 60 topics were chosen to have at least 4 stories in each
language; the new 60 topics will consist of 30 chosen from the English
portion and 30 chosen from the Chinese portion with no requirement that
those topics have any stories in the other language. (Annotators and
algorithms will, however, search for on-topic stories in both languages.)
While it would be very desirable to have more training and development
test data, I don't think this should come at the expense of evaluation
test data, and I would be very nervous about the (unassessable) effects of
using any TDT3 data for any kind of algorithm development or training.
As James pointed out, the LDC annotated a fair amount of additional
material in the TDT2 corpus (to support the 1999 JHU workshop), and sites
could use these annotations. TDT researchers are very creative people;
I hope some will be able to locate additional data or devise clever ways
to use other non-TDT3 data.
If someone wishes to raise the issue of mining the TDT3 corpus for
training data, that could be done at the TDT2000 dry run meeting scheduled
for 7 August at Penn. For now, I believe the decision to leave TDT3 alone
should stand. (The purpose of the dry run is to debug the evaluation and
to try out new ideas, not to get optimal results.)
On Tue, 6 Jun 2000, James Allan wrote:
> Anyone else have any comments on Jon's concerns about whether or not
> people can look at the evaluation data in any way at all? At the
> moment, the only comment has been from Rich who agrees with Jon.
> Anyone else?
> -- james
> An extract of Jon's message....
> > (I assume we are talking about topics annotated on the
> > October-December data.) Under this plan, the October-December data
> > is off-limits for any use, which is PROPER, in the sense that
> > evaluation data should be off limits, but VERY LIMITING, as we are
> > already starved for development data as it is. Under this plan, we
> > cannot even look at our Eval-99 output (even looking at the DET
> > curve for Tracking or C_det for Detection is a cheat, something we
> > have all already done)!
> > I thought that what we actually decided was to use the 60 new topics
> > for evaluation, but to allow sites to use the original 60 topics for
> > tuning (or more precisely, as the development test set). Of course,
> > sites would NOT be allowed to incorporate the October-December data
> > into their training (e.g., for building discriminator models).
> > While this is a little bit suspect, because we are tuning on a data
> > set that contains the stories we will evaluate on, it isn't too bad,
> > since the topics we are tuning on are disjoint from the ones we will
> > be evaluated on. This seems like a reasonable accommodation for the
> > research, given the fact that we were unable to acquire a new corpus
> > for this evaluation.
> > In short, if we want to get useful work done this year, I think we
> > need an evaluation plan that allows us to use the 60 topics from
> > Eval-99 for development.
(248) previous ~ index ~ next
Last updated Mon Jul 3 17:15:50 2000