(310) previous ~ index ~ next
To: Yiming Yang <Yiming_Yang@agra.mt.cs.cmu.edu>
From: Rich Schwartz <schwartz@bbn.com>
Subject: Re: Adjudicated topic_relevance.table
Date: Fri, 15 Jan 1999 14:24:07 -0500 (EST)
Yiming,
I agree that events change over time and that we should track them
as they change. But it's not an issue of tuning to events. It's that if
the DEFINITION of an event changes, then of course a system that was good
at detecting events based on one criterion would not be good based on
another.
I do not believe the news changed radically in character between
the March-April period and the May-June period. There might have been one
or two special events in each of those periods. But over the 25 events,
there should not be a difference if they were selected randomly. However,
if the definition of the events were made up at two different times, then
there is the possibility (and we see the reality) that the criteria
changed.
So what I'm suggesting would be to collect a large corpus over a
period (say March through June). And then choose 50 events. You would
alternate choosing events from the first two months and then from the
second two months. This way the criteria remain constant. (Actually, the
criteria will drift as you go, but this effect will be divided among the
two data sets.)
You are also correct that in the REAL world, even the criteria for
what is an event will change. But if we are to make ANY sense out of our
research we must have some kind of continuity or homogeneity in the
problem we are trying to solve. If the measure of success changes, then
although that might be real, it's very hard for us to learn from it.
So obviously, we can't redo the dev-set and eval-set. What I'm
saying hopefully can be applied to the next corpus. That is, given that I
think the data is all collected, we can alternate between corpora, having
the same annotators work alternately between them so that we have a
representative development set.
--Rich
=-============================================================
On Fri, 15 Jan 1999, Yiming Yang wrote:
> Date: Fri, 15 Jan 1999 14:11:52 -0500
> From: Yiming Yang <Yiming_Yang@agra.mt.cs.cmu.edu>
> To: David Graff <graff@unagi.cis.upenn.edu>
> Cc: Rich Schwartz <schwartz@bbn.com>, tdt-distrib@unagi.cis.upenn.edu
> Subject: Re: Adjudicated topic_relevance.table
>
>
> >> I think this says what we have known (and seen) for years.
> >> If test sets are made at different times, they will be different.
> >> ... The only way to
> >> get two sets of data that are remotely comparable is to create them
> >> at the same time. That way, the random differences will wash out
> >> (as Ellen Vorhees pointed out) by virtue of having 25 randomly
> >> chosen topics in each set.
>
> >Does this mean that you would propose defining/annotating 50 topics
> >on a given time-period/data-set, using 25 of those topics and that
> >data-set for devtest, and then using the other 25 topics and the same
> >data-set for eval?
>
> I think the dynamically changing nature of events is a task-specific
> property of the TDT problems (segmentation, tracking and detection).
> In this sense, using an evaluation set generated in a different time
> period from the training or development set is more realist a test
> than using an evaluation set in the same time period as the training
> or development set. As long as the events in the evaluation set are
> sufficiently "random", not artificially odd from the pool (assuming
> the pool is sufficiently large), I do not see what's wrong with it.
>
> Sure that classifiers tuned optimal for the training or development
> sets may not be optimal on the evaluation set, but didn't we all agree
> that events evolve and die over time, and that this is the part of the
> challenge?
>
> - Yiming
>
(310) previous ~ index ~ next
Last updated Wed Feb 3 10:44:21 1999