(272) previous ~ index ~ next

To: George Doddington <doddington@nist.gov>
From: Doug Oard <oard@glue.umd.edu>
Subject: Re: Proposed change to the topic tracking task
Date: Thu, 10 Aug 2000 00:28:53 -0400 (EDT)

On Wed, 9 Aug 2000, George Doddington wrote:

> Your note exhibits term warp. In points 1) through 3) your word
> "condition" seems to translate to the TDT term "topic".
>
> > 1) We could conceivably run 1500 conditions, but getting to that scale
> > would detract focus from other aspects of system tuning.

I'm confused, George. A topic is defined in the evaluation plan as:

"A topic is defined to be a seminal event or activity,
along with all directly related events and activities."

If you have created 1500 topics (and by implication 1500 seminal events)
we're talking about something very different that I had understood. I
thought we were talking about submitting 1500 result sets, for 1500 of
what I refered to as "conditions." If I have inadvertantly overloaded the
term "condition" and we have another name for the set of parameters that
define a run, I'll be happy to use that. But referring to different
conditions for the same topic as if they are different topics seems like
it muddies the discussion in a way that is not helpful.

> In point
> 4) your word "condition" would seem to translate to the TDT term
> "training story".
>
> > 4) Building on that last point, have we agreed on what kind of averaging
> > makes sense if there are variable numbers of conditions per
> > topic? Condition-weighted? topic-weighted?

As I understand it, in this case, the conditions are defined by SETS of
training stories, so I suppose we could call the question "training story
set weighted" vs. "topic weighted" vs. anything else we can think of. My
question (which remains unanswered) is what we are seeking to optimize?
To make the point clearer, imagine that the evaluation requires that we
run 121 conditions, Nt=1 for all 120 topics and Nt=2 for one of the
topics. If we do topic-weighted scoring in the sense of topic that the
evaluation plan uses, then I suppose we would average the performance for
that one topic over two training story set sizes and then average that in
with the results from the other 119 topics. I don't mean to suggest that
this is a good way to do things, just that it is one possible
interpretation of what "topic weighted" means in this context. I
understand that lots of different ways of aggregating things could be
reported. My questions is, when NIST flashes up the one slide with
everyone's DET curve on it (you know, the one with BBN in the lower left
:) how will that have been computed?

> That being the case, your concern about an
> excessive number of topics is well taken. But not to worry --
> There is no need to run all possible different sets of training
> stories. This may be controlled to taste, balancing the desire
> to explore different values of Nt and to have good statistical
> significance against limited processing abilities of sites with
> limited computational facilities.

I realize that it is premature to ask how many "conditions" there will be
while we are still discussing your question regarding whether we will have
lots of conditions in the first place. But if we do decide to require
more than 120 conditions, it would be helpful if you could tell us how
many will be required as soon as you have decided that.

> Regarding your point 3), there should be no concern about limits
> to development efforts, because the processing load that we are
> discussing applies only to the evaluation, not to your research
> and development.

My point had been that developing a system that is fast and space
efficient enough to run 1500 conditions in the available time could use up
resources (time and effort) that we could have otherwise invested in
making a system that moved our det curve down and to the left. Our
present system is not sufficiently space efficient to store 1500 result
sets, much less the intermediate files that we generate, and fixing that
would take time.

------- Everyone but George can stop reading here ----

BTW, while we're nitpicking terminology, the training set (a
defined subcollection of the TDT-2 collection) contains stories that could
reasonably be referred to as a set of training stories. We also have the
concept of a training story set (as defined above), which could also
reasonably be eferred to as a set of training stories. Thus, the term
"training story" is ambiguous in TDT. Within our group we routinely refer
to the stories that are presented as examples of a desired topic as
"exemplars" for just this reason. But we continue to nod politely when
you refer to training stories, and to disambiguate your meaning from
context. If we ever agree on what a "condition" should be called, perhaps
we could then take up the issue of what "training stories" should be
called :-)

Doug

(272) previous ~ index ~ next

Last updated Tue Sep 19 14:30:55 2000