(009) previous ~ index ~ next
To: James Allan <email@example.com>
From: Rich Schwartz <firstname.lastname@example.org>
Subject: Re: TDT3, a variety of issues
Date: Tue, 9 Feb 1999 09:26:51 -0500 (EST)
Some minor comments below. Again, I mostly agree. It is
important to focus our efforts so that we end up with something of value
rather than a smattering of bits and pieces.
And anything we can do to accelerate the data so that we can start
working on this before July would be good as well. It's would be a pity
if we only got to work on it for 4 months before an evaluation (which
would be a dry run no matter what we called it), rather than working on it
for most of a year. I firmly believe that evaluations focus and drive
the research. But evaluations without the research aren't much use.
So while I objected to a meeting with results every two months as
being excessive, if we really want to make more progress, I think a
meeting about twice a year with dry run results would get things moving
faster. Progress in these new areas requires several ingredients:
1. An appropriate corpus for the problem, labeled as needed.
2. An evaluation metric with corresponding software.
3. Time to do the research.
4. Iterations to work out the kinks, because we never get it right
the first time.
5. A few more iterations to really nail the problem, otherwise all we did
was measure the performance on the problem.
So I think it's safe to say that we've now nailed the topic
tracking problem pretty well. The topic detection problem is done pretty
well also, except for issues related to topic definition which are
difficult to resolve. Story segmentation is a little up in the air,
because the purpose and metric are in debate.
The new goal of not LABELING background stories is a reasonable
one in that it reduces the necessary effort in the application. I'm
assuming there would never be any reason to limit use of past
(or even concurrent) background data. It's just that it wouldn't be
labeled. I doubt that this will be hard to solve.
Cross Lingual TDT is definitely new and very hard. It also
requires that we be very careful to define what it is trying to accomplish
in a realistic scenario with reasonable goals and resources. I understand
the tracking. The detection will be hard to define or do with no a priori
idea of what a topic is. I'm not saying we shouldn't do it. Just that we
need to think about it carefully.
First story detection was done in TDT1 and dropped for several
logistic reasons, like there being too few test stories. More were
created artificially by removing the first stories. But people were
unsatisfied. So it clearly will need some thought to define it better.
This doesn't mean that it isn't useful -- I assume it would be very useful
operationally. Again, we just need to think about why it was dropped
before and improve the definition.
On Mon, 8 Feb 1999, James Allan wrote:
> Date: Mon, 08 Feb 1999 23:08:47 -0500
> From: James Allan <email@example.com>
> To: firstname.lastname@example.org
> Subject: Re: TDT3, a variety of issues
> Following up the few comments....
> No one has screamed in response to either of our comments about
> considering a limit to the number of different tasks we undertake. It
> may be that there is enough interest in each of them, though, that we
> will have to prioritize, or at least choose 2-3 "major" tasks that
> everyone is expected to contribute to. Other tasks could then be
> viewed as more exploratory?
Remember that even in TDT2 we didn't have everyone do all 3 tasks.
Most did 1 or 2. If we could limit the total number of tasks to 3-4
that would get more participation on each.
> There appears to be general agreement that tracking without large
> numbers of off-topic stories is a reasonable task that would not be
> too difficult to accomodate. I am against having *NO* off-topic
> stories at all, though I am not against having a "what happens if
> there are none"--I would think that the interesting question would be
> how much off-topic material do you need, and what should its nature
> be? (Of course, we can play with this somewhat with the TDT-2 corpus,
> so perhaps this should all be low-priority for TDT-3.)
Again, the issue is not having background data. It's that it's not
labeled, so there may be on-topic stories in it. (We can simulate this
condition now by making believe we don't know some of the YES labeled
stories. The only problem is that we know that there are only 1 or 2 of
them, while in the real case, there might be many.
> > 2. I agree completely that we should not all try to create Mandarin
> > resources. It's not that hard, but would absorb most of our effort and
> > would also create differences in the results that would be impossible to
> > separate from the underlying IR techniques.
> So one excellent goal for the TDT-3 meeting at Dulles will be a list
> of known Mandarin resources, from the LDC and from participating (or
> not) sites.
YES. Can we get that list before the meeting by Email so that it can be
almost finalized by the workshop?
> > 3. I don't personally care about getting more English data, since we
> > already have so much -- 3 sets of TDT2 -- and I'd hate to use valuable LDC
> > time to create it rather than getting the new data we need ASAP. I'm
> > concerned that we won't really be able to start on any of these new tasks
> > until months from now.
> I agree that it is not worthwhile if it takes more than a pittance of
> LDC time. If it's gathered, though, and if it's already in usable
> shape (though obviously unlabelled wrt the topics) then it'd be nice
> to have it as a possible resource.
> I'm less concerned than Rich is about corpus consistency, though I
> haven't looked into as much as he appears to have. I look forward to
> seeing the analysis that George announced. My gut feeling is that the
> LDC took a bit to get "into the groove" and that they are now fairly
> consistent, whereas the training data is somewhat out of line with the
> dev and eval sets. But that's a gut feeling and not based on any
> careful analysis.
I also would like to see that analysis. If the "statistical analysis"
says that there is no difference, but every site reports that they get 2-3
times the CTrack error on the May-June data as on the Mar-April data (even
after tuning to the May-June data), then obviously the statistical
analysis needs rethinking.
Let's get started,
(009) previous ~ index ~ next
Last updated Thu May 13 09:28:12 1999