(004) previous ~ index ~ next
To: James Allan <firstname.lastname@example.org>
From: Rich Schwartz <email@example.com>
Subject: Re: TDT3, a variety of issues
Date: Mon, 8 Feb 1999 09:41:43 -0500 (EST)
I agree with most of your comments. I know we don't all have to
do everything, but if there are too many tasks, then we might not have
many doing any particular thing. It would be good to focus on one really
new thing -- like the cross-language TDT. Additional evaluation measures
and conditions will take time but may not really teach us as much. And
certainly in terms of public perception, I doubt that anyone will
understand the distinctions.
1. Tracking without labeled background stories.
This just means that there may be several additinal positive
stories within the training. While I agree that this is essential for a
practical system, I don't know whether it changes our systems much. Most
of the systems don't even use the negative background stories. The BBN
system does, but even for TDT2, there were enough labeling mistakes so
that we had to deal with this anyway. So it's fine as a new condition, in
my mind, and won't create much extra work. However, it won't have the
effect of decreasing the LDC annotation effort, since full annotation is
still needed for the detection task (and certainly would be needed for
first story detection).
2. I agree completely that we should not all try to create Mandarin
resources. It's not that hard, but would absorb most of our effort and
would also create differences in the results that would be impossible to
separate from the underlying IR techniques.
3. I don't personally care about getting more English data, since we
already have so much -- 3 sets of TDT2 -- and I'd hate to use valuable LDC
time to create it rather than getting the new data we need ASAP. I'm
concerned that we won't really be able to start on any of these new tasks
until months from now.
My own issue:
We always have a huge problem about vast differences between data
sets. We tune to one and then find that the next set is completely
different in that the criteria used have drifted -- even though not
officially. So we see 4000 ontopic stories for Jan-Feb, 500 for Mar-Apr,
and 1800 for May-June. These differences are clearly not due to random
sampling, because there were plenty of topics to keep the differences
smaller than that. It is because the types of topics and the criteria for
inclusion of stories obviously drifted -- even if it was because of
I know that this may mirror naturally changing conditions in the
real world desires as well. But it makes useful research nearly
impossible. Research requires repeatable experimental conditions.
My proposal for any corpus (whether TDT or speech or anything
else), is to collect all of the data at the same time interleaving effort
among test sets. For speech recognition this would mean simply collecting
all of the test shows at the same time, during the same epoch. For TDT,
since the whole set is used for training and test, it requires different
epochs. But these can be done all at the same time. Then at least the
criteria applied are the same.
(004) previous ~ index ~ next
Last updated Thu May 13 09:28:12 1999