(015) previous ~ index ~ next

To: Christopher Cieri <ccieri@ldc.upenn.edu>
From: Rich Schwartz <schwartz@bbn.com>
Subject: Re: Mandarin Resources and Large, Non-Punctual Topics
Date: Wed, 10 Feb 1999 09:20:01 -0500 (EST)

Chris,

On Tue, 9 Feb 1999, Christopher Cieri wrote:

> James asked us to list any Mandarin resources we might contribute to
> TDT-3.

The resource we all need most is a bilingual dictionary, because we
are supposed to find Mandarin and English documents that are about the
same topic. We could each start to work on techniques for estimating
a probabilistic bilingual dictionary from general news. But I'm
assuming that is beyond the scope of this effort. Am I wrong there?
It's certainly an interesting and current topic. I just didn't think
we were going to do that here.

Doug Oard's message is encouraging.



> Rich raised the issue of differences in the number of on-topic stories
> in TDT-2 Training, Dev/Test and Evaluation sets and how that affects the
> research.

.......

> So under current practice, large, ongoing topics present early in the
> corpus effect the distribution of on-topic stories in two ways: 1) they
> concetrate on-topic stories in the early segment 2) they consume
> air-time so that topics selected subsequently are less likely to be
> fruitful. These are, I think, the major causes of the differences in
> TDT-2.


I agree that this explains the extremely large difference
between the Jan-Feb set and the Mar-Apr set. Based on this logic, the
May-June set should have had even FEWER new topics with
correspondingly fewer ontopic stories. But it was about 3-4 times as
much as the Mar-Apr set. I understand that the variance is topic size
is very large, so even with 25 topics in a set, we need to expect
large variations. But I wouldn't expect such large variations in
performance. There were basically a few topics in the eval set that
tended to get divided in half resulting in large p(MISS), and also
some with large overlap in topic with some other (perhaps nonlabeled)
topics.

Somehow, this resulted in a large change in the measured
performance measures. As I said in my previous message, this large
change might actually be expected over time in the real world due to
changing realities. But it would be nice if we could make the
research (artificially if you like) homogeneous in order to be able to
understand what's happening. I think this can only be done by
artificial means, like creating all the corpora at the same time.
So I'd like to propose that we do that for TDT-3.

--Rich


(015) previous ~ index ~ next

Last updated Thu May 13 09:28:14 1999