(208) previous ~ index ~ next

To: Yiming Yang <yiming@cs.cmu.edu>
From: Hubert Jin <hjin@bbn.com>
Subject: Re: questions about the FSD analysis
Date: Mon, 11 Oct 1999 14:01:46 -0400 (EDT)

Yiming,

The quantitative analysis is based on some fair assumptions where
you could imagine the process is

(1) Let the FSD system make a judgment on an incoming story.
(2) Record error if the FSD system is wrong.
(3) Assign the story to the right cluster where it belongs.

So at any given moment, the stories in the history are always
well clustered. And we use that topic-weighted P_trk(FA) and
P_trk(Miss) to be the error rates from any generic topic. We
also assume that the topics are independent in content.

It is roughly 50 stories per each LDC annotated topic. Therefore,
in a corpus of 20K stories, the expected number of topics should
be around 400. So, we use N=400.

Of course, in a real FSD task, the clusters are not always perfect
at any given moment and the topics are not always independent. It
would be a much messier situation to analyze.

Hope this clarifies.

-Hubert

On Sat, 9 Oct 1999, Yiming Yang wrote:

>
> 1) Could you define the lowercase "n" in the formula? Is it the
> number of topic before the first story? Is it equal to k - 1 for
> topic-k? If so, is n ranging from 0 to N - 1?

'n' is the number of topics appeared in the history at any given
moment. For the 1st story of topic-k, n is k-1. But for a non-first
story of topic-k, n could be k to N.

> 2) Are P_trk(FA) and P_trk(Miss) topic-weighted averages over topics?

Yes. We used that topic-weighted P_trk(FA) and P_trk(Miss) to be
the expected error rates for any labelled/unlabelled topic.

> 3) In the upper bound of P_fsd(FA), the RHS seems to be just the
> average over topics from 1 to N, why is it the upper bound? I also
> don't understand the lower bound formula. My confusions may have
> something to do the definition of "n", i.e., the first question ...

The non-first stories of topic-k occur in random. They could come
immediately after the 1st of topic-k or at the end of the data corpus.
The later they occur, the more likely they will be tracked by the
existing topics in the history, simply because they are more and
more topics in the history as time goes. As long as a non-first
story is tracking (even mistakenly by other topic), we won't have
a false alarm error anyway.

If all the non-first stories of topic-k came in immediately after
the 1st story of topic-k, that will be the upper bound for the FA.
And the topic-weighted P_fsd(FA) is just the average over topics.

The lower bound case happens when all the non-first stories occur
after there are already N topics in the history. You could imagine
that it is a data corpus that the first N stories are all 1st
stories of the N topics. So the non-first stories all have the
same P_fsd(FA) which is same as if n=N in the individual story
P_fsd(FA).

> Finally, I have a real question: How accurate can we use tracking
> performance to predict FSD performance (upper/lower bounds or expected
> DET)? Given that tracking is conditioned to the availability of
> labelled stories (on-topic ones, at least) while FSD is not, the
> "expected" FSD performance would be optimistic. The more training
> data we have, and the better the learning method is, the less accurate
> of the estimate, I would assume.

The Labelled stories make it possible to evaluate the tracking system
performance. We have to assume that the tracking performance on the
labelled topics should be approximately same as that on the unlabelled
topics. Every story in the TDT corpus is on-topic for some topic.
Those off-topic stories are just not on the selected topics, but they
are still on some unlabelled topics. So from my opinion, labelling all
stories/topics in the data corpus won't change the probabilistic
relation of FSD and Tracking

>
> What do you think?
>
> - Yiming
>

(208) previous ~ index ~ next

Last updated Tue Oct 19 10:10:09 1999