(008) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: Hubert Jin <hjin@bbn.com>
Subject: Re: TDT3, a variety of issues
Date: Tue, 9 Feb 1999 09:01:41 -0500 (EST)

Following up on the not-on-topic stories issue...

If we researchers are allowed to use previous data corpus whatever way we
like to, having or not having not-on-topic story list is not even an issue
here. For any large corpus, we can safely say that the stories that are
not on topic is over 95%. So if we do a back-tracking, we can almost
surely to filter out a list of stories that are likely to be on-topic.
(we do similar filtering during normalization to discard the few human
false alarm stories in the LDC provided not-on-topic lists) This
automatically detected not-on-topic story list may not have the same
quality as those provided by LDC, but it probably is good enough to
generate approximate ststistics for normalization (though it may
introduces little degradation as I would guess)

But if we are not allowed to use any information from not-on-topic
stories, that will be artifically unfair, and against the reality. We
will end up with algorithms that are not smart enough to take advantage
of largely available information.

-Hubert

(008) previous ~ index ~ next

Last updated Thu May 13 09:28:12 1999