(223) previous ~ index ~ next
To: Jon_Yamron@dragonsys.com
From: James Allan <allan@cs.umass.edu>
Subject: Re: Clarification of tracking
Date: Thu, 05 Nov 1998 13:13:30 -0500
Following up Doug's and George's comments....
> 1) BBN has demonstrated that, due to occasional labeling errors, a few
> "off-topic" training examples may actually be on topic, and they put some
> effort into automatically finding these and eliminating them from the
> training of the background. The first question is: can these mis-labeled
> stories be used to supplement the training material for the topic model?
> (My assumption is NO, because if they had been labeled properly, the
> systems would not have been allowed to use them.)
I would think that you could use the mislabeled stories to supplement
the training material, but you cannot use the *fact* that they're
mislabeled since you wouldn't know that by any process other than
human intervention.
That is, if I have a process that mines the past for stories that are
strongly similar to my Nt training stories, I can use those any way I
want. If those similar stories are from the training data, then I
know they're judged off-topic; I may choose to use them anyway, hoping
that the judgement was wrong. If they're from an entirely different
corpus, then I haven't a clue about the judgement and have no way of
finding it out.
> 2) If we chose to use other material to train a background model, such as
> the January-April data, are the restrictions that same? In other words, if
> we (automatically) scan for and find what we believe to be on-topic
> material in data that predates the test corpus, can we use it to help train
> the topic model? (Again, I assume the answer is NO, but that we are free
> to eliminate this material from the training of the background.)
Again, it seems to me that you can use those stories; you just can't
*know* that they're on- or off-topic. The only judgements we have are
those in the index file; anything else the system would have to take
on faith. (And given point (1) above, even the ones in the index file
are taken on faith.)
-- james
(223) previous ~ index ~ next
Last updated Fri Nov 6 15:29:22 1998