(226) previous ~ index ~ next

To: Jon_Yamron@dragonsys.com, "'James Allan'" <allan@cs.umass.edu>
From: "Strzalkowski, Tomek (CRD)" <strzalkowski@crd.ge.com>
Subject: RE: Clarification of tracking
Date: Thu, 5 Nov 1998 13:30:11 -0500

I think that the only realistic assumption is that beyond the handful of
training examples we know nothing at all about any past stories, that is
other than what can be inferred from the said training examples. In TDT2
we actually know more: we know that non-topical stories in the training
set are most likely off topic -- and therefore probably useful for training
without too much risk. In reality, using unknown data for training is not
likely to be much more useful than doing adaptive tracking....

---- Tomek

> ----------
> From: James Allan[SMTP:allan@cs.umass.edu]
> Sent: Thursday, November 05, 1998 1:13 PM
> To: Jon_Yamron@dragonsys.com
> Cc: tdt-distrib@unagi.cis.upenn.edu
> Subject: Re: Clarification of tracking
>
> Following up Doug's and George's comments....
>
> > 1) BBN has demonstrated that, due to occasional labeling errors, a few
> > "off-topic" training examples may actually be on topic, and they put some
> > effort into automatically finding these and eliminating them from the
> > training of the background. The first question is: can these mis-labeled
> > stories be used to supplement the training material for the topic model?
> > (My assumption is NO, because if they had been labeled properly, the
> > systems would not have been allowed to use them.)
>
> I would think that you could use the mislabeled stories to supplement
> the training material, but you cannot use the *fact* that they're
> mislabeled since you wouldn't know that by any process other than
> human intervention.
>
> That is, if I have a process that mines the past for stories that are
> strongly similar to my Nt training stories, I can use those any way I
> want. If those similar stories are from the training data, then I
> know they're judged off-topic; I may choose to use them anyway, hoping
> that the judgement was wrong. If they're from an entirely different
> corpus, then I haven't a clue about the judgement and have no way of
> finding it out.
>
> > 2) If we chose to use other material to train a background model, such as
> > the January-April data, are the restrictions that same? In other words, if
> > we (automatically) scan for and find what we believe to be on-topic
> > material in data that predates the test corpus, can we use it to help train
> > the topic model? (Again, I assume the answer is NO, but that we are free
> > to eliminate this material from the training of the background.)
>
> Again, it seems to me that you can use those stories; you just can't
> *know* that they're on- or off-topic. The only judgements we have are
> those in the index file; anything else the system would have to take
> on faith. (And given point (1) above, even the ones in the index file
> are taken on faith.)
>
> 			-- james

>
(226) previous ~ index ~ next

Last updated Fri Nov 6 15:29:22 1998