(007) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: James Allan <allan@cs.umass.edu>
Subject: Re: TDT3, a variety of issues
Date: Mon, 08 Feb 1999 23:08:47 -0500

Following up the few comments....

No one has screamed in response to either of our comments about
considering a limit to the number of different tasks we undertake. It
may be that there is enough interest in each of them, though, that we
will have to prioritize, or at least choose 2-3 "major" tasks that
everyone is expected to contribute to. Other tasks could then be
viewed as more exploratory?

There appears to be general agreement that tracking without large
numbers of off-topic stories is a reasonable task that would not be
too difficult to accomodate. I am against having *NO* off-topic
stories at all, though I am not against having a "what happens if
there are none"--I would think that the interesting question would be
how much off-topic material do you need, and what should its nature
be? (Of course, we can play with this somewhat with the TDT-2 corpus,
so perhaps this should all be low-priority for TDT-3.)

> 2. I agree completely that we should not all try to create Mandarin
> resources. It's not that hard, but would absorb most of our effort and
> would also create differences in the results that would be impossible to
> separate from the underlying IR techniques.

So one excellent goal for the TDT-3 meeting at Dulles will be a list
of known Mandarin resources, from the LDC and from participating (or
not) sites.

> 3. I don't personally care about getting more English data, since we
> already have so much -- 3 sets of TDT2 -- and I'd hate to use valuable LDC
> time to create it rather than getting the new data we need ASAP. I'm
> concerned that we won't really be able to start on any of these new tasks
> until months from now.

I agree that it is not worthwhile if it takes more than a pittance of
LDC time. If it's gathered, though, and if it's already in usable
shape (though obviously unlabelled wrt the topics) then it'd be nice
to have it as a possible resource.


I'm less concerned than Rich is about corpus consistency, though I
haven't looked into as much as he appears to have. I look forward to
seeing the analysis that George announced. My gut feeling is that the
LDC took a bit to get "into the groove" and that they are now fairly
consistent, whereas the training data is somewhat out of line with the
dev and eval sets. But that's a gut feeling and not based on any
careful analysis.
			-- james

(007) previous ~ index ~ next

Last updated Thu May 13 09:28:12 1999