(003) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Summary of topic selection method
Date: Tue, 17 Feb 1998 12:32:55 EST

I felt compelled to go back through the TDT email to find the
relevant points about topic selection (mostly to help orient Mike
Schultz, who just joined the discussion recently). George D.
encouraged me to circulate this summary to the entire group, to see
whether anyone has anything to add from their own recollections.

The summary is just a collection of exerpts from George's email over
the last few weeks, separated by "[From GD's ... message of ...]".
Following the summary, I will attach some comments about the final
selection of topics using this week's input from the sites.


------------- SUMMARY OF INFORMATION ON TOPIC SELECTION -----------

[From GD's "data collection question" message of 1/7:]

CORPUS CONSTRUCTION:
...
* TDT2 researchers will define 4 events every week, one event
from each of four sources. (The sources will rotate, so that
each source will account for the same number of events.)
...

[From GD's "Action items" message of 2/11:]

Mon, Feb 16th:

* LDC posts for ftp access the story text for the first 4 weeks
of TDT2 (all 6 sources).

* LDC distributes (via email) randomized lists of story pointers
for this data to each of the participating sites.

These lists are to be used by the sites to identify and define
candidate target topics. There will be 6 randomized lists per
site, one for each source. Each list is to be independently
randomized for each source and site.

Mon-Thu, Feb 16th-19th:

* Each site defines 12 unique topics from the stories identified
in the lists, with two topics per source.

[From GD's "meta-definition of TOPIC" message of 2/12:]

How to define TDT2 topics is a troublesome problem, one that we have
discussed at length. One of the most difficult issues is how to
limit the scope of a topic. Charles Wayne and I have discussed this
at length and have decided that the best solution is to define topic
boundaries as simply as possible by asking whether the topics
(events/activities) being discussed are independent or causally
related. If they are independent, then they are NOT part of the same
topic. If they are causally related, then they ARE part of the same
topic. So, here is the definition of topic:

For the purposes of TDT2 research:

A TOPIC is a set of connected events and activities. This includes
all directly related and/or consequential events and activities.

This means that the "Nicole Simpson murder" TOPIC is a large topic
that includes all of the derivative events and that is of indefinite
extent. For example, future related events, such as exhuming
Nicole's body or the discovery of new criminal evidence, would be
included in the topic. But O.J.'s current activities, which might
give rise to stories due to his notoriety stemming from the case,
would NOT be part of the topic, because there is no direct connection
with it.

[From GD's "please expedite" message of 2/17:]

In order to give LDC time to tag stories before our meeting, it would
be very helpful for y'all to complete your topic definitions as soon
as possible. Also, please submit them to LDC as you define them.
Don't wait to complete them before submitting them. Thanks.

--------------------------- END OF SUMMARY --------------------------


SO, based on the "Action items" info of 2/11, I am actually expecting
that by the end of this week, each site will have sent me 10
story-ids, representing two topics from each of the five sources that
I made available as of yesterday morning. From the total of 80 such
story-ids, I should select 14, making sure that all sources are
represented at least twice. (There will still need to be two
additional topics (story-ids) selected from the public radio source,
as soon as this text becomes available.) I believe I should also
make sure that the 14 stories selected this way are all about
different topics.

PLEASE send email to <tdt-distrib@ldc.upenn.edu> if you have any
questions, additions, corrections or disputes about any of the above.

Thanks,

Dave Graff
(003) previous ~ index ~ next

Last updated Wed Sep 9 09:40:45 1998