(316) previous ~ index ~ next

To: tdt-distrib@ldc.upenn.edu
From: James Allan <allan@cs.umass.edu>
Subject: TDT 2001 call for participation
Date: Wed, 21 Mar 2001 16:42:08 -0500

CALL FOR PARTICIPATION
		     Topic Detection and Tracking
			       TDT 2001

			March to November 2001


You are invited to participate in TDT 2001. This year is the fourth
in a series of workshops investigating methods for organizing a stream
of broadcast news into stories based on the real-world events they
describe.

TDT 2001 will continue investigation into the five core tasks
described below. The workshop format includes a training/development
corpus available immediately, and a dry run evaluation on that data
(optional for returning participants) to be held over the summer. The
evaluation data will be distributed in September, with system results
due in October, and evaluation results returned by late October.

Interested participants should contact the workshop leaders by early
April, 2001, to help with planning the details of the workshop.

A workshop for participants will be held in conjuction with the TREC
workshop in Gaithersburg, Maryland, on November 12-13.


... THE TASKS OF TDT

The TDT workshop investigates organization of broadcast news by the
events described in the news. Processing must be done as the news
arrives, and not using a static collection. TDT investigates the
following tasks:

(a) Story Segmentation is the task of dividing the audio
transcript of a news program into individual news stories.

(b) Topic Cluster Detection is the task of grouping all stories in
the stream into "bins" that correspond to events in the news,
including creating new "bins" for previously unseen events.

(c) Topic Tracking is a supervised variant of Cluster Detection,
where a system is provided with a handful of on-topic
stories and must find the rest.

(d) First Story Detection is also a variant of Cluster Detection,
where a system must identify the onset of any new topic as
rapidly as possible.

(e) Story Link Detection is a core task that requires a system to
decide whether two randomly selected stories are on the same
topic.

Research on all tasks will continue in TDT 2001. However, strong
emphasis will be placed on the tracking task this year, and in
particular on how to normalize scores across topics. Participants are
also strongly encouraged to focus on the Story Link Detection task
because of its general applicability.

Variations on the tasks may be adopted this year, provided there is
consensus from the participating sites. Those may include more
realistic handling of "brief" stories (those that include only a
passing mention of a topic), detection of non-news stories, and so on.


... THE EVALUATION CORPUS

The evaluation corpus for TDT 2001 will be based upon three months of
news stories from the end of 1998 (the TDT-3 corpus). The set of
approximately 60,000 news stories is in either English or Mandarin,
and from a variety of sources spanning newswire, television, radio,
and the Web. Because this corpus was used for TDT 2000, it will be
"perturbed" this year to make it a different task.

Evaluation topics will be selected from the TDT 1999 and TDT 2000
evaluation tasks.

Training and development data will use the TDT-2 corpus, approximately
72,000 news stories from the first six months of 1998. Sites will
also be permitted to use the "unperturbed" TDT-3 corpus for
development, though are cautioned against overfitting. Training and
development topics will include approximately 100 topics available for
the TDT-2 English corpus, of which 20 are also available for the TDT-2
Mandarin corpus. A set of 30 topics judged for the TDT-3 corpus will
also be available.

A TDT-4 corpus is being created, but will not be ready this year.
This corpus will contain all of the TDT-3 sources plus four additional
Chinese broadcast sources. The four month collection will include news
text and over 600 hours of English and Mandarin broadcast news
yielding approximately 48,000 English and Mandarin news
stories. Annotations will include story boundaries and 60 new topics
defined and annotated following the processes that were used for TDT
2000 evaluation. Sites particularly interested in the new evaluation
corpus should participate in TDT 2001 to be involved in the
specifications for TDT-4.

TDT data is provided by the Linguistic Data Consortium. Sites that
are current members of the LDC will have access to the data via their
membership. Sites who are unable to join the LDC and cannot afford
the appropriate license will be given access to the data via an
Evaluation Membership as long as they are participating in TDT. (We
are working on less-restrictive conditions, but the high cost of the
intellectual property makes it difficult to achieve.)


... PARTICIPATING IN TDT

More information about TDT, including details of TDT 2001, many TDT
publications, and information from past TDT workshops, is available at
the TDT Web site,

http://www.nist.gov/TDT

Organizations wishing to participate in TDT 2001 should respond to
this call by:

(1) sending an email message to James Allan (allan@cs.umass.edu)
indicating your intent, and including a list of the TDT tasks
in which you expect to be involved;

(2) joining the TDT mailing lists (tdt-distrib@ldc.upenn.edu) by
sending a message to David Graff at graff@ldc.upenn.edu.

Please indicate your interest as soon as possible. Dates of all
meetings and evaluations will be specified by April 1, 2001. Sites
may join TDT 2001 after that time, but will have to abide by those
deadlines.

This workshop will be conducted by the National Institute of Standards
and Technology (NIST), with support from Defenese Advanced Research
Projects Agency (DARPA).
(316) previous ~ index ~ next

Last updated Wed Mar 21 16:47:26 2001