(001) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: "Charles L. Wayne" <clwayne@afterlife.ncsc.mil>
Subject: TDT3 Plans & Progress
Date: Mon, 1 Feb 1999 23:25:00 -0500 (EST)

TDTers,

I was very pleased with the results of the TDT2 evaluations in December
-- the product of creative thinking and hard work by researchers, data
collectors, and evaluators. Kudos to all.

I look forward to the upcoming presentations at the Broadcast News
Workshop, and would also encourage you to share important insights
before then.

In this message, please let me bring you up to date concerning the
evolving plans for TDT3 and solicit your advice. Most of this was
discussed during our meetings at BBN and IBM ---

1. Tasks: We are planning to keep the principal tasks much as they
were last year with the following changes:

a. Add First Story Detection.

b. Do Tracking without labeled background stories.

c. Add Cross-Language TDT (i.e., find all stories on a topic
regardless of language).

d. Tweak some scoring details.

e. Possibly add Topic Links as a supplemental task.

f. Possibly add a Broad Classification task (e.g., distinguish
stories from commercials et cetera.)

2. Languages: We will use both English and Mandarin data. (We had
earlier talked about using Spanish as well, but decided to stick to a
single non-English language for this year.)

3. Test Material: We will draw TDT3 test material from a three-
month period, October through December 1998 (skipping over July through
September to gain more sources). The English language material will
encompass eight sources; Mandarin, three. The test material will be
annotated with respect to 50 target topics. (We should revisit the
question about how topics are chosen and whether 50 are enough.)

4. Training Material: We will not provide any additional English
language training material. Sites may use all of the TDT2 data (in
fact, any data prior to October 1998) for training, but may wish to
preserve TDT2 test material for internal baselines. The LDC will
provide Mandarin data from the first six months of 1998 and will
annotate it with respect to TDT2 topics.

Things are moving ahead nicely on the data front:

1. To let sites start working with more accurate automatic
transcripts, NIST is gearing up to begin retranscribing all of the TDT2
audio data using an improved recognizer provided by BBN. The
processing will begin in February and end in March.

2. To help sites start looking at Mandarin data, the LDC will
release (unannotated) newswire data from the first half of 1998 in
February.

3. The LDC has found some promising Mandarin transcription
agencies and is hiring Mandarin annotators. They plan to begin doing
Mandarin segmentation and annotation later this month.

George will shortly send out a draft evaluation plan to stimulate
discussion and to help us converge on and clarify our research
objectives and metrics.

The formal TDT3 evaluation will happen late in the fall, the exact
timing chosen to avoid conflicts with the Hub4 and Named Entity
evaluations.

Since TDT3 will be breaking new ground, we should hold a small dry run
as soon as practicable. I would favor doing this early in the summer,
if enough Mandarin data is available then; otherwise, late in the
summer like last year.

Since face-to-face discussions are very helpful, I have reserved a
meeting room for us to use from 7 to 9 p.m. on Tuesday night during the
Broadcast News Workshop. (This is the only gap in the workshop
schedule.) Please have a senior representative from your site attend
that meeting, which will be held in the Fairfax Room at the Hilton.

In the interim, please let me know any thoughts or suggestions you have
about any of the above items.

Best wishes,

Charles



(001) previous ~ index ~ next

Last updated Thu May 13 09:28:12 1999