Topic Detection and Tracking - Phase 3 (TDT-3) Overview

Introduction

The goal of Topic Detection and Tracking - Phase 3 (TDT-3) is to create core technology to monitor multiple streams of news in multiple languages and media (newswire, radio, television, web sites or some future combination or innovation) segmenting the streams into individual stories, detecting new topics and tracking all stories discussing them. In additional to the TDT-2 tasks of segmentation, detection and tracking, TDT-3 adds the tasks of first story detection and story-link detection. The goal of the latter is to detect links between stories that discuss the same topic even though the topic has not been defined in advance.

Nomenclature

The topic detection and tracking project is now entering its third phase of research and corpus creation. In the first phase, research sites created and tested the TDT Pilot corpus. In 1998, LDC undertook the corpus creation efforts for the second phase corpus which included 6 English sources collected from January to June of 1998. We call this the TDT-2 Corpus. The second phase of the research program ran through 1998.

In 1999, LDC's has undertaken two corpus creation efforts for TDT. LDC is annotating 3 Mandarin language sources for the period of January through June 1998. We calls this the TDT-2 Mandarin Corpus since it overlaps with the original TDT-2 collection period. LDC is also annotating 8 English sources and 3 Mandarin sources for the period of October through December 1998. We call this the TDT-3 Corpus. The third phase of TDT research runs through 1999.

Phase 2 used the TDT-2 Corpus exclusively. Phase 3 will use the TDT-2 Corpus as Training and Dev/Test data and the TDT-3 Corpus as Evaluation Data.

Collection Period and Data Sets

Again, the TDT-2 Corpus contains data collected daily from January 1998 through June 1998. The TDT-3 Corpus contains data collected from October 1998 through December 1998. The Training and Development/Test data used in Phase 2 will become Training Data for TDT-3 Research. Evaluation data in the TDT-2 Corpus will become Development/Test data for Phase 3 Research. The TDT-3 Corpus -- collected from October through December of 1998 -- will become the Phase 3 Evaluation data. The data LDC collected from July through September of 1998 will not be used in Phase 3 Research. Table 1 summarizes.
 

1998 Data

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Phase 2 Research 

T

T

D

D

E

E

 

 

 

 

 

 

Phase 3 Research 

T

T

T

T

D

D

 

 

 

E

E

E

Table 1: Distribution of data by month and data set for Phase 2 and Phase 3 Research.
Legend. T=Training Set. D=Development/Test Set. E=Evaluation Set

News Sources

The TDT-3 Corpus contains eight English news sources and three Mandarin sources collected roughly from October through December 1998.
 

 

English

Mandarin

Newswire

Associated Press Worldstream
New York Times Newswire Service

Xinhua News

WWW Site

 

Zaobao

Broadcast Radio

PRI The World
VOA English

VOA Mandarin

Broadcast TV

CNN Headline News
ABC World News Tonight
NBC Nightly News
MSNBC News with Brian Williams

 

Sampling

Given the large number of stories in a collection of this size and the small percentage of on-topic stories in the TDT-2 Corpus, we have adjust both the sampling of stories and the rules for defining topics to yield hopefully a higher percentages of on-topic stories. The TDT-3 English sources divide themselves into three categories: Newswire, Broadcast Radio and Broadcast Television. The English sources contain an estimated 34,600 stories.

The TDT-3 Mandarin sources include Newswire (Xinhua), Broadcast Radio (VOA) and a new category, WWW-based news (Zaobao). We sampled a maximum of 60 stories per day from Xinhua. We included all of the available broadcast radio from VOA's Mandarin service. Of two possible WWW-based news sources, we have excluded VOA WWW summaries given their likely similarity to the radio broadcast. We have, pending IPR negotiations, included Zaobao. Zaobao stories are divided into categories. We have selected those categories most likely to bear fruit and then sampled half of the stories in those categories. Table 2 shows the Zaobao categories and which have been selected.
 

Suggestion

Label

Category

# Stories

Exclude

SP

Signapore news

9019

Include

YX

Asia news

4795

Include

GJ

International

3852

Include

ZG

China, HK, Taiwan

3005

Include

CJ

Business

6827

Exclude

TY

Sports

3228

Exclude

SL

Editorial

272

Exclude

YL

Opinion

1986

TOTAL

 

 

18479

SUBSAMPLE

 

 

9240

Table 2: Zaobao News categories and proposed disposition

The total number of TDT-2 and TDT-3 Mandarin stories will be approximately 30,000. The total 1999 collection will contain 63,000 stories. This represents a 11% increase in stories over TDT-2. Table 3 shows the superset of TDT-2 and TDT-3 data by source and month and its use Phase 3 research.

 

1998

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Newswire

AP

T

 

 

 

NYT

T

T

 

 

 

Xinhua

T

 

 

 

WWW

Zaobao

T

 

 

 

Radio

VOA Eng

T

 

 

 

VOA Man

T

 

 

 

PRI

T

T

 

 

 

TV

ABC

T

T

T

D

D

 

 

 

CNN

T

T

T

D

D

 

 

 

NBC

X

 

 

 

MSNBC

X

 

 

 

Table 3: Data by source, month and data set for Phase 3 research. T= Training Set. D=Development/Test Set. E=Evaluation Set. X=No Data Collected or Data Excluded. Red indicates that the work was complete as part of the TDT-2 Corpus.

Topics

The TDT-2 Corpus defined 100 topics. In the TDT-3 Corpus, LDC will define 60 additional topics for purposes of general annotation. Topics were defined in TDT-2 by selecting stories at random from subsets that correspond to all of the data for single month of a given source. This gave each source and each time period an equal representation in the corpus but otherwise made topic selection a random process. In TDT-3, topics will also be constrained to those that will reveal at least four on-topic stories in both the English and Mandarin sources. The 60 new topics for TDT-3 will be divided into three new topic lists -- lists six, seven and eight. Each will contain 20 topics. LDC will define these 60 topics in consultation with the sponsors and with NIST.

Annotation

LDC will annotate all English and Mandarin sources sampled from September through December against the 60 new topics. To provide cross-linguistic Training material, LDC will also annotate all Mandarin data collected from January 1998 through June 1998 against 20 of the 100 TDT-2 topics that have at least 4 on-topic stories in the Mandarin data collected from January to June. The combination of TDT-2 and TDT-3 corpora will thus provide Training, Development/Test and Evaluation data for both English and Mandarin. In support of First Story Detection, LDC will identify the first stories to discuss an additional 120 topics that will be defined according to rules used to define the 60 primary stories but which will not be exhaustively annotated against all the stories in the corpus. In support of Story Link detection, LDC will annotate story links for 180 topics including the 60 used for the bulk of annotation.