Topic Detection and Tracking - Phase 3 (TDT-3) Overview
Introduction
The goal of Topic Detection and Tracking - Phase 3 (TDT-3) is to create core technology to monitor multiple streams of news in multiple languages and media (newswire, radio, television, web sites or some future combination or innovation) segmenting the streams into individual stories, detecting new topics and tracking all stories discussing them. In additional to the TDT-2 tasks of segmentation, detection and tracking, TDT-3 adds the tasks of first story detection and story-link detection. The goal of the latter is to detect links between stories that discuss the same topic even though the topic has not been defined in advance.
Nomenclature
The topic detection and tracking project is now entering its third phase of research and corpus creation. In the first phase, research sites created and tested the TDT Pilot corpus. In 1998, LDC undertook the corpus creation efforts for the second phase corpus which included 6 English sources collected from January to June of 1998. We call this the TDT-2 Corpus. The second phase of the research program ran through 1998.
In 1999, LDC's has undertaken two corpus creation efforts for TDT. LDC is annotating 3 Mandarin language sources for the period of January through June 1998. We calls this the TDT-2 Mandarin Corpus since it overlaps with the original TDT-2 collection period. LDC is also annotating 8 English sources and 3 Mandarin sources for the period of October through December 1998. We call this the TDT-3 Corpus. The third phase of TDT research runs through 1999.
Phase 2 used the TDT-2 Corpus exclusively. Phase 3 will use the TDT-2 Corpus as Training and Dev/Test data and the TDT-3 Corpus as Evaluation Data.
Collection Period and Data Sets
Again, the TDT-2 Corpus contains data collected daily from January 1998 through June 1998. The TDT-3 Corpus contains data collected from October 1998 through December 1998. The Training and Development/Test data used in Phase 2 will become Training Data for TDT-3 Research. Evaluation data in the TDT-2 Corpus will become Development/Test data for Phase 3 Research. The TDT-3 Corpus -- collected from October through December of 1998 -- will become the Phase 3 Evaluation data. The data LDC collected from July through September of 1998 will not be used in Phase 3 Research. Table 1 summarizes.
|
1998 Data |
Jan |
Feb |
Mar |
Apr |
May |
Jun |
Jul |
Aug |
Sep |
Oct |
Nov |
Dec |
|
Phase 2 Research |
T |
T |
D |
D |
E |
E |
|
|
|
|
|
|
|
Phase 3 Research |
T |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
Table 1: Distribution of data by month and data set for Phase 2 and Phase 3 Research.
Legend. T=Training Set. D=Development/Test Set. E=Evaluation Set
News Sources
The TDT-3 Corpus contains eight English news sources and three Mandarin sources collected roughly from October through December 1998.
|
|
English |
Mandarin |
|
Newswire |
Associated Press Worldstream |
Xinhua News |
|
WWW Site |
|
Zaobao |
|
Broadcast Radio |
PRI The World |
VOA Mandarin |
|
Broadcast TV |
CNN Headline News |
|
Sampling
Given the large number of stories in a collection of this size and the small percentage of on-topic stories in the TDT-2 Corpus, we have adjust both the sampling of stories and the rules for defining topics to yield hopefully a higher percentages of on-topic stories. The TDT-3 English sources divide themselves into three categories: Newswire, Broadcast Radio and Broadcast Television. The English sources contain an estimated 34,600 stories.
The TDT-3 Mandarin sources include Newswire (Xinhua), Broadcast Radio (VOA) and a new category, WWW-based news (Zaobao). We sampled a maximum of 60 stories per day from Xinhua. We included all of the available broadcast radio from VOA's Mandarin service. Of two possible WWW-based news sources, we have excluded VOA WWW summaries given their likely similarity to the radio broadcast. We have, pending IPR negotiations, included Zaobao. Zaobao stories are divided into categories. We have selected those categories most likely to bear fruit and then sampled half of the stories in those categories. Table 2 shows the Zaobao categories and which have been selected.
|
Suggestion |
Label |
Category |
# Stories |
|
Exclude |
SP |
Signapore news |
9019 |
|
Include |
YX |
Asia news |
4795 |
|
Include |
GJ |
International |
3852 |
|
Include |
ZG |
China, HK, Taiwan |
3005 |
|
Include |
CJ |
Business |
6827 |
|
Exclude |
TY |
Sports |
3228 |
|
Exclude |
SL |
Editorial |
272 |
|
Exclude |
YL |
Opinion |
1986 |
|
TOTAL |
|
|
18479 |
|
SUBSAMPLE |
|
|
9240 |
Table 2: Zaobao News categories and proposed disposition
The total number of TDT-2 and TDT-3 Mandarin stories will be approximately 30,000. The total 1999 collection will contain 63,000 stories. This represents a 11% increase in stories over TDT-2. Table 3 shows the superset of TDT-2 and TDT-3 data by source and month and its use Phase 3 research.
|
|
1998 |
Jan |
Feb |
Mar |
Apr |
May |
Jun |
Jul |
Aug |
Sep |
Oct |
Nov |
Dec |
|||||||||||||||
|
Newswire |
AP |
T |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
|||||||||||||||
|
NYT |
T |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
||||||||||||||||
|
Xinhua |
T |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
||||||||||||||||
|
WWW |
Zaobao |
X |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
|||||||||||||||
|
Radio |
VOA Eng |
T |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
|||||||||||||||
|
VOA Man |
X |
X |
T |
T |
D |
D |
|
|
|
E |
E |
E |
||||||||||||||||
|
PRI |
T |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
||||||||||||||||
|
TV |
ABC |
T |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
|||||||||||||||
|
CNN |
T |
T |
T |
T |
D |
D |
|
|
|
E |
E |
E |
||||||||||||||||
|
NBC |
X |
X |
X |
X |
X |
X |
|
|
|
E |
E |
E |
||||||||||||||||
|
MSNBC |
X |
X |
X |
X |
X |
X |
|
|
|
E |
E |
E |
||||||||||||||||
Table 3: Data by source, month and data set for Phase 3 research. T= Training Set. D=Development/Test Set. E=Evaluation Set. X=No Data Collected or Data Excluded. Red indicates that the work was complete as part of the TDT-2 Corpus.
Topics
The TDT-2 Corpus defined 100 topics. In the TDT-3 Corpus, LDC will define 60 additional topics for purposes of general annotation. Topics were defined in TDT-2 by selecting stories at random from subsets that correspond to all of the data for single month of a given source. This gave each source and each time period an equal representation in the corpus but otherwise made topic selection a random process. In TDT-3, topics will also be constrained to those that will reveal at least four on-topic stories in both the English and Mandarin sources. The 60 new topics for TDT-3 will be divided into three new topic lists -- lists six, seven and eight. Each will contain 20 topics. LDC will define these 60 topics in consultation with the sponsors and with NIST.
Annotation
LDC will annotate all English and Mandarin sources sampled from September through December against the 60 new topics. To provide cross-linguistic Training material, LDC will also annotate all Mandarin data collected from January 1998 through June 1998 against 20 of the 100 TDT-2 topics that have at least 4 on-topic stories in the Mandarin data collected from January to June. The combination of TDT-2 and TDT-3 corpora will thus provide Training, Development/Test and Evaluation data for both English and Mandarin. In support of First Story Detection, LDC will identify the first stories to discuss an additional 120 topics that will be defined according to rules used to define the 60 primary stories but which will not be exhaustively annotated against all the stories in the corpus. In support of Story Link detection, LDC will annotate story links for 180 topics including the 60 used for the bulk of annotation.