(014) previous ~ index ~ next
To: tdt-distrib@ldc.upenn.edu
From: Christopher Cieri <ccieri@ldc.upenn.edu>
Subject: Mandarin Resources and Large, Non-Punctual Topics
Date: Tue, 09 Feb 1999 21:36:22 -0800
TDT Participants,
James asked us to list any Mandarin resources we might contribute to
TDT-3. Other than scads of telephone and newswire data, we have the
CALLHOME Mandarin Chinese Lexicon covering 44,405 words with
phonological, morphological and frequency information (but not glosses)
for each word. Xiaoyi Ma has developed software for aligning audio and
its transcript. He has also been working on a translation dictionary
based on parallel text corpora -- and a collection of parallel text. I'm
not sure what help any of this will be for what you are doing -- all
seems a little raw for TDT-3. Back when we did the transcription for
CALLHOME, we used Dragon's segmentation sofware.
Rich raised the issue of differences in the number of on-topic stories
in TDT-2 Training, Dev/Test and Evaluation sets and how that affects the
research. Since George and Alvin are working on the statistical
significance of any differences among the sets, I'll offer just a few
general comments. Although there have certainly been changes in the
mechanisms by which we think about, define and describe topics, I would
argue that neither the evolution of TDT-2 topics nor variation in
annotation practice is the primary cause of differences. Instead, I
think the differences are caused by one simple fact.
Over any period of time, there will be topics that are frequently
discussed in the news. Given our current practice, if these topics span
the period of news collection, they will have a high probability of
being selected early in the collection. Once selected, these topics
cannot be selected again but they will continue to account for on-topic
stories that will not be part of any subsequent data set.
Consider the list below showing for each training and dev/test topic,
the number of hits it received in Jan-Feb and in Mar-Apr. There are
large topics in both sets. In the dev/test set India Parliamentary
Elections, National Tobacco Settlement, Jonesboro shooting and James
Earl Ray's Retrial had over 50 on-topic documents each. Between them,
they account for 2/3 of all on-topic stories in the Dev/Test set. They
are comparable to State of the Union Address, Pope visits Cuba, Violence
in Algeria and Superbowl '98 in the Training Set. But the topics that
make a difference are Asian Economic Crisis and Monica Lewinsky Case.
Together they account for 1870 hits in the Training set. That means more
than 10% of ALL stories in the Training Set were on one of these two
topics. It was very likely that they would be chosen for the Training
set and be unavailable for the Dev/Test set (Note that there were also
587 stories in the Dev/Test on these two topics). Since we don't use
Training topics in the Dev/Test set and since stroies are rarely about
more than one topic, a total of 1036 dev/test stories (about 6-7%) were
essentially unavailable to Dev/Test topics. Unless we had another
once-in-the-history-of-the-US event during Mar-Apr, the number of
on-topic stories were certain to be less.
So under current practice, large, ongoing topics present early in the
corpus effect the distribution of on-topic stories in two ways: 1) they
concetrate on-topic stories in the early segment 2) they consume
air-time so that topics selected subsequently are less likely to be
fruitful. These are, I think, the major causes of the differences in
TDT-2.
Topic Jan-Feb Mar-Apr Topic
Num. Hits Hits Name
1 1078 306 Asian Economic Crisis
2 792 281 Monica Lewinsky Case
4 16 1
5 9 5
6 6 2
7 24 1
8 49 5
9 53 1
10 7 0
11 107 0 State of the Union Address
12 177 14 Pope visits Cuba
13 652 48 1998 Winter Olympics
14 2 2
15 1473 207 Current Conflict with Iraq
16 6 0
17 22 0
18 77 11
19 69 19
20 35 4
21 57 6
22 30 0
23 102 17 Violence in Algeria
24 35 5
25 1 0
26 69 1
27 0 1
28 9 3
29 9 2
30 2 0
31 31 7
32 57 69
33 127 3 Superbowl '98
34 18 0
35 6 0
36 5 0
37 16 15
Total 5228 1036
Topic Jan-Feb Mar-Apr Topic
Num. Hits Hits Name
38 0 1
39 58 65 India Parliamentary Elections
40 0 3
41 1 25
42 0 27
43 1 14
44 40 171 National Tobacco Settlement
46 0 5
47 0 29
48 0 146 Jonesboro shooting
50 0 11
52 0 5
53 0 7
54 0 1
55 0 1
56 2 54 James Earl Ray's Retrial?
57 0 17
58 0 1
59 0 1
60 0 8
61 1 4
62 0 2
63 0 17
64 0 12
65 0 47
66 0 6
Total 103 680
--
Christopher Cieri
Executive Director, Linguistic Data Consortium
3615 Market Street, Philadelphia, PA 19104-2608 USA
phone: 215-573-5489, fax: 215-573-2175
mailto:Christopher.Cieri@ldc.upenn.edu
http://www.ldc.upenn.edu
(014) previous ~ index ~ next
Last updated Thu May 13 09:28:13 1999