(032) previous ~ index ~ next
To: tdt-distrib@ldc.upenn.edu
From: Christopher Cieri <ccieri@ldc.upenn.edu>
Subject: Topic Selection Criteria
Date: Wed, 17 Feb 1999 11:10:31 -0800
TDT Folks,
Charles, George, Dave and I have been discussing topic selection
criteria and wanted to get your comments. Note that I am viewing topic
selection as distinct from topic definition or explication. The latter
has been discussed on this list in the past few days and LDC is are
working on a study of topic definition consistency as well as a summary
of inter-annotator consistency that I will send along shortly. In the
meantime, here are our thoughts on topic selection criteria.
In our discussions of topic selection, we came up with 11 criteria. I am
assuming in all cases that we want to keep the criteria of:
1) equal representation of sources in the topic selection process
2) seed article chosen at random from a source
3) seminal event determined from reading seed article (though not
necessarily explicitly named in seed article)
4) seminal event expanded into topic according to rules of
interpretation
So, the criteria open to debate are:
5) Number of Topics -- For TDT-3, we began thinking we would annotate 50
new topics from the October through December material. However, the
temporal and monetary cost of annotation is to be related primarily to
the number of stories and topic lists. Knowing that our current
annotation interface -- and more importantly our annotation staff -- can
handle about 20 topics per topic list, it would be more efficient to
select 20, 40 or 60 topics. Currently we are planning to select 60.
6) Topics selected from English only or also from Mandarin -- All TDT-3
topics will be annotated against both English and Mandarin data from
October through December. However, we have the option of defining TDT-3
topics based upon stories in the English source, stories in the Mandarin
sources or both. We have annotators who are Mandarin/English bilinguals
and are familiar with TDT-2. We could train one of them to select topics
based upon the Mandarin sources. So far, we are planning to choose
topics from all sources, English and Mandarin, equally.
7) Selection limited to those topics likely to yield a minimum of
on-topic stories -- In TDT-2 several topics yielded too few stories to
permit tracking. We could adjust our TDT-3 procedures to improve the
yield on topics. Once possibility would be to submit seed articles as
queries into a search engine, review hits and assure ourselves that
there are at least 4 (or 8 or 16) on-topic stories before accepting the
topic for annotation. We are currently planning to do this. We have the
means at hand to do this for English and are working on the same
capacity in Mandarin. We will need to agree upon a minimum of on-topic
stories to qualify a topic. I like 4.
8) Selection limited to those topics likely to yield a minimum of on
topic stories IN OTHER SOURCES -- The annotators tell me that certain
sources generate topics that are only fruitful in the original source.
We are checking the numbers to see to what extent this affects the
corpus but intuitively it rings true (Public radio has more stories on
international human rights issues than network news). We could expand
the procedure in 7 to reject topics that don't cross pollinate (to
continue our seedy metaphor) or we could be content with assuring that
all sources are roughly equally represented. Our current thought is that
this will be work than is justifiable. However, we are looking for ways
to use the observation that some topics are only fruitful in a single
source to improve annotation or at least QC.
9) Selection limited to those topics likely to yield a minimum of on
topic stories IN OTHER LANGUAGES -- This may be the most important
criterion for TDT-3. Many of the TDT-3 sources have large local news
components. In our TDT-3 proposal, we suggested rejecting entire sources
(C.N.A.) because of this bias. For the remaining sources, we may want to
select at least some of the topics carefully to assure that we have
enough topics with enough hits in both Mandarin and English to make the
cross-linguistic research possible. Currently, we plan to do this though
we are a little concerned about it effect on the time required to define
topics. We would want to set the minimum low (say 4) and to consider
this a desideratum for the corpus as a whole but not a requirement for
all topics.
10) Distribution of large and small topics -- In response to the
observation that large and small topics are unevenly distributed across
Training, Development/Test and Evaluation sets, it is conceivable that
we could select topics to even the distribution. In TDT-3 as defined,
this will not be possible because we already have Training and Dev/Test
topics selected but we could take this approach for subsequent
collections. I should also note that although I've presented this as a
possibility, I am not actually in favor of it. I think selecting the
topics for a corpus all at one time and then distributing them over data
sets so that large/small topics were equally represented runs counter to
the ultimate use of TDT technology as I understand it.
11) Selection including multiple instances of a certain type of topic --
Charles has suggested that we may also want to consider the idea of
making sure that the corpus contains several topics of the same type --
e.g., more than one airline crash topic, more than one Japanese
earthquake. To the extent that current events cooperate and to the
extent that our annotators can keep the topics distinct, their lexical
similarity will push the development of more robust algorithms than an
entirely random selection process.
We will all be interested in your comments. Charles, George and Dave,
please correct me if I haven't represented your views accurately.
Best wishes,
Chris
--
Christopher Cieri
Executive Director, Linguistic Data Consortium
3615 Market Street, Philadelphia, PA 19104-2608 USA
phone: 215-573-5489, fax: 215-573-2175
mailto:Christopher.Cieri@ldc.upenn.edu
http://www.ldc.upenn.edu
(032) previous ~ index ~ next
Last updated Thu May 13 09:28:15 1999