(228) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: Stephanie Strassel <strassel@ldc.upenn.edu>
Subject: finding 'top 2 no stories' in TDT corpora
Date: Tue, 18 Apr 2000 14:57:20 -0400

Hello all,

At the most recent TDT PI meeting, the LDC agreed to identify the 'top 2
no stories' for each topic in the TDT corpora. Based on discussions at
that meeting and consultations with the sponsors, we have come up with
the following plan.

We will use a search engine (developed my Mike Schultz) to identify the
'top 2 no' stories in both English and Mandarin for the TDT2 and TDT3
corpora. We will identify 2 stories for each of 100 English TDT2
topics, 20 Mandarin TDT2 topics, and 60 TDT3 topics for both English and
Mandarin. All of the NO stories identified will come from the training
epoch (before the Nt=4th story) of the relevant corpus.

The process of locating the NO stories will be follows: the annotator
will issue a query for each topic, using the training stories (the first
4 YES stories for that topic) as the query. The search engine will scan
across the stories in the training epoch, excluding the stories already
judged as YES or BRIEF for that topic, and will return a
relevance-ranked list of stories. The annotator will review this list
to identify the two highest-ranking stories for each language that are
truly off-topic; these will constitue the 'top 2 no' stories.

If you have any comments or questions about this plan, please let me
know. We hope to begin annotation of the 'top no' stories shortly.

Stephanie

--
Stephanie Strassel
Linguistic Data Consortium
strassel@ldc.upenn.edu
(228) previous ~ index ~ next

Last updated Wed May 24 17:18:23 2000