(276) previous ~ index ~ next
To: Sheera_Knecht@dragonsys.com, strassel@ldc.upenn.edu, graff@ldc.upenn.edu
From: Christopher Cieri <ccieri@ldc.upenn.edu>
Subject: Re: TDT2 topics with no stories
Date: Tue, 15 Aug 2000 10:56:22 -0400
This is a multi-part message in MIME format.
--------------5BFC429616BF79F1C5164563
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Hi Sheera,
There were several topics in the TDT-2 with no on-topic stories. I don't
remember the actual topic numbers but am copying Dave Graff and Stephanie
Strassel who should be able to provide them. It is true that all 100 topics
were annotated in TDT2 English. For those topics with zero hits, we were
really not able to find any on-topic stories in the 6-month corpus. This seems
odd if you know that topics are defined on the basis of a seed story. Here's
how it happened. During TDT-2 we were annotating the corpus as we were
collecting. After we collected the audio, we sent it out for transcription,
segmented the resulting transcripts into stories and then began annotating.
Selecting topics takes time and we were under very tight schedule so we had to
select topics for annotation as soon as the transcripts came in.
Unfortunately, this meant that in a few cases, the one story we used to define
a topic happened to be the only on-topic story we could find in the corpus and
was itself eventually rejected due to problems in its formatting, etc. This
doesn't happen in TDT-3 because the story inventory was stable before topic
selection began. This is also why we are cautious about scheduling for the
TDT-4 corpus. If we don't have enough time to complete
transcription/segmentation before topic selection and annotation we could face
similar problems.
Dave, Stephanie, can you check on the number of the empty topics and copy the
list on your response.
Chris
Sheera_Knecht@dragonsys.com wrote:
> Chris:
>
> I seem to be finding that 4 topics (20003, 20045, 20049 and 20051) have no
> on-topic stories
> designated in either English or Mandarin for the entire TDT2 6 month time
> frame (jan-june).
> Is that correct? We'd like to know if all 100 topics (20001-20100) were
> annotated, at least for
> English, for TDT2?
>
> Thanks,
> Sheera
--
Christopher Cieri
Executive Director, Linguistic Data Consortium
3615 Market Street, Philadelphia, PA 19104-2608 USA
phone: 215-573-5489, fax: 215-573-2175
mailto:Christopher.Cieri@ldc.upenn.edu
http://www.ldc.upenn.edu
--------------5BFC429616BF79F1C5164563
Content-Type: text/x-vcard; charset=us-ascii;
name="ccieri.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Christopher Cieri
Content-Disposition: attachment;
filename="ccieri.vcf"
begin:vcard
n:Cieri;Christopher
tel;fax:(215) 573-2175
tel;work:(215) 573-5489
x-mozilla-html:FALSE
url:www.ldc.upenn.edu/Staff/ccieri
org:Linguistic Data Consortium
adr:;;3615 Market Street;Philadelphia;PA;19104;USA
version:2.1
email;internet:ccieri@ldc.upenn.edu
title:Executive Director
fn:Christopher Cieri
end:vcard
--------------5BFC429616BF79F1C5164563--
(276) previous ~ index ~ next
Last updated Tue Sep 19 14:30:57 2000