(110) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: ASR TEXT - doc has no data - judged as "yes"
Date: Tue, 04 Aug 1998 17:11:13 EDT

I did a quick scan of the tables in the training and devtest
releases, and found a total of 9 stories that have the problem just
mentioned by Sreenivasa Sista.

The problem is this: these stories underwent normal topic labeling
(by LDC annotators working from transcripts or closed captions), they
were judged relevant to one of the target topics, and the corpus does
include ASR output for the audio sample files containing these
stories, BUT... the ASR system apparently produced no output during
the entire time segments occupied by these relevant stories.

Here is a listing of the affected entries in the topic_relevance.table
files from the training and devtest sets:

-- In the training set (Jan/Feb material, released 980522): --

<ONTOPIC topicid=2 level=YES docno=CNN19980121.1600.0983 fileid=19980121_1600_1630_CNN_HDL comments=NO>
<ONTOPIC topicid=2 level=YES docno=CNN19980218.2130.0394 fileid=19980218_2130_2200_CNN_HDL comments=NO>
<ONTOPIC topicid=13 level=YES docno=ABC19980220.1830.0989 fileid=19980220_1830_1900_ABC_WNT comments=NO>
<ONTOPIC topicid=15 level=YES docno=CNN19980225.1130.0470 fileid=19980225_1130_1200_CNN_HDL comments=NO>
<ONTOPIC topicid=15 level=YES docno=CNN19980127.1600.0416 fileid=19980127_1600_1630_CNN_HDL comments=NO>
<ONTOPIC topicid=18 level=YES docno=CNN19980201.1130.0429 fileid=19980201_1130_1200_CNN_HDL comments=NO>

-- In the devtest set (March/April material, released 980708): --

<ONTOPIC topicid=44 level=YES docno=CNN19980427.1600.0892 fileid=19980427_1600_1630_CNN_HDL comments=NO>
<ONTOPIC topicid=65 level=YES docno=CNN19980428.1600.0351 fileid=19980428_1600_1630_CNN_HDL comments=NO>
<ONTOPIC topicid=66 level=YES docno=CNN19980410.1130.1333 fileid=19980410_1130_1200_CNN_HDL comments=NO>

So, if you happen to look for these "docno" entries in the
corresponding ".bndasr" tables, you will find "Bsec" and "Esec"
values, but no "Brecid" or "Erecid" information, and no recognized
words in the ".asr" text file.

I also noticed that there are an additional 25 stories (6 in training,
22 in devtest) which are likewise present (and topic-labeled) in
transcript files and absent in ASR files, but which were not relevant
to any target topic.

Dave Graff
(110) previous ~ index ~ next

Last updated Wed Sep 9 09:40:53 1998