(113) previous ~ index ~ next

To: graff@unagi.cis.upenn.edu, tdt-distrib@unagi.cis.upenn.edu
From: Jon Yamron <Jon@dragonsys.com>
Subject: Re: More Problems with TDT2 devset data -- follow-up -Reply
Date: Wed, 05 Aug 1998 10:41:05 -0500

We should identify the audio disks and see what's going on. I know there were some early
cases in which shows were not recorded properly. If that is not the case here, we should
re-recognize the show.

- Jon

>>> David Graff <graff@unagi.cis.upenn.edu> 08/04/98 04:45pm >>>

In response to an earlier report about boundary table records for
"NEWS" stories that appeared to contain no "recid" information (hence no words), there
were a total of 34 such cases in the ASR boundary tables of the two releases -- I described
these in my previous message.
These appear to be cases where the ASR system simply failed to produce output over the
duration of the given story. (I haven't checked every case, but this is true for the ones I've
checked.)

There are also 9 boundary records in the "bndtkn" files -- 8 in the training release, 1 in the
devtest release -- that have "doctype=NEWS" but lack "recid" information. These 9 cases
should all have been marked as "MISCELLANEOUS" (or possibly as
"UNTRANSCRIBED") rather than as "NEWS" -- they really do not have any text content.
(Three of the cases, from APW, actually do contain weather information, but also contained
some anomaly that caused them to be mis-parsed by the newswire conditioning filter -- they
should have been eliminated as rejects.)

Here is the listing of these 9 erroneous entries from the "bndtkn" files -- "doctype" should be
"MISCELLANEOUS" in all cases:

tdt_deliv_980522/tables/19980104_1600_1630_CNN_HDL.bndtkn:<BOUNDARY
docno=CNN19980104.1600.1304 doctype=NEWS Bsec=1304.89 Esec=1316.29>
tdt_deliv_980522/tables/19980106_1600_1630_CNN_HDL.bndtkn:<BOUNDARY
docno=CNN19980106.1600.1348 doctype=NEWS Bsec=1348.07 Esec=1355.54>
tdt_deliv_980522/tables/19980115_2130_2200_CNN_HDL.bndtkn:<BOUNDARY
docno=CNN19980115.2130.1163 doctype=NEWS Bsec=1163.35 Esec=1173.50>
tdt_deliv_980522/tables/19980118_1743_1920_APW_ENG.bndtkn:<BOUNDARY
docno=APW19980118.0856 doctype=NEWS>
tdt_deliv_980522/tables/19980120_2300_2400_VOA_TDY.bndtkn:<BOUNDARY
docno=VOA19980120.2300.1691 doctype=NEWS Bsec=1691.54 Esec=1786.66>
tdt_deliv_980522/tables/19980205_1106_1135_APW_ENG.bndtkn:<BOUNDARY
docno=APW19980205.0916 doctype=NEWS>
tdt_deliv_980522/tables/19980206_1600_1630_CNN_HDL.bndtkn:<BOUNDARY
docno=CNN19980206.1600.1401 doctype=NEWS Bsec=1401.82 Esec=1414.26>
tdt_deliv_980522/tables/19980208_1643_1851_APW_ENG.bndtkn:<BOUNDARY
docno=APW19980208.0822 doctype=NEWS>

tdt_deliv_980708/tables/19980331_1700_1800_VOA_TDY.bndtkn:<BOUNDARY
docno=VOA19980331.1700.2954 doctype=NEWS Bsec=2954.97 Esec=2961.88>


Dave Graff


(113) previous ~ index ~ next

Last updated Wed Sep 9 09:40:54 1998