(111) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: More Problems with TDT2 devset data -- follow-up
Date: Tue, 04 Aug 1998 17:45:44 EDT
In response to an earlier report about boundary table records for
"NEWS" stories that appeared to contain no "recid" information (hence
no words), there were a total of 34 such cases in the ASR boundary
tables of the two releases -- I described these in my previous message.
These appear to be cases where the ASR system simply failed to produce
output over the duration of the given story. (I haven't checked every
case, but this is true for the ones I've checked.)
There are also 9 boundary records in the "bndtkn" files -- 8 in the
training release, 1 in the devtest release -- that have "doctype=NEWS"
but lack "recid" information. These 9 cases should all have been
marked as "MISCELLANEOUS" (or possibly as "UNTRANSCRIBED") rather than
as "NEWS" -- they really do not have any text content. (Three of the
cases, from APW, actually do contain weather information, but also
contained some anomaly that caused them to be mis-parsed by the
newswire conditioning filter -- they should have been eliminated as
rejects.)
Here is the listing of these 9 erroneous entries from the "bndtkn"
files -- "doctype" should be "MISCELLANEOUS" in all cases:
tdt_deliv_980522/tables/19980104_1600_1630_CNN_HDL.bndtkn:<BOUNDARY docno=CNN19980104.1600.1304 doctype=NEWS Bsec=1304.89 Esec=1316.29>
tdt_deliv_980522/tables/19980106_1600_1630_CNN_HDL.bndtkn:<BOUNDARY docno=CNN19980106.1600.1348 doctype=NEWS Bsec=1348.07 Esec=1355.54>
tdt_deliv_980522/tables/19980115_2130_2200_CNN_HDL.bndtkn:<BOUNDARY docno=CNN19980115.2130.1163 doctype=NEWS Bsec=1163.35 Esec=1173.50>
tdt_deliv_980522/tables/19980118_1743_1920_APW_ENG.bndtkn:<BOUNDARY docno=APW19980118.0856 doctype=NEWS>
tdt_deliv_980522/tables/19980120_2300_2400_VOA_TDY.bndtkn:<BOUNDARY docno=VOA19980120.2300.1691 doctype=NEWS Bsec=1691.54 Esec=1786.66>
tdt_deliv_980522/tables/19980205_1106_1135_APW_ENG.bndtkn:<BOUNDARY docno=APW19980205.0916 doctype=NEWS>
tdt_deliv_980522/tables/19980206_1600_1630_CNN_HDL.bndtkn:<BOUNDARY docno=CNN19980206.1600.1401 doctype=NEWS Bsec=1401.82 Esec=1414.26>
tdt_deliv_980522/tables/19980208_1643_1851_APW_ENG.bndtkn:<BOUNDARY docno=APW19980208.0822 doctype=NEWS>
tdt_deliv_980708/tables/19980331_1700_1800_VOA_TDY.bndtkn:<BOUNDARY docno=VOA19980331.1700.2954 doctype=NEWS Bsec=2954.97 Esec=2961.88>
Dave Graff
(111) previous ~ index ~ next
Last updated Wed Sep 9 09:40:54 1998