(201) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Problems/updates for some boundary tables
Date: Thu, 22 Oct 1998 12:27:06 EDT
Folks,
Hubert Jin at BBN brought to our attention a set of eight anomalous
"story" entries in "bndtkn" tables (the boundary info for tkntext data).
The anomaly involves a story classified as "NEWS" but having no
"Brecid, Erecid" information -- i.e. no text content in the
corresponding token streams. It's easy enough to find the bad
entries: "grep NEWS *.bndtkn | grep -v Brecid"
Six of the eight cases noted by BBN have been found to be errors of
misclassification that occurred when the files were segmented: the
stories in question should have shown up either as "UNTRANSCRIBED" or
as "MISCELLANEOUS" -- in these cases, the classification error affects
the corresponding "bndasr" table as well.
Here is the list of affected files:
19980104_1600_1630_CNN_HDL.{bndtkn,bndasr}
19980106_1600_1630_CNN_HDL.{bndtkn,bndasr}
19980115_2130_2200_CNN_HDL.{bndtkn,bndasr}
19980120_2300_2400_VOA_TDY.bndtkn (*)
19980206_1600_1630_CNN_HDL.{bndtkn,bndasr}
19980331_1700_1800_VOA_TDY.{bndtkn,bndasr}
(* there was no ASR data for 19980120_2300_2400_VOA_TDY)
The other two cases mentioned by BBN involved a single ABC file:
19980202_1830_1900_ABC_WNT.fdch.bndtkn
Here, it turns out that the last two "NEWS" stories in the broadcast
had adequate text content in the closed-caption (ccap) file, but were
untranscribed in the FDCH transcript version. Topic annotation was
done using the "ccap" file, so the stories in question have been
judged against the 66 topics, and qualify as "NEWS"; the "ccap.bndtkn"
file for this broadcast does not have a problem, and no correction is
called for.
In terms of providing updated versions of table files for the six
samples listed above, I'd like to support two methods of access, and
sites can choose which method they prefer.
The first method will be to post an "updates" tar file, which contains
only the table files that have been corrected (using the same
directory structure as the earlier release, so these files would
"replace" the originals).
The second method will be to post a complete "latest" tar file, which
contains ALL the tables, including the corrected ones. This would
also use the same directory structure as the earlier release, but in
this case the entire "tables" directory would be replaced by the
contents of the tar file.
I will continue to post both kinds of update tar files if and when any
additional corrections are made to tables. (Note that these updates
contain table data only, no sgml, tkntext or asrtext files.)
[ftp instructions available on request from Dave Graff <graff@ldc.upenn.edu>
File sizes of the tar sets are:
tdt_tables_latest.tar.gz 39473 bytes
tdt_tables_patch_981021.tar.gz 6840 bytes
Let me know if you have questions or problems, and by all means do not
hesitate to point out any other data errors you find.
Dave Graff
(201) previous ~ index ~ next
Last updated Wed Oct 28 14:44:12 1998