(030) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: TDT2 materials for January
Date: Wed, 29 Apr 1998 10:57:38 EDT

Folks,

I apologize for not providing more details earlier about problems
that have been found in the original release of TDT tables and data.
I'll summarize what has been found so far (all of which is being
fixed in the next release); they range from trivial to potentially
confusing.

In the "asrtext" files, originally produced by Dragon and then
slightly reformatted here before being released, the final line of
each file contains "</ASR>" where it should contain "</DOCSET>".

In the "bndtkn" files for broadcast source data (ABC,CNN,PRI,VOA), I
failed to include the "doctype=" attribute for each entry, and the
token-index attributes are called "bgnrec=" and "endrec=", where they
should be called "Brecid=" and "Erecid=". These tables are also
lacking the final closing tag "</BOUNDSET>" at the end of each file.

In both "bndasr" and "bndtkn" tables for broadcast sources, the
original script I used to create the tables did not do the right
thing with respect to DOC units that contained no text in either the
original SGML or the original ASR output data. In these cases of
"empty" DOC units, the tables contain funny "recid" attributes;
typically one or both of Brecid and Erecid equal 0, or bgnrec=1 and
endrec=0. The right thing, I think, is simply not to have "Brecid"
and "Erecid" attributes for such DOCs, and this will be the approach
in the next release.

In the SGML data, the following files had errors in the timing
information associated with the story boundaries, as indicated (the
line numbers identify the locations of the affected (adjacent)
"END_TIME" and "DATE_TIME" tags, respectively):

19980111_1130_1200_CNN_HDL.sgm lines 406,411: 11:46:23.00 should be 11:46:20.90

19980112_1600_1630_CNN_HDL.sgm lines 168,173: 16:06:29.60 should be 16:06:26.72
				lines 179,184: 16:06:26.72 should be 16:06:29.60


19980106_2100_2200_VOA_WRP.sgm lines 1103,1108: 22:01:00.00 should be 21:38:12.50

19980131_0130_0200_CNN_HDL.sgm line 4: 11:30:00.00 should be 1:30:00.00

These were cases where one story boundary was mistakenly tagged as
being later in time than a following story boundary, and would have
caused bad alignment of the ASR data. (You'll notice that the bndasr
table for 19980131_0130_0200_CNN_HDL has negative values for all the
"Bsec" and "Esec" attributes, and all the recid attributes were
clearly wrong. In the case of the other files, the damage to the
bndasr tables would have been more localized.)

Finally, I think my first treatment of aligning ASR data to story
boundaries did not deal properly with the case where the transcripts
or closed-caption text did not extend to the end of the audio stream.
In these cases, there would be some amount of text in the ASR data
for which there was no corresponding DOC unit in the SGML text. In
the earlier release, all those extra words at the end were simply
lumped into the last DOC unit. In the next release, this is treated
in the "bndasr" tables by having a final BOUNDARY entry in which
"docno=UNASSIGNED" and "doctype=UNTRANSCRIBED".

Dave Graff
(030) previous ~ index ~ next

Last updated Wed Sep 9 09:40:47 1998