(207) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: Bad boundaries in PRI ASR files
Date: Wed, 28 Oct 1998 13:23:03 EST
My thanks to Bowden for that report about the discrepancy he found
between the story boundaries and the ASR file content. The source of
the problem was a mistake on my part in preparing the Oct. 6 release.
Eight audio recordings from January were resampled from tapes at
various times (for various reasons) -- and this happened AFTER the
initial cdroms of audio data had been sent to Dragon for ASR
processing. Then, manual transcription and segmentation of story
boundaries were done using the newly resampled audio files. The
result was that the original ASR output was significantly out of sync
with the published SGML and boundary data, because the starting point
of the resampled file differed by as much as 60 seconds from the
original sample version used for ASR.
In late June, I identified the affected files, and sent a cdrom of
the current audio versions to Dragon; a few weeks later, they sent
back a new set of ASR output for those files, which would align
properly with the current SGML and boundary data.
I simply failed to put those newer ASR files in place of the outdated
originals when preparing the Oct. 6 release.
The 8 affected files (in asrtext and tables) are:
19980104_1830_1900_ABC_WNT.asr,bndasr
19980106_1830_1900_ABC_WNT.asr,bndasr
19980108_2130_2200_CNN_HDL.asr,bndasr
19980120_1830_1900_ABC_WNT.asr,bndasr
19980120_2000_2100_PRI_TWD.asr,bndasr
19980122_2000_2100_PRI_TWD.asr,bndasr
19980123_0130_0200_CNN_HDL.asr,bndasr
19980130_2000_2100_PRI_TWD.asr,bndasr
(It turns out that these 8 files are all distinct from the 6 files
whose boundary tables were updated in the patch I posted last week.
Also, two of the files -- the ABC programs for 980106 and 980120 --
were absent from the original ASR set; this is the first appearance
of ASR data for these two programs.)
There is now a new patch tar file, which replaces the patch file I
posted last week.
[ftp instructions available on request from Dave Graff <graff@ldc.upenn.edu>
This is a CUMULATIVE patch file -- it includes the both the revised
tables posted last week AND the new ASR data and tables listed above.
The other tar file that I posted last week, containing ALL tables,
is also in that ftp directory, and this file has been updated
accordingly, to include the new tables as well as the new "asrtext"
files.
The file sizes are:
tdt_patch_981028.tar.gz 702237 bytes
tdt_tables_latest.tar.gz 1731696 bytes
Please let me know if you have questions or problems about the
patches, and please accept my apologies for the mishap.
Dave Graff
(207) previous ~ index ~ next
Last updated Wed Oct 28 14:44:12 1998