(200) previous ~ index ~ next
To: Sreenivasa Sista <ssista@bbn.com>
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: data problems!
Date: Tue, 28 Sep 1999 18:16:55 -0400 (EDT)
In response to Sreenivasa's report about the discrepancy between tkn and as0
data for this file:
19980302_0900_1000_VOA_MAN
The problem affects not only the machine-translated data, but the untranslated
mandarin files as well. One other file set, 19980302_0700_0800_VOA_MAN, is
also affected.
It appears that the LDC experienced some confusion with regard to the two
VOA_MAN files that were collected on March 2, 1998. For reasons that I have
yet to fathom, the cdroms of audio data that were sent to Dragon Systems for
doing asr had file names assigned to these two recordings one way, whereas the
identification of audio tapes sent out for manual transcription (and the
manual segmentation of those transcripts) had the file names assigned the
other way.
(This has something to do with the fact that the transcription service needed
to use truncated file names, and they were not always careful or consistent in
doing the truncation.)
In other words, the asr content in 19980302_0700_0800_VOA_MAN should
correspond to the tkn (ref. text) content for 19980302_0900_1000_VOA_MAN, and
vice-versa.
In order to correct this problem, I would need to re-assign the as0 file
names, and redo the alignment of asr data with manually segmented story
boundaries. This is easy enough to do, and I can send out patches of the as0
and mtas0 data and boundary tables tomorrow morning. In the absence of a
patch, it would be best to consider the as0 files for 19980302_*_VOA_MAN as
unusable.
This is the only case in the VOA_MAN data where this confusion occurred. (For
all other files, the audio sent to Dragon for asr corresponds correctly to the
audio used for transcription and segmentation.)
Dave Graff
(200) previous ~ index ~ next
Last updated Wed Sep 29 12:12:22 1999