(009) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: Output file formats for TDT Recog. Output -Reply
Date: Fri, 27 Feb 1998 15:03:59 EST
Given that the Dragon folks are flexible about file format for ASR
output, would anyone object to the following:
<ASR fileid=19980104_1130_1200_CNN_HDL>
<speech spkrCluster=9>
35.00 0.18 PEOPLE 0.38
35.18 0.16 IN 0.74
35.34 0.37 TWO 0.36
35.71 0.56 STATES 0.82
36.27 0.15 TO 0.76
36.42 0.21 LEAVE 0.50
36.63 0.13 THEIR 0.82
36.76 0.55 HOMES 0.83
<speech spkrCluster=17>
37.57 0.10 THE 0.91
37.67 0.29 FIRE 0.88
37.96 0.46 STARTED 0.96
...
</ASR>
Notes:
1) The value for the "spkrCluster" attribute would be the label of
the speaker cluster assigned to the following section of speech. I'm
assuming that Dragon's chunking of the signal produces chunks that
are homogeneous as to the cluster that is applied.
2) The use of SGML here is not crucial in any sense; if it would make
more sense as a simple, flat table (and/or if SGML really turns some
people off), then let's make the chunk boundaries meaningful as
regular table entries, like this (I don't think we need to have
file-ids on every line):
;;;cluster start duration word confidence
-1 0.00 35.00 no-speech 1.0
09 35.00 0.18 PEOPLE 0.38
09 35.18 0.16 IN 0.74
09 35.34 0.37 TWO 0.36
09 35.71 0.56 STATES 0.82
09 36.27 0.15 TO 0.76
09 36.42 0.21 LEAVE 0.50
09 36.63 0.13 THEIR 0.82
09 36.76 0.55 HOMES 0.83
-1 37.31 0.26 no-speech 1.0
17 37.57 0.10 THE 0.91
17 37.67 0.29 FIRE 0.88
17 37.96 0.46 STARTED 0.96
...
(Perhaps Dragon has some variable confidence score regarding the
speech/nonspeech decision? If so, the fifth column would always be
meaningful.)
For the record, I will recommend the SGML format, but I won't argue
for it strenuously -- if someone really wants the flat table approach,
I will be easily swayed.
Dave Graff
(009) previous ~ index ~ next
Last updated Wed Sep 9 09:40:46 1998