(010) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Format of ASR text data for TDT
Date: Mon, 02 Mar 1998 10:04:38 EST
Following some suggestions provided by Jon Fiscus at NIST, I'd like
to settle on the following as the "official" format for ASR text data
in the TDT2 corpus.
The data for one broadcast episode would be bounded by
<ASR>...</ASR>, with an attribute to identify the file-ID of the
episode. Within this unit:
- each non-speech region is marked by a (contentless) <SIL> tag,
with attributes giving begin-time and duration in seconds (and a
confidence score if there is one);
- each hyp. word is marked by a <W> tag, with attributes giving
begin-time and duration (in seconds), the speaker cluster used by the
recognizer, and the confidence score.
I'll use Jon Yamron's initial data sample to illustrate:
<ASR fileid=19980104_1130_1200_CNN_HDL>
<SIL Bsec=0.00 Dur=35.00 Conf=1.0>
<W Bsec=35.00 Dur=0.18 Clust=7 Conf=0.38> PEOPLE
<W Bsec=35.18 Dur=0.16 Clust=7 Conf=0.74> IN
<W Bsec=35.34 Dur=0.37 Clust=7 Conf=0.36> TWO
<W Bsec=35.71 Dur=0.56 Clust=7 Conf=0.82> STATES
<W Bsec=36.27 Dur=0.15 Clust=7 Conf=0.76> TO
<W Bsec=36.42 Dur=0.21 Clust=7 Conf=0.50> LEAVE
<W Bsec=36.63 Dur=0.13 Clust=7 Conf=0.82> THEIR
<W Bsec=36.76 Dur=0.55 Clust=7 Conf=0.83> HOMES
<SIL Bsec=37.31 Dur=0.26 Conf=1.0>
<W Bsec=37.57 Dur=0.10 Clust=19 Conf=0.91> THE
<W Bsec=37.67 Dur=0.29 Clust=19 Conf=0.88> FIRE
<W Bsec=37.96 Dur=0.46 Clust=19 Conf=0.96> STARTED
...
</ASR>
This format will make it fairly simple to add in story-boundary
markup when needed.
Dave Graff
(010) previous ~ index ~ next
Last updated Wed Sep 9 09:40:46 1998