(107) previous ~ index ~ next

To: "Strzalkowski, Tomek (CRD)" <strzalkowski@exc01crdge.crd.ge.com>
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: FW: More Problems with TDT2 devset data
Date: Wed, 29 Jul 1998 12:29:25 EDT

Tomek (and all TDT'ers),

The following property of the various boundary tables is to be
expected in the TDT2 corpus:

> ... some records in bndtkn and bndasr files do not have the
> Brecid or Erecid tags, they only have Bsec and Esec (TIME) tags.

This is not an error, because the broadcast audio data do contain
portions between news stories that span an appreciable amount of time
and do not contain any speech. As there is no speech, there can be
no words, and with no words, there is nothing to be referred to by
"Brecid" and "Erecid" attributes. Given this condition for a
non-news segment, the boundary records for that segment will not
contain these attributes.

In the case of records that show "doctype=UNTRANSCRIBED", this
happens in the context of SGML files (and token stream files created
from the SGML data) that have been drawn from closed-captions. There
are cases where a reporter or announcer speaks what would be
classified as a full story, but the closed-caption text contains
few or none of the words spoken (e.g. because all the relevant
information already shows up in the video image). When the closed
captions contain none of the words, the topic labelers will have no
text on which to base their judgments, and there will be nothing for
the "Brecid" and "Erecid" attributes to refer to in the ".bndtkn"
table, hence these attributes are left out for that news story (but
there will be such attributes for this segment in the ".bndasr"
table, assuming that the ASR system did recognize words in that
region of the audio recording).

Dave Graff
(107) previous ~ index ~ next

Last updated Wed Sep 9 09:40:53 1998