(097) previous ~ index ~ next

To: Grace A Crowder <crowder@afterlife.ncsc.mil>
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: commercials in asr files
Date: Fri, 24 Jul 1998 13:12:37 EDT

Grace,

Thanks for this report:

> In listening to the audio disk and looking at the data for
> 0401_1130_1200_CNN_HDL, I find that the audio contains a commercial
> break beginning at approximately 11:43, or about 13 minutes into
> the broadcast. Given that, I would expect to see in
> 19980401_1130_1200_CNN_HDL.bndasr 2 records:
>
> 1. doctype=NEWS for the preceding story with Bsec=704.79
> Esec=770.69 Brecid=1822 Erecid=1985.
>
> 2. And for the commercial,
> doctype=MISCELLANEOUS Bsec=770.69 Esec=957.76 Brecid=1986
> Erecid=2363.
>
> But what I see is a single entry for doctype=NEWS Bsec=704.79
> Esec=770.69 Brecid=1822 Erecid=2363. It appears that commercials
> are considered part of the preceding NEWS story. Is this correct
> or did I just find one that slipped by?

The asr boundary table record involved actually reads as follows (as
released in tables/19980401_1130_1200_CNN_HDL.bndasr):

<BOUNDARY docno=CNN19980401.1130.0704 doctype=NEWS Bsec=704.79 Esec=957.76
Brecid=1822 Erecid=2363>

The corresponding boundary record for the token-stream data (released
in tables/19980401_1130_1200_CNN_HDL.bndtkn) reads as follows:

<BOUNDARY docno=CNN19980401.1130.0704 doctype=NEWS Bsec=704.79 Esec=957.76
Brecid=1591 Erecid=1761>

Both of these have the wrong end time for the story, due to an error
in the manual segmentation of the audio data -- in both the first and
second passes, our people failed to notice the onset of the
commercial, and did not insert a boundary at that point.

The token-stream boundary table has the correct "recid" information,
because there was no text (closed-caption) content during the
commercial break, but the "recid" info for the asr stream is wrong
and should be as you suggested. Also, there should be an additional
"MISCELLANEOUS" segment in both tables, with the time stamps that you
indicated, as well as an additional "<DOC>" unit (with suitable tag
content but no text content) in the sgml file.

Dave Graff
(097) previous ~ index ~ next

Last updated Wed Sep 9 09:40:52 1998