(108) previous ~ index ~ next

To: "G. Bowden Wise" <wisegb@crd.ge.com>
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: FW: FW: More Problems with TDT2 devset data
Date: Wed, 29 Jul 1998 13:40:02 EDT

Bowden,

It looks to me like the cases where you find "doctype=NEWS" and there
are no recid attributes are all from "bndasr" tables. This would
mean that there are cases where there was transciption or closed
caption text for the news story (used by the annotators), but the ASR
system failed to recognize any words -- hence there are no words in
that portion of the audio and no recids for that segment in the
"bndasr" table.

This is yet a third condition that would give rise to this property
in the tables. Given the number of times this condition is indicated
in the listing you sent earlier, there is good reason to check into
this more carefully, to see whether something else may have gone
wrong in preparing the tables.

I did notice one case in your listing where the doctype was "news"
and the absence of recid attributes was seen in the "bndtkn" table.
This looks like a problem, and I will need to check into this one
more carefully, as well:

> 19980331_1700_1800_VOA_TDY.bndtkn:<BOUNDARY docno=VOA19980331.1700.2954
> doctype=NEWS Bsec=2954.97 Esec=2961.88>

Thanks for the clarification.

Dave G.

> Date: Wed, 29 Jul 1998 13:29:30 -0400
> From: "G. Bowden Wise" <wisegb@crd.ge.com>
> To: graff@unagi.cis.upenn.edu
> Subject: Re: FW: FW: More Problems with TDT2 devset data
>
> Dave:
>
> ... in your response, you mention specifically UNTRANSCRIBED and
> non-news segments.... did you also mean this also happens
> to NEWS segments as well?
>
> Just to be clear, the examples I found are indeed
> NEWS segments. And if you look in the SGML files
> there are indeed words of usable text.
>
> So you are saying that because the broadcast audio
> had no usable words, that is why there are no Brecid
> or Erecids in the boundary files? Even for
> doctype=NEWS ?
>
> Thanks!
> Bowden
(108) previous ~ index ~ next

Last updated Wed Sep 9 09:40:53 1998