(031) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu,
From: "Strzalkowski, Tomek (CRD)" <strzalkowski@exc01crdge.crd.ge.com>
Subject: RE: TDT2 materials for January
Date: Wed, 29 Apr 1998 12:52:53 -0400
Since we are at it, I may mention that the document tracking relevance table
has some problems. For example, the document listed as first relevant (chronologically) to topic #2
(Monica L story) is only at best marginal (maybe
BRIEF). The following topic (#3) has a story on Monica listed as relevant.
This would be good relevant to topic #2. Topic #3 is on anti-torture laws in
Peru... perhaps not entirely unrelated, depending upon your perspective ;-)
--- Tomek
> ----------
> From: David Graff[SMTP:graff@unagi.cis.upenn.edu]
> Sent: Wednesday, April 29, 1998 10:57 AM
> To: tdt-distrib@unagi.cis.upenn.edu
> Subject: Re: TDT2 materials for January
>
>
> Folks,
>
> I apologize for not providing more details earlier about problems
> that have been found in the original release of TDT tables and data.
> I'll summarize what has been found so far (all of which is being
> fixed in the next release); they range from trivial to potentially
> confusing.
>
> In the "asrtext" files, originally produced by Dragon and then
> slightly reformatted here before being released, the final line of
> each file contains "</ASR>" where it should contain "</DOCSET>".
>
> In the "bndtkn" files for broadcast source data (ABC,CNN,PRI,VOA), I
> failed to include the "doctype=" attribute for each entry, and the
> token-index attributes are called "bgnrec=" and "endrec=", where they
> should be called "Brecid=" and "Erecid=". These tables are also
> lacking the final closing tag "</BOUNDSET>" at the end of each file.
>
> In both "bndasr" and "bndtkn" tables for broadcast sources, the
> original script I used to create the tables did not do the right
> thing with respect to DOC units that contained no text in either the
> original SGML or the original ASR output data. In these cases of
> "empty" DOC units, the tables contain funny "recid" attributes;
> typically one or both of Brecid and Erecid equal 0, or bgnrec=1 and
> endrec=0. The right thing, I think, is simply not to have "Brecid"
> and "Erecid" attributes for such DOCs, and this will be the approach
> in the next release.
>
> In the SGML data, the following files had errors in the timing
> information associated with the story boundaries, as indicated (the
> line numbers identify the locations of the affected (adjacent)
> "END_TIME" and "DATE_TIME" tags, respectively):
>
> 19980111_1130_1200_CNN_HDL.sgm lines 406,411: 11:46:23.00 should be 11:46:20.90
>
> 19980112_1600_1630_CNN_HDL.sgm lines 168,173: 16:06:29.60 should be 16:06:26.72
> lines 179,184: 16:06:26.72 should be 16:06:29.60
>
> 19980106_2100_2200_VOA_WRP.sgm lines 1103,1108: 22:01:00.00 should be 21:38:12.50
>
> 19980131_0130_0200_CNN_HDL.sgm line 4: 11:30:00.00 should be 1:30:00.00
>
> These were cases where one story boundary was mistakenly tagged as
> being later in time than a following story boundary, and would have
> caused bad alignment of the ASR data. (You'll notice that the bndasr
> table for 19980131_0130_0200_CNN_HDL has negative values for all the
> "Bsec" and "Esec" attributes, and all the recid attributes were
> clearly wrong. In the case of the other files, the damage to the
> bndasr tables would have been more localized.)
>
> Finally, I think my first treatment of aligning ASR data to story
> boundaries did not deal properly with the case where the transcripts
> or closed-caption text did not extend to the end of the audio stream.
> In these cases, there would be some amount of text in the ASR data
> for which there was no corresponding DOC unit in the SGML text. In
> the earlier release, all those extra words at the end were simply
> lumped into the last DOC unit. In the next release, this is treated
> in the "bndasr" tables by having a final BOUNDARY entry in which
> "docno=UNASSIGNED" and "doctype=UNTRANSCRIBED".
>
> Dave Graff
>
(031) previous ~ index ~ next
Last updated Wed Sep 9 09:40:47 1998