(206) previous ~ index ~ next
To: tdt-distrib <tdt-distrib@unagi.cis.upenn.edu>
From: "G. Bowden Wise" <wisegb@crd.ge.com>
Subject: Bad boundaries in PRI ASR files
Date: Tue, 27 Oct 1998 17:14:55 -0500
Hello fellow TDTers
We found some discrepancies in some of the PRI bndasr
files. The Brecid and Erecid story boundaries often do not
match up to the correct segment in the associated ASR file
for a particular document.
For an example of this please take a look at the file
19980130_2000_2100_PRI_TWD.bndasr
We have confirmed this with the latest CDROM re-release
of the TDT2 data.
If you examine the SGML source file along with the
ASR tokens you can see where the boundaries of the stories
lie.
Now I realize that the document bondaries are not
suppose to coincide perfectly, but in this file, the stories
are way off.
Here is a run-down of where stories from
19980130_2000_2100_PRI_TWD.bndasr match up in the corresponding
.asr file:
The beginning of the file has severa stories that do
not even appear in the ASR file
From ASR
=========
PRI19980130.2000.0000 MISCELLANEOUS Brecid=1 ?missing?
PRI19980130.2000.0059 NEWS Brecid=128 ?missing?
PRI19980130.2000.0121 NEWS Brecid=276 ?missing?
PRI19980130.2000.0176 NEWS Brecid=419 ?missing?
PRI19980130.2000.0216 NEWS Brecid=514 ?missing?
PRI19980130.2000.0236 NEWS Brecid=573 ?missing?
PRI19980130.2000.0278 NEWS Brecid=695 ?missing?
PRI19980130.2000.0329 NEWS Brecid=851 ?missing?
in the middle of this boundary file we can, indeed, find
the stories in the ASR file, but notice how far off the
.bndasr file is from the stories!!
PRI19980130.2000.0350 MISCELLANEOUS Brecid=913 [1054,1080]
PRI19980130.2000.0371 NEWS Brecid=972 [1087,1415]
PRI19980130.2000.0489 NEWS Brecid=1273 [1416,2329]
PRI19980130.2000.0774 MISCELLANEOUS Brecid=2159 [2330,2340]
PRI19980130.2000.0818 NEWS Brecid=2294 [2350,3077]
PRI19980130.2000.1149 MISCELLANEOUS Brecid=3022 [3078,3156]
PRI19980130.2000.1188 NEWS Brecid=3078 [3319,4789]
PRI19980130.2000.1684 MISCELLANEOUS Brecid=4458 [4795,4866]
PRI19980130.2000.1723 NEWS Brecid=4571 [4867,4919]
PRI19980130.2000.1743 NEWS Brecid=4631 [4920,5059]
PRI19980130.2000.1792 NEWS Brecid=4785 [5060,5168]
PRI19980130.2000.1834 NEWS Brecid=4803 [5169,5236]
PRI19980130.2000.1857 NEWS Brecid=4829 [5237,5320]
PRI19980130.2000.1890 NEWS Brecid=4913 [5321,5355]
PRI19980130.2000.1901 MISCELLANEOUS Brecid=4936 [5356,5400]
PRI19980130.2000.1923 NEWS Brecid=4996 [5407,6107]
PRI19980130.2000.2176 NEWS Brecid=5679 [6110,6254]
PRI19980130.2000.2231 MISCELLANEOUS Brecid=5849 [6255,6272]
PRI19980130.2000.2240 NEWS Brecid=5871 [6273,7093]
PRI19980130.2000.2575 MISCELLANEOUS Brecid=6758 ?missing?
PRI19980130.2000.2604 NEWS Brecid=6839 [7206,7298]
PRI19980130.2000.2645 MISCELLANEOUS Brecid=6962 [7304,7314]
PRI19980130.2000.2648 NEWS Brecid=6971 [7315,7397]
PRI19980130.2000.2674 NEWS Brecid=7040 [7398,7447]
PRI19980130.2000.2689 NEWS Brecid=7079 [7448,7511]
PRI19980130.2000.2763 MISCELLANEOUS Brecid=7218 [7532,7636]
PRI19980130.2000.2834 NEWS Brecid=7407 [7898,8683]
PRI19980130.2000.3132 MISCELLANEOUS Brecid=8118 [8684,8780]
Has anyone else noticed this before? Can anyone confirm
or dispute these findings?
We think that there may be more PRI boundary file that are suspect.
But we have not plowed through them in this detail.
Other candiates are:
19980120_2000_2100_PRI_TWD.bndasr
19980122_2000_2100_PRI_TWD.bndasr
Bowden
--
-------------------------------------------------------------------
G. Bowden Wise General Electric Company
wisegb@crd.ge.com Corporate Research and Development
Phone: 518 387-5175 Dial Comm: 8*833-5175 FAX: 518-387-6845
(206) previous ~ index ~ next
Last updated Wed Oct 28 14:44:12 1998