(138) previous ~ index ~ next

To: tdt-distrib <tdt-distrib@unagi.cis.upenn.edu>
From: "G. Bowden Wise" <wisegb@crd.ge.com>
Subject: Brecid Mismatches with Index files in version2 (tracking task)
Date: Thu, 03 Sep 1998 17:14:00 -0400

We have been using the index file from
tdt_deliv_980708_devtest_version2

and noticed some problems... can anyone else confirm or
dispute these? We are doing the topic tracking task.

For topic 39, using index: trk_nwt+asr_39.ndx
The first tracking file is:
tkntext/19980301_2300_2400_VOA_TDY.tkn 955

However, if you look in the bndtkn file for this one,
tables/19980301_2300_2400_VOA_TDY.bndtkn

we find that the document starting at Brecid=955 is of
type MISCELANEOUS:

<BOUNDARY docno=VOA19980301.2300.0356 doctype=MISCELLANEOUS Bsec=356.44
Esec=417
.80 Brecid=955 Erecid=1069>

I thought MISCELLANEOUS files were not considered at all
for training or tracking??? Can someone tell us if we
should track MISC docs or not??

I also found that the offsets for the first tracking document
do not match up to the respective boundary files for
some of the topics for the tracking task. Again we are
looking at the index files for trk_nwt+asr_*.ndx

If you grep the index files we find the first documents
for each topic. And then look up the corresponding boundary
file to check the Brecid, I find that there are mismatches.

Topic39 mismatches because as explained above, offset 955
is a MISC file not a NEWS.

The topics which do not match Brecid's are
43, 46, 47, 48

Is there a reason why these mismatches occur???

Below I show when there is a match and in case of
a mismatch, the doc that is closest to the indicated
Brecid from the index

trk_nwt+asr_39.ndx: tkntext/19980301_2300_2400_VOA_TDY.tkn 955
Mismatch because 955 is MISCELLANEOUS
trk_nwt+asr_41.ndx: tkntext/19980306_2203_2359_NYT_NYT.tkn 10855
Match--NYT19980306.0377 [10855,11120]
trk_nwt+asr_42.ndx: asrtext/19980312_0130_0200_CNN_HDL.asr 2543
Match--CNN19980312.0130.1017 [2543,2584]
trk_nwt+asr_43.ndx: asrtext/19980316_1600_1630_CNN_HDL.asr 2618
Mismatch--CNN19980316.1600.1008 [2619,2876]
trk_nwt+asr_44.ndx: asrtext/19980307_1130_1200_CNN_HDL.asr 872
Match--CNN19980307.1130.0380 [872,1023]
trk_nwt+asr_46.ndx: asrtext/19980324_1830_1900_ABC_WNT.asr 1811
Mismatch--ABC19980324.1830.0880 [2164,2469]
trk_nwt+asr_47.ndx: asrtext/19980327_2130_2200_CNN_HDL.asr 1215
Mismatch--CNN19980327.2130.0542 [1239,1438]
trk_nwt+asr_48.ndx: asrtext/19980324_1600_1630_CNN_HDL.asr 4319
Mismatch--CNN19980324.1600.1670 [4269,4318]
trk_nwt+asr_50.ndx: asrtext/19980331_1830_1900_ABC_WNT.asr 3438
Match--ABC19980331.1830.1553 [3438,3774]
trk_nwt+asr_52.ndx: tkntext/19980303_1928_2037_NYT_NYT.tkn 1693
Match--NYT19980303.0236 [1693,2088]
trk_nwt+asr_56.ndx: asrtext/19980324_1600_1630_CNN_HDL.asr 1230
Match--CNN19980324.1600.0493 [1230,1295]
trk_nwt+asr_57.ndx: tkntext/19980401_2116_2203_NYT_NYT.tkn 12397
Match--NYT19980401.0500 [12397,12977]
trk_nwt+asr_60.ndx: asrtext/19980409_1830_1900_ABC_WNT.asr 3496
Match--ABC19980409.1830.1432 [3496,3541]
trk_nwt+asr_63.ndx: tkntext/19980327_2130_2230_NYT_NYT.tkn 14432
Match--NYT19980327.0422 [14432,14731]
trk_nwt+asr_64.ndx: asrtext/19980414_1830_1900_ABC_WNT.asr 1312
Match--ABC19980414.1830.0472 [1312,1351]
trk_nwt+asr_65.ndx: asrtext/19980415_1700_1800_VOA_WRP.asr 6669
Match--VOA19980415.1700.2570 [6669,7378]
trk_nwt+asr_66.ndx: asrtext/19980409_2130_2200_CNN_HDL.asr 3147
Match--CNN19980409.2130.1332 [3147,3175]



--
-------------------------------------------------------------------
G. Bowden Wise General Electric Company
wisegb@crd.ge.com Corporate Research and Development
Phone: 518 387-5175 Dial Comm: 8*833-5175 FAX: 518-387-6845
(138) previous ~ index ~ next

Last updated Wed Sep 9 09:40:55 1998