(183) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: About those index files...
Date: Mon, 13 Sep 1999 13:11:07 EDT

Folks,

In case you haven't worked it out already, the problem reported last week with
the NIST "asr" index tar file involved the use of the Solaris version of "tar"
on the NIST tar file.

Using the GNU version of "tar" (e.g. including the "z" option) yields a
complete extraction of all files with no errors.

Another comment to me from Bowden brought up another small detail: between
the NIST release of index files on August 12 and the latest release of TDT2
data last week, there were some changes to the inventory of stories in the
corpus -- in particular, some segmentation errors in the broadcast files were
corrected, and some unusable content in the newswire data was eliminated.

(There had also been a couple of sample files dropped from the corpus since
the June 6 NIST release of the data, but it appears that these have been taken
into account in the Aug.12 index files.)

I've done a thorough check of "docno" and "fileid" references in all of the
index and key files of the Aug.12 NIST release, against the current inventory
of TDT2 text data. Below is a listing of the index file references that need
to be "retired" (i.e. items to be removed from the index files, or which would
not show up if the index files were regenerated). It turns out that all cases
involve the "story-link" index files and key files.


- References to non-existent docnos -- in each case, I'm listing the line# in
the index file and the docno involved:

indexes_oct_dry_run_as1_v1/sld_SRC=nwt+bnasr_TEST:SL=eng,CL=nat.ndx
line 2611: CNN19980108.1130.1709

indexes_oct_dry_run_as1_v1/sld_SRC=nwt+bnasr_TEST:SL=eng,CL=nat.key
line 77: CNN19980108.1130.1709


- References to docnos which are not news stories (i.e. corrections of
segmentation errors involved changing these stories from "NEWS" to something
else) -- again, I'm listing the line# in the index file and the docno involved:

indexes_oct_dry_run_as1_v1/sld_SRC=nwt+bnasr_TEST:SL=eng,CL=nat.ndx:
line 2841: CNN19980115.2130.1163
line 3620: PRI19980127.2000.2903
line 4678: CNN19980206.1600.1162
line 7928: VOA19980331.1700.2954

indexes_oct_dry_run_as1_v1/sld_SRC=nwt+bnman_TEST:SL=eng,CL=nat.ndx:
line 2736: CNN19980109.1130.1171
line 3161: VOA19980122.2300.1714
line 3455: CNN19980126.2130.0283
line 4420: CNN19980206.1600.1162

indexes_oct_dry_run_asr_v1/sld_SRC=nwt+bnasr_TEST:SL=eng,CL=nat.ndx:
line 2827: CNN19980114.2130.0069
line 3620: PRI19980127.2000.2903
line 4677: CNN19980206.1600.1162

indexes_oct_dry_run_asr_v1/sld_SRC=nwt+bnman_TEST:SL=eng,CL=nat.ndx:
line 2736: CNN19980109.1130.1171
line 3161: VOA19980122.2300.1714
line 3455: CNN19980126.2130.0283
line 4420: CNN19980206.1600.1162

indexes_oct_dry_run_as1_v1/sld_SRC=nwt+bnasr_TEST:SL=eng,CL=nat.key:
line 307: CNN19980115.2130.1163
line 1086: PRI19980127.2000.2903
line 2144: CNN19980206.1600.1162
line 5394: VOA19980331.1700.2954

indexes_oct_dry_run_as1_v1/sld_SRC=nwt+bnman_TEST:SL=eng,CL=nat.key:
line 202: CNN19980109.1130.1171
line 627: VOA19980122.2300.1714
line 921: CNN19980126.2130.0283
line 1886: CNN19980206.1600.1162

indexes_oct_dry_run_asr_v1/sld_SRC=nwt+bnasr_TEST:SL=eng,CL=nat.key:
line 293: CNN19980114.2130.0069
line 1086: PRI19980127.2000.2903
line 2143: CNN19980206.1600.1162

indexes_oct_dry_run_asr_v1/sld_SRC=nwt+bnman_TEST:SL=eng,CL=nat.key:
line 202: CNN19980109.1130.1171
line 627: VOA19980122.2300.1714
line 921: CNN19980126.2130.0283
line 1886: CNN19980206.1600.1162

(Obviously, the same "retired" stories are popping up in more than one index
or key file. There are actually eight distinct docnos in this set.)

Dave G.


(183) previous ~ index ~ next

Last updated Wed Sep 22 10:26:04 1999