(128) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: James Allan <allan@cs.umass.edu>
Subject: Re: Example Evaluation Mismatch
Date: Wed, 02 Sep 1998 18:00:33 -0400
TDTers,
In response to Tomek's message, I did a quick check. I think GE and
NIST are both wrong. My check involved pulling out the various files
and looking at them manually (with grep and gawk and wc and all that).
I've included more information than is reasonable in this message so
that anyone can duplicate (or challenge) my method. For that
particular index file, there are:
418 files in the test set
10,466 stories (BOUNDARY's) in those files
8,543 NEWS stories in those files
1,714 MISC stories
209 UNTRANSCRIBED stories
22 stories (BOUNDARY's) in the first test file
15 NEWS stories in the first test file
1 NEWS story starting at recid 3438 in there
14 training NEWS stories in the first file
8,529 NEWS stories in the test set
I found stories by looking for lines containing BOUNDARY in the bndxxx
files. (I confirmed that all of them had exactly 5 or 7
space-delimited fields per line.) I found NEWS stories by grepping
that list for NEWS (the same results if I search for "doctype=NEWS").
HOWEVER, the asr boundary files also include a bunch of stories that
although they are classified as NEWS, they have no text (no Brecid is
present). I believe those get skipped. (The index file does *NOT*
include those stories as non-topic-training stories.)
14 NEWS stories in the test files without Brecid's
3 are in the first file's training set
Meaning:
8,529 NEWS stories with Brecid's in all test files
12 NEWS stories with Brecid's in the first file
1 NEWS story with Brecid at 3438
so 8,518 NEWS stories with Brecid's in the test set
The first test file's line in the index file is:
asrtext/19980331_1830_1900_ABC_WNT.asr 3438
The last (positive) training story ends at 3437 in that file. That
last negative training story ends at 3063 in that file. There is a
MISC story between the last negative training story and the last
positive training story. After the last positive training story,
there is one NEWS story, one MISC story, and one UNTRANSCRIBED story.
Note that my numbers aren't the same as either Tomek's (finding 8,541
stories to test on) or NIST's (finding 8,530). If NIST accidentally
counted those 12 training stories from the first test file, then our
numbers would match up.
-- james
(128) previous ~ index ~ next
Last updated Wed Sep 9 09:40:55 1998