(129) previous ~ index ~ next

To: James Allan <allan@cs.umass.edu>
From: "G. Bowden Wise" <wisegb@crd.ge.com>
Subject: Re: Example Evaluation Mismatch
Date: Wed, 02 Sep 1998 19:05:46 -0400

James

Thanks for sharing your method for checking the numbers for
topic 50... I decided to try to duplicate (challenge!!)
your results and got something slightly different, but
I think it confirms where the NIST count of 8530 comes from.

Please anyone else is free to duplicate (challenge!) as well...

I also did some greps and wcs and discovered that there
are

Bowden James
418 test files 418
10488 stories (BOUNDARY's) in those files 10466
8558 NEWS stories 8543
1720 MISC stories 1714
210 UNTRANSCRIBED stories 209
^^^^^
Note that these are slightly different than James's counts

Using the same techniques, I also duplicated Jame's results
on the first test file

> 22 stories (BOUNDARY's) in the first test file
> 15 NEWS stories in the first test file
> 1 NEWS story starting at recid 3438 in there
> 14 training NEWS stories in the first file

Deducting those 14 training NEWS stories we have
8558 - 14 = 8544 stories. In those
there are 14 which do not have a Brecid present. This
means that there are 8544 - 14 - 8530

8530 NEWS stories in the training set

Which is what NIST counted.

However, when I compute statistics for the tracking
task for topic 50 I am counting 8541 documents.

I need to look into this further, but for now can
NIST or others confirm the 8530 count?

James can you double check your initial counts to
be sure you don't get 10488 total stories with
8558 NEWS.

Bowden
wisegb@crd.ge.com


James Allan wrote:
>
> TDTers,
>
> In response to Tomek's message, I did a quick check. I think GE and
> NIST are both wrong. My check involved pulling out the various files
> and looking at them manually (with grep and gawk and wc and all that).
> I've included more information than is reasonable in this message so
> that anyone can duplicate (or challenge) my method. For that
> particular index file, there are:
>
> 418 files in the test set
> 10,466 stories (BOUNDARY's) in those files
> 8,543 NEWS stories in those files
> 1,714 MISC stories
> 209 UNTRANSCRIBED stories
>
> 22 stories (BOUNDARY's) in the first test file
> 15 NEWS stories in the first test file
> 1 NEWS story starting at recid 3438 in there
> 14 training NEWS stories in the first file
>
> 8,529 NEWS stories in the test set
>
> I found stories by looking for lines containing BOUNDARY in the bndxxx
> files. (I confirmed that all of them had exactly 5 or 7
> space-delimited fields per line.) I found NEWS stories by grepping
> that list for NEWS (the same results if I search for "doctype=NEWS").
>
> HOWEVER, the asr boundary files also include a bunch of stories that
> although they are classified as NEWS, they have no text (no Brecid is
> present). I believe those get skipped. (The index file does *NOT*
> include those stories as non-topic-training stories.)
>
> 14 NEWS stories in the test files without Brecid's
> 3 are in the first file's training set
>
> Meaning:
> 8,529 NEWS stories with Brecid's in all test files
> 12 NEWS stories with Brecid's in the first file
> 1 NEWS story with Brecid at 3438
>
> so 8,518 NEWS stories with Brecid's in the test set
>
> The first test file's line in the index file is:
>
> asrtext/19980331_1830_1900_ABC_WNT.asr 3438
>
> The last (positive) training story ends at 3437 in that file. That
> last negative training story ends at 3063 in that file. There is a
> MISC story between the last negative training story and the last
> positive training story. After the last positive training story,
> there is one NEWS story, one MISC story, and one UNTRANSCRIBED story.
>
> Note that my numbers aren't the same as either Tomek's (finding 8,541
> stories to test on) or NIST's (finding 8,530). If NIST accidentally
> counted those 12 training stories from the first test file, then our
> numbers would match up.
>
> -- james

--
-------------------------------------------------------------------
G. Bowden Wise General Electric Company
wisegb@crd.ge.com Corporate Research and Development
Phone: 518 387-5175 Dial Comm: 8*833-5175 FAX: 518-387-6845
(129) previous ~ index ~ next

Last updated Wed Sep 9 09:40:55 1998