(141) previous ~ index ~ next
To: "G. Bowden Wise" <wisegb@crd.ge.com>
From: "Vernon L. Warnick" <skip@Glue.umd.edu>
Subject: Re: Example Evaluation Mismatch
Date: Fri, 4 Sep 1998 09:38:29 -0400 (EDT)
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.
---559023410-851401618-904916309=:6859
Content-Type: TEXT/PLAIN; charset=US-ASCII
Hello All,
I attempted to replicate the count exercise for topic 50 and have, much to
my dismay, produced a third set of numbers for the topic. The attached
file is a perl script that I used to pull my numbers from the files. I
used the source files tables and *tkn files from the tdt_deliv_980708/
release and the index files from tdt_deliv_980708_devtest_version2/
release. I did not modify the index files or run another set using the
script provided with that release. Here are my numbers along with the
other two sets:
Bowden James Skip
SOURCE FILES 418 418 418
DOC TOTALS 10488 10466 10488
NEWS 8558 8543 8530
*NULLNEWS 14
MISC 1720 1714 1715
UNTRANS 210 209 210
*Those documents identified as type=NEWS but having no associated tokens
My numbers a step further
8530
14
1715
210
-----
10469
+19 (stories identified as training in the firstsource file)
-----
10488
I have attached the script not necessarily for you to run, but rather
that you can see how I am compiling my data. If you see any holes in my
script or my methodology please point them out.
On Wed, 2 Sep 1998, G. Bowden Wise wrote:
> James
>
> Thanks for sharing your method for checking the numbers for
> topic 50... I decided to try to duplicate (challenge!!)
> your results and got something slightly different, but
> I think it confirms where the NIST count of 8530 comes from.
>
> Please anyone else is free to duplicate (challenge!) as well...
>
> I also did some greps and wcs and discovered that there
> are
>
> Bowden James
> 418 test files 418
> 10488 stories (BOUNDARY's) in those files 10466
> 8558 NEWS stories 8543
> 1720 MISC stories 1714
> 210 UNTRANSCRIBED stories 209
> ^^^^^
> Note that these are slightly different than James's counts
>
> Using the same techniques, I also duplicated Jame's results
> on the first test file
>
> > 22 stories (BOUNDARY's) in the first test file
> > 15 NEWS stories in the first test file
> > 1 NEWS story starting at recid 3438 in there
> > 14 training NEWS stories in the first file
>
> Deducting those 14 training NEWS stories we have
> 8558 - 14 = 8544 stories. In those
> there are 14 which do not have a Brecid present. This
> means that there are 8544 - 14 - 8530
>
> 8530 NEWS stories in the training set
>
> Which is what NIST counted.
>
> However, when I compute statistics for the tracking
> task for topic 50 I am counting 8541 documents.
>
> I need to look into this further, but for now can
> NIST or others confirm the 8530 count?
>
> James can you double check your initial counts to
> be sure you don't get 10488 total stories with
> 8558 NEWS.
>
> Bowden
> wisegb@crd.ge.com
>
>
> James Allan wrote:
> >
> > TDTers,
> >
> > In response to Tomek's message, I did a quick check. I think GE and
> > NIST are both wrong. My check involved pulling out the various files
> > and looking at them manually (with grep and gawk and wc and all that).
> > I've included more information than is reasonable in this message so
> > that anyone can duplicate (or challenge) my method. For that
> > particular index file, there are:
> >
> > 418 files in the test set
> > 10,466 stories (BOUNDARY's) in those files
> > 8,543 NEWS stories in those files
> > 1,714 MISC stories
> > 209 UNTRANSCRIBED stories
> >
> > 22 stories (BOUNDARY's) in the first test file
> > 15 NEWS stories in the first test file
> > 1 NEWS story starting at recid 3438 in there
> > 14 training NEWS stories in the first file
> >
> > 8,529 NEWS stories in the test set
> >
> > I found stories by looking for lines containing BOUNDARY in the bndxxx
> > files. (I confirmed that all of them had exactly 5 or 7
> > space-delimited fields per line.) I found NEWS stories by grepping
> > that list for NEWS (the same results if I search for "doctype=NEWS").
> >
> > HOWEVER, the asr boundary files also include a bunch of stories that
> > although they are classified as NEWS, they have no text (no Brecid is
> > present). I believe those get skipped. (The index file does *NOT*
> > include those stories as non-topic-training stories.)
> >
> > 14 NEWS stories in the test files without Brecid's
> > 3 are in the first file's training set
> >
> > Meaning:
> > 8,529 NEWS stories with Brecid's in all test files
> > 12 NEWS stories with Brecid's in the first file
> > 1 NEWS story with Brecid at 3438
> >
> > so 8,518 NEWS stories with Brecid's in the test set
> >
> > The first test file's line in the index file is:
> >
> > asrtext/19980331_1830_1900_ABC_WNT.asr 3438
> >
> > The last (positive) training story ends at 3437 in that file. That
> > last negative training story ends at 3063 in that file. There is a
> > MISC story between the last negative training story and the last
> > positive training story. After the last positive training story,
> > there is one NEWS story, one MISC story, and one UNTRANSCRIBED story.
> >
> > Note that my numbers aren't the same as either Tomek's (finding 8,541
> > stories to test on) or NIST's (finding 8,530). If NIST accidentally
> > counted those 12 training stories from the first test file, then our
> > numbers would match up.
> >
> > -- james
>
> --
> -------------------------------------------------------------------
> G. Bowden Wise General Electric Company
> wisegb@crd.ge.com Corporate Research and Development
> Phone: 518 387-5175 Dial Comm: 8*833-5175 FAX: 518-387-6845
>
Skip Warnick
skip@glue.umd.edu
---559023410-851401618-904916309=:6859
Content-Type: TEXT/PLAIN; charset=US-ASCII; name="test.pl"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.GSO.3.95q.980904093829.6859B@oriole.umd.edu>
Content-Description:
IyEvdXNyL2xvY2FsL2Jpbi9wZXJsICAtdw0KIyBVc2FnZTogIHRlc3QucGwg
PHRvcGljIG51bWJlcj4NCiR0b3BpY2lkID0gJEFSR1ZbMF07DQoNCg0KI09w
ZW4gdGhlIGluZGV4IGZpbGUgZm9yICB0b3BpYw0Kb3BlbiAoSU5ERVgsICIu
Li90ZHRfZGVsaXZfOTgwNzA4L2luZGV4ZXMvdHJrX253dCthc3JfJHRvcGlj
aWQubmR4IikgfHwgZGllICJTb3JyeSwgdGhhdCB0b3BpYyBpcyBub3QgeWV0
IGFjdGl2ZVxuIjsNCndoaWxlICg8SU5ERVg+KSB7DQoJY2hvbXA7DQoJaWYg
KCRfID1+IC9eW2F0XS8pew0KCSAgICBwdXNoIEBpbmRleGJvdW5kcnlzLCAk
XzsNCgl9ICAgDQogICAgfQ0KY2xvc2UgKElOREVYKTsNCg0KI09wZW4gdGhl
IGZpbGVzIHlvdSBuZWVkIHRvIHdyaXRlIHRvDQpvcGVuIChTT1VSQ0UsICI+
Li9zb3VyY2UuY291bnQiKSB8fCBkaWUgIm5vIHNvdXJjZSI7DQpvcGVuIChT
T1VSQ0UyLCAiPi4vYm91bmRyeXMuY291bnQiKSB8fCBkaWUgIm5vIGJvdW5k
cnlzIjsNCm9wZW4gKERPQ05PLCAiPi4vZG9jbm8uY291bnQiKSB8fCBkaWUg
Im5vIGRvY25vIjsgDQoNCiNDb3VudCB0aGUgZG9jbm8ncw0KZm9yZWFjaCAk
XyAoQGluZGV4Ym91bmRyeXMpew0KICAgIEBzZXBlcmF0ZSA9IHNwbGl0IC8o
W1wvXC4iICJdKS87DQogICAgJGZpbGVpZCA9ICRzZXBlcmF0ZVsyXTsNCiAg
ICAkc3VmZml4ID0gJHNlcGVyYXRlWzRdOw0KICAgICRzdGFydCA9ICRzZXBl
cmF0ZVstMV07DQogICAgI0ZpbGUgZm9yIGNvdW50aW5nIHNvdXJjZSBmaWxl
cyBmb3IgdGhlIGlucHV0IHRvcGljIG51bWJlcg0KICAgIHByaW50IFNPVVJD
RSAiJF9cbiI7DQogICAgaWYgKCRzdWZmaXggPX4gL15hLykgew0KCW9wZW4g
KFRBQkxFLCAiLi4vdGR0X2RlbGl2Xzk4MDcwOC90YWJsZXMvJGZpbGVpZC5i
bmQkc3VmZml4IikgfHwgZGllICJDYW4ndCBvcGVuIHRoYXQgZGFtbmVkIGZp
bGUhIjsgDQoJI0EgY2hlY2sgdG8gc2VlIGlmIGFsbCBzb3VyY2UgZmlsZXMg
Z2V0IG9wZW5lZA0KCSNwcmludCAiRmlsZSBvcGVuZWQgc3VjY2Vzc2Z1bGx5
XG4iOw0KICAgICAgICB3aGlsZSAoPFRBQkxFPil7DQoJICAgIGNob21wOw0K
CSAgICBpZiAoKCRfICF+IC9ePEJPVU5EU0VULykgYW5kICgkXyAhfiAvXjxc
L0JPVU5EU0VULykpew0KCQlwcmludCBTT1VSQ0UyICIkX1xuIjsNCgkgICAg
QHRhYmxlYm91bmRyeXMgPSBzcGxpdCAvKFs9ICIgIj5dKS87DQoJICAgICRk
b2NpZCA9ICR0YWJsZWJvdW5kcnlzWzRdOw0KCSAgICAkZmlsdGVyID0gJHRh
YmxlYm91bmRyeXNbMl07DQoJICAgICRicmVjaWQgPSAkdGFibGVib3VuZHJ5
c1stNl07DQoJICAgICRlcmVjaWQgPSAkdGFibGVib3VuZHJ5c1stMl07DQoJ
ICAgICRkb2N0eXBlID0gJHRhYmxlYm91bmRyeXNbOF07DQoJICAgICNwcmlu
dCAiJGJyZWNpZCAkZXJlY2lkXG4iOw0KCSAgICBpZiAoKCRmaWx0ZXIgPX4g
L15kb2MvKSBhbmQgKCR0YWJsZWJvdW5kcnlzWy0yXSAhfiAvXC4vKSBhbmQg
KCRicmVjaWQgPj0gJHN0YXJ0KSl7DQoJCSBwcmludCBET0NOTyAiJGRvY2lk
ICRkb2N0eXBlXG4iOw0KCSAgICAgfQ0KCX0NCgl9DQogICAgfQ0KIGVsc2Ug
ew0KCW9wZW4gKFRBQkxFLCAiLi4vdGR0X2RlbGl2Xzk4MDcwOC90YWJsZXMv
JGZpbGVpZC5ibmQkc3VmZml4IikgfHwgZGllICJDYW4ndCBvcGVuIHRoYXQg
ZGFtbmVkIGZpbGUhIjsgDQoJI0EgY2hlY2sgdG8gc2VlIGlmIGFsbCBzb3Vy
Y2UgZmlsZXMgZ2V0IG9wZW5lZA0KCSNwcmludCAiRmlsZSBvcGVuZWQgc3Vj
Y2Vzc2Z1bGx5XG4iOw0KICAgICAgICB3aGlsZSAoPFRBQkxFPil7DQoJICAg
IGNob21wOw0KCSAgICBpZiAoKCRfICF+IC9ePEJPVU5EU0VULykgYW5kICgk
XyAhfiAvXjxcL0JPVU5EU0VULykpew0KCQlwcmludCBTT1VSQ0UyICIkX1xu
IjsNCgkgICAgQHRhYmxlYm91bmRyeXMgPSBzcGxpdCAvKFs9ICIgIj5dKS87
DQoJICAgICRkb2NpZCA9ICR0YWJsZWJvdW5kcnlzWzRdOw0KCSAgICAkZmls
dGVyID0gJHRhYmxlYm91bmRyeXNbMl07DQoJICAgICRicmVjaWQgPSAkdGFi
bGVib3VuZHJ5c1stNl07DQoJICAgICRlcmVjaWQgPSAkdGFibGVib3VuZHJ5
c1stMl07DQoJICAgICRkb2N0eXBlID0gJHRhYmxlYm91bmRyeXNbOF07DQoJ
ICAgICNwcmludCAiJGJyZWNpZCAkZXJlY2lkXG4iOw0KCSAgICBpZiAoKCRm
aWx0ZXIgPX4gL15kb2MvKSBhbmQgKCR0YWJsZWJvdW5kcnlzWy0yXSAhfiAv
XC4vKSBhbmQgKCRicmVjaWQgPj0gJHN0YXJ0KSl7DQoJCSBwcmludCBET0NO
TyAiJGRvY2lkICRkb2N0eXBlXG4iOw0KCSAgICB9DQoJfQ0KCX0NCiAgICB9
DQp9DQoNCg0KDQoNCg0KDQoNCg0KDQoNCg==
---559023410-851401618-904916309=:6859--
(141) previous ~ index ~ next
Last updated Wed Sep 9 09:40:56 1998