(092) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Re: Topic relevance error
Date: Wed, 15 Jul 1998 18:42:55 EDT

Those four stories listed by James Allan as false alarms on Topic 2
(Lewinsky) appear to have been due to some sort of "transmission
error" that took place here after the initial annotation was done --
the original annotator labeled these four stories as hits on Topic 1
(Asian economic crisis), which is basically correct.

I'll restate for clarity: These four stories cited by James are
supposed to be listed as "YES" on topic 1, instead of on topic 2:

topic docid judgement

2 -> 1 TDT00300 APW19980105.0021 YES
2 -> 1 TDT00332 APW19980105.0549 YES
2 -> 1 TDT00333 APW19980105.0550 YES
2 -> 1 TDT00375 APW19980105.0808 YES

We haven't tracked down just how this corruption occurred, or whether
the cause (whatever it was) may have had a wider effect beyond these
four stories, but we will be doing a thorough pass of QC over the
topic labels from the training-set release, on a par with what was
done for the devtest release sent out last week. We hope to have a
fixed version of text and table files for the training set available
prior to the PI meeting at BBN.

In addition to fixing the topic label errors listed by James (and any
other topic errors we find), there will also be fixes to some basic
formatting errors, and inclusion of some Jan/Feb material that had
not made it into the earlier release. To summarize the issues known
about so far:

(1) the topic_relevance table had "nyt" instead of "NYT" in the
January "docno" values.

(2) the asrtext files spanning Feb 3-15 had incorrect values for the
"fileid" field.

(3) the asrtext files for Feb 17 were mistakenly left out.

While I'm at it, I should point out a known problem in the devtest
release, which was not mentioned in the accompanying readme file:

The topic_relevance.table in the 980708 devtest release includes
references to 24 stories in VOA programs for which the text data was
not present in the release. Here is a list of the docnos for stories
that are listed in the topic table but absent from the text
collection in that release:

VOA19980302.2300.0159
VOA19980303.2300.0142
VOA19980303.2300.2059
VOA19980304.2100.0701
VOA19980309.2100.1172
VOA19980310.2100.1516
VOA19980310.2100.3239
VOA19980310.2300.0251
VOA19980310.2300.2738
VOA19980312.2100.1042
VOA19980312.2100.2590
VOA19980312.2100.2825
VOA19980312.2300.1077
VOA19980312.2300.2214
VOA19980313.2100.2835
VOA19980313.2100.3011
VOA19980330.2100.1475
VOA19980330.2100.2066
VOA19980331.2100.1654
VOA19980331.2100.1830
VOA19980331.2100.2730
VOA19980401.2100.3268
VOA19980402.2100.1402
VOA19980402.2100.3222

The 13 sample files that contain these stories (together with another
5 files that contained no relevant stories) will be ready for release
in the next day or so.

Dave Graff
(092) previous ~ index ~ next

Last updated Wed Sep 9 09:40:52 1998