(054) previous ~ index ~ next
From: David Graff <email@example.com>
Subject: RE-RELEASE OF TDT2 TRAINING DATA: major bug fixes!
Date: Fri, 22 May 1998 03:55:15 EDT
Hubert Jin was the first to point out a serious flaw in the topic
relevance table that came with the previous release of TDT2 training
data (which I announced on May 16); and when I fixed that problem, I
learned there was also a serious flaw in sgml files (and associated
token stream files) for NYT February data.
I have prepared a complete REPLACEMENT for the training set release,
in which both problems have been fixed. Please DO NOT USE the data
we posted on May 16 (the "nominal" release date was 980515).
Here are instructions for retrieving the repaired version of the
training set, followed by the extra README file that I included in
the release, describing the problems in more detail.
I'm sorry about the inconvenience -- this was the combined result of
a simple programming error (producing a bad topic table) and a mixup
in our logs of the newswire sampling process (causing the released
story set to be different from the annotated story set).
[ftp instructions available on request from firstname.lastname@example.org]
May 22, 1998 UPDATE to the first full training set release of TDT2
This release REPLACES (supercedes) the delivery of TDT2 training data
made available to participants on May 15, 1998.
A programming error involved in creating the topic-relevance.table
file caused a significant number of entries to be missing from this
table. In effect, the relevance information for more than half the
target topics in the training set, including all topics defined over
the month of February, was inadvertently left out of the table.
In addition, when this error was fixed, and all the topic-story
relations were properly included in the table, we also discovered that
many of the data files from the February NYT text collection had been
produced with incorrect content: altogether, of the roughly 1900
stories present in the NYT February files that were released on May
15, only about 800 stories had actually been used in topic annotation,
while an additional 1200 stories that had been annotated were not
included in the release.
In other words, there had been a mismatch between the set of stories
covered by LDC annotators, and the set of stories released to
researchers, such that more than half of the stories in each set did
not match up. This affected only the February collection of NYT
data. The material being delivered here includes exactly the set of
stories that was annotated.
To reduce further confusion, the current release is intended to FULLY
REPLACE the release of May 15. It contains all the other materials of
the previous release not affected by the corrections described above,
including the earlier README file that describes the full content in
better detail -- all the information in that file is still current
(though now there is a bit more truth to the closing statements in
section (4): "Care has been taken...").
We apologize for any difficulties and lost effort that you may have
suffered as a result of these problems in the May 15 release.
We wish to extend our deepest thanks to Hubert Jin of BBN, for
bringing these problems to our attention.
One additional change has been made to the topic-relevance.table file:
the earlier version used values "FULL" and "BRIEF" for the "level="
attribute; the current version uses the values "YES" and "BRIEF".
Below is a histogram that shows the total number of times each topic
appears in the table (as either YES or BRIEF):
(054) previous ~ index ~ next
Last updated Wed Sep 9 09:40:50 1998