(251) previous ~ index ~ next
To: "TDT distribution list" <tdt-distrib@ldc.upenn.edu>
From: Thomas C Pierce <tp26+@andrew.cmu.edu>
Subject: oopsie!
Date: Thu, 3 Dec 1998 05:43:34 -0500 (EST)
hi folks,
i have a heads-up for everyone. i'm sending this out 'en masse' so
people can (if they want) do some quick fixes locally, while we await a
standard fix.
i found some problems with the index files circulated for the test set.
for one, the files contain many duplicate lines. for example:
% grep CNN19980517.1000.0921 trk_nwt+asr_98.ndx
# Non_topic_training_story CNN19980517.1000.0921
asrtext/19980517_1000_1030_CNN_HDL.asr 2269 2299
# Non_topic_training_story CNN19980517.1000.0921
asrtext/19980517_1000_1030_CNN_HDL.asr 2269 2299
the other problem i found is a bit more serious. for event 88 all "on
topic" training stories are also listed as off-topic training stories.
% grep APW19980506.1942 trk_nwt+asr_88.ndx
# Topic_training_story APW19980506.1942
text/19980506_2159_2347_APW_ENG.tkn 4638 5101
# Non_topic_training_story APW19980506.1942
text/19980506_2159_2347_APW_ENG.tkn 4638 5101
hopefully, it can be verified that this didn't happen anywhere else, and
that "BRIEFS" were not similarly included as off-topic training material.
-tom
(251) previous ~ index ~ next
Last updated Fri Dec 4 12:05:50 1998