(165) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: More problems with latest TDT2 data release
Date: Fri, 27 Aug 1999 12:24:52 EDT
Folks,
Jon Fiscus has pointed out to me that there are a number of problems with the
latest cdrom of the TDT2 Multilanguage Text corpus (the silver copy).
These include:
- dual entries for 12 stories in the main topic table
(topics/tdt2_topic_rel.complete_annot): for each of these 12 stories, one
"ONTOPIC" records shows "level=BRIEF" while a second entry for the same story
shows "level=YES". Listed below are the 12 entries that would be REMOVED from
the table to make it correct:
topicid=20004 level=YES docno=NYT19980406.0474
topicid=20019 level=BRIEF docno=APW19980303.1227
topicid=20019 level=BRIEF docno=APW19980312.0676
topicid=20019 level=BRIEF docno=NYT19980420.0305
topicid=20019 level=BRIEF docno=PRI19980505.2000.0138
topicid=20019 level=BRIEF docno=VOA19980203.2300.2137
topicid=20020 level=BRIEF docno=APW19980302.1654
topicid=20020 level=BRIEF docno=APW19980318.1749
topicid=20020 level=YES docno=APW19980426.0173
topicid=20042 level=BRIEF docno=ABC19980311.1830.1390
topicid=20042 level=BRIEF docno=NYT19980312.0336
topicid=20042 level=BRIEF docno=PRI19980312.2000.0925
- there are 9 "mttkn_bnd" files for XIN_MAN data that each contain an
anomalous entry -- a NEWS story in which Brecid (first word) is greater than
Erecid (last word); these are actually cases where Systran produced no output
for a story, and the bogus table entry was due to a programming error. None
of the stories involved are actually tabular listings of some sort, which I
will eliminate from the corpus.
In addition to those two problems, I have since done a more careful check of
all boundary tables, and found another problem that is more wide spread. Each
boundary table is supposed to account for all word tokens in the corresponding
token stream file -- that is, the union of all "Brecid..Erecid" spans from a
boundary table should include all the tokens identified in the token stream.
In quite a few cases of ASR and Ssytran token streams, this is not the case,
and some tokens are left out of the boundary tables.
I will need to create yet another release, and will let you know when it will
be shipped.
Dave G.
(165) previous ~ index ~ next
Last updated Thu Sep 2 18:19:19 1999