(179) previous ~ index ~ next
To: tdt-distrib@ldc.upenn.edu
From: Christopher Cieri <ccieri@ldc.upenn.edu>
Subject: TDT2 Corpus Updates
Date: Fri, 11 Sep 1998 21:14:22 -0400
TDT Folk,
As Charles mentioned here is a lengthy update on LDC activities with
regard to the TDT2 corpus since the Cambridge meeting. As you know, LDC
is committed to working closely with the research community to develop
what will be a unique and very useful corpus. Your feedback continues to
be extremely beneficial.
Following George's approval, topic list 5 went into production on 9/1.
Most of May and June have been annotated against topic list 5. Much of
the off-diagonal (not critical path) material prior to the months/topics
reserved for the evaluation has also been annotated.
At Cambridge, we discussed using the sites' research results to check
for human errors. We have been working with bug reports provided to us
over the past weeks as well as copies of the results you all submitted
to NIST. We will report on those at the IBM meeting.
Following Grace Crowder's identification of segmentation errors in the
April data, Dave Graff wrote a program to identify other instances of
such errors. He identified 110 suspects files, 42 of which had the
problem (in some instances, more than once). All 54 of the errors have
been fixed and Segmentation staff have been instructed to be watchful
for the category of error Grace identified. We have also instituted a
change in the interface used for 2nd pass segmentation. Now,
segmentation staff will check each boundary in the table by listening to
10 seconds before and 10 seconds after the boundary and check to be sure
the boundary type is correct. We will continue to look forother
improvements that would focus the segmentation staff attention on
possible sources of error.
There have been requests for tools to use to Browse TDT2 corpus
releases. Our Zhibiao Wu has implemented a tool -- still in Beta -- that
displays TDT2 SGML files in alignment with the corresponding NIST SPH
audio files and optionally the ASR and tokenized text files plus the
appropriate portion of the topic relevance tables. An important concept
in the creation of this tool was to provide direct access to all the
components of the *released* version of the corpus. LDC currently keeps
a copy of each release on-line. The tool is a WWW front end to our copy
of the release data. The user selects a release version and source. The
program responds with a list of files (in this case broadcast episodes
or the sum of newswire stories we collect in a day). The user then
selects a file and sets some parameters for the next step. The program
then scans the file and provides a list of stories with indications if
any are on a particular topic. For each story, the user may view the
SGML, ASR, and token files and/or listen to the entire audio segment or
just a few seconds of the audio around the boundary. ZB finished the
first version on Friday. The program is still very new and testing is
ongoing. However, if you'd like to get an early look, the URL is:
http://www.ldc.upenn.edu/cgi-bin/tdt/webview. We would be interested in
your feedback.
As you know LDC has also installed and configured Jitterbug software,
originally developed as a bug reporting facility for the SAMBA project.
The URL is: http://www.ldc.upenn.edu/cgi-bin/bugreport/TDT.
At the Cambridge meeting and subsequently, we have had some discusions
about the time spent in annotation and whether one could make better use
of title information and skimming techniques to improve throughput. Nii
has identified cases in which the titles of AP and PRI stories are
misleading. The cause in the case of AP is unknown however it may be
related to transmission or formatting errors. In the case of PRI, the
captioning services generate a title based upon the first sentence of a
story. That practice sometimes leads to misleading titles. Topic
Labelling staff have been instructed not to rely exclusively on story
titles and to be especially careful with AP and PRI titles. The process
they do use varies with the source. They read short stories completely.
They read the first paragaphs of long newswire stories and skim the
remainder. Headline information is variably useful depending upon the
source. Given the occurrence of unreliable titles in two of the sources,
none of the annotation staff felt comfortable using titles exclusively
to annotate an article. However, in sources such as the New York Times
newswire, annotation staff did combine the headline with a thorough
reading of the first paragarphs and a skimming of the remaining
paragraphs to label the story.
There has been some concern that labelling newswire data absorbs
disproportionately large amounts of time. Upon closer inspection, it
seems that the additional time required to annotate newswire files is
more a function of the number of stories in a newswire file than the
actual time required to annotate one newswire story. TV sources require
on average 0.50 hours to annotate a file of 16-20 stories; that's 32-40
stories per hour of annotation effort. Radio sources require on average
0.75 hours to annotate a file of 30 stories; that's 40 stories per hour
of annotation effort. Newswire sources require on average 2.00 hours to
annotate a file of 80 stories; that's 40 stories per hour of annotation
effort.
We will certainly want to discuss the remaining releases at the Dry Run
meeting. In Gaithesburg, we proposed delivering the Evaluation Set in
October and the Reference Corpus -- which we defined as the cross
product of all months and topic lists -- at the end of the calendar
year. In Cambridge, there were requests for additional data sooner that
we'd like to adress. We propose the following schedule with explanations
below
Week of 9/28 Maintenance Release - bug fixes and dropouts
since recovered for Training and Dev/Test releases
Week of 10/30 Evaluation Release to NIST - May and June data
annotated for part of topic list 4 and all of 5
After Eval release Pre-Evaluation Release - January through April
annotated for topic lists 1, 2, 3 and part of 4
End of 1998 December Reference Corpus - January through
June annotated for topic lists 1 through 5
Note that although the annotation for the Pre-Evaluation Release is
nearly done, the quality checks will consume a substantial amount of
time. The additional off-diagonal material amounts to twice the amount
of material in the Dev/Test release. It is unlikely that quality checks
will be done for all of this material until some time mid to late
November. George has suggested an alternative approach -- to order
quality checks on March and April data versus topic lists 1 and 2 before
checking January and February data versus topic list 3 and part of 4.
That would ensure that the material ready to be released by late
October/early November would be the most useful to the community since
there is a greater probability of hits with that combination
The entire Reference Corpus, including the material to be used for the
evaluation, should be ready by the end of the calendar year though LDC
will need to hold publication until after evaluation results are
reported.
I look forward to our next meeting in NY.
Best wishes,
Chris
(179) previous ~ index ~ next
Last updated Fri Oct 2 19:04:20 1998