(079) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: TDT2 Dev-Set Release
Date: Wed, 08 Jul 1998 06:45:53 EDT
Folks,
Here are the instructions for obtaining the portion of TDT2 that has
been designated as the "development test set". I hope you will find it
fairly well described in the accompanying README file (which I am
attaching below).
[ftp instructions available on request from graff@ldc.upenn.edu]
The file is 78,610,663 bytes in compressed form, and will expand to a tad
over 305 MB when uncompressed. As always, let me know if you have questions
or problems with the release.
Best regards,
Dave Graff
-----------------------------------------------------------
Top-level README file for the TDT2 Development Test Set
Initial Release
July 8, 1998
This release contains all annotated text data from the six TDT2 data
sources, spanning March and April of 1998. This portion of the overall
corpus is intended for use as development test data, and supplements the
earlier release of the training set, spanning January and February.
The following list shows the directories included in this distribution, and
the number of files contained within each directory:
directory # of files
--------------------------
asrtext/ 350
dtd/ 4
sgml/ 887
tables/ 1239
tkntext/ 887
In terms of data sources, the contents can be broken down as follows; the
next table shows the number of files per source in each of the data
directories:
directory ABC CNN PRI VOA NYT APW
-----------------------------------------------------------
asrtext/ 57 229 37 27 N/A N/A
sgml/ 108 230 37 57 211 244
tables/ 165 459 74 84 211 244
tkntext/ 108 230 37 57 211 244
This tally accounts for all but two files in the "tables/" directory; those
two additional files are:
topic_relevance.table
topic_relevance.comments
These two files contain the results of the manual topic annotation done at
the LDC. These results have gone through a fairly thorough quality control
process, in which extra passes over the annotations have sought to resolve
disagreements among multiple annotators, to eliminate false alarms, and to
recover misses.
The topic relevance judgements provided here cover only the annotation of
March and April data against the target topics #38 through #66 (from topic
lists 3 and 4, which can be found on the LDC's main TDT web page, at
http://www.ldc.upenn.edu/TDT). At a later time, a more thorough set of
annotations will be released, covering both training and development sets
(January through April) and topics #1 through #66 (lists 1 through 4).
The various quantities of files in the second table above deserve some
additional comments.
The NYT and APW sets represent four discrete sample blocks drawn from each
date of the collection. The NYT set is smaller due to modem transmission
problems that caused a total loss of 7 days from the collection (3/1, 3/8,
4/15-4/19); also, the number of usable stories in NYT can fall short of the
desired 80 per day on weekends, so a few days have only three sample blocks
present instead of four.
With regard to the ABC data, the "sgml/" directory contains two versions of
text for most of the ABC broadcasts: one from closed-captions (".ccap") and
the other from a commercial transcription service (".fdch"). The
segmentation and time-stamping of story units in the two versions of a
given ABC broadcast are identical, but the lexical content within each
story unit will be found to differ -- typically, the "fdch" file will be
more complete and correct in representing the spoken content of the
broadcast; as a result, there are separate tokenized text files (in
"tkntext/") for each version. For a handful of broadcasts, there is only
closed-caption text in "sgml/". The "asrtext/" directory contains only one
file per broadcast.
Regarding VOA data, the March-April collection period contains the first
set of VOA broadcasts recorded via a satellite down-link at the LDC; while
recordings were being made from the satellite signal, we continued to
gather digital audio data via the internet from the VOA web site. Due to
different limitations in each of these methods, we have both forms of audio
data for some broadcasts, only web-gathered data for others, and only
satellite transmission data for yet another portion of the collection.
(The satellite broadcasts were stored by downsampling the digital audio
stream from an MPEG-encoded 48 kHz sample rate to 16 kHz; the web data were
digitized at VOA using "sound-blaster" style equipment at the studio with a
sample rate of 11 kHz. Only the satellite data are being used for ASR
output by Dragon Systems.)
A couple of problems arose from this dual-source situation, which only came
to light as we were in the final stages of preparing the release. First,
six of the satellite broadcasts were not sent out to Dragon Systems for
ASR output. Second, there were 18 cases where we had both forms of audio
data and Dragon produced ASR output from the satellite audio, but manual
time-stamping of the story boundaries was done using the web audio file;
because the two recordings of a given broadcast began at different times,
there was a mismatch of time stamps between the ASR output and the
manually-created transcript.
The six satellite files missing from the initial ASR runs are being
shipped to Dragon, and ASR output for them will be available soon. For the
18 files with timing misalignment, we have established that the two
digitizations are congruent, and that adjusting the time-stamps in the SGML
transcripts by a constant offset will fix the problem. This adjustment
will be completed soon, and the files will be made available.
The VOA files included in the current release consist of the following:
- broadcasts for which we have only the web-audio data (with no ASR output
possible)
- broadcasts for which the original manual time-stamping was done using
the satellite audio data
Note that the names of the VOA files encode the audio source in their
"begin-time" and "end-time" portions, as indicated by these examples:
satellite: 19980406_1600_1700_VOA_WRP, 19980410_1800_1900_VOA_TDY, etc.
web-audio: 19980421_2100_2200_VOA_WRP, 19980421_2300_2400_VOA_TDY, etc.
The satellite recordings have hours ranging between 1600 and 1900, while
the web-audio files show 2100 through 2400; this is because the VOA web
site identifies files with respect to UTC (i.e. GMT), whereas the LDC's
local audio capture process from the satellite uses local time (EST or EDT)
to name the files.
There are some additional March and April broadcasts from ABC, CNN and PRI
(a total of 14 files) that failed to make it into the annotation stage in
time for inclusion in this release. These files will be made available
when the complete training/devtest annotation is released.
-----------
David Graff Linguistic Data Consortium
graff@ldc.upenn.edu 3615 Market St., Suite 200
voice: (215) 898-0887 University of Pennsylvania
fax: (215) 573-2175 Philadelphia, PA 19104
(079) previous ~ index ~ next
Last updated Wed Sep 9 09:40:51 1998