(255) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: New release of TDT1 (Pilot) Corpus
Date: Thu, 03 Dec 1998 11:29:17 EST

Folks,

I realize this may not be the best time for some of you to attend to
this matter, and I apologize for the fact that it took so long to get
this ready.

We have created a modified version of the TDT1 corpus, in which the
file and data formats have been adapted to match those of TDT2. This
means that you can now process the TDT1 data using the tools and
protocols that have been established for TDT2.

We are assigning the name "TDT1.5" to this new version of the TDT
Pilot Corpus, and it is now available via ftp in the usual manner:

[ftp instructions available on request from graff@ldc.upenn.edu]

The file size is 54250864 bytes (54MB), and it occupies 200 MB when
uncompressed and unpacked. If you would prefer to receive this on
CD-ROM, please send me a request, and I'll see that it gets to you.

Attached below is the top-level README file that comes with the
corpus. Please let me know if you find any problems with this
version of the Pilot.

Dave Graff


------------------------------------------------------------

Main Readme File for TDT1.5 : The TDT2-style Release of TDT1
============================================================

This corpus is a reformatted version of the TDT1 (a.k.a. TDT Pilot)
Text Corpus, in which the file structure and markup format have been
adapted to match those of the TDT2 corpus. This will allow the TDT1
data to be handled by the same processes, and according to the same
test regimen, that apply to TDT2. In what follows, the term "TDT1"
will refer to the Pilot Corpus data, as opposed to "TDT2"; we will use
the term "TDT1.0" to refer to the original release, as formatted by
James Allen at the University of Massachussettes, and we will use
"TDT1.5" to refer to the current re-release of that data using the
TDT2 structure and formatting.

TDT1 comprised two sources of text: a variety of CNN news program
transcripts, and the Reuters North American general news service.
It contains a total of 15863 stories that appeared in these sources
between July 1994 and June 1995 (inclusive); each story was manually
checked for relevance to 25 different topics that were reported on
during that period.

The differences between TDT1.0 and TDT1.5 are as follows:

(1)
- TDT1.0 had all text data in a full SGML markup format.

- TDT1.5 has this SGML format (with minor changes in the tag
inventory), plus a tokenized version of the text, in which all the
descriptive SGML tags and peripheral information about stories have
been removed; there is also a corresponding set of boundary tables
that map story boundaries to the tokenized text in terms of sequential
numbers assigned to word tokens.

(2)
- TDT1.0 had all text data in a single file, with all stories from
the two sources arranged in chronological order; within a set of
stories having the same date, stories from the two sources were
interleaved at irregular intervals to represent a time-sequential
ordering.

- TDT1.5 has the stories divided into separate data files according
to date and source; since the collection has data from both sources
for each day in a one-year period, the stories are divided into 760
data files (one file per day per source).

(3)
- TDT1.0 used both a "source" document-id number (enclosed in a
"<DOCID>" SGML tag) and a corpus-internal "tdt-id" number (enclosed in
a "<TDTID>" SGML tag) to identify each story; the format of the
"source" number varied slightly between Reuters and CNN data, while
the "tdt-id" number was consistent for all stories and represented the
established chronological sequence of each story (starting at
"TDT000001" for the first story on the first day of the collection,
and ending at "TDT015863" for the last story on the last day).

- TDT1.5 assigns a "DOCNO" to each story (enclosed in a "<DOCNO>"
SGML tag), which reflects the date of collection, the source, and the
sequential order of the story in the original (interleaved) series of
stories for that date (e.g. "CNN19940701.0001" and "REU19940701.0042"
are the first and last stories from the first date in the collection).
Note that the sgml files in TDT1.5 do not contain the "<DOCID>" or
"<TDTID>" tags used in TDT1.0.

(4)
- In TDT1.0 the topic relevance judgements were presented in a file
named "tdt-corpus.judgments", with comment lines marking off each of
the 25 topics covered; each record in the table consisted of four
unlabeled fields: topic number, tdt-id number, story-id number, and
level of relevance ("YES" or "BRIEF").

- In TDT1.5, this information is in "tables/topic-relevance.table",
and is structured as an SGML-tagged file; each record in the file is
an SGML tag containing attributes for topic-id, relevance level,
docno, and fileid (the data file name where the story is found).


Note that for a given day, the original time-sequential ordering of
stories in the single interleaved text stream of TDT1.0 is NOT
directly preserved in the two data files for that date in TDT1.5. To
determine the original sequencing of all stories using the files in
TDT1.5, it is necessary to compare the "DATE_TIME" tags, and/or
the sequence-number portion of the DOCNO fields, in all stories from
the two files for that date. For example, here are the contents of
the DOCNO and DATE_TIME fields for stories in the two files for July
1, 1994 (19940701):

CNN19940701.0001 07/01/94 00:00 REU19940701.0002 07/01/94 00:57
CNN19940701.0004 07/01/94 02:37 REU19940701.0003 07/01/94 02:36
CNN19940701.0005 07/01/94 02:38 REU19940701.0012 07/01/94 07:09
CNN19940701.0006 07/01/94 02:39 REU19940701.0013 07/01/94 09:36
CNN19940701.0007 07/01/94 02:40 REU19940701.0014 07/01/94 10:07
CNN19940701.0008 07/01/94 02:41 REU19940701.0020 07/01/94 12:44
CNN19940701.0009 07/01/94 02:42 REU19940701.0021 07/01/94 13:08
CNN19940701.0010 07/01/94 02:43 REU19940701.0022 07/01/94 13:33
CNN19940701.0011 07/01/94 02:44 REU19940701.0023 07/01/94 13:55
CNN19940701.0015 07/01/94 10:08 REU19940701.0024 07/01/94 16:04
CNN19940701.0016 07/01/94 10:09 REU19940701.0028 07/01/94 16:38
CNN19940701.0017 07/01/94 10:10 REU19940701.0029 07/01/94 16:44
CNN19940701.0018 07/01/94 10:11 REU19940701.0030 07/01/94 16:52
CNN19940701.0019 07/01/94 10:12 REU19940701.0031 07/01/94 17:15
CNN19940701.0025 07/01/94 16:05 REU19940701.0032 07/01/94 17:52
CNN19940701.0026 07/01/94 16:06 REU19940701.0035 07/01/94 19:12
CNN19940701.0027 07/01/94 16:07 REU19940701.0036 07/01/94 19:22
CNN19940701.0033 07/01/94 17:53 REU19940701.0042 07/01/94 22:59
CNN19940701.0034 07/01/94 17:54
CNN19940701.0037 07/01/94 19:23
CNN19940701.0038 07/01/94 19:24
CNN19940701.0039 07/01/94 19:25
CNN19940701.0040 07/01/94 19:26
CNN19940701.0041 07/01/94 19:27

In the lists above, note for example that the first and second stories
in the Reuters file were originally ordered between the first and
second stories from CNN on this date.

For convenience, we have included a table in the top-level directory
that relates the original TDTID and DOCID values to the new DOCNO
values, along with the TDT1.5 file names and the putative DATE_TIME
fields for all stories. This table is in the file "old2new.table".


Differences between TDT1.5 and TDT2:

Although this release matches the data format specifications of TDT2,
there are some features of the TDT2 data collection that are not
shared by TDT1. These differences are summarized below.

(A) In TDT2, each data file represents a specific period of time
during which the file contents were collected from the given source
(and the time period for the sample is reflected in the file name);
for broadcast sources, the time period is either 30 or 60 minutes,
while for newswire sources, the period could range from, say, 25
minutes to several hours, depending on how much time elapsed before
the desired number of stories were received for a given data sample.
In TDT1.5, each file contains ALL the stories for a given source and
date; the period of time covered by each sample file is effectively
the full 24-hour day.

To put this another way, whereas TDT2 contains up to 16 files per day
that can be ordered chronologically, TDT1.5 contains only two files
per day, and there is no true chronological ordering between these two
files.

(B) There are two sources of text data for the broadcast material in
TDT2: one is a human-produced "reference" text transcription, and the
other is the output of an automatic speech recognition (ASR) system
using the broadcast audio signal as input. In TDT1, there is no ASR
output for the CNN data, only a human-produced reference transcript.

(C) The following SGML tags, which are used in the TDT2 corpus, do not
occur in TDT1.0 or TDT1.5:

HEADER
TRAILER

The following SGML tags, which are NOT used in TDT2, were used in
TDT1.0 and have been retained in TDT1.5:

DATELINE
AUTHOR
SOURCE
SUBJECT
TITLE
COPYRIGHT
SUMMARY
TOPIC
B
I

(D) The "<B>" and "<I>" tags mentioned above appear to reflect
type-setting or font-selection information (i.e. "bold" and "italic";
there are also annotations in the TDT1 sgml data (in CNN stories) that
are enclosed in square brackets (e.g. "[sp?]", "[commercial break]",
etc). Annotations of this sort are not present in TDT2, and
tokenization of TDT2 data was fairly simple:

- eliminate all material not enclosed with <TEXT> tags
- eliminate all text data enclosed within <ANNOTATION> tags
- eliminate all remaining SGML tags from within the <TEXT> content
(e.g. "<TURN>" tags)
- tokenize the remaining text data by splitting on white space

In order to make the tokenization of TDT1 comparable, the following
steps were added prior to splitting the text data into tokens:

- eliminate all text enclosed within <B> tags
- eliminate <I> tags, but retain any text they enclose
- eliminate all text enclosed between square brackets

For additional details on corpus format and structure, please consult
the LDC web pages for the TDT project: http://www.ldc.upenn.edu/TDT/.

David Graff, graff@ldc.upenn.edu
Linguistic Data Consortium
December, 1998


(255) previous ~ index ~ next

Last updated Fri Dec 4 12:05:50 1998