(161) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Documentation on revisions to the TDT2 Text Corpus
Date: Thu, 19 Aug 1999 15:56:26 EDT
The following text is being provided as a documentation file
("tdt2mods.doc") on the cdrom that contains the latest version of the
TDT2 Text Corpus.
DESCRIPTION OF CHANGES IN TDT2 VERSION 3
========================================
The following sections describe how the current release of TDT2 data
differs from earlier releases. The changes involve restructuring of
the corpus directories, slight modifications to the designation of
topic-ids and to some file formats, and a variety of bug fixes.
Summary of TDT2 release history:
--------------------------------
Version 1: This was the form of the corpus that was used in the 1998
TDT2 benchmark tests, consisting of six English news sources
annotated against 100 target topics (of which only 96 topics
yielded on-topic "hits" in the text collection); training and
development test data were released in October 1998, and
evaluation test data were released in December 1998.
Version 2: This was the form of the corpus made available for the
first dry-run test for TDT3 benchmark participants, consisting
of six English news sources and 3 Mandarin news sources; the
Mandarin sources were annotated against 20 target topics
selected from the original 96, such that each topic had at
least four on-topic stories in each language. The full
six-month, nine-source collection was designated as training
and development test data, and released by NIST, June 6, 1999.
Version 3: This is the current release, comprising the same sources
and target topics as Version 2, plus an additional 96 new
topics that have been partially annotated against the English
sources, primarily for purposes of the "First Story Detection"
research task in the TDT3 Evaluation Plan.
Differences between Version 2 and Version 3:
--------------------------------------------
1. Directory structure and file names
Version 2 was organized into the following data directories, and the
file name extensions applied to the directory contents were as shown
here:
Path Contents
---------------------------------------------------------------
sgml/ *.sgm (reference texts including descriptive markup)
tkntext/ *.tkn (tokenized version of reference texts)
asrtext/ *.asr (output of Dragon ASR systems for all broadcast data)
as1text/ *.as1 (output of the BBN ASR system for English broadcast data)
mtrtext/ *.mtr (SYSTRAN machine translation of Mandarin tkntext data)
mtatext/ *.mta (SYSTRAN machine translation of Mandarin asrtext data)
tables/ *.bndtkn, *.bndasr, *.bndas1, *.bndmtr, *.bndmta
(boundary tables for data files in all the "*text" paths)
and also the file "topic_relevance.table"
In Version 3, the various boundary table files have been partitioned
into separate directories depending on the type of content they
pertain to; the directory names have been altered, and the file name
extensions are now set to be identical to the name of the directory
that contains each file; i.e.:
Path Contents
-----------------------------------------------------
sgm/ *.sgm (reference text with markup)
tkn/ *.tkn (tokenized version of ref.text)
as0/ *.as0 (Dragon ASR output, English and Mandarin)
as1/ *.as1 (BBN ASR output, English only)
mttkn/ *.mttkn (SYSTRAN output from Mandarin *.tkn)
mtas0/ *.mtas0 (SYSTRAN output from Mandarin *.as0)
tkn_bnd/ *.tkn_bnd (boundary tables for *.tkn)
as0_bnd/ *.as0_bnd (boundary tables for *.as0)
as1_bnd/ *.as1_bnd (boundary tables for *.as1)
mttkn_bnd/ *.mttkn_bnd (boundary tables for *.mttkn)
mtas0_bnd/ *.mtas0_bnd (boundary tables for *.mtas0)
topics/ tdt2_topic_rel.* (topic relevance tables)
This reorganization of boundary tables and path names is intended to
make individual files more accessible, reduce the overpopulation of
any single directory, and allow for the creation of alternative sets
of boundary tables for any given form of data. (For example, a user
could create a directory called "tkn_bnd_a" to store boundary tables
that are generated by an automatic story segmentation function applied
to the "tkn" data files, and could easily use this set of tables, in
place of the reference boundary tables in "tkn_bnd", to test system
performance.)
2. Names of VOA English files
Although the VOA English news service is described and treated as a
single source in TDT2, Version 2 used three different patterns to name
the VOA English files: from January through May, there were two news
programs that aired daily, "VOA Today" and "VOA World Report"; the
difference in program names was preserved in the corresponding file
names (VOA_TDY and VOA_WRP), even though the content and structure of
the two programs was quite similar -- both were 60-minute shows
providing "news and features". In June, VOA abandoned the use of
different names for news programs, and switched to a schedule in which
hour-long "news and features" programming made up the bulk of the
broadcast day. This schedule change was reflected in the Version 2
file names by switching to "VOA_ENG" for all June recordings.
After the Version 2 release, it was decided that the distinctions
among VOA English file names were of little or no practical use, and
were instead a hindrance to using this one source in a simple and
uniform way. The discontinuity in VOA English file names, combined
with the inclusion of VOA Mandarin data (named VOA_MAN), made it
difficult to reference all VOA English data as a coherent set.
In Version 3, all VOA English files use the the string "VOA_ENG" in
their file names. In case some users may want to investigate possible
differences among the shows that used to be differently named, a table
is provided in the "corpus_info" directory that records the file name
correspondences between Version 2 and Version 3 ("voa_names.table").
3. Topic designations
Version 2 identified the target topics using sequential numbers, 1
through 100.
In Version 3, the topic identifiers have been expanded to fixed-length
strings of 5 digits, by adding 20000 to each original topic ID; the
original 100 topics are now identified, in the same sequence, as 20001
through 20100.
This change was intended to differentiate TDT2 topic IDs from those of
other TDT phases. The TDT Pilot corpus (TDT1) will be re-released
with a similar modification, using topic IDs 10001 through 10025, and
the main target topics in TDT3 will be designated 30001 through 30060.
This change also accommodates expansion in the set of annotated topics
for each phase, and allows for easier sorting of topic data by ID.
4. Additional topic tables
Version 2 provided a single topic_relevance.table, containing all
on-topic judgments ("YES" and "BRIEF") resulting from full annotation
of 100 target topics against all news stories.
Prior to releasing Version 3, the LDC carried out additional topic
annotations on TDT2 data to support the JHU CLSP 1999 Summer Workshop
project on First Story Detection. This effort involved selecting an
additional 97 target topics, and judging up to 60 stories against each
new topic, with a focus on finding the earliest report in the corpus
on each new topic, as well as some number of additional (subsequent)
on-topic stories and a number of off-topic stories. Only a fairly
small number of stories was judged for each new topic.
This "First Story" annotation has lead to the inclusion of two
additional topic tables:
- "tdt2_topic_rel.partial_annot" contains records for all the stories
that were judged against each of the new topics (in this table, the
"level" attribute can have a value of "YES", "BRIEF" or "NO" --
stories that are not listed with a given topic in this table have
NOT been judged against that topic)
- "tdt2_topic_rel.first_story" contains a listing of just those
stories which chronologically first for each of the 193 defined
topics (the original 96 plus the newly added 97); this can be
derived from the other two tables, and does not represent any new
information -- it is provided simply as a convenience
5. Format of *.as1 files
In Version 2, the token records (<W> elements) of BBN *.as1 files
contained only "recid", "Bsec" and "Dur" attributes, whereas the
Dragon *.asr files contained these attributes plus "Cluster" and
"Conf" (speaker cluster and recognition confidence score information)
for each word.
In Version 3, the same attributes are used in all <W> elements of all
*.as0 and *.as1 files. In the *.as1 files, because the BBN system
does not currently provide speaker cluster or confidence information
in its output, the "Clust" and "Conf" attributes are always assigned
the constant value "NA".
6. Format of *.mttkn and *.mtas0 files
The SYSTRAN machine translation program, which is used by the LDC to
provide English renditions of Mandarin data files, has the property
that it fails to translate some strings of Mandarin text; when this
happens, it simply includes the untranslated string as part of the
translated output. As a result, the English output file may contain a
scattering of "word" tokens that consist of unmodified 16-bit GB
encoded characters intermixed among the English words.
In Version 2, these GB strings were simply treated as word tokens just
like the English words, and were not explicitly marked in any way as
being untranslated. (They were distinct from English words, in terms
of being composed of pairs of bytes in which all bytes had the 8th bit
set.)
In Version 3, an attribute has been added to each <W> element to
indicate whether the corresponding token represents a "successful"
translation to English. The attribute is "tr", and it receives a
value of "Y" if the corresponding token is English, or "N" if the
token is an untranslated GB Mandarin string. For example:
<DOCSET type=SYSTRAN fileid=19980101_0016_1116_XIN_MAN ...>
<W recid=1 tr=Y> Is
...
<W recid=53 tr=N> ΞΎ
<W recid=54 tr=Y> healthy
...
</DOCSET>
(The character data for recid=53 consists of two bytes: 0xCE 0xBE)
7. Tokenization of Mandarin *.sgm files into *.tkn
There were three issues affecting the tokenization of reference texts
in Mandarin that were not properly dealt with in Version 2:
(a) newswire articles contained "dateline" strings, "end-of-story"
strings, and various "pictorial" characters (symbols to provide
"bullet" highlighting of certain paragraphs) that should have been
eliminated from the tokenized output, but were not.
(b) newswire articles (particularly Xinhua) contained regions of
corrupted data, yielding byte codes that were uninterpretable as
either GB or ASCII characters; either the corrupted bytes, or whole
stories that contained them, should have been excluded from the
tokenized output, but were not.
(c) often (especially in Xinhua), there were 16-bit codes in the text
that mapped to a portion of the GB character table used to replicate
the standard ASCII characters -- in other words, the text contained
strings of digits and roman-alphabet letters (even spaces) that were
rendered using 16-bit codes; these should have been be replaced by the
corresponding 7-bit ASCII characters, but were not.
For Version 3, the tokenization function was improved to eliminate
dateline, byline and end-of-story strings from the newswire sources,
as well as "highlighting" characters (this made Mandarin newswire
tokenization comparable to the treatment of NYT and APW in English).
Extra care was taken to isolate byte sequences that were untreatable
as GB or printable ASCII character sequences, and to produce only
valid, printable tokens as output (in some cases, stories were
manually inspected, and deleted from the corpus if the data corruption
was severe). Also, the new method identified GB characters with 7-bit
ASCII equivalents, made sure that these alphanumerics and punctuations
were rendered invariantly in ASCII form, and structured the tokenized
output so that each <W> element contains either a single GB character
or a string of one or more contiguous ASCII characters.
8. Derivation of machine-translated text
In Version 2, the machine translation of Mandarin reference text data
was affected by the presence of dateline, byline and end-of-story
strings (as well as data corruption) in the Mandarin newswires, as
described in the previous section.
In Version 3, the machine translation used the newly tokenized
reference data files (*.tkn) as input, to assure that the translations
would be of better general quality and that there would be proper
equivalence of content between corresponding "native" and "translated"
token stream files.
9. Consistency among various boundary tables
In Version 2, there were a number of cases in which a comparison of
different boundary tables for the same file-id (e.g. comparing the
"bndtkn" file to the "bndasr" file) showed different inventories of
stories; e.g. the "bndasr" table may have included fewer story entries
than the "bndtkn" table, or the "doctype" of a given story might have
differed in the two files. Also, the treatment of story boundaries in
ASR data sometimes involved the addition of an extra entry at the end
of the "bndasr" table, with "docno=UNASSIGNED".
In Version 3, the creation of boundary tables was modified to assure
that all boundary tables sharing a given file-id would have the same
set of story entries, that there would only be entries for identified
stories, and that the doctype of each story would be constant across
all tables referring to that story. For example, there are four
distinct boundary tables for each VOA_MAN program (for tkn, mttkn, as0
and mtas0 forms of the data); in this version, the four tables for a
given file-id will have the same number of lines and the same set of
docno and doctype values.
(The "Brecid" and "Erecid" values will of course differ across tables;
in fact a story may lack these values in one table and not in another,
e.g. if an ASR system produced words where the human transcriber or
closed caption service did not. Also, the "Esec" value of the final
story in a file may differ when comparing the tkn_bnd to the as0_bnd
or as1_bnd file, because time stamps on the ASR tokens may have
extended beyond those of the manual transcription; it is still the
case that all time spans and all tokens are accounted for in each
boundary table.)
10. Miscellaneous bug fixes
- Version 2 contained a set of files for 19980209_2000_2100_PRI_TWD;
these were derived from an incorrect audio recording, which was
actually a duplication of 19980216_2000_2100_PRI_TWD. The former
file set has been deleted from the corpus.
- Version 2 had bad asr and as1 data for 19980528_1600_1630_CNN_HDL,
again due to a bad audio recording; a correct recording was used
for closed-caption text and topic annotation, and NIST has provided
a corrected version of the as1 data for this file; the as0 file for
this broadcast has been deleted.
- The first story annotation and recent work at the JHU summer
workshop turned up a small number of incorrect topic labels in the
Version 2 topic_relevance.table; these have been corrected.
- The Version 2 topic_relevance.table contained a number of on-topic
stories collected in the first three days of July 1998, even though
text data for these dates were not part of the corpus; these unused
topic labels have been removed.
- All of the *.as1 files in Version 2 were lacking a final line-feed
character at the end of the last line (after "</DOCSET>"); this has
been corrected.
David Graff
LDC
August, 1999
(161) previous ~ index ~ next
Last updated Thu Aug 19 16:14:48 1999