(322) previous ~ index ~ next
To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: TDT2 and TDT3 Multilanguage Text Corpora -- Latest release
Date: Tue, 29 May 2001 12:24:31 -0400
Folks,
It was brought to my attention that some members of the "tdt-distrib"
email list might not have received the recent general announcement from
the LDC about the availability of these two cdrom releases (as of May 3
or thereabouts):
TDT2 Multilanguage Text Corpus, v4.0 (one cdrom)
TDT3 Multilanguage Text Corpus, v2.0 (one cdrom)
(total: two cdroms). These corpora are available for free to those who
have or purchase an LDC membership for 2001; the TDT2 cdrom is also free
to those who had an LDC membership in 1999. Each cdrom is available to
non-members for a purchase price of $2,500 (total: $5,000 for both; note
that LDC membership for 2001 is $2,000 for non-profit members, $20,000
for commercial members).
The LDC can arrange "evaluation only" membership agreements for those
who would like to participate in this year's TDT evaluation cycle, but
cannot afford either an LDC membership or the non-member purchase price;
the "evaluation only" membership will permit free access to the data
until the scheduled evaluation has been completed, after which the data
must either be purchased or returned to the LDC (user must pay for
shipping).
These cdroms contain a number of enhancements relative to earlier
releases of these corpora, including:
- A more "user-friendly" rendering of reference and ASR text data, in
which the contents of the corresponding "token stream" files have been
reformatted to a simple and consistent "TIPSTER-style" document markup
(sample provided below); the original TDT-specific token stream file
formats are also provided, as well as the "unprocessed" source
(reference) text files from which token stream and "tipsterized" formats
have been derived.
- Some patches to reference text source files to repair or eliminate
bad character data, with some resulting differences (cleaner content)
in the corresponding token stream files.
- A patch to the process that created token stream data from Chinese
newswire sources. We had discovered a bug in the earlier releases that
had caused portions of an initial paragraph to be elided in some
stories; this bug was fixed prior to creating these new releases of
TDT2 and TDT3.
These cdroms DO NOT contain the exposed sets of topic table data (to be
used for system training and development), or information about the
topics themselves. For these items, please visit the TDT web pages at
the LDC:
www.ldc.upenn.edu/Projects/TDT2/
www.ldc.upenn.edu/Projects/TDT3/
For further information about the corpora, please consult those urls,
as well as the LDC web-catalog pages:
http://www.ldc.upenn.edu/Catalog/LDC2001T57.html (TDT2)
http://www.ldc.upenn.edu/Catalog/LDC2001T58.html (TDT3)
Please accept my apologies for not posting this information sooner on
the tdt-distrib email list.
Best regards,
Dave Graff
LDC
---------------
sample of "tipsterized" markup, provided consistently for all data
sources, in both reference text data and ASR text data:
<DOC>
<DOCNO> VOA19980501.1700.0000 </DOCNO>
<DOCTYPE> MISCELLANEOUS </DOCTYPE>
<TXTTYPE> TRANSCRIPT </TXTTYPE>
<TEXT>
It's 2100 universal time. From V.O.a. news Washington, this is world
report. It is Friday, may 1. I'm Susan Clark. Topping the news at
this hour, police in Nigeria fire into a demonstration killing at
least seven people. I'm Mimi Levich. A former prime minister of Rwanda
pleads guilty to a genocide. We'll hear about a battle against drug
smugglers and the money behind the sweet smell of success. And we'll
go round and round and round a roller rink for the last time. All
coming up in the next hour on world report.
</TEXT>
</DOC>
...
<DOC>
<DOCNO> VOA19980501.1700.0510 </DOCNO>
<DOCTYPE> NEWS </DOCTYPE>
<TXTTYPE> TRANSCRIPT </TXTTYPE>
<TEXT>
The leader of Kosovo's ethnic Albanians has welcomed
international efforts to halt violence in the ugeo slav province,
but urged stronger measures. Abrahima golvia suggested extending sanctions,
which were announced Wednesday, to sports, including a ban on Yugoslav
participation in June's world cup soccer matches in France. The international
community has announced a package of incentives and penalties for
pressuring yugoslava to open talks with Kosovo's ethnic Albanian
majority on ending the crisis.
</TEXT>
</DOC>
-------------------------------------------------------------
To unsubscribe from tdt-distrib, email majordomo@ldc.upenn.edu
with "unsubscribe tdt-distrib" in the body of the message.
(322) previous ~ index ~ next
Last updated Wed Aug 22 16:07:33 2001