(039) previous ~ index ~ next

To: tdt-distrib@unagi.cis.upenn.edu
From: David Graff <graff@unagi.cis.upenn.edu>
Subject: Mandarin Newswire Text
Date: Fri, 05 Mar 1999 16:41:28 EST

Folks,

The LDC has prepared an initial sample of Mandarin Newswire text,
which we want to make available to the research sites who will be
participating in the TDT3 evaluation. I will attach the readme file
for this distribution below.

The data file is now available for "members_only" ftp distribution.
The file name is:

prelim-man-nw.990305.tar.gz

The file size is 16275954 bytes (16.3 MB), and when uncompressed, it
will be about 35.6 MB.

Please send me email if you need a reminder about instructions for
the "members_only" ftp retrieval.

Dave Graff

--------------------

This is a PRELIMINARY release of Chinese newswire text to be used in
support of the TDT3 project. The intention of this release is to give
the research sites a chance to work with the file format and character
encoding as soon as possible.

The content of this release represents ALL the suitable Chinese
newswire data that is available from the TDT2 collection period
(January through June, 1998); that is, the text data presented here
represents the complete pool from which the LDC will draw a sample for
topic annotation.

PLEASE NOTE that some portion of the data presented here will NOT be
selected for topic annotation by the LDC. The final, annotated
release of TDT2 Mandarin data is expected to contain about 18,000
stories from the two news sources presented here, which means that
about 25% of the stories in this preliminary release will have been
excluded.

The two sources are Xinhua News and Zao Bao News; the former is
collected via dedicated modem, and the latter is harvested via the
world-wide web. Both sources use GB character encoding; SGML tagging
is applied in a manner parallel to the English sources used in TDT2,
to mark story boundaries, to assign a unique identifier (a "DOCNO") to
each story, to indicate the date and time of the story, and to provide
other information that comes as part of the data transmission from the
source. As with the TDT2 English data, the text content of each
newstory is contained between "<TEXT>" and "</TEXT>"; within this
region, there are "<P>" tags to mark paragraph boundaries. Word
segmentation has not been applied.

Regarding time stamps, the ZBN stories are marked with respect to date
of "delivery" only -- there is no time information provided by this
source that can be used to order the stories within a given day's
collection. The XIN stories have time stamps, but because the data
arrive via modem, and because some stories are short, it is possible
(but rare) that two consecutive stories may appear to have the same
time stamp. (It is not uncommon for consecutive stories to differ by
only a few seconds in their time stamps.)


David Graff
LDC
March 5, 1999

(039) previous ~ index ~ next

Last updated Thu May 13 09:28:18 1999