The LDC receives the New York Times News Service (NYT) and the Associated Press Worldstream Service (APW) via dedicated modems, operating 24 hours/day.
The NYT service includes not only the news stories that appear in the New York Times newspaper, but also a wide range of content from other newspapers around the U.S. and other wire services. The APW service includes world-wide news in 6 languages.
The TDT2 corpus uses only NYT material that originates from the New York Times proper, and only the English portion of the APW service. The NYT wire delivers over 22000 story units (about 100 MB of data) per month. Of these, only about 4000 units (about 20 MB) are from the NYT proper (i.e. material generated by NYT for use, or possible use, in that newspaper). AP delivers over 14000 stories in English per month (about 40 MB).
The data are delivered to the modems in a standard ANPA format (American Newspaper Publishing Association), containing a mixture of control characters and printable ASCII text. The format defines the data structure of a "message" (the basic unit of newswire transmission), consisting of a header, body and trailer. The modem connection is subject to various "normal" disruptions, causing occasional drop-outs or corruptions in the data.
There are two properties of newswire transmissions that work against the intent and assumptions of TDT2.
First, some story units are actually composite lists of brief reports; these reports may be loosely related by a broader topic (e.g. "business/finance") or by geographic region. These pose a problem not only for the assumption of one topic per story, but also for annotators doing topic labeling on story units. This appears somewhat more often in APW than in NYT.
Second, some stories are quite long, and are broken up for transmission as a sequence of parts. Each part has the structural appearance of an independent story unit in the newswire transmission format (see below about ANPA data format), except that the non-initial parts of the story lack a headline. This happens more often in NYT (about 20% of stories) than in APW (less than 5%).
Another property is that the same story may be posted more than once in the course of a day, or even over adjacent days. There may be minor changes to content between successive versions. Also, the services sometimes post simply a fragment of a story, to replace one or more paragraphs in a previously posted full story.
The LDC breaks the continuous stream from each modem into one-day partitions, starting at midnight each day. We then apply a four-stage filter to the daily capture files.
The first stage simply checks the stream for compliance to the ANPA transmission format, discarding portions of the stream that contain corrupted or incomplete messages, retaining messages that appear to be complete and properly formatted, and translating the significant control characters into visible SGML markup. Some transmission failures could go unnoticed at this stage; in particular, if the modem connection were broken in the middle of one story unit, and resumed noiselessly during the middle of a later story, two message fragments could appear joined as a single message in the output of this stage of filtering. (This sort of "silent" corruption is rare -- typically, a disruption of service is accompanied by noticeable noise in the text.)
The second stage reads the SGML output of stage one and creates a reduced stream that contains, for each service, just the material of interest for TDT2. This is done using a source identification string provided by NYT in the intial part of each message, and a language identification string provided by APW in the trailing part of each message.
The third stage translates the original SGML markup produced by stage one (which is designed for internal LDC use and simply mimics the ANPA data structure), to produce the SGML format established for use in the TDT2 Corpus.
The fourth stage seeks to identify messages in the stream whose content is not a "narrative" news story; these "non-narrative news" messages include advisories to news editors about services or upcoming stories, formulaic tables or listings (stock prices, bestsellers), and so on. Material of this sort is not annotated or included in TDT2, and it is identified by inserting a "<DOCTYPE>" tag with the string "MISCELLANEOUS TEXT" as its content, into the SGML "<DOC>" (message) unit. Units that are not so marked are given a "<DOCTYPE>" tag containing "NEWS STORY".
In the case of NYT, an additional process is applied at this stage that seeks to locate and assemble the pieces of long stories that are transmitted as a sequence of separate ANPA messages. This process extracts just the text content of non-initial story pieces and pastes these cleanly into the message that contains the initial piece of the story; in this way, the entire text content of the story appears unbroken in one SGML "<TEXT>" unit; the header information that accompanied the first piece of the story, and the trailer information that accompanied the last piece (including the transmission time stamp), will apply to the entire story; intermediate header and trailer data are discarded.
Both tasks at this stage use heuristics based on observed patterns, and both rely on cues that may be lacking in some cases due to corruptions, typographic errors or variability in the data. The use of the "MISCELLANEOUS TEXT" classification and the piecing together of multi-message stories will apply only to cases where observed pattern conditions are explicitly met; we therefore expect that some percentage of non-news messages will not be identified as such, and some multi-part stories may not get re-assembled.
(It turns out that transmission conventions used by NYT make it very simple to reassemble long stories from multiple message units, and these conventions are strictly followed in all but a few articles in the course of a week. In the case of APW, the conventions for multipart stories are less simple and much less consistent -- they are sufficient to allow human readers to piece stories together, but are less amenable to programmatic treatment; in any case, so little APW material is affected by this issue, it was deemed more cost-effective ignore it for this source.)
The quantity of news data coming out of the APW and NYT pipelines is greater than the designed scope of the TDT2 Corpus in terms of the amount of material to be labeled with relevance judgements for target topics, so a selection process is applied to the output of the four-stage filter that will yield approximately 80 stories per day for topic labeling. The selection is based on average size of news stories and total amount of news data collected from each source on a given day; if the day's collection is significantly greater than needed for labeling, four equally-sized sets of contiguous stories are selected from the daily file at evenly-spaced intervals. The sizes in bytes of the selected sets will actually be only approximately equal, since the selection will involve whole story units; the number of stories in each set will vary according to the length of the stories that happen to fall within each set.