Once the text data from all news sources have been conditioned into the uniform SGML format established for TDT2, the text collection is scanned, using a combination of random selection and judgement, to identify topics that will be retained as targets for the detection and tracking tasks.
After topics have been identified, annotators at the LDC go over the text data from all sources to assess the relevance of each story unit to each of the target topics. Initially, this is done in a manner identical to the "sequential" (i.e. not "query-based") event labeling method that was used in the TDT Pilot Corpus. In fact, the basic approach for sequential marking of stories, created at UMass for the Pilot Corpus, has been adapted by the LDC for use on this much larger text collection and target event set. At a later stage in the labeling task, query-based methods are used to provide assistance in quality control.
Ideally, the selection of target topics should be drawn from news stories that are evenly sampled across both time and news sources. The LDC is responsible for selecting events in the news (and "seed" articles for these events) that establish the target topics for the collection. A total of 100 topics have been identified from the January-June collection period for TDT2. During the topic annotation process, LDC annotators work with one subset of topics (a list of up to 20) at a time, reading stories in sequence from a given SGML sample file, and marking which topics in the chosen list (if any) are addressed in each story.