ACE 2005 Corpora

A new feature of this year's ACE training corpus is careful, targeted data selection. Rather than choosing files at random for annotation, this year's task requires a certain density of annotation across the corpus. The established target, agreed upon at the Fall 2004 ACE Workshop, is 50 examples of each entity, relation and event type/subtype within the 300,000-word corpus for each language. Note that the "50-example threshhold" is simply a target and not a hard and fast requirement of the corpus. LDC has made a concerted effort to identify at least 50 examples of each type/subtype, but has likely fallen short of the goal in some cases. Actual counts for each type/subtype won't be known until the corpus has been fully annotated; but we are confident that this year's data is much richer in targeted entities, relations and events than previous training corpora. What follows is a brief description of our data selection process.

First, we created a small set of example documents from each language and genre that had been quickly labeled by ACE annotators as "good" or "bad" for ACE annotation. The good/bad determiniation was based on the document content including number and type of entities, relations and events. We then did a statistical analysis of the frequency of each token that appeared in the set of "good" ACE documents as compared to the set of "bad" ACE documents. This analysis was used to generate a list of positively- and negatively-weighted keywords to help in the identification of additional "good" ACE documents from the data pool. The list of keywords was then fed through a LDC-developed search engine and some additional formulas to generate relevancy rankings for each document.

ACE annotators then read through the relevance-ranked list confirming which documents were suitable for annotation and providing a rough estimate of the number of entities, relations and events by type and subtype for each file. This process was supplemented by manual keyword searching focused on the rarest annotation types. Note that the language used in different data sources (genres) is different enough that it was necessary to run the searching and relevancy ranking on each genre independently. Once approximate counts were in place, the final step was to run algorithm over the judged/counted files to generate a maximal subset of documents that met word count and other requirements for each language.