REFLEX ET 2007 Data selection
Data come from three languages: English, Chinese and Arabic; and from two genres: newswire and weblogs. LDC will employ careful data selection (manually reviewing each document prior to including it) to maximize concentration of names in the selected files. Some attempt will be made to incorporate files containing infrequent names instead of only targeting names of the most common newsmakers.
The devtest corpus comprises two data sets:
A portion of the ACE2005 training corpus, which like the ACE2005 eval corpus has already been annotated for entities and TIMEX2 in the source language. Translations will be necessary for the full data set, as will entity and TIMEX2 annotations of the translated data. Some of this data is also designated as part of the Less Commonly Taught Languages core English translation corpus. Wherever possible, files from this corpus will be selected for inclusion in the current effort.
A portion of the ACE2004 training corpus, namely files from Chinese and Arabic Treebank. This data has already been annotated for entities in the source language, and English translations plus entity annotations also exist. The data will need to be translated into the other target language (Chinese > Arabic; Arabic > Chinese), entity annotations of those files will need to be completed, and TIMEX2 annotation will need to be provided for the full set of data.