TIDES Summarization

The goal of TIDES Summarization is to support summarization technology, which includes summarizing not only individual documents, but also clusters of topically-related documents.

LDC supports summarization technology by creating linguistic resources for research participants. LDC annotators create summaries of topic clusters and documents within clusters. Each document and cluster is summarized by four separate humans.

Summarization in 2004 - 2005

LDC created the evaluation data for the 2005 Multilingual Summarization Evaluation (MSE). Using the TDT-4 English and Arabic corpora (including improved Arabic translations from ISI), LDC developed and summarized a total of 50 new topics, 25 of which are being used for the 2005 TIDES Multilingual Summarization Evaluation (MSE). Topics were drawn from the output of Columbia's NewsBlaster topic clustering system. For each of the 25 topics, annotators created 100-word manual summaries based on the English source documents plus English translations of the original on-topic Arabic documents. For each topic, a total of 4 summaries are provided, each created by an independent annotator.

2004-2005 instructions to annotators

Previous summarization efforts: 2003 - 2004

In 2003-2004, for the Document Understanding Conference (DUC), LDC created document-level and cluster-level summaries for 25 topics drawn from the TDT3 corpus. Additionally, LDC found 12 new multilanguage topics in the TDT3 corpus epoch. The 13 other topics had been created previously during TDT3. The original language of the topics was Arabic: LDC Arabic-speaking annotators performed topic selection using the EZQuery search engine to locate previously-untouched topics with ten relevant documents per topic. Relevant Arabic documents were outsourced to a translation agency to be translated into English; manually translated documents were then summarized into 10-word summaries by four independent English-speaking annotators. 100-word summaries were created for the topic clusters.

Links to 2003 - 2004 information:

Additional links and resources


Contact mlglenn@ldc.upenn.edu
Last modified:
© 1996-2000 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.