The DARPA TIDES program is developing robust technology for translingual information processing. The DARPA web page devoted to the program describes TIDES' goal: " ... to revolutionize the way that information is obtained from human language by enabling people to find and interpret needed information, quickly and effectively, regardless of language or medium." TIDES tasks include information detection, extraction, summarization and translation focusing currently on English, Chinese and Arabic with some research on Korean and Spanish. LDC is distributing much of the text, parallel text, lexicons, annotations and other resources necessary support TIDES research.
TIDES program participants, please read the policy on distribution of resources created for DARPA TIDES.
TIDES Data Matrix - Updated 2/9/2005
Detection
LDC provides manually-annotated corpora to support work in topic detection and
tracking as part of the TDT and HARD projects. Consult the following documents
for information about these projects and the associated annotation task
definitions and guidelines.
HARD (High Accuracy Retrieval from Documents)
TDT (Topic Detection and Tracking)
Extraction
LDC provides linguistic resources for TIDES Extraction evaluations, as well as
for theAutomatic Content Extraction (ACE) program. The following pages
provide information about ACE annotation and research tasks:
Machine Translation
LDC is providing linguistic resources to support 2004 TIDES MT evaluation. For more information, consult
Summarization
LDC is providing linguistic resources to support the 2004 DUC (Document
Understanding Conference) Evaluation. For more information, consult
The table below shows LDC resources relevant to TIDES and categorized by language and resource type. Also see Mark Liberman's March 2000 assessment of the state-of-the art for language resources related to TIDES. For a list of data resources in the LDC Catalog that are associated with TIDES, click here.
|
|
English |
|
Text |
North
American News Text Corpus |
|
Parallel Text |
UN Parallel Text (English) - documents provided by the United Nations for use in research on machine translation technology. |
|
Lexicons |
|
|
Morpho-Syntactic Tagged Text |
|
|
Detection |
TDT
Pilot Study Corpus TDT4 Multilanguage Audio: contact LDC for a pre-release copy (LDC catalog no.: LDC2003E02) TDT4 Annotations: 2002 and 2003 topics completed TREC Cross-Language Topics: 2001 and 2002 topics completed HARD: 2003 evaluation
topics completed |
|
Extraction |
Message
Understanding Conference (MUC) 7
|
|
Summarization |
|
|
Translation |
NA |
|
|
Arabic |
|
Text |
|
|
Parallel Text |
|
|
Lexicons |
|
|
Morpho-Syntactic Tagged Text |
|
|
Detection |
TDT 3 Arabic Text - TDT participants contact LDC (LDC catalog no.: LDC2002E32
) TDT4 Multilanguage Audio: contact LDC for a pre-release copy (LDC catalog no.: LDC2003E02) TDT4 Annotations: 2002 and
2003 topics completed
|
|
Extraction |
Automatic Content Extraction (ACE) Corpora
|
|
Summarization |
|
|
Translation |
|
|
|
Chinese |
|
Text |
|
|
Parallel Text |
|
|
Lexicons |
|
|
Morpho-Syntactic Tagged Text |
|
|
Detection |
TDT2
Mandarin Audio TDT4 Multilanguage Audio: contact LDC for a pre-release copy (LDC catalog no.: LDC2003E02) TDT4 Annotations: 2002 and
2003 topics completed |
|
Extraction |
Automatic Content
Extraction (ACE) Corpora
|
|
Summarization |
|
|
Translation |
|
|
|
Korean |
|
Text |
|
|
Parallel Text |
|
|
Lexicons |
|
|
Morpho-Syntactic Tagged Text |
Korean Treebank in progress |
|
Detection |
|
|
Extraction |
|
|
Summarization |
|
|
Translation |
|
|
|
Spanish |
|
Text |
|
|
Parallel Text |
|
|
Lexicons |
|
|
Morpho-Syntactic Tagged Text |
|
|
Detection |
|
|
Extraction |
|
|
Summarization |
|
|
Translation |
|
|
|
Japanese |
|
Text |
Japanese
Business News Text |
|
Parallel Text |
|
|
Lexicons |
|
|
Morpho-Syntactic Tagged Text |
|
|
Detection |
|
|
Extraction |
|
|
Summarization |
|
|
Translation |
|