TIDES

The DARPA TIDES program is developing robust technology for translingual information processing. The DARPA web page devoted to the program describes TIDES' goal: " ... to revolutionize the way that information is obtained from human language by enabling people to find and interpret needed information, quickly and effectively, regardless of language or medium." TIDES tasks include information detection, extraction, summarization and translation focusing currently on English, Chinese and Arabic with some research on Korean and Spanish. LDC is distributing much of the text, parallel text, lexicons, annotations and other resources necessary support TIDES research.

TIDES program participants, please read the policy on distribution of resources created for DARPA TIDES.

TIDES Data Matrix - Updated 2/9/2005

Detection
LDC provides manually-annotated corpora to support work in topic detection and tracking as part of the TDT and HARD projects. Consult the following documents for information about these projects and the associated annotation task definitions and guidelines.

HARD (High Accuracy Retrieval from Documents)


TDT (Topic Detection and Tracking)

 

Extraction
LDC provides linguistic resources for TIDES Extraction evaluations, as well as for theAutomatic Content Extraction (ACE) program.  The following pages provide information about ACE annotation and research tasks: 

 

Machine Translation

LDC is providing linguistic resources to support 2004 TIDES MT evaluation. For more information, consult


Summarization

LDC is providing linguistic resources to support the 2004 DUC (Document Understanding Conference) Evaluation.  For more information, consult

Resources by Language and Task

The table below shows LDC resources relevant to TIDES and categorized by language and resource type. Also see Mark Liberman's March 2000 assessment of the state-of-the art for language resources related to TIDES. For a list of data resources in the LDC Catalog that are associated with TIDES, click here.

 

English

Text

North American News Text Corpus
North American News Text Supplement
English Gigaword - newswire text totalling 1 billion words in English
ACL/DCI - variety of text news sources and scientific abstracts
TDT2 Careful Transcription Audio - broadcast news audio; the transcriptions to these recordings are available in the Topic Detection and Tracking (TDT) 2 Careful Transcription Text Corpus
TDT2 Careful Transcription Text - transcription of the broadcast news audio files in the Topic Detection and Tracking (TDT) 2 Careful Transcription Audio Corpus

Parallel Text

UN Parallel Text (English) - documents provided by the United Nations for use in research on machine translation technology.

Lexicons

CELEX2

Morpho-Syntactic Tagged Text

BLLIP 1987-89 WSJ Corpus Release 1
Treebank-3

Detection

TDT Pilot Study Corpus 
TDT2 English Audio
TDT3 English Audio
TDT2 Multilanguage Text Version 4.0
TDT3 Multilanguage Text Version 2.0 
TIPSTER Complete
TDT4 Multilanguage Text Version 1.1:  contact LDC for a pre-release copy (LDC catalog no.: LDC2003E21)

TDT4 Multilanguage Audio: contact LDC for a pre-release copy (LDC catalog no.: LDC2003E02)

TDT4 Annotations: 2002 and 2003 topics completed

TREC Cross-Language Topics: 2001 and 2002 topics completed
HARD GovDocs Corpus contact LDC for a pre-release copy (LDC catalog no.: LDC2003E15)

HARD:  2003 evaluation topics completed

Extraction

Message Understanding Conference (MUC) 7
Automatic Content Extraction (ACE) Corpora


ACE/TIDES participants only (contact LDC):

  • Name-Annotated TDT Corpus Supplement for ACE (LDC2002E50)
  • ACE 2004 Pilot Corpus V1.2 (LDC2004E03)

Summarization

 

Translation

NA

 

 

Arabic

Text

Parallel Text

  • UN Arabic English Parallel Text  TIDES participants contact LDC for a pre-release copy (LDC catalog no.:LDC2004E13). 
  • Arabic English Parallel News Text Part 1 7,048 stories pairs, 53,411 sentence pairs, 2M Arabic words and
    1.6M English words. contact LDC for a pre-release copy (LDC catalog no.: LDC2004E08).

Lexicons

Morpho-Syntactic Tagged Text

Detection

TDT 3 Arabic Text - TDT participants contact LDC (LDC catalog no.: LDC2002E32 )
TDT4 Multilanguage Text Version 1.1:  contact LDC for a pre-release copy (LDC catalog no.: LDC2003E21)

TDT4 Multilanguage Audio: contact LDC for a pre-release copy (LDC catalog no.: LDC2003E02)

TDT4 Annotations: 2002 and 2003 topics completed


TREC Cross-Language Topics - 2001 and 2002 topics completed

Extraction

Automatic Content Extraction (ACE) Corpora


ACE/TIDES participants only (contact LDC):

  • ACE 2004 Pilot Corpus V1.2 (LDC2004E03)

Summarization

 

Translation

  • Multiple Translation Arabic Corpus Part 1 (2002 TIDES MT eval data) - 141 Arabic news articles, 10 sets of human translations, 2 sets of COTS outputs and human assessment of the two COTS outputs. TIDES MT participants, contact LDC for pre-release copy (LDC catalog no.: LDC2002E54)
  • Arabic News Translation Corpus Part 1 - 600 Arabic news articles, translated by 6 translation agencies (each translates 100 news articles). TIDES participants, contact LDC for pre-release copy (LDC catalog no.: LDC2003E05)
  • Arabic News Translation Corpus Part 2 - 740 Arabic news articles, translated by 5 translation agencies. TIDES participants, contact LDC for pre-release copy (LDC catalog no.: LDC2003E09)
  • Arabic News Translation Corpus Part 3 - 1,485 Arabic news articles, translated by 5 translation agencies. TIDES participants, contact LDC for pre-release copy (LDC catalog no.: LDC2004E07)
  • Arabic News Translation Corpus Part 4 - 414 Arabic news articles, translated by 5 translation agencies. TIDES participants, contact LDC for pre-release copy (LDC catalog no.: LDC2004E11)
  • Arabic Treebank: Part 1 - 10K-word English Translation English translation of 49 files selected from Arabic Treebank Part 1 v2.0

 

 

Chinese

Text

Mandarin Chinese News Text

Chinese Gigaword

Parallel Text

  • Hong Kong News Parallel Text - TIDES participants, contact LDC for pre-release sentence-aligned copy (LDC catalog no.: LDC2003E25)
  • Hong Kong Hansards Parallel Text - TIDES participants, contact LDC for pre-release sentence-aligned copy (LDC catalog no.: LDC2004E09) .
  • Hong Kong Laws Parallel Text
  • Xinhua Chinese-English Parallel News Text - TIDES participants contact LDC for a pre-release sentence-aligned copy (LDC catalog no.: LDC2002E18).
  • Sinorama Chinese-English Parallel Text: Articles published by Sinorama (Taiwan) from 1976 to 2001. 2,373 pairs of articles. 3.2M English words, 5.3M Chinese characters. TIDES participants, contact LDC for pre-release sentence-aligned copy (LDC catalog no.: LDC2002E58)
  • United Nations Chinese-English Parallel Text: UN documents from 1993 to 2002. It contains 53,486 document pairs, 4,979,857 sentence pairs, 147M English words, 247M Chinese characters. TIDES participants contact LDC for a pre-release copy (LDC catalog no.:LDC2004E12).

Lexicons

  • Chinese English Translation Lexicon Version 3.0
  • Chinese <-> English Name Entity Lists Version 1.0 beta now available! Created from Xinhua proper name and who's who databases, the package has almost 1 million unique proper names of various kinds (click here for the readme file). TIDES participants contact LDC for retrieval instructions (Catalog number: LDC2003E01).

Morpho-Syntactic Tagged Text

Detection

TDT2 Mandarin Audio
TDT3 Mandarin Audio
TDT2 Multilanguage Text Version 4.0 
TDT3 Multilanguage Text Version 2.0 
TREC Mandarin
TDT4 Multilanguage Text Version 1.1:  contact LDC for a pre-release copy (LDC catalog no.: LDC2003E21)

TDT4 Multilanguage Audio: contact LDC for a pre-release copy (LDC catalog no.: LDC2003E02)

TDT4 Annotations: 2002 and 2003 topics completed

Extraction

Automatic Content Extraction (ACE) Corpora


ACE/TIDES participants only (contact LDC):

  • ACE 2004 Pilot Corpus V1.2 (LDC2004E03)
  • Chinese <-> English Name Entity Lists Version 1.0 beta now available! Created from Xinhua proper name and who's who databases, the package has almost 1 million unique proper names of various kinds (click here for the readme file). TIDES participants contact LDC for retrieval instructions (Catalog number: LDC2003E01).

Summarization

 

Translation

  • Multiple-Translation Chinese Corpus 
  • Multiple Translation Chinese Corpus Part 2 (2002 TIDES MT eval data) - 100 Chinese news articles, 4 sets of human translations, 6 sets of COTS outputs and human assessment of two COTS outputs. TIDES participants, contact LDC for pre-release copy (LDC catalog no.: LDC2002E53)
  • Multiple Translation Chinese Corpus Part 3- 100 Chinese news articles, 4 sets of human translations. TIDES participants, contact LDC for pre-release copy (LDC catalog no.: LDC2002E04)
  • Chinese News Translation Corpus Part 1 - 740 Arabic news articles, translated by 5 translation agencies. TIDES participants, contact LDC for pre-release copy (LDC catalog no.: LDC2003E08)
  • Chinese Treebank English Parallel Corpus TIDES participants, contact LDC for pre-release copy (LDC catalog no.: LDC2003E07)


 

 

Korean

Text

Korean Newswire

Parallel Text

 

Lexicons

 

Morpho-Syntactic Tagged Text

Korean Treebank in progress

Detection

 

Extraction

 

Summarization

 

Translation

 

 

 

Spanish

Text

Spanish News Text
Spanish News Text, Volume 2

Parallel Text

UN Parallel Text (Spanish)

Lexicons

 

Morpho-Syntactic Tagged Text

 

Detection

TREC Spanish

Extraction

 

Summarization

 

Translation

 

 

 

Japanese

Text

Japanese Business News Text
Japanese Business News Text Supplement

Parallel Text

 

Lexicons

 

Morpho-Syntactic Tagged Text

 

Detection

 

Extraction

 

Summarization

 

Translation