Allowable training data for the separate evaluation conditions. Registered participants should contact the LDC to acquire this data:

Small Data Condition

LDC2002E17

English Translation of Chinese Treebank

 

Not Applicable

The 10k-word dictionary from CMU (S.Vogel)

 
 
Large Data Condition
LDC Catalog Number Title Description

LDC2003E14

FBIS Multilanguage Texts

parallel text from FBIS; document aligned

LDC2004E12

UN Chinese English Parallel Text Version 2

Parallel text from UN; 147M English words; sentence aligned
* LDC2004T08 Hong Kong Parallel Text Parallel text from Hong Kong SAR; 59M English words, sentence aligned

LDC2002E17

English Translation of Chinese Treebank

 
LDC2002E18

Xinhua Chinese-English Parallel News Text Version 1.0 beta 2

3.6M words
LDC2002L27

Chinese English Translation Lexicon version 3.0

54K headwords
LDC2003E01

Chinese-English Name Entity Lists version 1.0 beta

1.5M+ entries
* LDC2005E47 Chinese English News Magazine Parallel Text parallel text from Sinorama; 20M Chinese characters and 9M English words; sentence aligned
LDC2002T01

Multiple-Translation Chinese (MTC) Corpus

See the catalog page for details
* LDC2003T17 Multiple Translation Chinese (MTC) Part 2 See the catalog page for details
* LDC2004T07 Multiple Translation Chinese (MTC) Part 3 See the catalog page for details
* LDC2005T06 Chinese News Translation Text Part 1 474K Chinese characters, or about 240K words
* LDC2005T01 Chinese Treebank 5.0 507K words
LDC2003E07

Chinese Treebank English Parallel Corpus

 
Not Applicable NIST May 2004 MT evaluation data can be acquired from NIST  
 
Unlimited Data Condition
All publicly available data up to November 30th, 2004

* denotes resources created after the 2004 TIDES MT Evaluation