Allowable training data for the separate evaluation conditions. Registered participants should contact the LDC to acquire this data:
| Small Data Condition | ||
|
English Translation of Chinese Treebank |
||
|
Not Applicable |
The 10k-word dictionary from CMU (S.Vogel) |
|
| Large Data Condition | ||
| LDC Catalog Number | Title | Description |
|
FBIS Multilanguage Texts |
parallel text from FBIS; document aligned | |
|
UN Chinese English Parallel Text Version 2 |
Parallel text from UN; 147M English words; sentence aligned | |
| * LDC2004T08 | Hong Kong Parallel Text | Parallel text from Hong Kong SAR; 59M English words, sentence aligned |
|
English Translation of Chinese Treebank |
||
| LDC2002E18 |
Xinhua Chinese-English Parallel News Text Version 1.0 beta 2 |
3.6M words |
| LDC2002L27 |
Chinese English Translation Lexicon version 3.0 |
54K headwords |
| LDC2003E01 |
Chinese-English Name Entity Lists version 1.0 beta |
1.5M+ entries |
| * LDC2005E47 | Chinese English News Magazine Parallel Text | parallel text from Sinorama; 20M Chinese characters and 9M English words; sentence aligned |
| LDC2002T01 |
Multiple-Translation Chinese (MTC) Corpus |
See the catalog page for details |
| * LDC2003T17 | Multiple Translation Chinese (MTC) Part 2 | See the catalog page for details |
| * LDC2004T07 | Multiple Translation Chinese (MTC) Part 3 | See the catalog page for details |
| * LDC2005T06 | Chinese News Translation Text Part 1 | 474K Chinese characters, or about 240K words |
| * LDC2005T01 | Chinese Treebank 5.0 | 507K words |
| LDC2003E07 |
Chinese Treebank English Parallel Corpus |
|
| Not Applicable | NIST May 2004 MT evaluation data can be acquired from NIST | |
| Unlimited Data Condition | ||
| All publicly available data up to November 30th, 2004 | ||
* denotes resources created after the 2004 TIDES MT Evaluation