Translation Guidelines, data format and annotation specifications:

Translation Guidelines (sent to translation agencies):

·         Chinese->English translation guideline for the first set of data in 2001

·         Chinese->English translation guideline for the second set of data in 2001

·         Instructions for the 2002 Chinese->English translation evaluation data

·         Chinese->English translation guidelines for 2003 training and eval data

·         Arabic-> English translation guidelines for 2002 evaluation data

·         Arabic-> English translation guidelines for 2003 training and eval data

Final Data Format (of published corpora):

·         Final Data Format for LDC Multiple Translation Chinese Corpora

·         Final Data Format for LDC Multiple Translation Arabic Corpora

Specifications for human assessment of translation quality:

·         Specification for human assessment of translation quality

NIST 2003 MT Evaluation Resources:

Allowable training data for the separate evaluation conditions. Registered participants should contact the LDC to acquire this data:

Small Data Condition

Chinese

Arabic

English Translation of Chinese Treebank

Not Applicable

The 10k-word dictionary from CMU (S.Vogel)

 

 

Large Data Condition

Chinese

Arabic

LDC catalog #

Title

LDC catalog #

Title

LDC2003E14

FBIS data

 

 

LDC2000T47

Hong Kong Laws Parallel Text

 

 

LDC2002E16

Hong Kong News Parallel Text, sentence-aligned

LDC2003T07

Arabic Treebank Part 1 10k word English Translation

LDC2000T46

Hong Kong News Parallel Text -

LDC2002E15

UN Arabic English parallel Text

LDC2000T50

Hong Kong Hansard Parallel Text, aligned at the document level

LDC2002E48

Ummah Arabic English Parallel News Text

LDC2002E19

Hong Kong Hansard Parallel Text, aligned at the sentence level

LDC2002L49

Buckwalter Arabic Morphological Analyzer Version 1.0

LDC2002E17

English Translation of Chinese Treebank

LDC2003T06

Arabic Treebank Part 1 v2.0

LDC2002E18

Xinhua Chinese-English Parallel News Text Version 1.0 beta 2

LDC2002E54

Multiple Translation Arabic Corpus

NIST June 2002 MT evaluation data

LDC2003E11

UN Chinese-English Parallel Text Version 1.0 beta

LDC2003E05

Arabic News Translation Corpus Part 1

LDC2002L27

Chinese English Translation Lexicon version 3.0

LDC2003E09

Arabic News Translation Corpus Part 2

LDC2002E58

Sinorama Chinese-English Parallel Text

 

 

LDC2002T01

Multiple-Translation Chinese Corpus

 

 

LDC2002E53

Multiple Translation Chinese Corpus part 2:

NIST June 2002 MT evaluation data

 

 

LDC2003E01

Chinese-English Name Entity Lists version 1.0 beta

 

 

LDC2002E04

Multiple Translation Chinese Corpus Part 3

 

 

LDC2001T11

Chinese Treebank 2.0

 

 

LDC2003E06

Chinese Treebank 3.0

 

 

LDC2003E07

Chinese Treebank English Parallel Corpus

 

 

LDC2003E08

Chinese News Translation Corpus Part 1

 

 

 

Unlimited Training Condition

Chinese

Arabic

All publicly available data up to Jan. 1st, 2003

All publicly available data up to Jan 1st, 2003

 

 

Created: 11-Jun-2002

Last updated: 02-May-2003