ACE 2005 Corpora

Data Profile

Training data: 300,000 words/language
Test data: 50,000 words/language

Data Type English Chinese Arabic
newswire (NW) 20% 40% 40%
broadcast news (BN) 20% 40% 40%
broadcast conversation (BC) 15% 0% 0%
weblogs (blogs) (WL) 15% 20% 20%
usenet newsgroups, discussion forums (UN) 15% 0% 0%
conversational telephone speech (CTS) 15% 0% 0%

Training and Test Epochs

English

  • NW: training epoch March-June 2003; test epoch July-Aug 2003
  • BN: training epoch March-June 2003; test epoch July-Aug 2003
  • BC: training epoch March-June 2003; test epoch July-Aug 2003
  • WL: training epoch Nov 2004-Feb 2005; test epoch March-April 2005
  • UN: training epoch Nov 2004-Feb 2005; test epoch March-April 2005
  • CTS: training epoch Nov-Dec 2004; test epoch Nov-Dec 2004 (Note: For CTS data, test data is differentiated from training data by conversational topic rather than calendar epoch)

Chinese

  • NW: training epoch Oct-Dec 2000; test epoch Jan 2001
  • BN: training epoch Oct-Dec 2000; test epoch Jan 2001
  • WL: training epoch Nov 2004-Feb 2005; test epoch March-April 2005

Arabic

  • NW: training epoch Oct-Dec 2000; test epoch Jan 2001
  • BN: training epoch Oct-Dec 2000; test epoch Jan 2001
  • WL: training epoch Nov 2004-Feb 2005; test epoch March-April 2005