GALE

Back to Data or the Home page

Key

  • Phase = TRN (training), DEV (development), EVL (evaluation)
  • Genre = BN (Broadcast News), BC (Broadcast Conversation), NW (Newswire), NG (Newsgroups), WL (Weblogs)
  • Volume expressed in hours for speech sources (BN, BC); tokens for text sources (NW, NG, WL)
  • Tokens = 1 word (Arabic, English) or 1 character (Chinese); 1 word = 1.5 characters

Data Matrix for Phase Three of GALE

Updated 4.3.2008

Download .pdf version here.


    Language Genre Unit Minimum Targeted Volume Volume Released to Date R1 R2 R3 Notes
    10.1.2007 4.4.2008 TBD
Training


Data
Collection Arabic NW Words See Notes See Notes             All collected NW to be released in Gigaword 4 (2009) and/or ad hoc as required for specific tasks
BN Hours 1000 1103.561 372.797 LDC2007E99 730.764 LDC2008E38     Includes both LDC- and web-harvested audio
BC Hours 1000 950.616 434.771 LDC2007E99 515.845 LDC2008E38     Includes both LDC- and web-harvested audio
WL Words 10,000,000 98,729,851 26,296,828 LDC2007E102 72,433,023 LDC2008E41      
NG Words 10,000,000 70,784,544 27,838,327 LDC2007E102 42,946,217 LDC2008E41      
Chinese NW Chars See Notes See Notes             All collected NW to be released in Gigaword 4 (2009) and/or ad hoc as required for specific tasks
BN Hours 1000 620.824 377.331 LDC2007E99 243.492 LDC2008E38     Includes both LDC- and web-harvested audio
BC Hours 1000 697.515 304.718 LDC2007E99 392.797 LDC2008E38     Includes both LDC- and web-harvested audio
WL Chars 15,000,000 206,361,199
185,550,342 LDC2007E102 20,810,857 LDC2008E41      
NG Chars 15,000,000 271,207,196 164,658,186 LDC2007E102 106,549,010 LDC2008E41      
English NW Words See Notes  See Notes             All collected NW to be released in Gigaword 4 (2009) and/or ad hoc as required for specific tasks
BN Hours 250 120.59 60.216 LDC2007E99 60.374 LDC2008E38     Includes both LDC- and web-harvested audio
BC Hours 250 120.637 60.234 LDC2007E99 60.403 LDC2008E38     Includes both LDC- and web-harvested audio
WL Words 10,000,000 4,515,219 3,024,785 LDC2007E102 1,490,434 LDC2008E41      
NG Words 10,000,000 827,115,809 400,201,443 LDC2007E102 426,914,366 LDC2008E41      
Transcription Arabic BN Hours 500 449.699 305.593 LDC2007E100 144.106 LDC2008E39 Includes manual and web-harvested transcripts
BC Hours 500 543.49 380.938 LDC2007E100 162.552 LDC2008E39 Includes manual and web-harvested transcripts
Chinese BN Hours 500 341.138 293.252 LDC2007E100 47.886 LDC2008E39 Includes manual and web-harvested transcripts
BC Hours 500 280.022 220.471 LDC2007E100 59.551 LDC2008E39 Includes manual and web-harvested transcripts
English BN Hours 0 0 46.730 LDC2007E100 53.386 LDC2008E39 All English transcripts are CCAP
BC Hours 0 0 38.174  LDC2007E100 48.975 LDC2008E39 All English transcripts are CCAP
Translation Arabic NW Words 0 10,319,327 102377 manual + 10,216,950  found parallel text LDC2007E101, LDC2007E103 Includes found parallel text.
BN Words See notes. 220,797 159480 LDC2007E101 61317 LDC2008E40 Targets are unspecified; instructions from data committee are to translate up to 10Kw/source for new or previously under-represented sources.
BC Words See notes. 442,641 245789 LDC2007E101 196852 LDC2008E40  Targets are unspecified; instructions from data committee are to translate up to 10Kw/source for new or previously under-represented sources.
WEB* Words Up to 250,000 *WEB includes both NG and WL.
Chinese NW Chars 0 13,310,949 146573 manual + 13,164,376 found parallel text LDC2007E101, LDC2007E103 Includes found parallel text.
BN Chars See notes. 501,852 307516 LDC2007E101 194336 LDC2008E40 Targets are unspecified; instructions from data committee are to translate up to 10Kw/source for new or previously under-represented sources.
BC Chars See notes. 485,475 444373 LDC2007E101 41102 LDC2008E40 Targets are unspecified; instructions from data committee are to translate up to 10Kw/source for new or previously under-represented sources.
WEB*  Chars Up to 250,000 *WEB includes both NG and WL.





Data Matrix for Phase One of GALE is available here.

Data Matrix for Phase Two of GALE is available here.