| |
|
Language |
Genre |
Unit |
Minimum
Targeted Volume |
Volume
Released to Date |
R1 |
R2 |
R3 |
Notes |
| |
|
10.1.2007 |
4.4.2008 |
TBD |
Training
Data |
Collection |
Arabic |
NW |
Words |
See
Notes |
See
Notes |
|
|
|
|
|
|
All collected
NW to be released in Gigaword 4 (2009) and/or ad hoc as required for specific tasks |
| BN |
Hours |
1000 |
1103.561 |
372.797 |
LDC2007E99 |
730.764 |
LDC2008E38 |
|
|
Includes
both LDC- and web-harvested audio |
| BC |
Hours |
1000 |
950.616 |
434.771 |
LDC2007E99 |
515.845 |
LDC2008E38 |
|
|
Includes
both LDC- and web-harvested audio |
| WL |
Words |
10,000,000 |
98,729,851 |
26,296,828 |
LDC2007E102 |
72,433,023 |
LDC2008E41 |
|
|
|
| NG |
Words |
10,000,000 |
70,784,544 |
27,838,327 |
LDC2007E102 |
42,946,217 |
LDC2008E41 |
|
|
|
| Chinese |
NW |
Chars |
See
Notes |
See Notes |
|
|
|
|
|
|
All collected
NW to be released in Gigaword 4 (2009) and/or ad hoc as required for specific tasks |
| BN |
Hours |
1000 |
620.824 |
377.331 |
LDC2007E99 |
243.492 |
LDC2008E38 |
|
|
Includes
both LDC- and web-harvested audio |
| BC |
Hours |
1000 |
697.515 |
304.718 |
LDC2007E99 |
392.797 |
LDC2008E38 |
|
|
Includes
both LDC- and web-harvested audio |
| WL |
Chars |
15,000,000 |
206,361,199
|
185,550,342 |
LDC2007E102 |
20,810,857 |
LDC2008E41 |
|
|
|
| NG |
Chars |
15,000,000 |
271,207,196 |
164,658,186 |
LDC2007E102 |
106,549,010 |
LDC2008E41 |
|
|
|
| English |
NW |
Words |
See
Notes |
See Notes
|
|
|
|
|
|
|
All collected
NW to be released in Gigaword 4 (2009) and/or ad hoc as required for specific tasks |
| BN |
Hours |
250 |
120.59 |
60.216 |
LDC2007E99 |
60.374 |
LDC2008E38 |
|
|
Includes
both LDC- and web-harvested audio |
| BC |
Hours |
250 |
120.637 |
60.234 |
LDC2007E99 |
60.403 |
LDC2008E38 |
|
|
Includes
both LDC- and web-harvested audio |
| WL |
Words |
10,000,000 |
4,515,219 |
3,024,785 |
LDC2007E102 |
1,490,434 |
LDC2008E41 |
|
|
|
| NG |
Words |
10,000,000 |
827,115,809 |
400,201,443 |
LDC2007E102 |
426,914,366 |
LDC2008E41 |
|
|
|
| Transcription |
Arabic |
BN |
Hours |
500 |
449.699 |
305.593 |
LDC2007E100 |
144.106 |
LDC2008E39 |
|
|
Includes manual and web-harvested transcripts |
| BC |
Hours |
500 |
543.49 |
380.938 |
LDC2007E100 |
162.552 |
LDC2008E39 |
|
|
Includes manual and web-harvested transcripts |
| Chinese |
BN |
Hours |
500 |
341.138 |
293.252 |
LDC2007E100 |
47.886 |
LDC2008E39 |
|
|
Includes manual and web-harvested transcripts |
| BC |
Hours |
500 |
280.022 |
220.471 |
LDC2007E100 |
59.551 |
LDC2008E39 |
|
|
Includes manual and web-harvested transcripts |
| English |
BN |
Hours |
0 |
0 |
46.730 |
LDC2007E100 |
53.386 |
LDC2008E39 |
|
|
All English
transcripts are CCAP |
| BC |
Hours |
0 |
0 |
38.174 |
LDC2007E100 |
48.975 |
LDC2008E39 |
|
|
All English
transcripts are CCAP |
| Translation |
Arabic |
NW |
Words |
0 |
10,319,327 |
102377 manual + 10,216,950 found parallel text |
LDC2007E101, LDC2007E103 |
|
|
|
|
Includes found parallel text. |
| BN |
Words |
See notes. |
220,797 |
159480 |
LDC2007E101 |
61317 |
LDC2008E40 |
|
|
Targets are unspecified; instructions from data committee are to
translate up to 10Kw/source for new or previously under-represented
sources. |
| BC |
Words |
See notes. |
442,641 |
245789 |
LDC2007E101 |
196852 |
LDC2008E40 |
|
|
Targets
are unspecified; instructions from data committee are to translate up
to 10Kw/source for new or previously under-represented sources. |
| WEB* |
Words |
Up to 250,000 |
|
|
|
|
|
|
|
*WEB includes both NG and WL. |
| Chinese |
NW |
Chars |
0 |
13,310,949 |
146573 manual + 13,164,376 found parallel text |
LDC2007E101, LDC2007E103 |
|
|
|
|
Includes found parallel text. |
| BN |
Chars |
See notes. |
501,852 |
307516 |
LDC2007E101 |
194336 |
LDC2008E40 |
|
|
Targets are unspecified; instructions from data committee are to
translate up to 10Kw/source for new or previously under-represented
sources. |
| BC |
Chars |
See notes. |
485,475 |
444373 |
LDC2007E101 |
41102 |
LDC2008E40 |
|
|
Targets are unspecified; instructions from data committee are to
translate up to 10Kw/source for new or previously under-represented
sources. |
| WEB* |
Chars |
Up to 250,000 |
|
|
|
|
|
|
|
*WEB includes both NG and WL. |