Data Matrix for Year One of GALE
Updated 10.31.2006
Key
- Phase = TRN (training), DEV (development), EVL (evaluation)
- Genre = BN (Broadcast News), BC (Broadcast Conversation), NW (Newswire), NG (Newsgroups), WL (Weblogs)
- Volume expressed in hours for speech sources (BN, BC); tokens for text sources (NW, NG, WL)
- Tokens = 1 word (Arabic, English) or 1 character (Chinese); 1
word = 1.5 characters
| Phase | Task | Data Profile |
Annotations |
Deliveries (click on release name for detailed information) |
Notes | ||||||||||
| Genre | Lang | Total Volume | Source | Epoch | Kickoff1 10.17.05 |
Kickoff2 11.15.05 |
Q1 12.15.05 |
Q2 3.15.06 |
Q3 6.22.06 |
Q4 10.16.06 |
|||||
| TRN | Collect | BN | Arabic | 1000 hrs |
GALE
Collection |
2004.05-- |
manual audit | 43.5 hrs LDC2005E62 |
213
hrs LDC2005E80 |
294
hrs LDC LDC2006E31 |
263hrs LDC2006E83 |
355
hrs LDC2006E89 |
|||
| Chinese | 1000 hrs |
GALE Collection | 2004.11-- |
manual audit | 65 hrs LDC2005E62 |
263 hrs LDC2005E80 |
265
hrs LDC; 39 hrs web LDC2006E31 |
281hrs
LDC2006E83 |
519
hrs LDC2006E89 |
||||||
| English | 50 hrs |
GALE
Collection |
2004.11-- |
manual audit | -- |
25
hrs LDC2005E80 |
10
hrs LDC LDC2006E31 |
10 hrs LDC2006E83 |
10 hrs LDC2006E89 |
||||||
| BC | Arabic | 945 hrs |
GALE
Collection |
2005.10-- |
manual audit | 20 hrs LDC2005E61 |
203
hrs LDC2005E80 |
130
hrs LDC; 31 hrs web LDC2006E31 |
249hrs
LDC2006E83 |
391
hrs LDC2006E89 |
|||||
| Chinese |
940 hrs |
GALE Collection |
2005.03-- |
manual audit | 25 hrs LDC2005E61 |
67 hrs LDC2005E80 |
188 hrs LDC; 82 hrs web LDC2006E31 |
287hrs LDC2006E83 |
519 hrs LDC2006E89 |
||||||
| English | 50 hrs |
GALE Collection | 2005.06-- |
manual audit | -- |
27
hrs LDC2005E80 |
10
hrs LDC LDC2006E31 |
10
hrs LDC2006E83 |
10
hrs LDC2006E89 |
||||||
| NG | Arabic | 1M+ words |
GALE Collection | open | DataScout guidelines | -- | 886Kw LDC2005E81 |
11,516Kw
LDC2006E32 |
2062Kw LDC2006E83 |
1893Kw LDC2006E90 |
|||||
| Chinese | 1M+ words |
GALE Collection | open | DataScout guidelines | -- | 12,860Kw LDC2005E81 |
12,719Kw
LDC2006E32 |
17611Kw LDC2006E77 |
21963Kw LDC2006E90 |
||||||
| English | 1M+ words |
GALE Collection | open | DataScout guidelines | -- | 9,082Kw LDC2005E81 |
91,639Kw
LDC2006E32 |
13342Kw LDC2006E77 |
13644Kw LDC2006E90 |
||||||
| WL | Arabic | 1M+ words |
GALE Collection | open | DataScout guidelines | -- | 2,833Kw LDC2005E81 |
4,254Kw
LDC2006E32 |
2083Kw
LDC2006E77 |
2666Kw LDC2006E90 |
|||||
| Chinese | 1M+ words |
GALE Collection | open | DataScout guidelines | -- | 4,594Kw LDC2005E81 |
5,525Kw LDC2006E32 |
2162Kw LDC2006E77 |
3851Kw LDC2006E90 |
||||||
| English | 1M+ words |
GALE Collection | open | DataScout guidelines | -- | 4,297Kw LDC2005E81 |
3,394Kw LDC2006E32 |
867Kw LDC2006E77 |
792Kw LDC2006E90 |
||||||
| Transcribe | BN | Arabic | 870 hrs |
VOA
archive; GALE collection |
TBD | web-harvested
transcripts; QuickTR; QuickRichTR |
10 hrs LDC2005E71 |
10
hrs QRTR LDC2005E82 |
5.305hrs
QRTR LDC2006E33 |
47
hrs LDC2006E84 |
94.789
hrs QRTR LDC2006E91 |
||||
| Chinese | 540 hrs |
GALE collection | TBD | web-harvested
transcripts; QuickTR; QuickRichTR |
-- |
71
hrs web; 12.7 hrs QTR; 5 hrs QRTR LDC2005E82 |
134
hrs QTR; 21 hrs QRTR; 39 hrs web LDC2006E33 |
133
hrs LDC2006E84 |
65.994
hrs QTR; 86.4 hrs QRTR; 65.677 web LDC2006E91 |
||||||
| English | 50 hrs |
GALE collection | TBD | QuickRichTR | -- |
-- |
-- |
--
|
2.561
QRTR LDC2006E91 |
||||||
| BC | Arabic | 1000 hrs |
Al Jazeera web harvest; GALE collection | TBD | web-harvested
transcripts; QuickTR; QuickRichTR |
20 hrs LDC2005E63 |
193
hrs web; 12 hrs QRTR LDC2005E82 |
31
hrs web LDC2006E33 |
64.3
hrs LDC2006E84 |
35.865
hrs QRTR; 116.141 web LDC2006E91 |
|||||
| Chinese | 1000 hrs |
NTDTV
web harvest; GALE collection |
TBD | web-harvested
transcripts; QuickTR; QuickRichTR |
25 hrs LDC2005E63 |
11.9
hrs QTR; 12 hrs QRTR LDC2005E82 |
86
hrs QTR; 6 hrs QRTR; 65 hrs web LDC2006E33 |
166.5
hrs LDC2006E84 |
157.6
hrs QTR; 109.9 hrs QRTR; 7.408 web LDC2006E91 |
||||||
| English | 50 hrs |
GALE collection | TBD | QuickRichTR | -- |
-- |
-- |
-- |
46.025
hrs LDC2006E91 |
||||||
| Translate | BN | Arabic | 75 hrs |
TDT4;
VOA Archive; GALE collection |
TBD |
Arabic
translation guidelines |
-- |
12
hrs LDC2005E83 |
1
hr |
-- | 16.5
hrs LDC2006E92 |
||||
| Chinese | 75 hrs |
Hub4; GALE Collection | TBD |
Chinese
translation guidelines |
-- |
5
hrs LDC2005E83 |
17.5
hrs LDC2006E34 |
19
hrs LDC2006E85 |
-- | ||||||
| BC | Arabic | 50 hrs |
GALE Collection | TBD |
Arabic
translation guidelines |
-- |
5
hrs LDC2005E83 |
12.5
hrs LDC2006E34 |
-- |
-- |
|||||
| Chinese | 50 hrs |
GALE Collection | TBD |
Chinese
translation guidelines |
-- |
5
hrs LDC2005E83 |
3.6
hrs LDC2006E34 |
-- |
32.3
hrs LDC2006E92 |
||||||
| NW | Arabic | 7M words |
found parallel text, FBIS archive, GALE Collection | open | -- | -- |
-- |
626Kw Arabic;15Mw Chinese |
-- | -- | |||||
| NG | Arabic | 300K words |
GALE collection | open |
Arabic
translation guidelines |
-- |
-- |
-- | 18Kwords LDC2006E85 |
179.925K
words LDC2006E92 |
|||||
| Chinese | 300K words |
GALE collection | open |
Chinese
translation guidelines |
-- |
-- |
-- |
77K
words LDC2006E85 |
218.27K
words LDC2006E92 |
||||||
| WL | Arabic | 200K words |
ACE2005;
GALE collection |
open |
Arabic
translation guidelines |
-- |
63K
words LDC2005E83 |
40K
words LDC2006E34 |
-- |
-- |
|||||
| Chinese | 200K words |
ACE2005; GALE collection | open |
Chinese
translation guidelines |
-- |
69K
words LDC2005E83 |
40K
words LDC2006E34 |
-- | -- | ||||||
| Align | BN | Arabic | 3 hrs |
subset
of above |
TBD |
Arabic
alignment guidelines |
-- |
-- |
-- |
-- |
3.9
hr LDC2006E93 |
||||
| Chinese | 3 hrs |
subset of above | TBD |
Chinese
alignment guidelines |
-- |
-- |
-- |
-- | 3.4
hr LDC2006E93 |
||||||
| BC | Arabic | 3 hrs |
subset of above | TBD | Arabic alignment guidelines | -- |
-- |
-- |
-- | 4.2
hr LDC2006E93 |
|||||
| Chinese | 3 hrs |
subset of above | TBD | Chinese alignment guidelines | -- |
-- |
-- |
-- |
3.4hr LDC2006E93 |
||||||
| NW | Arabic | 100K words |
subset of above | TBD | Arabic alignment guidelines | -- |
-- |
-- |
30K
words LDC2006E86 |
105K
words LDC2006E93 |
|||||
| Chinese | 100K words |
TBD |
TBD | Chinese alignment guidelines | -- |
-- |
-- |
49K
words LDC2006E86 |
102K
words LDC2006E93 |
||||||
| NG | Arabic | 50K words |
subset of above | TBD | Arabic alignment guidelines | -- |
-- |
-- |
-- | 51K
words LDC2006E93 |
|||||
| Chinese | 50K words |
subset of above | TBD | Chinese alignment guidelines | -- |
-- |
-- |
-- | 51K
words LDC2006E93 |
||||||
| WL | Arabic | 50K words |
subset of above | TBD | Arabic alignment guidelines | -- |
-- |
-- |
-- | 43K
words LDC2006E93 |
|||||
| Chinese | 50K words |
subset of above | TBD | Chinese alignment guidelines | -- |
-- |
-- |
-- | 51K
words LDC2006E93 |
||||||
| |
|||||||||||||||
| Treebank | BN | Arabic | 30 hrs |
TDT4;
VOA archive; GALE collection |
Arabic Treebank Guidelines |
-- |
4.5
hrs LDC2005E84 |
7.5
hrs LDC2006E35 |
7.5
hrs LDC2006E87 |
11
hrs LDC2006E94 |
|||||
| NW | English (Translation from Arabic) |
500K words |
AFP
translations; An-Nahar translations |
English Treebank Guidelines |
50K words |
50K
words LDC2005E85 |
75K
words LDC2006E36 |
101K
words LDC2006E82 |
140K
words LDC2006E95 |
||||||
| Distill | BN | Arabic | TDT4,
(subset of) above data |
Distillation
Guidelines draft V0.7 |
-- |
Incremental
ad hoc releases (tentative): DRY
RUN data - Chinese & English only
2.13.2006
TRAINING
data (per language)
3.15.2006
|
|||||||||
| Chinese | TDT4, (subset of) above data | -- |
|||||||||||||
| English | TDT4, (subset of) above data | -- |
|||||||||||||
| BC | Arabic | (subset of) above data | -- |
||||||||||||
| Chinese | (subset of) above data | -- |
|||||||||||||
| English | (subset of) above data | -- |
|||||||||||||
| NW | Arabic | |
TDT4, TDT5,(subset of) above data | -- |
|||||||||||
| Chinese | TDT4, TDT5,(subset of) above data | -- |
|||||||||||||
| English | TDT4, TDT5,(subset of) above data | -- |
|||||||||||||
| NG | Arabic | (subset of) above data | -- |
||||||||||||
| Chinese | (subset of) above data | -- |
|||||||||||||
| English | (subset of) above data | -- |
|||||||||||||
| WL | Arabic | (subset of) above data | -- |
||||||||||||
| Chinese | (subset of) above data | -- |
|||||||||||||
| English | (subset of) above data | -- |
|||||||||||||
| Data
Profile |
Distribution
Plan |
Notes |
|||||||||||||
| Resource Name | Size |
Source |
Description |
Kickoff1 10.17.05 |
Kickoff2 11.15.05 |
Q1 12.15.05 |
Q2 3.15.06 |
Q3 6.15.06 |
Q4 9.15.06 |
||||||
| General Resources |
ACE Arabic Namelist |
ACE2003-2005 training corpora |
Arabic names extracted from ACE training data,
plus type and frequency |
LDC2005E66 | |||||||||||
| ATB Arabic Namelist |
ATB publications |
Arabic names extracted from Arabic Treebank,
plus frequency |
LDC2005E68 | ||||||||||||
| Levantine Arabic CTS Audio | 40 hrs | Levantine Fisher |
Conversational telephone speech |
LDC2005E76 |
|||||||||||
| Levantine Arabic CTS Transcripts | 40 hrs |
Levantine Fisher | "Green" and "Yellow" layer transcripts of conversational telephone speech | LDC2005E77 | |||||||||||
| Levantine Arabic CTS Treebank | 33Kwords | Levantine Fisher |
LDC2005E78 | ||||||||||||
| English CTS Treebank with Structural Metadata | 140Kwords | English Switchboard and Fisher |
LDC2005E79 | ||||||||||||
| NGA Name Database |
NGA |
LDC2005G01 | FOUO |
||||||||||||
| BGN Romanization Guide | NGA |
LDC2005G03 | FOUO | ||||||||||||
| Intel Transliteration Standard | Intel |
LDC2005G02 | FOUO | ||||||||||||