Data Matrix for Year One of GALE

Updated 10.31.2006

Key

  • Phase = TRN (training), DEV (development), EVL (evaluation)
  • Genre = BN (Broadcast News), BC (Broadcast Conversation), NW (Newswire), NG (Newsgroups), WL (Weblogs)
  • Volume expressed in hours for speech sources (BN, BC); tokens for text sources (NW, NG, WL)
  • Tokens = 1 word (Arabic, English) or 1 character (Chinese); 1 word = 1.5 characters


Phase Task
Data Profile

Annotations

Deliveries
(click on release name for detailed information)
Notes
Genre Lang Total Volume Source Epoch Kickoff1

10.17.05
Kickoff2

11.15.05
Q1

12.15.05
Q2

3.15.06
Q3

6.22.06
Q4

10.16.06

TRN Collect BN Arabic 1000 hrs
GALE Collection
2004.05--
manual audit 43.5 hrs
LDC2005E62

213 hrs
LDC2005E80
294 hrs LDC
LDC2006E31
263hrs
LDC2006E83
355 hrs
LDC2006E89 


Chinese 1000 hrs
GALE Collection 2004.11--
manual audit 65 hrs
LDC2005E62

263 hrs
LDC2005E80
265 hrs LDC;
39 hrs web
LDC2006E31
281hrs
LDC2006E83
519 hrs
LDC2006E89 


English 50 hrs
GALE Collection
2004.11--
manual audit --

25 hrs
LDC2005E80
10 hrs LDC
LDC2006E31
10 hrs
LDC2006E83
10 hrs
LDC2006E89 


BC Arabic 945 hrs
GALE Collection
2005.10--
manual audit 20 hrs
LDC2005E61

203 hrs
LDC2005E80
130 hrs LDC;
31 hrs web
LDC2006E31
249hrs
LDC2006E83
391 hrs
LDC2006E89 


Chinese
940 hrs
GALE Collection
2005.03--
manual audit 25 hrs
LDC2005E61

67 hrs
LDC2005E80
188 hrs LDC;
82 hrs web
LDC2006E31
287hrs
LDC2006E83
519 hrs
LDC2006E89 


English 50 hrs
GALE Collection 2005.06--
manual audit --

27 hrs
LDC2005E80
10 hrs LDC
LDC2006E31
10 hrs
LDC2006E83
10 hrs
LDC2006E89 


NG Arabic 1M+ words
GALE Collection open DataScout guidelines --
886Kw
LDC2005E81
11,516Kw LDC2006E32
2062Kw
LDC2006E83
1893Kw
LDC2006E90


Chinese 1M+ words
GALE Collection open DataScout guidelines --
12,860Kw
LDC2005E81
12,719Kw LDC2006E32
17611Kw
LDC2006E77
21963Kw
LDC2006E90


English 1M+ words
GALE Collection open DataScout guidelines --
9,082Kw
LDC2005E81
91,639Kw LDC2006E32
13342Kw
LDC2006E77
13644Kw
LDC2006E90


WL Arabic 1M+ words
GALE Collection open DataScout guidelines --
2,833Kw
LDC2005E81
4,254Kw LDC2006E32
2083Kw
LDC2006E77
2666Kw
LDC2006E90


Chinese 1M+ words
GALE Collection open DataScout guidelines --
4,594Kw
LDC2005E81
5,525Kw
LDC2006E32
2162Kw
LDC2006E77
3851Kw
LDC2006E90


English 1M+ words
GALE Collection open DataScout guidelines --
4,297Kw
LDC2005E81
3,394Kw
LDC2006E32
867Kw
LDC2006E77
792Kw
LDC2006E90

















Transcribe BN Arabic 870 hrs
VOA archive; GALE collection
TBD web-harvested transcripts;
QuickTR;
QuickRichTR
10 hrs
LDC2005E71

10 hrs QRTR
LDC2005E82
5.305hrs QRTR
LDC2006E33
47 hrs
LDC2006E84
94.789 hrs QRTR
LDC2006E91


Chinese 540 hrs
GALE collection TBD web-harvested transcripts;
QuickTR;
QuickRichTR
--

71 hrs web; 12.7 hrs QTR; 5 hrs QRTR
LDC2005E82
134 hrs QTR;
21 hrs QRTR;
39 hrs web LDC2006E33
133 hrs
LDC2006E84
65.994 hrs QTR; 86.4 hrs QRTR; 65.677 web
LDC2006E91


English 50 hrs
GALE collection TBD QuickRichTR --

--
--
--
2.561 QRTR
LDC2006E91


BC Arabic 1000 hrs
Al Jazeera web harvest; GALE collection TBD web-harvested transcripts; QuickTR;
QuickRichTR
20 hrs
LDC2005E63

193 hrs web; 12 hrs QRTR
LDC2005E82
31 hrs web
LDC2006E33
64.3 hrs
LDC2006E84
35.865 hrs QRTR; 116.141 web
LDC2006E91


Chinese 1000 hrs
NTDTV web harvest;
GALE collection
TBD web-harvested transcripts; QuickTR;
QuickRichTR
25 hrs
LDC2005E63

11.9 hrs QTR; 12 hrs QRTR

LDC2005E82
86 hrs QTR;
6 hrs QRTR;
65 hrs web
LDC2006E33
166.5 hrs
LDC2006E84
157.6 hrs QTR; 109.9 hrs QRTR; 7.408 web
LDC2006E91


English 50 hrs
GALE collection TBD QuickRichTR --

--
--
--
46.025 hrs
LDC2006E91

















Translate BN Arabic 75 hrs
TDT4; VOA Archive; GALE collection
TBD Arabic translation guidelines
--

12 hrs
LDC2005E83
1 hr
-- 16.5 hrs
LDC2006E92


Chinese 75 hrs
Hub4; GALE Collection TBD Chinese translation guidelines
--

5 hrs
LDC2005E83
17.5 hrs
LDC2006E34
19 hrs
LDC2006E85
--

BC Arabic 50 hrs
GALE Collection TBD Arabic translation guidelines
--

5 hrs
LDC2005E83
12.5 hrs
LDC2006E34
--
--


Chinese 50 hrs
GALE Collection TBD Chinese translation guidelines
--

5 hrs
LDC2005E83
3.6 hrs
LDC2006E34
--
32.3 hrs
LDC2006E92


NW Arabic 7M words
found parallel text, FBIS archive, GALE Collection open -- --

--

626Kw Arabic;15Mw Chinese
LDC2006G05

-- --

NG Arabic 300K words
GALE collection open Arabic translation guidelines
--

--
-- 18Kwords
LDC2006E85
179.925K words
LDC2006E92


Chinese 300K words
GALE collection open Chinese translation guidelines
--

--
--
77K words
LDC2006E85
218.27K words
LDC2006E92


WL Arabic 200K words
ACE2005; GALE collection
open Arabic translation guidelines
--

63K words
LDC2005E83
40K words
LDC2006E34
--
--


Chinese 200K words
ACE2005; GALE collection open Chinese translation guidelines
--

69K words
LDC2005E83
40K words
LDC2006E34
-- --
















Align BN Arabic 3 hrs
subset of above
TBD Arabic alignment guidelines
--

--
--
--
3.9 hr
LDC2006E93


Chinese 3 hrs
subset of above TBD Chinese alignment guidelines
--

--

--
-- 3.4 hr
LDC2006E93


BC Arabic 3 hrs
subset of above TBD Arabic alignment guidelines --

--

--
-- 4.2 hr
LDC2006E93


Chinese 3 hrs
subset of above TBD Chinese alignment guidelines --

--

--
--   3.4hr
LDC2006E93


NW Arabic 100K words
subset of above TBD Arabic alignment guidelines --

--
--
30K words
LDC2006E86
105K words
LDC2006E93


Chinese 100K words
TBD
TBD Chinese alignment guidelines --

--
--
49K words
LDC2006E86
102K words
LDC2006E93


NG Arabic 50K words
subset of above TBD Arabic alignment guidelines --

--
--
-- 51K words
LDC2006E93


Chinese 50K words
subset of above TBD Chinese alignment guidelines --

--
--
-- 51K words
LDC2006E93


WL Arabic 50K words
subset of above TBD Arabic alignment guidelines --

--
--
-- 43K words
LDC2006E93


Chinese 50K words
subset of above TBD Chinese alignment guidelines --

--
--
-- 51K words
LDC2006E93





 











Treebank BN Arabic 30 hrs
TDT4;  VOA archive; GALE collection

Arabic Treebank Guidelines
--

4.5 hrs
LDC2005E84
7.5 hrs
LDC2006E35
7.5 hrs
LDC2006E87
11 hrs
LDC2006E94


NW English (Translation from Arabic)
500K words
AFP translations; An-Nahar translations

English Treebank Guidelines
50K words

50K words
LDC2005E85
75K words
LDC2006E36
101K words
LDC2006E82
140K words
LDC2006E95

















Distill BN Arabic   TDT4, (subset of) above data

Distillation Guidelines

draft V0.7
--

Incremental ad hoc releases (tentative):


DRY RUN data - Chinese & English only

2.13.2006

  • 10+ queries (snippets, nuggets)

TRAINING data (per language)

3.15.2006
  • 10+ queries (snippets)
3.30.2006
  • 25+ queries (snippets)
4.14.2006
  • 100+ queries (snippets)
5.1.2006
  • 25+ queries (nuggets, (super)nugs)
5.15.2006
  • 150+ queries (snippets)
  • 50+ queries (nuggets + (super)nugs)
6.15.2006
  • 150+ queries (snippets)
  • 50+ queries (nuggets + (super)nugs)
Final Release - November 2006
  • 200+ queries (snippets)
  • 100+ queries (nuggets + (super)nugs)



Chinese   TDT4, (subset of) above data
--



English   TDT4, (subset of) above data
--



BC Arabic   (subset of) above data
--



Chinese   (subset of) above data
--



English   (subset of) above data
--



NW Arabic  
TDT4, TDT5,(subset of) above data
--



Chinese   TDT4, TDT5,(subset of) above data
--



English   TDT4, TDT5,(subset of) above data
--



NG Arabic   (subset of) above data
--



Chinese   (subset of) above data
--



English   (subset of) above data
--



WL Arabic   (subset of) above data
--



Chinese   (subset of) above data
--



English   (subset of) above data
--






Data Profile

Distribution Plan

Notes


Resource Name Size
Source
Description
Kickoff1

10.17.05
Kickoff2

11.15.05
Q1

12.15.05
Q2

3.15.06
Q3

6.15.06
Q4

9.15.06

General

Resources
ACE Arabic Namelist

ACE2003-2005 training corpora
Arabic names extracted from ACE training data, plus type and frequency
LDC2005E66






ATB Arabic Namelist

ATB publications
Arabic names extracted from Arabic Treebank, plus frequency
LDC2005E68






Levantine Arabic CTS Audio 40 hrs Levantine Fisher
Conversational telephone speech

LDC2005E76






Levantine Arabic CTS Transcripts 40 hrs
Levantine Fisher "Green" and "Yellow" layer transcripts of conversational telephone speech
LDC2005E77





Levantine Arabic CTS Treebank 33Kwords Levantine Fisher


LDC2005E78





English CTS Treebank with Structural Metadata 140Kwords English Switchboard and Fisher


LDC2005E79





NGA Name Database

NGA


LDC2005G01 



FOUO

BGN Romanization Guide
NGA


LDC2005G03



FOUO
Intel Transliteration Standard
Intel


LDC2005G02



FOUO