Overview of LDC's GALE Activities
Linguistic Data Consortium is developing integrated linguistic resources and related infrastructure to support language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program.
There are five major tasks: Data Collection, Transcription, Translation and Alignment, Xbanking and Distillation Annotation. Infrastructure tasks include Resource Distribution and Technical Support.
Data Collection
LDC will collect Arabic, Chinese and English data in multiple genres.
The following collections have been targeted at the program’s outset:
- Broadcast News (BN)- consisting of "talking head"-style news broadcasts from radio and/or television networks.
- Broadcast Conversation (BC) - consisting of talk shows plus roundtable discussions and other interactive-style broadcasts from radio and/or television networks.
- Newswire (NW) - consisting of newswire feeds in multiple languages Conversational Telephone Speech (CTS) - consisting of phone conversations of variable duration among subjects who may or may not know each other with topics that may or may not be assigned.
- Web Newsgroups (NG) – consisting of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums
- Weblogs (WL) – consisting of posts to informal web-based journals of varying topical content
For all genres, collection includes infrastructure for harvesting the data, application of formatting standards to text data, and license fees to data providers so that data may be redistributed to GALE participants.
Year One activities focus on collection of all genres except Conversational Telephone Speech.
Transcription
LDC will create, purchase or otherwise acquire verbatim transcripts for
large volumes of Arabic and Chinese broadcast speech. Transcripts will
conform to a quick, rich transcription specification, including
accurate transcription of content words, speaker identification and
standardized punctuation and orthography but no additional markup. A
small amount of data will be dually transcribed and discrepancies
adjudicated to provide data for analysis of human transcription
variation.
Translation and Alignment
LDC will collect, produce or otherwise acquire manual translations of
targeted data for GALE. In addition to contracting with professional
translation agencies to provide data, LDC will continue to harvest
parallel text from the internet. Translations in Year One will focus on
broadcast sources plus weblogs and newsgroups.
Word alignment of parallel text is a critical component of statistical MT. The performance of word alignment algorithms has a direct impact on the performance of MT systems. LDC will provide manually word-aligned parallel text for training and evaluating word alignment algorithms. An easy-to-use GUI tool will be developed to speed up the annotation process, which will be carried out by bilingual speakers of the language pairs. Alignment will be performed for all genres except CTS.
XBanks
XBanking refers to syntactic and semantic annotation including
part-of-speech tagging, Treebanking and PropBanking. LDC will create
parallel Treebanks and extend Treebanking (syntactic annotation) into
spoken modes. Year One focus will include the Treebanks for Arabic
broadcast speech, plus English translations of Arabic newswire data
that has already been Treebanked.
Distillation
LDC will provide manual and/or semi-automatic annotations of collected,
transcribed and/or translated data to support the distillation research
task. Annotation tasks may include entity, relation and event tagging
plus co-reference within and/or across documents; topic tagging; event
or incident linking; question answering and/or other tasks as specified
by program sponsors and community members.
Resource Distribution
LDC will deliver specified linguistic resources via web download, FTP,
CD, DVD or other medium to program participants, sponsor(s), and/or the
sponsor's representative(s) as required. Delivery format of the
distributions will be determined based on size of the data sets. Media
shipments will use the express service (FedEx or DHL) most appropriate
given the location of the recipient. At the program's outset,
recipients will be required to sign licenses that recognize the
copyright of the original data providers and limits recipients' use of
the data to linguistic education, research and technology development.
The number and type of distributions each year will be determined based
on final content of tasks described above. Distribution schedules will
be established in negotiations with the GALE sponsors and community.
Task Integration
Wherever possible, LDC will use the same collected material for
multiple follow-on tasks. For instance, we will make every effort to
select data for Arabic Treebanking that has also been selected for
Translation and Alignment and to perform Distillation annotation on
this same set of data. This will save costs and also offer the
opportunity to learn from the multiple annotations on the same source
data.
General Publication
As the linguistic resources described above are distributed to GALE
program participants, LDC will wherever possible distribute the data
more broadly, for example to its members and licensees, through the
usual mechanisms. Upon sponsor request, some subsets of data may be
reserved for use within GALE only.
A visual depiction of LDC's GALE tasks is shown below.














