Overview of LDC's GALE Activities

Linguistic Data Consortium is developing integrated linguistic resources and related infrastructure to support language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program.

There are five major tasks: Data Collection, Transcription, Translation and Alignment, Xbanking and Distillation Annotation. Infrastructure tasks include Resource Distribution and Technical Support.

Data Collection
LDC will collect Arabic, Chinese and English data in multiple genres. The following collections have been targeted at the program’s outset:

  • Broadcast News (BN)- consisting of "talking head"-style news broadcasts from radio and/or television networks.
  • Broadcast Conversation (BC) - consisting of talk shows plus roundtable discussions and other interactive-style broadcasts from radio and/or television networks.
  • Newswire (NW) - consisting of newswire feeds in multiple languages Conversational Telephone Speech (CTS) - consisting of phone conversations of variable duration among subjects who may or may not know each other with topics that may or may not be assigned.
  • Web Newsgroups (NG) – consisting of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums
  • Weblogs (WL) – consisting of posts to informal web-based journals of varying topical content

For all genres, collection includes infrastructure for harvesting the data, application of formatting standards to text data, and license fees to data providers so that data may be redistributed to GALE participants.

Year One activities focus on collection of all genres except Conversational Telephone Speech.

Transcription
LDC will create, purchase or otherwise acquire verbatim transcripts for large volumes of Arabic and Chinese broadcast speech. Transcripts will conform to a quick, rich transcription specification, including accurate transcription of content words, speaker identification and standardized punctuation and orthography but no additional markup. A small amount of data will be dually transcribed and discrepancies adjudicated to provide data for analysis of human transcription variation.

Translation and Alignment
LDC will collect, produce or otherwise acquire manual translations of targeted data for GALE. In addition to contracting with professional translation agencies to provide data, LDC will continue to harvest parallel text from the internet. Translations in Year One will focus on broadcast sources plus weblogs and newsgroups.

Word alignment of parallel text is a critical component of statistical MT. The performance of word alignment algorithms has a direct impact on the performance of MT systems. LDC will provide manually word-aligned parallel text for training and evaluating word alignment algorithms. An easy-to-use GUI tool will be developed to speed up the annotation process, which will be carried out by bilingual speakers of the language pairs. Alignment will be performed for all genres except CTS.

XBanks
XBanking refers to syntactic and semantic annotation including part-of-speech tagging, Treebanking and PropBanking. LDC will create parallel Treebanks and extend Treebanking (syntactic annotation) into spoken modes. Year One focus will include the Treebanks for Arabic broadcast speech, plus English translations of Arabic newswire data that has already been Treebanked.

Distillation
LDC will provide manual and/or semi-automatic annotations of collected, transcribed and/or translated data to support the distillation research task. Annotation tasks may include entity, relation and event tagging plus co-reference within and/or across documents; topic tagging; event or incident linking; question answering and/or other tasks as specified by program sponsors and community members.

Resource Distribution
LDC will deliver specified linguistic resources via web download, FTP, CD, DVD or other medium to program participants, sponsor(s), and/or the sponsor's representative(s) as required. Delivery format of the distributions will be determined based on size of the data sets. Media shipments will use the express service (FedEx or DHL) most appropriate given the location of the recipient. At the program's outset, recipients will be required to sign licenses that recognize the copyright of the original data providers and limits recipients' use of the data to linguistic education, research and technology development. The number and type of distributions each year will be determined based on final content of tasks described above. Distribution schedules will be established in negotiations with the GALE sponsors and community.

Task Integration
Wherever possible, LDC will use the same collected material for multiple follow-on tasks. For instance, we will make every effort to select data for Arabic Treebanking that has also been selected for Translation and Alignment and to perform Distillation annotation on this same set of data. This will save costs and also offer the opportunity to learn from the multiple annotations on the same source data.

General Publication
As the linguistic resources described above are distributed to GALE program participants, LDC will wherever possible distribute the data more broadly, for example to its members and licensees, through the usual mechanisms. Upon sponsor request, some subsets of data may be reserved for use within GALE only.

A visual depiction of LDC's GALE tasks is shown below.

GALE Process Image