| Linguistic Resources |
LDC is involved in a number of data collection and annotation projects to support language-related education, research and technology development.
The DARPA BOLT (Broad Operational Language Translation) Program will create
new techniques for automated translation and linguistic analysis that can
be applied to informal genres of text and speech common to online and
in-person communications. LDC supports the BOLT Program by collecting
informal data sources including discussion forums, text messaging and chat
in English, Chinese and Egyptian Arabic, and applying annotations
including translation, word alignment, Treebanking, PropBanking,
co-reference and queries/responses. LDC also supports the evaluation of
BOLT technologies by post-editing machine translation system output and
assessing IR system responses during annual evaluations conducted by NIST.
The DARPA DEFT (Deep Exploration and Filtering of Text) Program will
develop automated systems to process text information and enable the
understanding of connections in text that might not be readily apparent to
humans. LDC supports the DEFT Program by collecting, creating and
annotating a variety of data sources to support Smart Filtering,
Relational Analysis and Anomaly Analysis.
The HAVIC (Heterogeneous Audio Visual Internet Collection) Corpus comprises
thousands of hours of real-world amateur video data, annotated for
features including topics and events depicted in the video (or its
corresponding audio). Currently, the HAVIC corpus is being used to support
the NIST TRECVid Multimedia Event Detection (MED) and Multimedia Event
Recounting (MER) Evaluations.
LDC develops linguistic resources to support the NIST Language Recognition
Evaluation (LRE) series. The LRE-11 corpus included narrowband broadcast
news speech and conversational telephone speech in 24 languages, including
several closely related/confusable varieties. Collection of the next LRE
corpus is just getting underway.
The goal of the DARPA MADCAT (Multilingual Automatic Document
Classification, Analysis and Translation) Program is to automatically
convert foreign text images into English transcripts. LDC supports MADCAT
by collecting handwritten documents in Arabic and Chinese, scanning texts
at a high resolution, annotating the physical coordinates of each line and
token, and transcribing and translating the content into English. LDC also
supports the evaluation of MADCAT technologies by post-editing machine
translation system output during annual evaluations conducted by NIST.
LDC develops linguistic resources to support the NIST Open Machine
Translation (OpenMT) Evaluation series by developing test sets in
multiple languages and genres, and by sharing linguistic resources
developed in other programs including DARPA GALE and TIDES.
LDC develops linguistic resources to support the NIST Open Handwriting
Recognition Technology (OpenHaRT) Evaluation series by collecting and
annotating naturally-occuring examples of handwriting in multiple
languages, genres and domains, and by sharing linguistic resources
developed in the DARPA MADCAT Program.
The DARPA RATS (Robust Automatic Transcription of Speech) Program will
develop algorithms and software for performing basic speech processing on
potentially speech-containing signals received over communication channels
that are extremely noisy and/or highly distorted. LDC supports the RATS
Program by collecting conversational data in multiple languages and
annotating collected speech to provide training, development and test data
for four tasks: Speech Activity Detection, Language ID, Speaker ID and
Keyword Spotting. LDC also supports the evaluation of RATS technologies by
adjudicating system output against human gold standard annotations, as
part of annual evaluations conducted by SAIC.
LDC develops linguistic resources to support the NIST Speaker Recognition
Evaluation (SRE) series. For the SRE-12 evaluation, LDC collected multiple
telephone calls from each of 414 English speakers who were also present in
earlier SRE corpora. All calls were audited for language, speaker identity
and other features.
The Text Analysis Conference (TAC) is a series of evaluation workshops
organized by NIST to encourage research in Natural Language Processing and
related applications. LDC provides linguistic resources including source
data, annotations and system assessment for the KBP (Knowledge Base
Population) Track, which promotes research in automated systems that can
discover information about named entities as found in a large corpus and
incorporate this information into a knowledge base.