Linguistic Resources
Current
Projects
The LDC is involved in a number of
projects to support language education, research and technology
development.
ACE - In
support of the Automatic Content Extraction Program, LDC develops text
corpora in English, Chinese and Arabic annotated for entities, the
relations among them and the events in which they participate.
GALE
-
LDC is developing integrated linguistic
resources and related infrastructure to support language exploitation
technologies within the DARPA GALE (Global Autonomous Language
Exploitation) Program.
LCTL (Less Commonly Taught Languages) - LDC is creating and distributing linguistic resources including monolingual and
parallel text, lexicons, encoding converters, word and sentence segmenters, morphological analyzers, named entity taggers, annotations, annotation infrastructre and specifications for a number of "less commonly taught languages". The focus in Year One of LCTL is on seven languages that have more than a million speakers but do not possess adequate resources for building human language technologies: Urdu, Thai, Hungarian, Bengali, Punjabi, Tamil and Yoruba.
Mixer - LDC is collecting speech data from a large number of participants using a Mixer-style telephone collection platform. For some participants, the telephone calls are coupled with a series of socio-linguistic interviews conducted at facilities equipped with multichannel recording devices. These facilities are also used to make telephone calls that are recorded by the platform as well as on the multichannel devices. The participants represent a wide range of demographics, and we are collecting both mono-lingual and bi-lingual conversations in close to 30 languages, which include a variety of dialects and accents.
Recent Projects
TIDES - LDC is collecting up to an hour per day in
each of the languages in which Voice of America broadcasts to support
this DARPA project in Translingual Information Detection Extraction and
Summarization. The following are subprojects in TIDES.
Extraction
- Corpus creation to support extraction of entities, relations and
events from text as a TIDES technology and in collaboration with the
Automatic Content Extraction program.
HARD - LDC
creates corpora and annotations including topics, metadata and
relevance judgements to support High Accuracy Retrieval from Documents
(HARD), a new evaluation track within TREC. The objective of HARD is to
improve retrieval results by leveraging additional information about
the searcher and/or the search context, through techniques such as
passage retrieval, and using very targeted interaction with the
searcher.
TDT - LDC
has developed the new TDT-5 corpus to support the 2004 Topic Detection
and Tracking evaluation. The multilingual corpus is annotated for
relevance to over 250 topics.
EARS -
LDC is providing broadcast news and telephone conversations,
transcripts, pronouncing lexicons and texts for language modeling in
English, Chinese and Arabic to support EARS.
STT - LDC creates high-quality careful transcripts of
English, Chinese and Arabic conversational telephone and broadcast news
speech to support the EARS Speech-to-Text evaluations.
MDE - LDC creates annotated corpora and guidelines to
support the EARS Metadata Extraction (MDE) program. The goal of EARS
MDE is to enable technology that can take raw Speech-to-Text output and
refine it into forms that are of more use to humans and to downstream
automatic processes.
Other Projects
AGTK - Annotation Graphs are a formal
framework for representing linguistic annotations of time series data.
Annotation graphs abstract away from file formats, coding schemes and
user interfaces, providing a logical layer for annotation
systems.
ANC - The American National Corpus (ANC) project is
fostering the development of a corpus comparable to the British
National Corpus (BNC), covering American English.
DASL - This project investigates best practices in
the use of digital speech corpora in the study of language variation.
Our pilot study will analyze four large speech corpora for a common
sociolinguistic variable, and will develop a corpus of carefully
transcribed and annotated sociolinguistic interviews.
FORM - The goal of the FORM project is to develop a
corpus annotated with multi-modal information concerning conversational
interaction. We are currently preparing a corpus of gesture-annotated
videos. Our next step will be to add speech transcription and
intonational information.
LDC Institute - A seminar series on issues
in language data and database creation.
QLDB - This project is investigating data models and
query languages for linguistic databases.
TalkBank - TalkBank is an
indisciplinary research project funded by a five year NSF grant to
foster research and development in communicative behavior by providing
tools and standards for analysis and distribution of language data.
Past
Projects
|