Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome



Linguistic Resources
Current Projects

The LDC is involved in a number of projects to support language education, research and technology development.

ACE - In support of the Automatic Content Extraction Program, LDC develops text corpora in English, Chinese and Arabic annotated for entities, the relations among them and the events in which they participate.

GALE - LDC is developing integrated linguistic resources and related infrastructure to support language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program.

LCTL (Less Commonly Taught Languages) - LDC is creating and distributing linguistic resources including monolingual and parallel text, lexicons, encoding converters, word and sentence segmenters, morphological analyzers, named entity taggers, annotations, annotation infrastructre and specifications for a number of "less commonly taught languages". The focus in Year One of LCTL is on seven languages that have more than a million speakers but do not possess adequate resources for building human language technologies: Urdu, Thai, Hungarian, Bengali, Punjabi, Tamil and Yoruba.

Mixer - LDC is collecting speech data from a large number of participants using a Mixer-style telephone collection platform. For some participants, the telephone calls are coupled with a series of socio-linguistic interviews conducted at facilities equipped with multichannel recording devices. These facilities are also used to make telephone calls that are recorded by the platform as well as on the multichannel devices. The participants represent a wide range of demographics, and we are collecting both mono-lingual and bi-lingual conversations in close to 30 languages, which include a variety of dialects and accents.

Recent Projects

TIDES
- LDC is collecting up to an hour per day in each of the languages in which Voice of America broadcasts to support this DARPA project in Translingual Information Detection Extraction and Summarization. The following are subprojects in TIDES.

Extraction - Corpus creation to support extraction of entities, relations and events from text as a TIDES technology and in collaboration with the Automatic Content Extraction program.

HARD - LDC creates corpora and annotations including topics, metadata and relevance judgements to support High Accuracy Retrieval from Documents (HARD), a new evaluation track within TREC. The objective of HARD is to improve retrieval results by leveraging additional information about the searcher and/or the search context, through techniques such as passage retrieval, and using very targeted interaction with the searcher.

TDT - LDC has developed the new TDT-5 corpus to support the 2004 Topic Detection and Tracking evaluation.  The multilingual corpus is annotated for relevance to over 250 topics. 

EARS - LDC is providing broadcast news and telephone conversations, transcripts, pronouncing lexicons and texts for language modeling in English, Chinese and Arabic to support EARS.

STT - LDC creates high-quality careful transcripts of English, Chinese and Arabic conversational telephone and broadcast news speech to support the EARS Speech-to-Text evaluations.

MDE - LDC creates annotated corpora and guidelines to support the EARS Metadata Extraction (MDE) program. The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes.

Other Projects

AGTK - Annotation Graphs are a formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems.

ANC - The American National Corpus (ANC) project is fostering the development of a corpus comparable to the British National Corpus (BNC), covering American English.

DASL - This project investigates best practices in the use of digital speech corpora in the study of language variation. Our pilot study will analyze four large speech corpora for a common sociolinguistic variable, and will develop a corpus of carefully transcribed and annotated sociolinguistic interviews.

FORM - The goal of the FORM project is to develop a corpus annotated with multi-modal information concerning conversational interaction. We are currently preparing a corpus of gesture-annotated videos. Our next step will be to add speech transcription and intonational information.

LDC Institute - A seminar series on issues in language data and database creation.

QLDB - This project is investigating data models and query languages for linguistic databases.

TalkBank - TalkBank is an indisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data.

Past Projects