EARS

The DARPA EARS (Effective, Affordable, Reusable Speech-to-Text) program is developing robust speech recognition technology to address a range of languages and speaking styles and "produce powerful new speech-to-text (automatic transcription) technology whose outputs are substantially richer and much more accurate than currently possible." LDC provides conversational and broadcast speech and transcripts, annotations, and lexicons and texts for language modeling in each of the EARS languages.

EARS Data Matrix Revised 11/07/04

EARS Arabic Website

RT-04 Data and Annotation Plan Revised 07/15/04

Quick Transcription
LDC is using a quick transcription approach in order to provide large volumes of transcribed speech for training purposes.  Quick transcription limits the amount of time an annotator spends transcribing each file, and removes some level of detail from the resulting transcript.  The following documents provide additional information about quick transcription for STT research:

Careful Transcription
LDC creates high-accuracy verbatim transcripts with rich markup for use as Speech-to-Text devtest and evaluation data.  The following documents provide additional information about careful transcription for STT research:

Metadata Extraction
LDC is creating annotated resources to support a metadata extraction (MDE) research evaluation. The goal of MDE is to enable technology that can take the raw STT output and refine it into forms that are of more use to humans and to downstream automatic processes. The following documents provide additional information about annotation in support of MDE research:

The table below summarizes English, Chinese and Arabic resources relevant to EARS categorized by language and resource type. 

 

English

Broadcast Speech

1997 HUB4 English Evaluation Speech and Transcripts - 3 hours of radio and tv news stories
1996 English Broadcast News Speech (Hub-4) - 104 hours of broadcasts
1996 English Broadcast News Dev and Eval - 104 hours of broadcasts
1997 English Broadcast News Speech (Hub-4) - 97 hours of news broadcasts
TDT2 English Audio - 1036 waveform files of complete broadcasts
TDT3 English Audio - 950 hours of news broadcasts
TDT2 Careful Transcription Audio - recorded broadcasts from 1998
EARS-only training data releases (contact LDC)

  • LDC2003E02 TDT4 Multilanguage Speech
  • LDC2003E02A TDT4 Multilanguage Supplemental

Careful Transcriptions

1996 English Broadcast News Transcripts (Hub-4) - 104 hours of transcripted broadcasts
1997 English Broadcast News Transcripts (Hub-4) - 97 hours of transcripted news broadcasts
1998 HUB5 English Transcripts - transcripts for 20 Callhome English and 20 Switcboard telephone conversations
TDT2 Careful Transcription Text - transcripts of broadcasts from 1998

Quick Transcriptions

TDT Pilot Study Corpus - 16,000 news stories
TDT2 Multilanguage Text Version 4.0 - news data in Mandarin and English
TDT3 Multilanguage Text Version 2.0 - news data in Mandarin and English

TDT4 Multilanguage Text Version 1.1:  contact LDC for a pre-release copy (LDC catalog no.: LDC2003E21)


EARS-only training data releases (contact LDC)

  • LDC2003E12A Fisher Training Speech Part 1 (250 two sided telephone conversations);
  • LDC2003E13A Fisher Quick Transcription Part 1 (250 transcribed two sided telephone conversations);
  • LDC2003E12B Fisher Training Speech Part 2 (250 two sided telephone conversations);
  • LDC2003E13B Fisher Quick Transcription Part 2 (250 transcribed two sided telephone conversations)

Metadata Annotations

EARS-only training data releases (contact LDC)

  • EARS MDE RT-03F Training Corpus (LDC2003E19)
  • EARS MDE RT-03 DevTest and Evaluation Corpus (LDC2003E27)

Conversational Speech

SWITCHBOARD-1 Release 2 - 2,400 two sided telephone conversations
Switchboard-2 Phase I - 3,638 five minute conversations
Switchboard-2 Phase II - 4,472 five minute conversations
Switchboard-2 Phase III - 2,728 two sided telephone conversations
CALLHOME American English Speech - 120 telephone conversations up to thirty minutes each
CALLFRIEND American English-Non-Southern - 60 telephone conversations five to thirty minutes each
CALLFRIEND American English-Southern - 60 telephone conversations five to thirty minutes each
Switchboard Cellular Part 1 Audio - 1309 GSM cellular phone calls

Careful Transcriptions

CALLHOME American English Transcripts - 18.3 hours of transcribed speech
1998 HUB5 English Evaluation - 40 sphere files
2001 HUB5 English Evaluation - 60 sphere files
Switchboard Cellular Part 1 Transcribed Audio - 250 transcribed files
Switchboard Cellular Part 1 Transcription - 250 transcribed files

Pronouncing Lexicon

CALLHOME American English Lexicon - 90,988 lexical entries
American English Spoken Lexicon - 50,602 lexical entries of the most common English words

Language Modeling Text

North American News Text Corpus - texts formatted with TIPSTER
North American News Text Supplement - texts formatted with TIPSTER
English Gigaword ~ 12 GB (over a billion words) of normalized English newswire text
EARS-only training data releases (contact LDC)

  • LDC2003E03 TDT4 Multilanguage Text
  • LDC2003E02A TDT4 Multilanguage Supplemental

 

 

Chinese

Broadcast Speech

1997 HUB-4 Broadcast News Evaluation Non English Test - 1 hour each of Spanish and Mandarin news broadcasts
1997 Mandarin Broadcast News Speech (Hub-4NE) - 30 hours of Mandarin news broadcasts
TDT2 Mandarin Audio - recorded Mandarin news broadcasts
TDT3 Mandarin Audio - recorded Mandarin news broadcasts
EARS-only training data releases (contact LDC)

  • LDC2003E02 TDT4 Multilanguage Speech
  • LDC2003E02A TDT4 Multilanguage Supplemental

 

Careful Transcriptions

1997 Mandarin Broadcast News Transcripts - transcripts for 30 hours of news broadcasts
1997 HUB-4 Broadcast News Evaluation Non English Test - transcripts for 1 hour each of Spanish and Mandarin broadcasts
2001 HUB5 Mandarin Transcripts - 20 Callhome Mandarin telephone conversations

Quick Transcriptions

TDT2 Multilanguage Text Version 4.0 - transcripts for recorded Mandarin news broadcasts
TDT3 Multilanguage Text Version 2.0 - transcripts for recorded Mandarin news broadcasts
TDT4 Multilanguage Text Version 1.1:  contact LDC for a pre-release copy (LDC catalog no.: LDC2003E21)

Conversational Speech

CALLFRIEND Mandarin Chinese-Mainland - 60 Mandarin telephone conversations 5-30 minutes each
CALLFRIEND Mandarin Chinese-Taiwan Dialect - 60 Taiwanese dialect telephone conversations 5-30 minutes each
CALLHOME Mandarin Chinese Speech - 120 telephone conversations between native Mandarin speakers
Hub-5 Mandarin Telephone Speech Corpus - 42 Mandarin telephone conversations
Taiwanese Putonghua Speech and Transcripts - 40 speakers using Taiwanese accented Putonghua

Careful Transcriptions

CALLHOME Mandarin Chinese Transcripts - 120 transcripted Mandarin telephone conversations
Taiwanese Putonghua Speech and Transcripts - transcripts for 40 speakers using Taiwanese accented Putonghua
Hub-5 Mandarin Transcripts - 42 transcripted Mandarin telephone conversations
2001 HUB5 Mandarin Evaluation - 8 hours of transcribed Mandarin conversations

Pronouncing Lexicon

CALLHOME Mandarin Chinese Lexicon - 44,405 words with phonological, morphological and frequency information

Language Modeling Text

Mandarin Chinese News Text - 250 million GB-encoded text characters
Chinese Gigaword ~ archive of Chinese newswire text data, totalling over 2.5 million documents or over 1.1 billion words
EARS-only training data releases (contact LDC)

  • LDC2003E03 TDT4 Multilanguage Text
  • LDC2003E02A TDT4 Multilanguage Supplemental

 

 

Arabic

Arabic Dialects Transcription Guidelines

 Levantine Arabic Transcription Guidelines - (under development)

Broadcast Speech

 EARS-only training data releases (contact LDC)

  • LDC2003E02 TDT4 Multilanguage Speech
  • LDC2003E02A TDT4 Multilanguage Supplemental

Careful Transcriptions

Callhome Egyptian Arabic Transcripts Supplement - transcripts for CallHome Egyptian Arabic Speech Supplement

Quick Transcriptions

TDT 3 Arabic Text : contact LDC for a pre-release copy (LDC catalog no.: LDC2002E32 )
TDT4 Multilanguage Text Version 1.1:  contact LDC for a pre-release copy (LDC catalog no.: LDC2003E21)

Conversational Speech

CALLHOME Egyptian Arabic Speech - 120 Egyptian Colloquial Arabic telephone conversations
CALLFRIEND Egyptian Arabic - 60 telephone conversations between native speaker of Egyptian dialect of Arabic
Callhome Egyptian Arabic Speech Supplement - 20 telephone conversations
Levantine Arabic QT Training Data Set 1 Speech - 135 phone conversation (up to 10 minutes) of Levantine Arabic speakers (Speech)
Levantine Arabic QT Training Data Set 1 Transcripts - 305 phone conversation (up to 10 minutes) of Levantine Arabic speakers (Transcripts)
Levantine Arabic QT Training Data Set 2 Speech - 305 phone conversation (up to 10 minutes) of Levantine Arabic speakers (Speech)
Levantine Arabic QT Training Data Set 2 Transcripts - 135 phone conversation (up to 10 minutes) of Levantine Arabic speakers (Transcripts)

Careful Transcriptions

CALLHOME Egyptian Arabic Transcripts - transcripts for 120 Egyptian Colloquial Arabic telephone conversations
1997 HUB5 Arabic Evaluation - 20 transcribed conversations
1997 HUB5 Arabic Transcripts - 20 transcribed conversations

Pronouncing Lexicon

Egyptian Colloquial Arabic Lexicon - electronic pronunciation dictionary of Egyptian Colloquial Arabic

Language Modeling Text

Arabic Newswire Part 1 - 2,337 Arabic text data files tagged using TIPSTER
Arabic Gigaword ~ archive of Arabic newswire text data, nearly 400 Million words
EARS-only training data releases (contact LDC)

  • LDC2003E03 TDT4 Multilanguage Text
  • LDC2003E02A TDT4 Multilanguage Supplemental

 




About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

(c) 1996-1999 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
Please send technical questions to online-service@ldc.upenn.edu, Member sales questions to ldc@ldc.upenn.edu.

tmalec@ldc.upenn.edu

Last modified: Wed Jun 26 09:44:41 2002