Less Commonly Taught Languages
The goal of this project is to create and share resources to
support additional basic research and initial technology development in
what have been called Less Commonly Taught Languages. These languages
have also been called Low Density, not for the population of native
speakers but rather for the scarcity of resources. A typology
that distinguishes both population of native speakers might label them
High Density/Sparse Resource language since the languages of current
focus have more than a million speakers but inadequate resources for
building human language technologies.
Resource Plan
The specific resources planned are:
- Monolingual Text: preferably news text written originally in the LCTL. Monolingual text will be detagged, tokenized and converted into a standard encoding for use in language modeling and to be annotated to create other resources. Goals are 250 Kwords that will be translated into English plus an additional 250 Kwords.
- Parallel Text: preferably news text written originally in the LCTL. Parallel text may be found and sentence aligned or created from monolingual text by separating into sentences and then having human translate each sentence of source into one or more sentences in the target language. Goals are 175 Kwords translated from the LCTL into English plus 75 kwords translated from English to the LCTL such that text is consistent across all LCTL languages to allow some comparison between the languages. The 75 kwords is further divided into 30 kwords from an English news corpus, 20 Kwords from an Elicitation Corpus and 25K words come from genres other than news. For two "large" languages, Thai and Urdu, the new target is 100K sentences.
- Bilingual Lexicon: containing a minimum of 10k lemmas but targeting larger lexicons that provide 90-95% coverage over the monolingual text corpus.
- Encoding Converters: convert all raw text and lexicons encodings into the standard encoding selected for the LCTL, most likely Unicode
- Sentence Segmenter: divide text into sentences. For many languages, sentence segmentation means dividing the text at certain punctuation characters. However in some languages the functions of sentence breaking characters are context dependent (such as the period in English). Other language do not mark sentence boundaries explicitly.
- Word Segmenter: necessary for languages that do not already show word segmentation in their writing systems and useful for language that do space separate words but use punctuation ambiguously
- POS Tagset, Tagger and Tagged Text:
- Morphological Analyzer: found or created, tagset coordinated to Bilingual Lexicon
- Morphologically Tagged:
- Named-Entity Tagged Text
- Named Entity tagger
- Personal Name Transliterator
- Descriptive Grammar: grammatical outline based upon existing
grammars of each LCTL and experiences garnered in this resource
gathering exercise. Target audience will be Reflex sites.
Summary
A summary of resources planned with volumes for each LCTL follows.| Task | Urdu | Thai | Hungarian | Bengali | Punjabi | Tamil | Yoruba |
| News Text | 2,000 | 2,000 | 500 | 500 | 500 | 500 | 500 |
| LCTL->English News |
130 |
130 |
130 |
130 |
130 |
130 |
130 |
| LCTL->English Blogs |
20 |
20 |
20 | 20 |
20 |
20 |
20 |
| LCTL->English Conversation |
20 |
20 |
20 |
20 |
20 |
20 |
20 |
| English->LCTL News | 40 | 40 | 40 | 40 | 40 | 40 | 40 |
| English->LCTL Elicitation | 20 | 20 | 20 | 20 | 20 | 20 | 20 |
| English->LCTL Blogs |
10 | 10 |
10 |
10 |
10 |
10 |
10 |
| English->LCTL Phrasebook |
10 |
10 |
10 |
10 |
10 |
10 |
10 |
| Lexicon | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| Encoding Converter | X |
X | X | X | X | X | X |
| Sentence Segmenter | X | X | X | X | X | X | X |
| Word Segmenter | X | X | X | X | X | X | X |
| POS Tagset | X | X | X | X | X | X | X |
| POS Tagger | X | X | X | X | X | X | X |
| POS Tagged Text | 5 | 5 | 5 | ||||
| Morphological Analyzer | X | X | X | X | X | X | X |
| Morph'ly Analyzed Text | 5 | 5 | |||||
| Named Entity Tagged Text | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| Named Entity Tagger | X | X | X | X | X | X | X |
| Name Transliterator | X | X | X | X | X | X | X |
| Narrative Grammar | X | X | X | X | X | X | X |














