LCTL

  • Specifications
  • Found/Processed Resources
    (This page is no longer available.)

Less Commonly Taught Languages

The goal of this project is to create and share resources to support additional basic research and initial technology development in what have been called Less Commonly Taught Languages. These languages have also been called Low Density, not for the population of native speakers but rather for the scarcity of resources.  A typology that distinguishes both population of native speakers might label them High Density/Sparse Resource language since the languages of current focus have more than a million speakers but inadequate resources for building human language technologies.

Resource Plan

The specific resources planned are:

  • Monolingual Text: preferably news text written originally in the LCTL. Monolingual text will be detagged, tokenized and converted into a standard encoding for use in language modeling and to be annotated to create other resources. Goals are 250 Kwords that will be translated into English plus an additional 250 Kwords.
  • Parallel Text: preferably news text written originally in the LCTL. Parallel text may be found and sentence aligned or created from monolingual text by separating into sentences and then having human translate each sentence of source into one or more sentences in the target language. Goals are 175 Kwords translated from the LCTL into English plus 75 kwords translated from English to the LCTL such that text is consistent across all LCTL languages to allow some comparison between the languages. The 75 kwords is further divided into 30 kwords from an English news corpus, 20 Kwords from an Elicitation Corpus and 25K words come from genres other than news. For two "large" languages, Thai and Urdu, the new target is 100K sentences.
  • Bilingual Lexicon: containing a minimum of 10k lemmas but targeting larger lexicons that provide 90-95% coverage over the monolingual text corpus.
  • Encoding Converters: convert all raw text and lexicons encodings into the standard encoding selected for the LCTL, most likely Unicode
  • Sentence Segmenter: divide text into sentences. For many languages, sentence segmentation means dividing the text at certain punctuation characters. However in some languages the functions of sentence breaking characters are context dependent (such as the period in English). Other language do not mark sentence boundaries explicitly.
  • Word Segmenter: necessary for languages that do not already show word segmentation in their writing systems and useful for language that do space separate words but use punctuation ambiguously
  • POS Tagset, Tagger and Tagged Text:
  • Morphological Analyzer: found or created, tagset coordinated to Bilingual Lexicon
  • Morphologically Tagged:
  • Named-Entity Tagged Text
  • Named Entity tagger
  • Personal Name Transliterator
  • Descriptive Grammar: grammatical outline based upon existing grammars of each LCTL and experiences garnered in this resource gathering exercise. Target audience will be Reflex sites.

Summary

A summary of resources planned with volumes for each LCTL follows.
Task Urdu Thai Hungarian Bengali Punjabi Tamil Yoruba
News Text 2,000 2,000 500 500 500 500 500
LCTL->English News
130
130
130
130
130
130
130
LCTL->English Blogs
20
20
20 20
20
20
20
LCTL->English Conversation
20
20
20
20
20
20
20
English->LCTL News 40 40 40 40 40 40 40
English->LCTL Elicitation 20 20 20 20 20 20 20
English->LCTL Blogs
10 10
10
10
10
10
10
English->LCTL Phrasebook
10
10
10
10
10
10
10
Lexicon 10 10 10 10 10 10 10
Encoding Converter X
X X X X X X
Sentence Segmenter X X X X X X X
Word Segmenter X X X X X X X
POS Tagset X X X X X X X
POS Tagger X X X X X X X
POS Tagged Text 5 5


5
Morphological Analyzer X X X X X X X
Morph'ly Analyzed Text 5 5




Named Entity Tagged Text 100 100 100 100 100 100 100
Named Entity Tagger X X X X X X X
Name Transliterator X X X X X X X
Narrative Grammar X X X X X X X