Overview

Motivation

Although the current state of information technology permits if not demands the integration of multi-modal, authentic language data into lesson authoring, progress is this area is hampered by the absence of an adequate supply of raw and annotated language data with appropriate distribution rights and tools for searching and browsing. The pilot project on Source Media Authoring Resources & Tools (SMART) addresses that problem. SMART refers to a combination of raw and annotated data sets, software resources for browsing, searching, extracting and preparing material for use in authored material distributed via an efficient infrastructure that provides licensing and computing support as well as an archive.

Although it is often repeated that ongoing, intensive and engaged interaction with real language is crucial to one's success as a language learner, such access is not readily available to all language learners. Language teachers may provide that interaction in a classroom setting but naturally want their students to have additional opportunities to read, listen, speak and write in the language under study. The Internet provides increasing access to foreign language but no mechanism to decide what is appropriate to the student's level or area of interest. The language teacher may provide additional material to read or exercises to complete outside of class. However, such material if naturally occurring is expensive to collect and prepare and if created ad hoc by the teacher is expensive and possibly unnatural.

Notwithstanding some specific annotations, large-scale data collections offer a different kind of flexibility. Authored learning materials necessarily impose a pedagogical approach on the user. Source media on the other hand is theory neutral. The same material can be annotated as a hypertext collection to permit random access or accessed via learning tools that guide learners through a curriculum.

Data

Source Media Resources are databases of raw and annotated language data and related graphics collected from existing sources. The language sources may include published texts, topical WWW sites, broadcast television and radio, cable television and conversations. The broadcast subset may include news, talk shows, dramas, science and nature programs and situation comedies, indeed the range of available programs types to exemplify the broad range of modes, regions, registers and vocabulary domains in which the language is used. The graphics may include maps, photos and illustrations relevant to the texts and spoken resources. For example, a report on crop failures in some geographic region might also include a map showing the extent of the affected areas. These resources can be collected in digital form and stored in a distributed archive for subsequent annotation and use.

Formats

Where the data is originally digital text, the text can be collected, converted into a standard encoding, either Unicode or a preferred national encoding, and stored digitally with very little overhead. Where the data is available as audio, it can be collected digitally and stored in files corresponding to a conversation, broadcast or other event. However, to facilitate searching the audio broadcasts, SMART will provide transcripts. At the very least these transcripts can be the results of a forced recognition performed by an automatic speech recognition system. The best such automatic transcripts have accuracies that approach 20% word error rate. Although the errors ASR systems make are disconcerting to a human reader, they provide adequate input to systems that categorize data topically and permit word searching with reasonable accuracy. SMART will also provide high quality transcripts of a relatively large percentage of the material. Time-alignment of either type of transcript, automatic or human generated will permit users to search the audio with fine granularity. On average, segments will not exceed 10 seconds. Where video data is available, the audio, video and text will be separable though time aligned. This will allow applications to make use of any individual mode or combination of modes. Video data divides into video and audio tracks that share the same timeline. Both can be time-aligned to a transcript via time-stamps placed in the transcript file. Video data gathered from broadcast television may also include closed-captioning. The closed-captioning "transcripts" provide a modest improvement over ASR output.

To assure maximum flexibility the text, audio, video and graphics will be available in a variety of commonly used forms across. Graphics will be available in .GIF, JPG and PNG forms; audio in WAVE, AIFF and MP3, video in MPEG and text in Unicode and in national encodings.

Tools

Although commercial sector software developers provide authoring solutions that enjoy a significant installed base among language teachers, these solutions routinely assume that data is available in small chunks (a sentence or a paragraph of text, a few seconds of audio or video) and authored interactively. Using SMART tools, materials developers can identify, extract and prepare SMART data for use with commercial authoring packages. SMART tools will include linguistically sophisticated search routines, annotation and alignment tools and the conversion routines described in Section 3 above. Together these allow the author to identify sections of SMART Data that exemplify language construction under study, extract them from the archive and convert them into the form required for authoring.

Papers

Coming Soon.

Annotated Data

Lessons





About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact Christopher.Cieri@ldc.upenn.edu

Contact David.Miller@ldc.upenn.edu

Last modified: Tuesday, 27-Jul-01 16:26:30
© 2000 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.