The LDC Institute is a seminar series on issues in language data and database creation. The goals of the seminar series are to create a forum of UPenn affiliates with an interest in language data, to communicate LDC and UPenn experience in data collections, standards, and annotation, and to work with researchers and others who may be interested in LDC data or who may wish to contribute new data to the archives. The topics covered will be descriptions of data-driven projects, both at the LDC and elsewhere and tutorials on specific methodologies and systems for dealing with language data. We assume attendees would be UPenn students, both undergraduate and graduate, as well as LDC staff who are interested. |
|||||
| Upcoming
Presentations and Notes |
|||||
| Date | Time | Location | Title and Instructor |
||
| May 18, 2012 | 12-2PM | LDC |
"The Sociolinguistic Archive and Analysis Project: Data, Tools and Applications" presented by Tyler Kendall, Assistant Professor, Department of Linguistics, University of Oregon | ||
In his presentation, Dr. Kendall describes his work on the Sociolinguistic Archive and Analysis Project (SLAAP), a web-based data preservation and exploration initiative centered at North Carolina State University. SLAAP houses audio recordings and associated materials from over 2,500 sociolinguistic interviews. In addition to its basic, preservational, organizational and access related features, a centerpiece of the archive is a time-aligned, databased transcription model which allows for dynamic, corpus-like analysis of the transcript data as well as real-time phonetic processing of the audio from the transcripts. The presentation describes the background of the project and its architecture, provides a demonstration of several of the web-based features, and discusses some of the ways that the archive and tools are being used in current sociolinguistic research. |
|||||
| Previous
Presentations and Notes by Year | |||||
| 2011| 2010| 2009| 2008| 2007| 2006| 2005| 2004| 2003| 2002| | |||||
| 2011 Presentations | Date | Time | Location | Title and Instructor |
|
| Jul 20, 2011 | 12:30-2:30pm | LDC |
"Building a universal corpus of the world's languages" Steven Bird, Department of Computer Science, University of Melbourne; LDC, University of Pennsylvania | ||
This talk will report ongoing work on developing a corpus that includes all of the world's languages, represented in a consistent structure that permits large-scale cross-linguistic processing (Abney & Bird 2010). The focal data types, bilingual texts and lexicons, relate each language to a reference language. The ability to train systems to translate into and out of a given language is proposed as the yardstick for determining when that language is adequately documented. Bird will report on recent efforts to incorporate datasets built by the language resources community, via the "Language Commons". He will also describe a new project that is recording and transcribing material from a large number of unwritten languages in Papua New Guinea. Steven Abney and Steven Bird (2010), The Human Language Project: Building a universal corpus of the World's languages, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. A pdf version of the presentation is available.
|
|||||
| Jun 29, 2011 | 12:30-2:30pm | LDC |
"Coding Conventions for Archival Sharing" by Malcah Yaeger-Dror, Senior Research Scientist, Speech Perception Laboratory, Cognitive Science Program, University of Arizona and LDC consultant | ||
Dr. Yaeger-Dror will discuss her recent work at LDC formulating coding conventions for speech archives in three areas: (1) coding for the social situation; (2) demographic coding that could be used as relevant research in future studies; and (3) the influence of interpersonal attitudes on speech variation. Most of the talk will center on this last point, focusing on the speakers' attitudes toward their interlocutors and how one might be able to go about determining this information without recourse to Gilesian psychological studies . A pdf version of the presentation is available.
|
|||||
| Jun 15, 2011 | 1:30-3pm | LDC |
"Free recall of word lists; empirical and theoretical issues" by Michael J. Kahana, Professor, Department of Psychology, and Director, Computational Memory Lab, University of Pennsylvania | ||
In this talk, Professor Kahana will discuss the major empirical phenomena concerning recall of common words and the theoretical issues raised by these phenomena. He will show how memory researchers have devised theories to explain these data and present some critical tests of those theories.
|
|||||
| Jan 14, 2011 | 12pm-2pm | LDC |
"Contact, Restructuring, and Decreolization: The Case of Tunisian Arabic"
by Thomas A. Leddy-Cecere, B.A. 2010 Dartmouth College; participant in the State Department's Critical Language Scholarship Program for advanced Arabic; and author of "New England Borderlands: A New Investigation of the East-West Dialect Boundary", presented at NWAV 39 and accepted for publication by Penn Working Papers in Linguistics. | ||
The modern Arabic colloquial dialects stand out in the world of dialectology and historical linguistics. Though all languages display dialectal variation, Arabic represents a special case -- attempts to classify and trace its varieties by standard linguistic techniques do not produce satisfactory results. It has been suggested (Versteegh, 1984) that this failure could be explained by positing that modern Arabic is a product of creolization. This study represents a focused, cross-dialectal examination of a single Arabic dialect area (Tunisia) in search of evidence of creolization and subsequent effects. A PowerPoint version of the presentation is available as well as Leddy-Cecere's honors thesis, which was the basis for this presentation .
|
|||||
| 2010 Presentations | Jul 22, 2010 | 12pm-2pm | LDC |
"Language Technology Resources for Sanskrit and other Indian Languages at Jawaharlal Nehru University, India"
by Girish Nath Jha, Associate Professor, Computational Linguistics, Jawaharlal Nehru University, India, and the Mukesh
and Priti Chatter Distinguished Professor of History of Science, University of Massachusetts Dartmouth. | |
The introductory section of the talk presents the complex linguistic scenario of India and the role Sanskrit has to play in it. The next section discusses the language technology resources being developed at Jawaharlal Nehru University (JNU), New Delhi, the premier university of India for Sanskrit and other major Indian languages under Technology Development for Indian Languages (TDIL) - an initiative of the Indian government’s Department of Information Technology (DIT). The concluding section highlights the issues and challenges being faced and scope for future collaboration. A pdf version of the presentation is available.
|
|||||
| Apr 28, 2010 | 3pm-5pm | LDC |
"Bibliotheca Alexandrina: The oldest library in the digital age"
by Magdy Nagi and "Ibrahim Shihata Arabic UNL Center at Bibliotheca Alexandrina" by Ahmed Bhargout | ||
The topic of this session will be the Bibliotheca Alexandrina, Alexandria, Egypt. Guest speakers Professors Magdy Nagi and Ahmed Bhargout will discuss the library's holdings and the current Universal Networking Language project, respectively. "Bibliotheca Alexandrina: The oldest library in the digital age" "Ibrahim Shihata Arabic UNL Center at Bibliotheca Alexandrina"
|
|||||
| Jan 26, 2010 | 10am-12pm | LDC |
"U.S. Supreme Court Corpus (SCOTUS)" by Daniel Katz, J.D., M.P.P., Fellow in Empirical Legal Studies, Michigan Law School and Michael Bommarito, PhD Student, Political Science: Methods & Modeling, University of Michigan | ||
The corpus of Supreme Court written opinions is a rich linguistic resource. Not only does this corpus provide a longitudinal sample of formal American English, but it is also a source of text with identified authors and vote-coded sentiment. Despite this value and years of qualitative and quantitative material of the United States Supreme Court, no compiled corpus of t hese opinions is currently available to researchers. The purpose of this talk is (1) to describe efforts to compile both the complete corpus of Supreme Court Opinions and associated metadata, (2) to outline a number of our current research projects utilizing this data, and (3) to discuss any criticism, potential projects, or possible collaboration. A pdf version of the presentation is available. |
|||||
| 2009 Presentations | Date | Time | Location | Title and Instructor |
|
| Nov 12, 2009 | 12-2pm | LDC |
Variations Across Languages, Divisions Within Communities: Languages, Schools and the Internet in Tunisia," by Simon Hawkins, Franklin and Marshall College | ||
Simon Hawkins is an assistant Professor of Anthropology at Franklin and Marshall College in Lancaster, PA. Professor Hawkins has studied language learning in Tunisia and will be sharing some aspects of that work. More information about Professor Hawkins can be found here. ABSTRACT: Despite the convenient categorization of languages, particularly English, into national varieties, some specific discourses cross national and linguistic borders. For some of these varieties, such as academic writing and Internet conventions, what is linguistically important is not the language used, but the global discourses and language ideologies of which they are a part. Multilingual practices in Tunisia illustrate these examples. Lunch will be provided. Hope to see you there! |
|||||
| Jun 10, 2009 | 12-2pm | LDC |
The LDC Standard Arabic Morphological Tagger, by Rushin Shah, LDC | ||
Rushin Shah is a visiting scholar working at Penn and LDC this year. He received his bachelor's and master's degrees in Computer Science from the Indian Institute of Technology (IIT) Kharagpur. In addition to his work at LDC under the direction of Mark Liberman and Mohamed Maamouri, Rushin has worked on machine learning and named entity tagging systems with Lyle Ungar at Penn's department of Computer and Information Science. ABSTRACT: The current process of Arabic corpus annotation at LDC relies on using the Standard Arabic Morphological Analyzer (SAMA) to generate various morphology and lemma choices, and supplying these to manual annotators who then pick the correct choice. However, a major constraint of this process is that SAMA can generate dozens of choices for each word, each of which must be examined by the annotator. Moreover, SAMA does not provide any information about the likelihood of a particular choice being correct. A system that ranks these choices in order of their probabilities, and manages to assign the highest or second-highest rank to the correct choice with a high degree of accuracy would hence be very useful in accelerating the rate of annotation of corpora. Such a system would also be able to aid intermediate Arabic language learners by creating annotated versions of news articles or other web pages submitted by them. We describe such a model that simultaneously performs morphological analysis and lemmatization for Arabic, using choices supplied by SAMA. We convert morphological labels into vectors of various morphosyntactic features, such as basic part-of-speech, gender, number, mood, prefixes, suffixes, case, etc. We then use these various attributes of the supplied Arabic data to create models for lemmas and MSFs, and combine these individual models into one aggregate model that simultaneously predicts lemmas and complete morphological analyses. Our model achieves accuracy in the high nineties. This is better than what prior approaches to this problem have been able to obtain, and it will allow us to not only significantly accelerate the corpus annotation process at LDC but also create an effective helper interface for Arabic language learners. In this presentation, I will discuss the problem, our approach to it, and the process of designing and creating the final model. A PowerPoint version of this presentation is available: |
|||||
| Jan 15, 2009 | 3-5pm | LDC |
Building an ASL Corpus Project by Gaurav Mathur, Gene Mirus, and Paul Didus, Gallaudet University | ||
ABSTRACT: There has been a need for the present American Deaf communities to preserve a sample of their language so that they can appreciate the richness of their language for generations. There is also a practical need for materials that can be used in a wide variety of settings, ranging from American Sign Language (ASL) instruction for deaf children to training of people who work with deaf communities. This talk describes a long-term project that meets those needs, namely, the establishment of an ASL corpus by collecting a comprehensive and representative sample of ASL from around the nation. The talk opens with a description of other sign langauge corpora outside the U.S. that have been successful and then offers an outline of the ASL corpus project that is currently underway, including the kinds of data to be be collected and the methodology to be used for the data collection. |
|||||
| 2008 Presentations | Date | Time | Location | Title and Instructor |
Jul 17, 2008 | 11:30-1:30pm | LDC |
Development of Resources and Techniques for Processing of Some Indian Languages by Shyam S. Agrawal, KIIT College of Engineering, Gurgaon, India |
Shyam S. Agrawal. Shyam is the Executive Director of KIIT College of Engineering, Gurgaon, India and Advisor to the Centre for Development of Advanced Computing (CDAC), Noida, India. ABSTRACT: In the past few decades there has been a pressing demand and need to develop speech and language corpora for training, testing and benchmarking of speech technology systems for various applications. Richly annotated corpora and labeled databases are needed to develop models of spoken languages and also to understand the structure of speech and the variability that occurs in the speech signals due to a variety of factors. There are twenty-two official Languages and more than 300 dialects used in India. These languages have their specific structure and properties which have significant differences from European languages. This talk presents some of the phonetic differences in Hindi compared to English and presents an overview of the efforts made by CDAC, Noida and some other institutions to develop text and speech databases, tools, and some techniques for the processing of some Indian languages. For the collection of speech databases, issues and procedures related to text selection, e.g. multiform units and variables such as demographic, dialectal, environmental, emotional, linguistic background etc. have been included. Special tools developed for the analysis of text are described. The objective has been that tools should be adaptive to other Indian Languages and also to other languages. Details of some of the application-oriented task specific databases such as the ELDA-sponsored database for Hindi and the CFSL Speaker Identification database for forensic applications will be described in detail. A PowerPoint version of this presentation is available: | |||||
| 2007 Presentations | Date | Time | Location | Title and Instructor |
|
| Jul 19, 2007 | 2:30-4:30pm | LDC |
HTML Templates for LDC Sponsored Projects by Shawn Medero, LDC | ||
Introduction to existing resources for communicating your LDC projects over the web. The templates provide a professional, consistent and attractive design while allowing creativity and variation in each project's web site approach. The templates will help you define a predictable and organized navigation structure for LDC employees, project sponsors, research, and general visitors. Questions concerning use of the web standards, such as CSS and HTML, are encouraged throughout the presentation. |
|||||
| Jul 13, 2007 | 2:30-4:30pm | LDC |
Speaking Arabic in Iraq and the Middle East: Reflections on Three Tours of Duty, USMC, ret. | ||
| Kenneth Gardner served in the U.S. military for 22 years. He learned Arabic at the Defense Language Institute Foreign Language Center in 1995. His missions included infantry and intelligence assignments in, or training troops from, the following countries, among others: Egypt, Saudi Arabia, Kuwait, Jordan, United Arab Emirates, Bahrain, Oman, Qatar, Afghanistan and Iraq. Most recently, Gardner served as an advisor to an Iraqi Army Infantry Battalion in Operation Al-Fajr (the November 2004 battle for Fallujah) and as base commander of an Iraqi Security Forces training base. Gardner will share his experiences as a non-native Arabic speaker communicating with native speakers in a variety of settings as well as the issues faced by monolingual U.S. troops in the field.
|
Jun 20, 2007 | 3-5pm | LDC |
Programming Specifications Procedures and Practices by Andy Cole, LDC | |
| Virtually all tasks at LDC depend on programmer input. It's been LDC's policy to require that most requests for programming assistance be accompanied by a specification that describes the desired output and includes some estimate of the programming time needed to complete the tasks in the specification. Andy Cole will outline a simple set of guidelines for LDC staff to follow when making programming requests and he will illustrate how those guidelines work by using them to develop a business systems specification.
|
|||||
| 2006 Presentations | Date | Time | Location | Title and Instructor |
|
| Oct 26, 2006 | 3-5pm | LDC |
Comparing Linguistic Annotations -- Issues in Harmonization and Quality Control by Christopher R. Walker | ||
| Consistency analysis was an important aspect of quality assessment in 2005 ACE data creation. Within this framework, Christopher Walker, LDC's Project Manager, Information Extraction, became quite interested in the various assumptions and applications of annotation scoring infrastructure. As he attempted to better understand the bounds of the problem and solution spaces, it quickly became clear that there was no existing discussion of these issues -- and very little documentation of general Best Practices.
In this talk Walker seeks to reduce this gap by outlining the apparent problem and solution spaces; and by opening for discussion the utility of annotation scoring metrics in the various domains of empirical, computational and corpus linguistics -- and more cogently in the domain of quality control for linguistic data creation. The talk will be organized into four main sections, followed by a brief discussion:
1. ACE Annotation: An Illustrative Introduction | |||||
| Sep 22, 2006 | 10:30-11:30am | LDC |
Recording and Annotation of Speech Data via the WWW - A Case Study by Dr. Christoph Draxler, Institute of Phonetics and Speech Communication, Ludwig-Maximilians University, Munich, Germany | ||
| The German Ph@ttSessionz project will create a database of 1000 adolescent speakers balanced by gender and covering all major German dialect areas. The project employs a novel approach to collecting speech data: all recordings are performed via the WWW -- using a web application in a standard web browser -- in more than thirty-five German public schools. Speech is recorded using a standardized audio setup on the school's PC, and the signal and administrative data are immediately transferred to the BAS server in Munich. Using this approach, geographically distributed recordings in high bandwidth quality can be made efficiently and reliably.
Dr. Draxler will describe the Ph@ttSessionz web application and its major components, SpeechRecorder and WebTranscribe, and he will outline the infrastructure developed at BAS for WWW-based speech recordings. Dr. Draxler will also discuss the strategies employed to enlist schools in the project and present preliminary analyses of the Ph@ttSessionz speech database.
| |||||
| Jul 27, 2006 | 3-5pm | LDC |
LDC Online by Dave Graff | ||
| The topic of this session will be LDC Online. Dave Graff, LDC's Lead Programmer Analyst, will present an overview of LDC Online's corpora coverage, search methodology and future plans for growth.
| |||||
| May 4, 2006 | 3-5pm | LDC |
Pros and Cons of Different Annotation Workflow Systems by Seth Kulick, Julie Medero, Hubert Jin, Dave Graff, and Kevin Walker | ||
| This LDC Institute is part of an effort to determine the desirable properties for a single workflow system that can be used (or extended as appropriate) in the various annotation projects at LDC and IRCS (Penn's Institute for Research and Cognitive Science). Because the several different workflow systems that are currently used were designed for projects with different needs, they handle many issues differently, among them, support for local and remote annotation, sophistication of report capability, use of automated tagging, and flexibility in the specification of workflow stages.
The speakers are LDC/IRCS programmers who designed some of the current systems. They will discuss the following topics: (1) the properties of the different systems; (2) why some characteristics of a particular workflow system might make it unsuitable for a particular project; (3) properties that should be added to the workflow systems; and (4) alternative ways of setting up a workflow system | |||||
| Apr 20, 2006 | 3-5pm | LDC |
Recent Trends in Annotation Tool Development at LDC by Kazuaki Maeda, Julie Medero and Haejoong Lee | ||
The Linguistic Data Consortium (LDC) has created large amounts of annotated linguistic data for a variety of evaluation programs and projects using highly customized annotation tools developed at the LDC. LDC programmers Kazuaki Maeda, Julie Medero and Haejoong Lee will discusses the history of annotation tool development at the LDC and share some of the approaches they are currently taking. Two tools in particular will be highlighted: (1) the LDC's model for decision point-based annotation and adjudication, which was used effectively in the ACE 2005 annotation effort; and (2) XTrans,a new speech transcription and annotation tool particularly suited for transcribing meeting speech that was used by the LDC in the NIST meeting recognition evaluation and Mixer Spanish and Russian telephone conversation studies.
| |||||
| 2005 Presentations | Date | Time | Location | Title and Instructor |
|
| 12.08.2005 | 3:00-5:00pm | LDC |
Building a Lexicon Database for Arabic Dialects by Dave Graff | ||
|
Download the presentation slides (PowerPoint) | ABSTRACT: One of the major problems in creating a lexicon database for Arabic dialects is the fact that standardized orthographic (spelling) conventions do not generally exist. The word forms generated by transcribers from recorded conversations are based on relatively loose conventions and show significant variability within any given dialect. We describe how that problem is being resolved by creating a relational database design that makes the transcripts a key part of the database so that repairs to word forms in the lexicon table are propagated automatically to the transcripts. We also review some earlier approaches to lexicon building, describe the annotation tools developed specifically for the current lexicon project, and briefly consider some possible extensions to the database structure and annotation methods the LDC currently uses to cover tasks such as treebank annotation. |
||||
| 11.17.2005 | 3:00-5:00pm | LDC |
Less Commonly Taught Languages (LCTLs) by Mark Mandel | ||
| The topic of this session will be the LDC’s work on the Less Commonly Taught Languages (LCTLs) portion of the REFLEX (Research on English and Foreign Language Processing) Project. Mark Mandel will describe the scope of the current work which is focusing on Thai, Urdu, Bengali, Panjabi, Hungarian, Tamil, and Yoruba. The talk will take about an hour and will be followed by questions from the audience.
Refreshments will be provided. | ABSTRACT: One of the LDC’s principal tasks in the REFLEX project is to discover, produce, and maintain language resources for the target languages. Those resources include, among other things, linguistic information, writing systems, converters, word segmenters, electronic lexicons, monolingual texts, bilingual texts, morphological parsers, and tools for producing and using annotated resources. Mark Mandel will describe the challenges associated with assembling those language resources, as well as the current progress of the LDC’s work. |
||||
| 07.07.2005 | 2:00-4:00pm | LDC |
The Teaching of Berber in Morocco: Reality and Perspectives by Fatima Agnaou, IRCAM | ||
| Fatima Agnaou will give an informative talk primarily discussing IRCAM (The Royal Institute of Amazigh Culture) and its realizations in regards to the integration of Berber in Moroccan schools. Agnaou will address the aims and objectives of teaching in Berber and the standardization of the Berber language as the language faces its new functions. A presentation of the textbooks and other teaching materials will also be included in this talk along with teacher training and methodology. The talk will take about an hour and will be followed by questions from the audience.
| |||||
| 02.03.2005 | 12:30-2:30 | LDC |
Functional Morphology by Otakar Smrz, Charles University | ||
| Computational morphological models are usually implemented as finite-state transducers. The morphologies of natural languages are, however, better described in terms of inflectional paradigms, lexicons, and categories (inflectional and inherent parameters).
Markus Forsberg and Aarne Ranta have recently introduced a framework called Functional Morphology (FM) smoothly reconciling both of these viewpoints. Linguists can model their systems without any 'finite-state restrictions', using the full power of the functional language Haskell and delegating actual computational issues to FM. The morphological models become clearer, reusable, (ex)portable, and even more efficient.
I would like to highlight the noticeable features of FM/Haskell and outline our plans to use it for Arabic. I will be referring to languages like Latin, Swedish, Sanskrit, Spanish and Russian.
.pdf, postscript and Powerpoint versions of this presentation are available. | |||||
| 2004 Presentations | Date | Time | Location | Title and Instructor |
|
| 05.13.2004 | 12:30-2:30 | LDC |
Arabic Propbank by Mona Diab, Stanford University | ||
| The holy grail of computational linguistics, from the time of the
field's inception, has been (automatic) natural language understanding.
Semantic parsing appears as a significant stride in that direction
giving researchers a glimpse into the world of concepts in a functional
and operational manner. Thanks to the English PropBank, a jumpstart for
the task of semantic role assignment is noted in several community wide
standardised evaluations such as CoNLL and Senseval3. In fact, the
porting of PropBank style annotation to 10 Chinese verbs has achieved
very interesting results in a relatively short period of time (Sun &
Jurafsky, 2004) but more importantly has shown some relevant
quantifiable linguistic variation between English and Chinese.
This is were my story begins. I am interested in building a semantic role assignment system for Modern Standard Arabic (MSA). Different from the Chinese effort however, no PropBank guidelines or data exist for Arabic yet. Therefore, I decided to manually annotate some of the data. This proved to be a very interesting and challenging task. In line with the chinese experience, I took on the annotation of the 20 most frequent Arabic verbs in the Arabic treebank v2. The verbs have to have occurred in more than 69 sentences. The work is still in progress with an expected completion date of June 30 or so. In this talk, I will focus on 3 of the verbs which are fully annotated discussing the different frames and roles associated with their arguments and adjuncts.The main issue that arises frequently is that of consistency and variation especially with respect to assigning ARGM and ARG3 roles to constituents. I would like to achieve a more generalised consistent annotation across verbs (sometimes it shifts even within a single verb [potentially sense variation]) as well as within a verb. I will discuss some of the rudimentary guidelines I have set for myself inspired from the annotation guidelines from both the English and Chinese PropBanks. | |||||
| 03.23.2004 | 1:30-2:30 | LDC |
Project Santiago by Colonel Stephen A. LaRocca of the Center for Technology Enhanced Language Learning | ||
| The Center for Technology Enhanced Language Learning (CTELL), organized within the Department of Foreign Languages at the U.S.
Military Academy, collects speech data and turns it into speech recognition software primarily for the benefit of cadets
learning languages at West Point. Since 1997 CTELL has collected broadband speech corpora for Arabic, Russian, Portuguese,
American English, Spanish, Croatian, Korean, German and French. | |||||
| 02.26.2004 | 12:30-2:30 | LDC |
Tongue-Tied in Singapore:
A Language Policy for Tamil? by Harold F. Schiffman, Department of South Asia Studies | ||
| The Tamil situation in Singapore is one that lends itself ideally to the
study of minority language maintenance. The Tamil community is small and
its history and demographics are well known. The Singapore educational
system supports a well-developed and comprehensive bilingual education
program for its three major linguistic communities on an egalitarian
basis, so Tamil is a sort of test-case for how well a small language
community can survive in a multilingual society where larger groups are
doing well. But Tamil is acknowledged by many to be facing a number of
crises; Tamil as a home language is not being maintained by the
better-educated, and Indian education in Singapore is also not living up
to the expectations many people have for it. Educated people who love
Tamil are upset that Tamil is becoming thought of as a 'coolie' language
and regret this very much. Since Tamil is a language characterized by
extreme diglossia, there is the additional pedagogical problem of trying
to maintain a language with two variants, but with a strong cultural bias
on the part of the educational establishment for maintaining the literary
dialect to the detriment of the spoken one.
This paper examines these attempts to maintain a highly-diglossic language in emigration and concludes that the well-meaning bilingual educational system actually produces a situation of subtractive bilingualism. | |||||
| 02.10.2004 | 1:00-3:00 | LDC |
The Contextualization of Linguistic Forms across Timescales by Stanton Wortham, Penn Graduate School of Education | ||
| When people speak, they socially identify themselves and others. The use of linguistic forms is one
important means through which social identities get established. But the implications of any utterance for social
identity depend on relevant context. Decontextualized sociolinguistic regularities cannot fully explain how a given
utterance establishes a given social identity, although such regularities certainly play an important role. The centrality
of context seems to imply, methodologically, that a full linguistic analysis of social identification must rely on case
studies of particular utterances in context. Social identification, however, is not a phenomenon of isolated cases.
An individual gets socially identified across a series of interrelated events, not a series of unique, unrelated contexts.
This paper describes an empirical research project which traces the identity development of a ninth grade student across the academic year in one classroom. The student’s identity develops in substantial part through speech events that position her in socially recognizable ways. The paper presents a methodological strategy for analyzing the interrelated events across which this individual gets socially identified, focusing on specific kinds of speech events that play a central role in the emergence of her social identity across time. This project provides an opportunity to reflect on the kinds of data necessary for studying social identification. |
|||||
| 01.26.2004 | 1:30-3:30 | LDC |
Interfaces for Parser and Dictionary Access by Malcolm D. Hyman, Harvard University | ||
| The subject of this presentation is "linguistic middleware" — software designed to
mediate between backend linguistic tools and data sources (for instance, tokenizers, morphological
analyzers, and parsers) and frontend user agents (browsers and editors). Making linguistic data
available within graphical user agents will allow for rich, next-generation working environments
that can offer substantial benefits to language researchers and students. Simple but powerful
interfaces will allow for interoperability between diverse technologies, including legacy systems.
Current web-services and XML standards provide the basis for the development of such interfaces. The
goal is distributed networks that connect arbitrary tools, databases, reference works, and corpora;
ultimately, this architecture will help to break down barriers between scholarly communities and to
enrich the work of linguists, philologists, technologists, historians, and literary scholars. In order to realize the vision of generalized linguistic middleware, we need to address a range of challenges encountered in typologically diverse languages and writing systems. This presentation will focus on: * multiple approaches to tokenization required by different writing systems * orthographic normalization * handling different "window sizes" needed for context-sensitive analysis * strategies for identifying lexical items that are realized discontinuously * metalanguages for morphosemantic and syntactic category labels The discussion will be accompanied by demonstrations of some prototype implementations and solutions. > | |||||
| 2003 Presentations | Date | Time | Location | Title and Instructor |
|
| 12.09.2003 | 12:00-2:00 | LDC |
Finite State Morphology using Xerox Software by Kenneth Beesley of XRCE | Morphological analysis (word analysis) and generation are foundation technologies used in many kinds of natural-language processing, including dictionary lookup, language teaching, part-of-speech disambiguation (tagging), syntactic parsing, etc. Successful and publicly available software implementations based on "finite-state" theory include Koskenniemi's Two-Level Morphology, AT&T Lextools, Groningen University's Fsm Utils, and Xerox's lexc and xfst languages. The Xerox tools are now available on CD-ROM in a book entitled "Finite State Morphology", Beesley & Karttunen, 2003, CSLI Publications. The talk will briefly and gently cover the history and underlying theory of finite-state morphology, and introduce lexc and xfst syntax. Finite-state morphological analyzers are excellent projects for a Master's Thesis, and for field linguists they are a practical way to encode, computerize and test morphotactic grammars, alternation rules and lexicons that would otherwise remain inert on paper. Finite-state morphology has been successfully applied to languages around the world, including the obviously "commercial" European languages, Finnish, Hungarian, Basque, Turkish, Korean, Arabic, Syriac, Hebrew, several Bantu languages, a variety of American Indian languages, etc. Beesley will also be available for consultation at the LDC 8-9 December. |
|
| 10.15.2003 | 1:00-3:00 | LDC |
Searching through Prague Dependency Treebank by Jiri Mirovsky and Roman Ondruska Charles University, Prague | ||
| Netgraph is a searching tool for annotated treebanks. It was
originally developed for the Prague Dependency Treebank but it is not limited
to this treebank. Netgraph is a multiuser system with a net architecture. This means that more than one user can access it at the same time and its components may be located in different nodes of Internet. After the introductory part of the demonstration, we will show NetGraph in use. Although work with NetGraph is easy, the client-server architecture requires actions from the user which a stand-alone application does not. It needs to be explained and understood. We will describe four parts of working with NetGraph -- connecting to the server, selection of files, creating a query, and viewing the results. The main part is the creation of queries. NetGraph provides a simple-to-use and still powerful query language. The basic query is a tree structure with a few evaluated attributes. Searching a corpus given a query means searching for all trees which contain the query tree as a subtree. This basic functionality is improved by so-called meta attributes -- an easy way to add more restrictions to found trees, e.g. size, orientation, position of query tree, forbidden nodes etc. We will show several examples of queries, from the simplest to more complex. |
|||||
| 11.12.2003 | 1:30-3:30 | LDC |
The Pennsylvania Sumerian Dictionary Project by Stephen Tinney, Penn Museum | ||
| The Pennsylvania Sumerian Dictionary project is developing an online
dictionary which will combine lexicon and text-corpora within an
interface which offers multiple entry points to the Dictionary,
multiple views of the lexicon or individual items and reverse
navigation back from the text-corpora. The framework within which
this system is being implemented is a generic XML data structure for
corpus-based dictionaries. The data structure binds control lists with the references drawn from the text-corpora. These references are tagged with morphological and semantic information to enable programmatic generation of lexical articles containing exhaustive information on orthography, chronological and geographical distribution and usage. This presentation will describe the particular problems associated with writing a dictionary of Sumerian, describe the corpus-based dictionary model and demonstrate the state of the current implementation. | |||||
| 04.10.2003 | 12:00-2:00 | LDC | Mohamed Maamouri & Tim Buckwalter: Arabic Language: Issues and Perspectives | ||
| The presentation will start from the early standardization of Arabic and lead to the emergence of 'diglossia' and its linguistic and sociolinguistic consequences. The dominant attitudes towards linguistic reforms will also be presented. The second focal point will be on the Arabic reading process, its challenges and its consequences for reading performance and for education in general. In a connected second part of the same presentation, Tim Buckwalter will focus on the Arabic NLP Issues. He will then present his morphological analyzer and his lexicon. A brief overview of the LDC Arabic Treebank project will follow. | |||||
| 03.20.2003 | 12:00-2:00 | LDC | David Miller: Collections | ||
| This presentation will discuss the various speech collection projects undertaken at the Linguistic Data Consortium at the University of Pennsylvania during the years 1995-present. Included will be a discussion of telephone speech data collection projects and on-site speech collection projects. Data collection projects covered in this talk will include: CallHome, CallFriend, Switchboard, ROAR and FISHER. |
|||||
| 01.16.2003 | 12:00-2:00 | LDC | Stephanie Strassel: DASL | ||
| Data and Annotations for Sociolinguistics (DASL): Using
digital data to address issues in sociolinguistic theory A longstanding focus of sociolinguistic research has been the quantitative analysis of language variation and change, an endeavor that necessarily begins with the empirical observation and statistical description of linguistic behavior. Current technology encourages the collection and analysis of such data, and even the presentation and publication of research findings, wholly within the digital domain. Within the field of human language technology, such benchmark data has proven to be an essential ingredient for progress, as it reduces the cost of analytical infrastructure within research communities, frees researchers to focus on their interests, encourages collaboration and reduces impediments to new participants. However, most empirical data within sociolinguistics continues to be collected and analyzed by individuals or individual research groups, and is never made available to the wider research community. This proprietary approach to data hampers collaboration, the replication of studies and the comparison of models, methods and results, all necessary components of rigorous science. The prospect of digital data sharing in sociolinguistics also raises theoretical and methodological questions: Can sociolinguists make effective use of existing corpora? Do the insights gained from corpus data differ qualitatively from data most commonly used in quantitative sociolinguistics, namely recordings of sociolinguistic interviews? What are the best practices for the creation of new digital resources for sociolinguistic research? The project on Data and Annotations for Sociolinguistics (DASL), currently underway at the Linguistic Data Consortium with support from NSF via the Talkbank project, begins to address these issues. DASL investigates the use of digital data in sociolinguistics through a series of case studies involving both analysis of variation in existing corpora and the creation of new data sets. This paper will introduce DASL's goals, assumptions, data and tools and will review annotation and corpus creation efforts and results to date. |
|||||
| 01.16.2003 | 12:00-2:00 | LDC | Chris Cieri: SocioLinguistics | ||
| Towards a Comprehensive, Empirical Analysis of Linguistic
Data: the case of Regional Italian vowel systems Any empirical study of language relies necessarily upon a body of observations of linguistic behavior even if the study fails to formally acknowledge its corpus. The decisions one makes in approaching data affect research profoundly by opening some avenues of inquiry while blocking others. Looking across research communities as diverse as sociolinguistics and speech technologies, one finds methods that may be integrated in order to both broaden research possibilities and to perform research more efficiently. This presentation explores the relationship among data, tools, annotation (or coding) processes and the research they support. It will focus specifically on the quantitative analysis of linguistic variation. The data come from a series of sociolinguistic interviews undertaken to investigate the modeling of variation in the regional speech of central Italy. After describing the motivation for this study, I will demonstrate a series of tools, processes and data formats that permit a comprehensive yet rapid analysis of vowel systems. Specifically, I will demonstrate tools for transcription and segmentation, lexicons and search tools that automatically select and categorize tokens of interest from the transcripts, batch processes that perform acoustic analyses of the selected tokens and an interface for managing and adding human judgments to these analyses. In the process I will offer a particular perspective on tool development, favoring information retention and annotator efficiency over computational efficiency and portability. The tools and processes I will demonstrate were designed to interact with each other without losing precious information from the original audio such as the phonetic and lexical context of each vowel, the time and situation in which it was uttered, the demographic features of the speaker and so on. As a result, they make it possible or even trivial to explore issues that are otherwise impossible or at least difficult such as the interaction of phonological variation as a function of speech rate and time within an interview.
|
|||||
| 2002 Presentations | Date | Time | Location | Title and Instructor |
|
| 06.14.2002 | 12:00-2:00 | LDC |
New Methods for Constructing Annotated Speech Corpora by Steven Bird, LDC |
||
| Over the past decade of creating and managing speech corpora, LDC staff have developed literally hundreds of utilities, user interfaces and file formats. These databases are becoming increasingly complex in their structure, with rich standoff annotations organized across multiple layers. At the same time, the range of contributing specialties has become more diverse, as illustrated by LDC's future publication plans in such areas as field linguistics, sociolinguistics, gesture, and animal communication. In this talk I will outline the traditional corpus production process and catalog the problems we have experienced. This provides the backdrop for LDC's R&D effort over the last four years, which has created new software infrastructure and a suite of annotation tools. I will introduce the principles and key concepts of the annotation graph toolkit (AGTK), describe the current tools, and give a brief overview of the tool development process. Finally, I will introduce OLAC, the Open Language Archives Community, and demonstrate how it is being used for describing and discovering language resources of the kind we create at the LDC. The talk will be followed by an informal demonstration session. More information about AGTK and OLAC is available at agtk.sf.net and www.language-archives.org. Note: This presentation will be accessible to a general audience. Developers of the work I will report include Kazuaki Maeda, Xiaoyi Ma, Haejoong Lee, Beth Randall, Salim Zayat, Craig Martell, Chris Osborn, John Breen, Jonathan Dick, Eva Banik, Alan Lee and Jonathan Wright. A PDF version of this presentation is available: |
|||||
| 06.27.2002 | 1:00-3:00 | LDC | Corpus Development for the ACE (Automatic Content Extraction) Program by Alexis Mitchell and Stephanie Strassel, LDC | ||
| The objective of the ACE Program is to develop core automatic content extraction technology to enable text processing through the the detection and characterization of entities, relations, and events. As part of the DARPA TIDES program, ACE supports technology research and development for various classification, filtering, and selection applications by extracting and representing language content (i.e., the meaning conveyed by the data). The ultimate goal of ACE is the development of technologies that will automatically detect and characterize this meaning. For the past three years, the Linguistic Data Consortium has been developing annotated corpora for the ACE program. Data for ACE consists of newspaper, newswire and broadcast news transcripts. To support Entity Detection and Characterization, ACE annotators label selected types of entities (Persons, Organizations, etc.) mentioned in text data. The textual references to these entities are then characterized and multiple entity mentions are co-referenced. The Relation Detection and Characterization task requires annotators to identify and characterize relations between the labeled set of entities. The LDC's role in ACE has recently expanded to encompass annotation of all data for the ACE program as well as development and maintenance of annotation guidelines and annotation tools. In this talk, we describe corpus development for the ACE program, focusing on annotation procedures and guidelines as well as quality assurance measures. In addition, we touch on particular annotation challenges including classifying generic entities, metonymic entity mentions (including the concept of GeoPolitical Entities) and identifying the temporal attributes of relations. A powerpoint version of this presentation is available:
|
|||||
| 07.25.2002 | 1:00-3:00 | LDC | Dictionary Creation by Mike Maxwell |
||
| What is a bilingual dictionary? Most of us have used bilingual dictionaries, so the answer seems obvious. But when it comes to defining the structure of a dictionary as a database on a computer, the obvious becomes non-obvious. I will talk about the structure of a bilingual lexicon, and in particular that of a lexical entry, from a computational and linguistic viewpoint. There are (at least) three levels at which one might define such structure. Proceeding from the most concrete to the most abstract, these are: the file format level (e.g. in terms of an XML structure); a model (using a modeling language such as UML); and an ontology of concepts. The talk will in fact begin with the abstract ontology and proceed to a model, which I have been developing in collaboration with several computational linguists in SIL (the Summer Institute of Linguistics). I will have little to say about the file format level. Some issues which arise include:
For the brave, a model for lexicons, morphology and other linguistic components is located at http://fieldworks.sil.org/ModelDoc/. (However, when I tried to go there today, the site was down.) A powerpoint version of this presentation is available: | |||||
| 10.10.2002 | 12:00-2:00 | LDC | Dave Graff: Scripts/Programs for Large Data Sets | ||
| In any sort of corpus-based language research, the efficiency
and usefulness of the research will be limited by the consistency and usefulness
of the corpus. In this talk, I'll focus on establishing consistency in terms
of how language corpora are presented to researchers as input for their
work: the directory structure, file structure, document structure, character encoding, the amount and nature of meta-data (information about the corpus content) and how this information is incorporated. Virtually all text corpora are drawn from "found" data -- material that already exists in electronic form to serve some purpose other than corpus-based language research, such as: publication of books, periodicals or daily news; archival preservation of public, commercial or government transactions; online discussions on various topics among diverse interest groups; and so on. The problem is that each data source has its own unique set of needs and conventions that dictate the data formats used to store and transport its particular content -- as well as its own rate of failure in making sure the data satisfy its needs and conventions. The task for the LDC, working on behalf of corpus researchers, is to
design and apply the tools needed to distill each source into a common,
standardized form that will (1) maximize the usability of the data on
any researcher's chosen computer system, (2) preserve as much information
as possible from the source, and (3) discard as much interference and
noise as possible -- and do all this with a minimum |
|||||
| 10.31.2002 | 1:00-3:00 | LDC | Xiaoyi Ma: BITS and other Machine Translation Collection Projects Shudong Huang: Overview of Machine Translations |
||
BITS and other Machine Translation Collection Projects Parallel corpus are valuable resource for machine translation, multi-lingual
text retrieval, language education and other applications, but for various
reasons, its availability is very limited at present. Noticed that the
World Word Web is a potential source to mine parallel text, researchers
are making their efforts to explore the Web in order to get a big collection
of bitext. This paper presents BITS Postscript and .pdf versions of this paper are available. |
|||||
| 12.19.2002 | 12:00-2:00 | LDC | Mark Liberman: Mining the Bibliome: Information Extraction from Biomedical Text | ||
| Our goal is qualitatively better methods for automatically extracting information from the biomedical literature, relying on recent progress and new research in three areas: high-accuracy parsing, shallow semantic analysis, and integration of large volumes of diverse data. We are focusing initially on two applications: drug development, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline, and pediatric oncology, in collaboration with researchers in the eGenome group at Children's Hospital of Pennsylvania. These applications, worthwhile in their own right, provide excellent test beds for broader research efforts in natural language processing and data integration. In particular, we propose to develop and test new general methods for information extraction from text, based on on-going research at Penn in corpus-based algorithms for parsing, predicate-argument analysis and reference resolution. We will collaborate with groups led by Robert Gaizauskas at the University of Sheffield (UK) and by Jun-ichi Tsujii at the University of Tokyo (Japan), as well as with the GSK and CHOP groups, in applying these general methods to particular problems in biomedical information extraction. The GSK group has already made effective use of the best available information-extraction technology, so that the new techniques can be assessed in the drug-development application for their added value relative to the state of the art. To give a simple, concrete example, we want a program that will read a phrase like
and add to a database a set of entries whose ordinary-language presentation is
This project will also address several database research problems, including methods for modeling complex, incomplete and changing information using semistructured data, and also ways to connect the text analysis process to an information integration environment that can deal with the wide variety of extant bioinformatic data models, formats, languages and interfaces. Such problems are central to current database research at Penn. Our work will build on the progress represented by Penn's K2/Kleisli data integration environment, by extending it to deal with semi-structured data. This will be important managing the information that is extracted, and also in making use of the many specialized data resources that can be brought to bear in generating the high-accuracy text analysis required to extract the information effectively in the first place. Because the K2/Kleisli system is widely used in bioinformatics applications, and in particular has been used for several years at GSK, these extensions will facilitate collaboration with biomedical researchers as well as supporting the IE research itself. The engine of recent progress in language processing research has been linguistic data: text corpora, treebanks, lexicons, test corpora for information retrieval and information extraction, and so on. Much of this data has been created by Penn researchers and published by Penn's Linguistic Data Consortium. As part of the proposed project, we will develop and publish new linguistic resources in three categories: a large corpus of biomedical text annotated with syntactic structures (Treebank) and shallow semantic structures ("proposition bank" or Propbank); a large set of biomedical abstracts and full-text articles annotated with entities and relations of interest to researchers, such as enzyme inhibition by various compounds, or genotype/phenotype connections (Factbanks); and broad-coverage lexicons and tools for the analysis of biomedical texts. |
|||||
|
| |||||