| The LDC Institute is a seminar series on issues in language
data and database creation. The goals of the seminar series are to create
a forum of UPenn affiliates with an interest in language data, to communicate
LDC and UPenn experience in data collections, standards, and annotation,
and to work with researchers and others who may be interested in LDC data
or who may wish to contribute new data to the archives.
The topics covered will be descriptions of data-driven projects, both
at the LDC and elsewhere and tutorials on specific methodologies and systems
for dealing with language data. We assume attendees would be UPenn students,
both undergraduate and graduate, as well as LDC staff who are interested.
The seminars will be scheduled on Thursday afternoons.
For most seminars, food will be provided, since they will take place at
noon or shortly after. |
|
Upcoming
Presentations and Notes |
| Date |
Time |
Location |
Title and Instructor |
| Jul 17, 2008 |
11:30-1:30pm |
LDC |
Development of Resources and Techniques for Processing of Some Indian Languages by Shyam S. Agrawal, KIIT College of Engineering, Gurgaon, India
|
|
|
|
|
|
Shyam S. Agrawal. Shyam is the Executive Director of KIIT College of Engineering, Gurgaon, India and Advisor to the Centre for Development of Advanced Computing (CDAC), Noida, India.
ABSTRACT: In the past few decades there has been a pressing demand and need to develop speech and language corpora for training, testing and benchmarking of speech technology systems for various applications. Richly annotated corpora and labeled databases are needed to develop models of spoken languages and also to understand the structure of speech and the variability that occurs in the speech signals due to a variety of factors.
There are twenty-two official Languages and more than 300 dialects used in India. These languages have their specific structure and properties which have significant differences from European languages. This talk presents some of the phonetic differences in Hindi compared to English and presents an overview of the efforts made by CDAC, Noida and some other institutions to develop text and speech databases, tools, and some techniques for the processing of some Indian languages. For the collection of speech databases, issues and procedures related to text selection, e.g. multiform units and variables such as demographic, dialectal, environmental, emotional, linguistic background etc. have been included. Special tools developed for the analysis of text are described. The objective has been that tools should be adaptive to other Indian Languages and also to other languages.
Details of some of the application-oriented task specific databases such as the ELDA-sponsored database for Hindi and the CFSL Speaker Identification database for forensic applications will be described in detail.
A PowerPoint version of this presentation is available:
Development of Resources and Techniques for Processing of some Indian Languages
|
|
Previous
Presentations and Notes |
| Date |
Time |
Location |
Title and Instructor |
| Jul 19, 2007 |
2:30-4:30pm |
LDC |
HTML Templates for LDC Sponsored Projects by Shawn Medero, LDC
|
|
|
|
|
Introduction to existing resources for communicating your LDC projects
over the web. The templates provide a professional, consistent and
attractive design while allowing creativity and variation in each
project's web site approach. The templates will help you define a predictable
and organized navigation structure for LDC employees, project sponsors,
research, and general visitors. Questions concerning use of the web
standards, such as CSS and HTML, are encouraged throughout the
presentation.
|
| Jul 13, 2007 |
2:30-4:30pm |
LDC |
Speaking Arabic in Iraq and the Middle East: Reflections on Three Tours of Duty, USMC, ret.
|
|
|
|
|
Kenneth Gardner served in the U.S. military for 22 years. He learned Arabic at the Defense Language Institute Foreign Language Center in 1995. His missions included infantry and intelligence assignments in, or training troops from, the following countries, among others: Egypt, Saudi Arabia, Kuwait, Jordan, United Arab Emirates, Bahrain, Oman, Qatar, Afghanistan and Iraq. Most recently, Gardner served as an advisor to an Iraqi Army Infantry Battalion in Operation Al-Fajr (the November 2004 battle for Fallujah) and as base commander of an Iraqi Security Forces training base. Gardner will share his experiences as a non-native Arabic speaker communicating with native speakers in a variety of settings as well as the issues faced by monolingual U.S. troops in the field.
|
Jun 20, 2007 |
3-5pm |
LDC |
Programming Specifications Procedures and Practices by Andy Cole, LDC
|
|
|
|
|
Virtually all tasks at LDC depend on programmer input. It's been LDC's policy to require that most requests for programming assistance be accompanied by a specification that describes the desired output and includes some estimate of the programming time needed to complete the tasks in the specification. Andy Cole will outline a simple set of guidelines for LDC staff to follow when making programming requests and he will illustrate how those guidelines work by using them to develop a business systems specification.
|
| Oct 26, 2006 |
3-5pm |
LDC |
Comparing Linguistic Annotations -- Issues in Harmonization and Quality Control by Christopher R. Walker
|
|
|
|
|
Consistency analysis was an important aspect of quality assessment in 2005 ACE data creation. Within this framework, Christopher Walker, LDC's Project Manager, Information Extraction, became quite interested in the various assumptions and applications of annotation scoring infrastructure. As he attempted to better understand the bounds of the problem and solution spaces, it quickly became clear that there was no existing discussion of these issues -- and very little documentation of general Best Practices.
In this talk Walker seeks to reduce this gap by outlining the apparent problem and solution spaces; and by opening for discussion the utility of annotation scoring metrics in the various domains of empirical, computational and corpus linguistics -- and more cogently in the domain of quality control for linguistic data creation.
The talk will be organized into four main sections, followed by a brief discussion:
1. ACE Annotation: An Illustrative Introduction
2. Exploring the problem space
3. Exploring the solution space
4. Extending the metaphor: Building better linguistic data
|
| Sep 22, 2006 |
10:30-11:30am |
LDC |
Recording and Annotation of Speech Data via the WWW - A Case Study by Dr. Christoph Draxler, Institute of Phonetics and Speech Communication, Ludwig-Maximilians University, Munich, Germany
|
|
|
|
|
The German Ph@ttSessionz project will create a database of 1000 adolescent speakers balanced by gender and covering all major German dialect areas. The project employs a novel approach to collecting speech data: all recordings are performed via the WWW -- using a web application in a standard web browser -- in more than thirty-five German public schools. Speech is recorded using a standardized audio setup on the school's PC, and the signal and administrative data are immediately transferred to the BAS server in Munich. Using this approach, geographically distributed recordings in high bandwidth quality can be made efficiently and reliably.
Dr. Draxler will describe the Ph@ttSessionz web application and its major components, SpeechRecorder and WebTranscribe, and he will outline the infrastructure developed at BAS for WWW-based speech recordings. Dr. Draxler will also discuss the strategies employed to enlist schools in the project and present preliminary analyses of the Ph@ttSessionz speech database.
|
| Jul 27, 2006 |
3-5pm |
LDC |
LDC Online by Dave Graff
|
|
|
|
|
The topic of this session will be LDC Online. Dave Graff, LDC's Lead Programmer Analyst, will present an overview of LDC Online's corpora coverage, search methodology and future plans for growth.
|
| May 4, 2006 |
3-5pm |
LDC |
Pros and Cons of Different Annotation Workflow Systems by Seth Kulick, Julie Medero, Hubert Jin, Dave Graff, and Kevin Walker
|
|
|
|
|
This LDC Institute is part of an effort to determine the desirable properties for a single workflow system that can be used (or extended as appropriate) in the various annotation projects at LDC and IRCS (Penn's Institute for Research and Cognitive Science). Because the several different workflow systems that are currently used were designed for projects with different needs, they handle many issues differently, among them, support for local and remote annotation, sophistication of report capability, use of automated tagging, and flexibility in the specification of workflow stages.
The speakers are LDC/IRCS programmers who designed some of the current systems. They will discuss the following topics: (1) the properties of the different systems; (2) why some characteristics of a particular workflow system might make it
unsuitable for a particular project; (3) properties that should be added to the workflow systems; and (4) alternative ways of setting up a workflow system
|
|
|
|
|
| Apr 20, 2006 |
3-5pm |
LDC |
Recent Trends in Annotation Tool Development at LDC by Kazuaki Maeda, Julie Medero and Haejoong Lee |
|
|
The Linguistic Data Consortium (LDC) has created large amounts of annotated
linguistic data for a variety of evaluation programs and
projects using highly customized annotation tools developed at the LDC. LDC programmers Kazuaki Maeda, Julie Medero and Haejoong Lee will
discusses the history of annotation tool development at the LDC and
share some of the approaches they are currently taking. Two tools in particular will be highlighted: (1) the LDC's model for decision point-based annotation and adjudication,
which was used effectively in the ACE 2005 annotation effort; and (2) XTrans,a new speech transcription and annotation tool particularly suited for
transcribing meeting speech that was used by the LDC in
the NIST meeting recognition evaluation and Mixer Spanish and
Russian telephone conversation studies.
|
| 12.08.2005 |
3:00-5:00pm |
LDC |
Building a Lexicon Database for Arabic Dialects by Dave Graff |
|
|
Download the presentation slides (PowerPoint)
|
|
ABSTRACT: One of the major problems in creating a lexicon database for Arabic dialects is the fact that standardized orthographic (spelling) conventions do not generally exist. The word forms generated by transcribers from recorded conversations are based on relatively loose conventions and show significant variability within any given dialect. We describe how that problem is being resolved by creating a relational database design that makes the transcripts a key part of the database so that repairs to word forms in the lexicon table are propagated automatically to the transcripts. We also review some earlier approaches to lexicon building, describe the annotation tools developed specifically for the current lexicon project, and briefly consider some possible extensions to the database structure and annotation methods the LDC currently uses to cover tasks such as treebank annotation.
|
| 11.17.2005 |
3:00-5:00pm |
LDC |
Less Commonly Taught Languages (LCTLs) by Mark Mandel
|
|
|
The topic of this session will be the LDC’s work on the Less Commonly Taught Languages (LCTLs) portion of the REFLEX (Research on English and Foreign Language Processing) Project. Mark Mandel will describe the scope of the current work which is focusing on Thai, Urdu, Bengali, Panjabi, Hungarian, Tamil, and Yoruba. The talk will take about an hour and will be followed by questions from the audience.
Refreshments will be provided.
|
|
ABSTRACT: One of the LDC’s principal tasks in the REFLEX project is to discover, produce, and maintain language resources for the target languages. Those resources include, among other things, linguistic information, writing systems, converters, word segmenters, electronic lexicons, monolingual texts, bilingual texts, morphological parsers, and tools for producing and using annotated resources. Mark Mandel will describe the challenges associated with assembling those language resources, as well as the current progress of the LDC’s work.
|
|
|
|
|
| Date |
Time |
Location |
Title and Instructor |
| 07.07.2005 |
2:00-4:00pm |
LDC |
The Teaching of Berber in Morocco: Reality and Perspectives by Fatima Agnaou, IRCAM
|
|
|
Fatima Agnaou will give an informative talk primarily discussing IRCAM (The Royal Institute of Amazigh Culture) and its realizations in regards to the integration of Berber in Moroccan schools. Agnaou will address the aims and objectives of teaching in Berber and the standardization of the Berber language as the language faces its new functions. A presentation of the textbooks and other teaching materials will also be included in this talk along with teacher training and methodology. The talk will take about an hour and will be followed by questions from the audience.
|
| postponed |
12:00-2:00 |
LDC |
"Metadata" Annotation for EARS by Stephanie Strassel and Chris Walker of the LDC
|
|
|
|
|
| 02.03.2005 |
12:30-2:30 |
LDC |
Functional Morphology by Otakar Smrz, Charles University
|
|
|
Computational morphological models are usually implemented as finite-state transducers. The morphologies of natural languages are, however, better described in terms of inflectional paradigms, lexicons, and categories (inflectional and inherent parameters).
Markus Forsberg and Aarne Ranta have recently introduced a framework called Functional Morphology (FM) smoothly reconciling both of these viewpoints. Linguists can model their systems without any 'finite-state restrictions', using the full power of the functional language Haskell and delegating actual computational issues to FM. The morphological models become clearer, reusable, (ex)portable, and even more efficient.
I would like to highlight the noticeable features of FM/Haskell and outline our plans to use it for Arabic. I will be referring to languages like Latin, Swedish, Sanskrit, Spanish and Russian.
.pdf, postscript and Powerpoint
versions of this presentation are available.
|
|
|
|
|
|
Previous
Presentations and Notes for Fall 2003- Spring 2004 |
| 05.13.2004 |
12:30-2:30 |
LDC |
Arabic Propbank by Mona Diab, Stanford University
|
|
|
The holy grail of computational linguistics, from the time of the
field's inception, has been (automatic) natural language understanding.
Semantic parsing appears as a significant stride in that direction
giving researchers a glimpse into the world of concepts in a functional
and operational manner. Thanks to the English PropBank, a jumpstart for
the task of semantic role assignment is noted in several community wide
standardised evaluations such as CoNLL and Senseval3. In fact, the
porting of PropBank style annotation to 10 Chinese verbs has achieved
very interesting results in a relatively short period of time (Sun &
Jurafsky, 2004) but more importantly has shown some relevant
quantifiable linguistic variation between English and Chinese.
This is were my story begins. I am interested in building a semantic
role assignment system for Modern Standard Arabic (MSA). Different from
the Chinese effort however, no PropBank guidelines or data exist for
Arabic yet. Therefore, I decided to manually annotate some of the data.
This proved to be a very interesting and challenging task.
In line with the chinese experience, I took on the annotation of the 20
most frequent Arabic verbs in the Arabic treebank v2. The verbs have to
have occurred in more than 69 sentences. The work is still in progress
with an expected completion date of June 30 or so.
In this talk, I will focus on 3 of the verbs which are fully annotated
discussing the different frames and roles associated with their
arguments and adjuncts.The main issue that arises frequently is that of
consistency and variation especially with respect to assigning ARGM and ARG3 roles to constituents.
I would like to achieve a more generalised consistent annotation across verbs (sometimes it shifts even
within a single verb [potentially sense variation]) as well as within a verb. I will discuss some of the
rudimentary guidelines I have set for myself inspired from the annotation guidelines from both the English
and Chinese PropBanks.
|
| 03.23.2004 |
1:30-2:30 |
LDC |
Project Santiago by Colonel Stephen A. LaRocca of the Center for Technology Enhanced Language Learning
|
|
|
The Center for Technology Enhanced Language Learning (CTELL), organized within the Department of Foreign Languages at the U.S.
Military Academy, collects speech data and turns it into speech recognition software primarily for the benefit of cadets
learning languages at West Point. Since 1997 CTELL has collected broadband speech corpora for Arabic, Russian, Portuguese,
American English, Spanish, Croatian, Korean, German and French.
|
| 02.26.2004 |
12:30-2:30 |
LDC |
Tongue-Tied in Singapore:
A Language Policy for Tamil? by Harold F. Schiffman, Department of South Asia Studies
|
|
The Tamil situation in Singapore is one that lends itself ideally to the
study of minority language maintenance. The Tamil community is small and
its history and demographics are well known. The Singapore educational
system supports a well-developed and comprehensive bilingual education
program for its three major linguistic communities on an egalitarian
basis, so Tamil is a sort of test-case for how well a small language
community can survive in a multilingual society where larger groups are
doing well. But Tamil is acknowledged by many to be facing a number of
crises; Tamil as a home language is not being maintained by the
better-educated, and Indian education in Singapore is also not living up
to the expectations many people have for it. Educated people who love
Tamil are upset that Tamil is becoming thought of as a 'coolie' language
and regret this very much. Since Tamil is a language characterized by
extreme diglossia, there is the additional pedagogical problem of trying
to maintain a language with two variants, but with a strong cultural bias
on the part of the educational establishment for maintaining the literary
dialect to the detriment of the spoken one.
This paper examines these
attempts to maintain a highly-diglossic language in emigration and
concludes that the well-meaning bilingual educational system actually
produces a situation of subtractive bilingualism.
|
| 02.10.2004 |
1:00-3:00 |
LDC |
The Contextualization of Linguistic Forms across Timescales by Stanton Wortham, Penn Graduate School of Education
|
|
When people speak, they socially identify themselves and others. The use of linguistic forms is one
important means through which social identities get established. But the implications of any utterance for social
identity depend on relevant context. Decontextualized sociolinguistic regularities cannot fully explain how a given
utterance establishes a given social identity, although such regularities certainly play an important role. The centrality
of context seems to imply, methodologically, that a full linguistic analysis of social identification must rely on case
studies of particular utterances in context. Social identification, however, is not a phenomenon of isolated cases.
An individual gets socially identified across a series of interrelated events, not a series of unique, unrelated contexts.
This paper describes an empirical research project which traces the identity development of a ninth grade
student across the academic year in one classroom. The student’s identity develops in substantial part through speech events
that position her in socially recognizable ways. The paper presents a methodological strategy for analyzing the
interrelated events across which this individual gets socially identified, focusing on specific kinds of speech events
that play a central role in the emergence of her social identity across time. This project provides an opportunity to
reflect on the kinds of data necessary for studying social identification.
|
| 01.26.2004 |
1:30-3:30 |
LDC |
Interfaces for Parser and Dictionary Access
by Malcolm D. Hyman, Harvard University
|
|
The subject of this presentation is "linguistic middleware" — software designed to
mediate between backend linguistic tools and data sources (for instance, tokenizers, morphological
analyzers, and parsers) and frontend user agents (browsers and editors). Making linguistic data
available within graphical user agents will allow for rich, next-generation working environments
that can offer substantial benefits to language researchers and students. Simple but powerful
interfaces will allow for interoperability between diverse technologies, including legacy systems.
Current web-services and XML standards provide the basis for the development of such interfaces. The
goal is distributed networks that connect arbitrary tools, databases, reference works, and corpora;
ultimately, this architecture will help to break down barriers between scholarly communities and to
enrich the work of linguists, philologists, technologists, historians, and literary scholars.
In order to realize the vision of generalized linguistic middleware, we need to address a range of challenges
encountered in typologically diverse languages and writing systems. This presentation will focus on:
* multiple approaches to tokenization required by different writing systems
* orthographic normalization
* handling different "window sizes" needed for context-sensitive analysis
* strategies for identifying lexical items that are realized discontinuously
* metalanguages for morphosemantic and syntactic category labels
The discussion will be accompanied by demonstrations of some prototype implementations and solutions.
|
| 12.09.2003 |
12:00-2:00 |
LDC |
Finite State Morphology using Xerox Software
by Kenneth Beesley of XRCE
|
|
Morphological analysis (word analysis) and generation are foundation technologies used in many kinds of
natural-language processing, including dictionary lookup, language teaching, part-of-speech disambiguation (tagging), syntactic parsing,
etc. Successful and publicly available software implementations
based on "finite-state" theory include Koskenniemi's Two-Level
Morphology, AT&T Lextools, Groningen University's Fsm Utils, and
Xerox's lexc and xfst languages. The Xerox tools are now available
on CD-ROM in a book entitled "Finite State Morphology", Beesley &
Karttunen, 2003, CSLI Publications.
The talk will briefly and gently cover the history and underlying
theory of finite-state morphology, and introduce lexc and xfst syntax.
Finite-state morphological analyzers are excellent projects for
a Master's Thesis, and for field linguists they are a practical
way to encode, computerize and test morphotactic grammars, alternation
rules and lexicons that would otherwise remain inert on paper.
Finite-state morphology has been successfully applied to languages
around the world, including the obviously "commercial" European
languages, Finnish, Hungarian, Basque, Turkish, Korean, Arabic,
Syriac, Hebrew, several Bantu languages, a variety of American Indian
languages, etc.
Beesley will also be available for consultation at the LDC 8-9 December.
|
| 10.15.2003 |
1:00-3:00 |
LDC |
Searching through Prague Dependency Treebank
by Jiri Mirovsky and Roman Ondruska Charles University,
Prague
|
|
Netgraph is a searching tool for annotated treebanks. It was
originally developed for the Prague Dependency Treebank but it is not limited
to this treebank.
Netgraph is a multiuser system with a net architecture. This means that
more than one user can access it at the same time and its components may
be located in different nodes of Internet.
After the introductory part of the demonstration, we will show NetGraph
in use. Although work with NetGraph is easy, the client-server architecture
requires actions from the user which a stand-alone application does not.
It needs to be explained and understood.
We will describe four parts of working with NetGraph -- connecting to the
server, selection of files, creating a query, and viewing the results. The
main part is the creation of queries. NetGraph provides a simple-to-use
and still powerful query language. The basic query is a tree structure with
a few evaluated attributes. Searching a corpus given a query means searching
for all trees which contain the query tree as a subtree.
This basic functionality is improved by so-called meta attributes -- an
easy way to add more restrictions to found trees, e.g. size, orientation,
position of query tree, forbidden nodes etc.
We will show several examples of queries, from the simplest to more complex.
|
| 11.12.2003 |
1:30-3:30 |
LDC |
The Pennsylvania Sumerian Dictionary Project by Stephen Tinney, Penn Museum
|
|
The Pennsylvania Sumerian Dictionary project is developing an online
dictionary which will combine lexicon and text-corpora within an
interface which offers multiple entry points to the Dictionary,
multiple views of the lexicon or individual items and reverse
navigation back from the text-corpora. The framework within which
this system is being implemented is a generic XML data structure for
corpus-based dictionaries.
The data structure binds control lists with the references drawn from
the text-corpora. These references are tagged with morphological and
semantic information to enable programmatic generation of lexical
articles containing exhaustive information on orthography,
chronological and geographical distribution and usage.
This presentation will describe the particular problems associated
with writing a dictionary of Sumerian, describe the corpus-based
dictionary model and demonstrate the state of the current
implementation.
|
|
|
|
|
|
Previous
Presentations and Notes for Summer 2002- Spring 2003 |
| 06.14.2002 |
12:00-2:00 |
LDC |
New Methods for Constructing Annotated Speech Corpora by Steven Bird,
LDC
|
|
Over the past decade of creating and managing speech corpora,
LDC staff have developed literally hundreds of utilities, user interfaces
and file formats. These databases are becoming increasingly complex in
their structure, with rich standoff annotations organized across multiple
layers. At the same time, the range of contributing specialties
has become more diverse, as illustrated by LDC's future publication plans
in such areas as field linguistics, sociolinguistics, gesture, and animal
communication.
In this talk I will outline the traditional corpus production process
and catalog the problems we have experienced. This provides the backdrop
for LDC's R&D effort over the last four years, which has created new software
infrastructure and a suite of annotation tools. I will introduce the principles
and key concepts of the annotation graph toolkit (AGTK), describe the
current tools, and give a brief overview of the tool development process.
Finally, I will introduce OLAC, the Open Language Archives Community,
and demonstrate how it is being used for describing and discovering language
resources of the kind we create at the LDC. The talk will be followed
by an informal demonstration session.
More information about AGTK and OLAC is available at agtk.sf.net and
www.language-archives.org.
Note: This presentation will be accessible to a general audience.
Developers of the work I will report include Kazuaki Maeda, Xiaoyi Ma,
Haejoong Lee, Beth Randall, Salim Zayat, Craig Martell, Chris Osborn,
John Breen, Jonathan Dick, Eva Banik, Alan Lee and Jonathan Wright.
A PDF version of this presentation is available:
New Methods
for Constructing Annotated Speech Corpora
|
| 06.27.2002 |
1:00-3:00 |
LDC |
Corpus Development for the ACE (Automatic Content Extraction) Program
by Alexis Mitchell and Stephanie Strassel, LDC |
|
The objective of the ACE Program is to develop core automatic
content extraction technology to enable text processing through the the
detection and characterization of entities, relations, and events. As
part of the DARPA TIDES program, ACE supports technology research and
development for various classification, filtering, and selection applications
by extracting and representing language content (i.e., the meaning conveyed
by the data). The ultimate goal of ACE is the development of technologies
that will automatically detect and characterize this meaning.
For the past three years, the Linguistic Data Consortium has been developing
annotated corpora for the ACE program. Data for ACE consists of newspaper,
newswire and broadcast news transcripts. To support Entity Detection and
Characterization, ACE annotators label selected types of entities (Persons,
Organizations, etc.) mentioned in text data. The textual references to
these entities are then characterized and multiple entity mentions are
co-referenced. The Relation Detection and Characterization task requires
annotators to identify and characterize relations between the labeled
set of entities. The LDC's role in ACE has recently expanded to encompass
annotation of all data for the ACE program as well as development and
maintenance of annotation guidelines and annotation tools.
In this talk, we describe corpus development for the ACE program, focusing
on annotation procedures and guidelines as well as quality assurance measures.
In addition, we touch on particular annotation challenges including classifying
generic entities, metonymic entity mentions (including the concept of
GeoPolitical Entities) and identifying the temporal attributes of relations.
A powerpoint version of this presentation is available:
Corpus
Development for the ACE (Automatic Content Extraction)
|
| 07.25.2002 |
1:00-3:00 |
LDC |
Dictionary Creation
by Mike Maxwell |
|
What is a bilingual dictionary? Most of us have used bilingual
dictionaries, so the answer seems obvious. But when it comes to defining
the structure of a dictionary as a database on a computer, the obvious
becomes non-obvious.
I will talk about the structure of a bilingual lexicon, and in particular
that of a lexical entry, from a computational and linguistic viewpoint.
There are (at least) three levels at which one might define such structure.
Proceeding from the most concrete to the most abstract, these are: the
file format level (e.g. in terms of an XML structure); a model (using
a modeling language such as UML); and an ontology of concepts.
The talk will in fact begin with the abstract ontology and proceed to
a model, which I have been developing in collaboration with several computational
linguists in SIL (the Summer Institute of Linguistics). I will have little
to say about the file format level. Some issues which arise include:
- The users of the lexicon (human users vs. computational linguistics
systems, such as parsers)
- On-line vs. printed dictionaries
- Data integrity issues (what happens when you delete the lexical entry
of a word that was the synonym of another lexical entry?)
- Modeling areas outside traditional lexicography, especially morphology
- Multi-dialectal lexicons
For the brave, a model for lexicons, morphology and other linguistic
components is located at http://fieldworks.sil.org/ModelDoc/. (However,
when I tried to go there today, the site was down.)
A powerpoint version of this presentation is available:
Modeling
Lexical Entries in Bilingual Dictionaries
|
| 10.10.2002 |
12:00-2:00 |
LDC |
Dave Graff: Scripts/Programs for Large Data Sets |
| |
In any sort of corpus-based language research, the efficiency
and usefulness of the research will be limited by the consistency and usefulness
of the corpus. In this talk, I'll focus on establishing consistency in terms
of how language corpora are presented to researchers as input for their
work: the directory structure, file
structure, document structure, character encoding, the amount and nature
of meta-data (information about the corpus content) and how this information
is incorporated.
Virtually all text corpora are drawn from "found" data -- material
that already exists in electronic form to serve some purpose other than
corpus-based language research, such as: publication of books, periodicals
or daily news; archival preservation of public, commercial or government
transactions; online discussions on various topics among diverse interest
groups; and so on. The problem is that each data source has its own unique
set of needs and conventions that dictate the data formats used to store
and transport its particular content -- as well as its own rate of failure
in making sure the data satisfy its needs and conventions.
The task for the LDC, working on behalf of corpus researchers, is to
design and apply the tools needed to distill each source into a common,
standardized form that will (1) maximize the usability of the data on
any researcher's chosen computer system, (2) preserve as much information
as possible from the source, and (3) discard as much interference and
noise as possible -- and do all this with a minimum
of manual effort. The present talk will discuss strategies and tools that
have been developed and used at LDC over the years for this purpose.
|
| 10.31.2002 |
1:00-3:00 |
LDC |
Xiaoyi Ma: BITS and other Machine Translation Collection Projects
Shudong Huang: Overview of Machine Translations |
|
|
|
|
|
BITS and other Machine Translation Collection Projects
Xiaoyi Ma, Mark Y. Liberman
Parallel corpus are valuable resource for machine translation, multi-lingual
text retrieval, language education and other applications, but for various
reasons, its availability is very limited at present. Noticed that the
World Word Web is a potential source to mine parallel text, researchers
are making their efforts to explore the Web in order to get a big collection
of bitext. This paper presents BITS
(Bilingual Internet Text Search), a system which harvests multilingual
texts over the World Wide Web with virtually no human intervention. The
technique is simple, easy to port to any language pairs, and with high
accuracy. The results of the experiments on German - English pair proved
that the method is very successful.
Postscript and .pdf
versions of this paper are available.
|
| 12.19.2002 |
12:00-2:00 |
LDC |
Mark Liberman: Mining the Bibliome: Information Extraction from Biomedical
Text |
|
Our goal is qualitatively better methods for automatically
extracting information from the biomedical literature, relying on recent
progress and new research in three areas: high-accuracy parsing, shallow
semantic analysis, and integration of large volumes of diverse data. We
are focusing initially on two applications: drug development, in collaboration
with researchers in the Knowledge Integration and Discovery Systems group
at GlaxoSmithKline, and pediatric oncology, in collaboration with researchers
in the eGenome group at Children's Hospital of Pennsylvania. These applications,
worthwhile in their own right, provide excellent test beds for broader
research efforts in natural language processing and data integration.
In particular, we propose to develop and test new general methods for
information extraction from text, based on on-going research at Penn in
corpus-based algorithms for parsing, predicate-argument analysis and reference
resolution. We will collaborate with groups led by Robert Gaizauskas at
the University of Sheffield (UK) and by Jun-ichi Tsujii at the University
of Tokyo (Japan), as well as with the GSK and CHOP groups, in applying
these general methods to particular problems in biomedical information
extraction. The GSK group has already made effective use of the best available
information-extraction technology, so that the new techniques can be assessed
in the drug-development application for their added value relative to
the state of the art.
To give a simple, concrete example, we want a program that will read
a phrase like
Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated
activities with Ki values of 45.1--271.6 μM
and add to a database a set of entries whose ordinary-language presentation
is
amiodarone inhibits CYP2C9 with Ki=45.1--271.6
amiodarone inhibits CYP2D6 with Ki=45.1--271.6
amiodarone inhibits CYP3A4 with Ki=45.1--271.6
This project will also address several database research problems, including
methods for modeling complex, incomplete and changing information using
semistructured data, and also ways to connect the text analysis process
to an information integration environment that can deal with the wide
variety of extant bioinformatic data models, formats, languages and interfaces.
Such problems are central to current database research at Penn. Our work
will build on the progress represented by Penn's K2/Kleisli data integration
environment, by extending it to deal with semi-structured data. This will
be important managing the information that is extracted, and also in making
use of the many specialized data resources that can be brought to bear
in generating the high-accuracy text analysis required to extract the
information effectively in the first place. Because the K2/Kleisli system
is widely used in bioinformatics applications, and in particular has been
used for several years at GSK, these extensions will facilitate collaboration
with biomedical researchers as well as supporting the IE research itself.
The engine of recent progress in language processing research has been
linguistic data: text corpora, treebanks, lexicons, test corpora for information
retrieval and information extraction, and so on. Much of this data has
been created by Penn researchers and published by Penn's Linguistic Data
Consortium. As part of the proposed project, we will develop and publish
new linguistic resources in three categories: a large corpus of biomedical
text annotated with syntactic structures (Treebank) and shallow
semantic structures ("proposition bank" or Propbank);
a large set of biomedical abstracts and full-text articles annotated with
entities and relations of interest to researchers, such as enzyme inhibition
by various compounds, or genotype/phenotype connections (Factbanks);
and broad-coverage lexicons and tools for the analysis of biomedical texts.
|
| 01.16.2003 |
12:00-2:00 |
LDC |
Stephanie Strassel: DASL |
|
Data and Annotations for Sociolinguistics (DASL): Using
digital data to address issues in sociolinguistic theory
Stephanie Strassel, Linguistic Data Consortium
A longstanding focus of sociolinguistic research has been the quantitative
analysis of language variation and change, an endeavor that necessarily
begins with the empirical observation and statistical description of linguistic
behavior. Current technology encourages the collection and analysis of
such data, and even the presentation and publication of research findings,
wholly within the digital domain. Within the field of human language technology,
such benchmark data has proven to be an essential ingredient for progress,
as it reduces the cost of analytical infrastructure within research communities,
frees researchers to focus on their interests, encourages collaboration
and reduces impediments to new participants. However, most empirical data
within sociolinguistics continues to be collected and analyzed by individuals
or individual research groups, and is never made available to the wider
research community. This proprietary approach to data hampers collaboration,
the replication of studies and the comparison of models, methods and results,
all necessary components of rigorous science. The prospect of digital
data sharing in sociolinguistics also raises theoretical and methodological
questions: Can sociolinguists make effective
use of existing corpora? Do the insights gained from corpus
data differ qualitatively from data most commonly used in quantitative
sociolinguistics, namely recordings of sociolinguistic interviews?
What are the best practices for the
creation of new digital resources for sociolinguistic
research?
The project on Data and Annotations for Sociolinguistics (DASL), currently
underway at the Linguistic Data Consortium with support from NSF via the
Talkbank project, begins to address these issues. DASL investigates the
use of digital data in sociolinguistics through a series of case studies
involving both analysis of variation in existing corpora and the creation
of new data sets. This paper will introduce DASL's goals, assumptions,
data and tools and will review annotation and
corpus creation efforts and results to date.
|
| 01.16.2003 |
12:00-2:00 |
LDC |
Chris Cieri: SocioLinguistics |
|
Towards a Comprehensive, Empirical Analysis of Linguistic
Data: the case of Regional Italian vowel systems
Any empirical study of language relies necessarily upon a body of observations
of linguistic behavior even if the study fails to formally acknowledge
its corpus. The decisions one makes in approaching data affect research
profoundly by opening some avenues of inquiry while blocking others. Looking
across research communities as diverse as sociolinguistics and speech
technologies, one finds methods that may be integrated in order to both
broaden research possibilities and to perform research more efficiently.
This presentation explores the relationship among data, tools, annotation
(or coding) processes and the research they support. It will focus specifically
on the quantitative analysis of linguistic variation. The data come from
a series of sociolinguistic interviews undertaken to investigate the modeling
of variation in the regional speech of central Italy.
After describing the motivation for this study, I will demonstrate a
series of tools, processes and data formats that permit a comprehensive
yet rapid analysis of vowel systems. Specifically, I will demonstrate
tools for transcription and segmentation, lexicons and search tools that
automatically select and categorize tokens of interest from the transcripts,
batch processes that perform acoustic analyses of the selected tokens
and an interface for managing and adding human judgments to these analyses.
In the process I will offer a particular perspective on tool development,
favoring information retention and annotator efficiency over computational
efficiency and portability.
The tools and processes I will demonstrate were designed to interact
with each other without losing precious information from the original
audio such as the phonetic and lexical context of each vowel, the time
and situation in which it was uttered, the demographic features of the
speaker and so on. As a result, they make it possible or even trivial
to explore issues that are otherwise impossible or at least difficult
such as the interaction of phonological variation as a function of speech
rate and time within an interview.
|
| 03.20.2003 |
12:00-2:00 |
LDC |
David Miller: Collections |
|
This presentation will discuss the various speech collection
projects undertaken at the Linguistic Data Consortium at the University
of Pennsylvania during the years 1995-present. Included will be a discussion
of telephone speech data collection projects and on-site speech collection
projects. Data collection projects covered in this talk will include:
CallHome, CallFriend, Switchboard, ROAR and FISHER.
|
| 04.10.2003 |
12:00-2:00 |
LDC |
Mohamed Maamouri & Tim Buckwalter: Arabic Language:
Issues and Perspectives |
|
The presentation will start from the early standardization
of Arabic and lead to the emergence of 'diglossia' and its linguistic
and sociolinguistic consequences. The dominant attitudes towards linguistic
reforms will also be presented. The second focal point will be on the
Arabic reading process, its challenges and its consequences for reading
performance and for education in general.
In a connected second part of the same presentation, Tim Buckwalter will
focus on the Arabic NLP Issues. He will then present his morphological
analyzer and his lexicon. A brief overview of the LDC Arabic Treebank
project will follow.
| |
| |
|
|
|
| |
|