Data and Annotations
for Sociolinguistics: A Corpus-Based Approach to Sociolinguistic Research
Stephanie Strassel
and Christopher Cieri
Presented to the Penn Linguistics Colloquium, March 3 2001
View presentation slides (in .pdf)
The project in Data and Annotations for
Sociolinguistics (DASL) investigates best practices in the use of digital
speech corpora to
address problems in sociolinguistic theory.
The quantitative study of linguistic variation is necessarily based upon
empirical observation and
statistical description of linguistic
behavior. Collecting and annotating databases plays a crucial role in quantitative
sociolinguistics. The current state of computing technology encourages
the collection, annotation, analysis and even summarization and presentation
of linguistic behavior wholly within the digital domain. Digital data is
easily shared and that in turn encourages a whole range of positive practices.
However, the use of speech corpora in sociolinguistics also raises questions
both theoretical and methodological. The goal of the DASL Project is to
begin to address these issues via a case study involving the analysis of
a well-documented sociolinguistic variable as it appears (or does not)
in
several large well-documented speech corpora.
This paper reports on the first phase of
DASL: an investigation of the process of -t/d-deletion in four large digital
speech corpora spanning a
range of speaking styles, from read speech
to casual conversations between intimates. The corpora were collected
for purposes other than
sociolinguistic research but are capable
of being re-annotated to fit our needs. -t/d deletion is a well-understood,
stable variable common in
multiple varieties of English. This variable
shows similar patterns of stratification across the many diverse speech
communities in which it
has been studied. A team of non-specialist
annotators, working under the direction of sociolinguists, identifies and
codes tokens of potential
deletion. The team approach allows for
the evaluation of inter-annotator consistency in coding. The interface
used to conduct the annotation
allows linguists to interact with the
corpora and the resulting annotations via the worldwide web so that this
project can generalize to include multiple sites. The structure of
the DASL Project also encourages collaborative data development and analysis
by providing sociolinguists with raw and annotated data and tools for browsing,
searching, (re)annotating and distributing that data via the Internet.
While we are contributing the four corpora annotated for -t/d deletion
to DASL, other researchers will be encouraged to do the same via "data
exchange"; those whose make their own contributions of data and annotations
will have access to the entire pool of data.
The ability to easily share digital data encourages collaboration via:
The paper reports results from annotation
of the first corpus, the TIMIT Acoustic-Phonetic Corpus of read speech.
This data set consists of over
600 speakers each reading a set of 10
phonetically rich sentences selected from a larger pool. The corpus (along
with the other corpora to
be analyzed) has already been transcribed
and segmented so that individual speaker turns can be retrieved separately.
Before coding begins, we use a custom-designed sociolinguistic annotation
interface to search the orthographic transcripts via a regular expression
query, identifying potential tokens of interest. Other filters are
applied to further reduce the list of tokens, excluding words that look
erroneously like candidates for deletion (e.g., would). Using this
approach, the 54,387-word TIMIT corpus was quickly reduced to a review
list of 2059 words; from the review list annotators identified 1578 actual
-t/d tokens.
Once the corpora have been concordanced,
filtered and prepared for annotation, an interactive web-based display
allows annotators to view
each token, listen to the utterance and
view the corresponding waveform, access demographic data and code linguistic
factors. The annotator can simply click on the word to hear it spoken.
Following each token, the interface displays the factors to be coded. Each
factor is shown as a
radio button, and coding a token entails
clicking on the button corresponding to the relevant factor within each
factor group. A comment field also appears after each token for the annotator
to record notes. Results are easily exported to a spreadsheet or statistical
analysis package. Using this approach, annotators have completed coding
the 1578 tokens of potential -t/d deletion in the TIMIT corpus with respect
to four factor groups: status of the dependent variable; morphological
category; preceding segment and following segment. A VARBRUL analysis of
the TIMIT data considers social factors (speaker age, sex, region, education)
along with the linguistic factors. Additionally, 5% of the tokens
in TIMIT have been re-coded by an independent annotator in order to establish
a measure of inter-annotator consistency.
In addition to the empirical study of -t/d deletion and the methodological questions concerning the use of published speech corpora in sociolinguistics, the paper addresses several other questions: