Data and Annotations
for Sociolinguistics: A Corpus-Based Approach to Sociolinguistic Research
Stephanie Strassel and Christopher Cieri
Presented to the Penn Linguistics Colloquium, March 3 2001
View presentation slides (in .pdf)
The project in Data and Annotations for
Sociolinguistics (DASL) investigates best practices in the use of digital
speech corpora to
address problems in sociolinguistic theory. The quantitative study of linguistic variation is necessarily based upon empirical observation and
statistical description of linguistic behavior. Collecting and annotating databases plays a crucial role in quantitative sociolinguistics. The current state of computing technology encourages the collection, annotation, analysis and even summarization and presentation of linguistic behavior wholly within the digital domain. Digital data is easily shared and that in turn encourages a whole range of positive practices. However, the use of speech corpora in sociolinguistics also raises questions both theoretical and methodological. The goal of the DASL Project is to begin to address these issues via a case study involving the analysis of a well-documented sociolinguistic variable as it appears (or does not) in
several large well-documented speech corpora.
This paper reports on the first phase of
DASL: an investigation of the process of -t/d-deletion in four large digital
speech corpora spanning a
range of speaking styles, from read speech to casual conversations between intimates. The corpora were collected for purposes other than
sociolinguistic research but are capable of being re-annotated to fit our needs. -t/d deletion is a well-understood, stable variable common in
multiple varieties of English. This variable shows similar patterns of stratification across the many diverse speech communities in which it
has been studied. A team of non-specialist annotators, working under the direction of sociolinguists, identifies and codes tokens of potential
deletion. The team approach allows for the evaluation of inter-annotator consistency in coding. The interface used to conduct the annotation
allows linguists to interact with the corpora and the resulting annotations via the worldwide web so that this project can generalize to include multiple sites. The structure of the DASL Project also encourages collaborative data development and analysis by providing sociolinguists with raw and annotated data and tools for browsing, searching, (re)annotating and distributing that data via the Internet. While we are contributing the four corpora annotated for -t/d deletion to DASL, other researchers will be encouraged to do the same via "data exchange"; those whose make their own contributions of data and annotations will have access to the entire pool of data.
The ability to easily share digital data encourages collaboration via:
The paper reports results from annotation
of the first corpus, the TIMIT Acoustic-Phonetic Corpus of read speech.
This data set consists of over
600 speakers each reading a set of 10 phonetically rich sentences selected from a larger pool. The corpus (along with the other corpora to
be analyzed) has already been transcribed and segmented so that individual speaker turns can be retrieved separately. Before coding begins, we use a custom-designed sociolinguistic annotation interface to search the orthographic transcripts via a regular expression query, identifying potential tokens of interest. Other filters are applied to further reduce the list of tokens, excluding words that look erroneously like candidates for deletion (e.g., would). Using this approach, the 54,387-word TIMIT corpus was quickly reduced to a review list of 2059 words; from the review list annotators identified 1578 actual -t/d tokens.
Once the corpora have been concordanced,
filtered and prepared for annotation, an interactive web-based display
allows annotators to view
each token, listen to the utterance and view the corresponding waveform, access demographic data and code linguistic factors. The annotator can simply click on the word to hear it spoken. Following each token, the interface displays the factors to be coded. Each factor is shown as a
radio button, and coding a token entails clicking on the button corresponding to the relevant factor within each factor group. A comment field also appears after each token for the annotator to record notes. Results are easily exported to a spreadsheet or statistical analysis package. Using this approach, annotators have completed coding the 1578 tokens of potential -t/d deletion in the TIMIT corpus with respect to four factor groups: status of the dependent variable; morphological category; preceding segment and following segment. A VARBRUL analysis of the TIMIT data considers social factors (speaker age, sex, region, education) along with the linguistic factors. Additionally, 5% of the tokens in TIMIT have been re-coded by an independent annotator in order to establish a measure of inter-annotator consistency.
In addition to the empirical study of -t/d deletion and the methodological questions concerning the use of published speech corpora in sociolinguistics, the paper addresses several other questions: