The SLX Corpus of Classic Sociolinguistic Interviews

The SLX Corpus comprises 8 sociolinguistic interviews with a total of 9 speakers, conducted between 1963 and 1973. The primary interviewer in this collection is William Labov. These interviews were selected as examples of classic sociolinguistic interview methodolgy, and represent a range of regional and social dialects.  In addition, these speakers have been targeted for inclusion within the collection because publication of the audio data and corresponding transcripts does not violate any agreement Labov made with the speakers. 
 
Speaker Age Speech community Occupation Tapes Others present Minutes Words Unique words
Adolphus H. 82 Hillsboro, NC Farmer 2 3 85 9660 1494
Bobbie A. 32 Ayr, Scotland Saw Doctor 1 1 44 8990 1769
Henry G. 61 E. Atlanta, GA Railroad Mechanic 3 5 112 20012 2372
Jerry T. 20 Leakey, TX Gas Attendant 2 1 66 11264 1700
Joe D. 21 Liverpool, England Docker 2 0 100 19798 2515
Eddie M. 21 Liverpool, England Docker 2 0 100 19798 2515
Kathy D. 15 Rochester, NY Student 2 2 64 29001 1938
Louise A. 53 Knoxville, TN Mother 3 0 76 11348 1521
Rose B. 36 New York, NY (LES) Seamstress 3 3 60 12184 1938
 

Processing: Digitization, Segmentation and Transcription

The interview sessions were digitized from open reel tapes onto DAT/disk at 16bit, 44KHz sampling. The monaural signal was passed through 2 channels at levels differing by 20% to capture the best digital copy in a single pass. Technicians monitored the recording process, and adjusted for sustained changes in speech levels. The digital files show no significant clipping in the digital domain. 

In the next stage, the files were processed to produce a single file for each speaker. The target speaker was distinguished from the interviewer, other speakers, silence and noise. Each audio file was transcribed. A first pass of segmentation identified basic utterance boundaries. Sentence or phrase boundaries, significant pauses and breath groups were identified and marked in the second pass. 

The transcripts are written in standard orthography with normal punctuation. No attempt has been made to correct speakers' pronunciation or grammar, and problematic or incomprehensible sections are transcribed to special conventions. Unintelligible speech or doubtful best guesses, for example ((are enclosed in double brackets)). Non-standard lexical items are marked with special symbols. All transcripts have undergone at least two further passes: the first to catch errors and revisit unintelligible sections, the second to allow a native speaker of British-English to check for errors based on infamiliarity with cultural items, slang or pronunciation. 
 
Is that ((Hugh Potty))? Is that how you put it?
She done her lovely! She done a wobbler!
Bloody (( )) uh. Bloody nutters, youse are.
All ((amber)) heads. All them birds.
 

Variables

An experimental first pass of the data is in progress. Researchers have examined a number of dialect-specific and general variables for each speaker. Currently 90 phonological variables and 60 grammatical variables are being considered, of which a subset appear below as an illustration: 

Phonological, phonetic, prosodic: 90 variables
Categories Subcategory examples
Consonants (DH) - voiced interdental fricative
Front vowels (ae-NAS) - tensing of short-a before vowels
Back vowels (ahr) -realization of /ahr/ sequence
General vowels (SCHWA) - realization of schwa
Prosody (RISE) - rising final intonation
 

Grammatical: 60 variables
Categories Subcategory examples
Prepositions (PREP-DEL) - preposition deletion
Adjectives (ADJ-WO) - non-standard ADJ word order
Determiners (DET-DEL) - determiner deletion
Negation (NEG-AINT) - use of ain't in neg. constructions
Word order (WO-LEFTDIS) - left dislocation of initial NP
Pronouns (POS-LEV) - levelling of possessives to mine paradigm
Verbs (COP-DEL) - copula deletion
Quantifiers (Q-BUT) - but as a quantifier
Agreement (PLURAL) - singular ending on plural noun
 

This approach not only builds up a profile of each speaker's speech, but gives us an idea of how the DASL project could eventually allow sociolinguists to examine a general variable across many speech communities. 

Value of the Corpus

The SLX corpus is an example of a multi-community sociolinguistic corpus. Containing classic interview material collected by William Labov, it is a valuable teaching tool for sociolinguists which demonstrates successful interviewing techniques. The sound quality is high, and the digitization, segmentation and transcription represent best practice in these areas. Most importantly for the goals of this project, the SLX corpus forms a stable benchmark for training in sociolinguistic interviewing, transcribing and coding. 

The corpus is slated for publication in early 2003 by the Linguistic Data Consortium.


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact Christopher.Cieri@ldc.upenn.edu

Contact Stephanie.Strassel@ldc.upenn.edu

Last modified: Tuesday, 27-Mar-01 16:26:30
© 2000 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.