The SLX Corpus of Classic Sociolinguistic Interviews
The SLX Corpus comprises 8 sociolinguistic
interviews with a total of 9 speakers, conducted between 1963 and 1973.
The primary interviewer in this collection is William Labov. These interviews
were selected as examples of classic sociolinguistic interview methodolgy,
and represent a range of regional and social dialects. In addition,
these speakers have been targeted for inclusion within the collection because
publication of the audio data and corresponding transcripts does not violate
any agreement Labov made with the speakers.
Processing: Digitization, Segmentation and Transcription
The interview sessions were digitized from open reel tapes onto DAT/disk at 16bit, 44KHz sampling. The monaural signal was passed through 2 channels at levels differing by 20% to capture the best digital copy in a single pass. Technicians monitored the recording process, and adjusted for sustained changes in speech levels. The digital files show no significant clipping in the digital domain.
In the next stage, the files were processed to produce a single file for each speaker. The target speaker was distinguished from the interviewer, other speakers, silence and noise. Each audio file was transcribed. A first pass of segmentation identified basic utterance boundaries. Sentence or phrase boundaries, significant pauses and breath groups were identified and marked in the second pass.
The transcripts are written in standard
orthography with normal punctuation. No attempt has been made to correct
speakers' pronunciation or grammar, and problematic or incomprehensible
sections are transcribed to special conventions. Unintelligible speech
or doubtful best guesses, for example ((are enclosed in double brackets)).
Non-standard lexical items are marked with special symbols. All transcripts
have undergone at least two further passes: the first to catch errors and
revisit unintelligible sections, the second to allow a native speaker of
British-English to check for errors based on infamiliarity with cultural
items, slang or pronunciation.
An experimental first pass of the data is in progress. Researchers have examined a number of dialect-specific and general variables for each speaker. Currently 90 phonological variables and 60 grammatical variables are being considered, of which a subset appear below as an illustration:
Phonological, phonetic, prosodic: 90 variables
Grammatical: 60 variables
This approach not only builds up a profile of each speaker's speech, but gives us an idea of how the DASL project could eventually allow sociolinguists to examine a general variable across many speech communities.
Value of the Corpus
The SLX corpus is an example of a multi-community sociolinguistic corpus. Containing classic interview material collected by William Labov, it is a valuable teaching tool for sociolinguists which demonstrates successful interviewing techniques. The sound quality is high, and the digitization, segmentation and transcription represent best practice in these areas. Most importantly for the goals of this project, the SLX corpus forms a stable benchmark for training in sociolinguistic interviewing, transcribing and coding.
The corpus is slated for publication in early 2003 by the Linguistic Data Consortium.
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data