Situation:
Recent discussions between
LDC staff and sociolinguists have revealed growing interest within that
community in using widely-available, pre-existing linguistic corpora ('public
access databases') to investigate sociolinguistic variation on a large
scale. While many sociolinguists are informed about the advantages
of shared resources, they remain largely skeptical of the practicality
of this approach, and this particularly applies to the availability, accessibility
and usability of the data. The LDC has a number of corpora that should
appeal to sociolinguistics (especially the CallHome and CallFriend corpora,
Switchboard and TIMIT); we also have tools (LDC-Online) that have the potential
to make this data easily accessible for a non-technical audience.
The LDC is in a position to act as a leader in the promotion of this new
approach to sociolinguistic research, by creating a small annotated sociolinguistic
corpus, and by publishing the results on the web as an example and invitation
for other sociolinguists to follow.
Charge:
The Sociolinguistic Corpus
Team is charged with creating an annotated corpus using existing tools
and resources, documenting the corpus creation effort, publishing the results
of these efforts on the web and publicizing these efforts within the sociolinguistics
community. Once these tasks have been accomplished, the same or another
Team may be charged with creating a written version of the results for
publication in an appropriate journal, and with publicizing the results
of the team's efforts to a larger community of linguists and other researchers
through the TalkBank project (and any other appropriate venues).
Membership:
Chris Cieri, Christine Lattin,
Stephanie Strassel, Zhibiao Wu. Additional team members will be added
who will act as annotators (to be decided by Team members at first meeting).
Strassel will act as team owner; Cieri will act as sponsor.
Deliverables:
1. Choose an appropriate
sociolinguistic variable to investigate and a corpus/corpora in which to
examine this variation.
2. Develop a coding
scheme (annotation guidelines).
3. Modify LDC-Online to
allow for easy searching of the corpus/corpora, easy audio playback of
examples, and coding/annotation of relevant tokens. Additional modifications
must allow for easy exporting of the coding string/annotations to an external
program, and the inclusion of speaker demographics within the coding string.
4. Coordinate with part-time
annotation staff to complete annotation. This involves training,
annotation and QC.
5. Produce documentation
of the corpus creation effort: annotation guide, tools & resources,
QC efforts, results, comparison with other studies of the variable.
6. Create a website
containing corpus documentation and results.
7. Publicize the website
within sociolinguistics community and solicit feedback from sociolinguists.
Timeline:
To be decided at first team
meeting.