Charge to the Sociolinguistic Corpus Team

Situation:
Recent discussions between LDC staff and sociolinguists have revealed growing interest within that community in using widely-available, pre-existing linguistic corpora ('public access databases') to investigate sociolinguistic variation on a large scale.  While many sociolinguists are informed about the advantages of shared resources, they remain largely skeptical of the practicality of this approach, and this particularly applies to the availability, accessibility and usability of the data.  The LDC has a number of corpora that should appeal to sociolinguistics (especially the CallHome and CallFriend corpora, Switchboard and TIMIT); we also have tools (LDC-Online) that have the potential to make this data easily accessible for a non-technical audience.  The LDC is in a position to act as a leader in the promotion of this new approach to sociolinguistic research, by creating a small annotated sociolinguistic corpus, and by publishing the results on the web as an example and invitation for other sociolinguists to follow.

Charge:
The Sociolinguistic Corpus Team is charged with creating an annotated corpus using existing tools and resources, documenting the corpus creation effort, publishing the results of these efforts on the web and publicizing these efforts within the sociolinguistics community.  Once these tasks have been accomplished, the same or another Team may be charged with creating a written version of the results for publication in an appropriate journal, and with publicizing the results of the team's efforts to a larger community of linguists and other researchers through the TalkBank project (and any other appropriate venues).

Membership:
Chris Cieri, Christine Lattin, Stephanie Strassel, Zhibiao Wu.  Additional team members will be added who will act as annotators (to be decided by Team members at first meeting).  Strassel will act as team owner; Cieri will act as sponsor.

Deliverables:
1.  Choose an appropriate sociolinguistic variable to investigate and a corpus/corpora in which to examine this variation.
2.  Develop a coding scheme (annotation guidelines).
3. Modify LDC-Online to allow for easy searching of the corpus/corpora, easy audio playback of examples, and coding/annotation of relevant tokens.  Additional modifications must allow for easy exporting of the coding string/annotations to an external program, and the inclusion of speaker demographics within the coding string.
4. Coordinate with part-time annotation staff to complete annotation.  This involves training, annotation and QC.
5.  Produce documentation of the corpus creation effort: annotation guide, tools & resources, QC efforts, results, comparison with other studies of the variable.
6.  Create a website containing corpus documentation and results.
7.  Publicize the website within sociolinguistics community and solicit feedback from sociolinguists.

Timeline:
To be decided at first team meeting.