LDC Transcribers' Guide
For each file, you will be provided with an initial transcript (created by FDCH) to work with. The annotator's job is to produce a perfect verbatim (word-for-word) transcription of each file, complete with accurate timestamps.
FDCH transcripts provide basic timestamps for the start and end time
of each speaker utterance. (A pause of greater than 1.5 seconds was
considered the start of a new utterance, unless it was obvious that the
speaker had not finished speaking.) Regions in between the end and
start time of a speaker turn indicate a pause (no speech sounds of any
kind). LDC annotators should use the xwaves interface to modify the
FDCH timstamps to provide cleaner, more accurate timestamps.
Special Transcription Conventions
Some special conventions are used to indicate particular kinds of speech:
Partial Words (audio cuts out)
Partial words (words cut off at the beginning, middle or end of the
word) will be marked with + at the beginning of the word (no space separating
the + from the word). The whole word will be spelled out in standard orthography;
do not try to represent how the word was pronounced. This convention
is used only for words that have been cut off or interrupted by the audio
signal. This notation does not pertain to those examples where
the speaker trails off or does not complete the word.
e.g. Say that again +please.
Speaker Noise
For speaker noise that could not be cut out of spoken phrases or documented
with another convention, curly brackets were used with the type pf noise
specified inside.
e.g. {cough} {laugh} {breath}
Background Non-Speaker Noise
Background noise (extended) is marked with [text/] [/text] for the
start / end of non speaker noise. In the case of the SPINE files, it has
been used to denote strong static.
e.g. [noise/] Armed. Firing [/noise]
Partial Words (speaker caused)
For these partial words, that part of the word that is heard should
be transcribed followed by a dash.
e.g. Let's tr- Let's try that again.
Spelled Out Words
If a speaker spells out the letters of a word, each individual letter
of the word should be preceded by a tilde (~) and written with a capital
letter. Each spelled-out letter should be space-separated. This would indicate
that the speaker said the word 'fear' and then spelled it out.
e.g. It's fear, ~F ~E ~A ~R.
Punctuation and Capitalization
Punctuation and capitalization are not required. Transcribers should
do whatever feels most comfortable.
Hesitation sounds, filled pauses
Transcribers will indicate filled pauses using a standardized spelling.
These hesitation words do not require any special markup.
(*If you believe a speaker uses a word that does not appear on this
list, let us know.)
ach hee
uh
uh-oh
geez
ay-yi-yi
yuh
ah hm
uh-huh
whoops
yep
yay
nah
eee huh
um
oop
yup
er
eh oh
yeah
oops
jeepers whoo-hoo
ew mm
duh
he-hem oof
ah-ha
ha mm-hm
whoa
whew
ooh
ay
Mispronounced Words
Mispronounced words will be marked with an asterik (*). The word should
be spelled in standard orthography. Do not try to represent
how the word was pronounced.
Speaker Noises
Sometimes speakers will make noises in between words. These sounds
are not "words" like our hesitation words. Examples are things like
sshhhhhhhhh, ssssssssssss, pssssssss. Note these sounds with a backslash
and the first two letters of the sound heard. (Put spaces around
these sounds - do not connect them to the previous/following word).
e.g. Well, I /sh I don't know.
/ss
/ps
These sounds should not be confused with elongated words, such
as ssshoot, which should be transcribed in standard orthography -
"shoot".
Hard-to-understand passages
(ph)
FDCH transcribers use (ph) to indicate a phonetic spelling of a word.
This was used when they weren't sure of what was being said.
Pay special attention to anything marked with (ph). You
must remove the (ph) and give your best approximation of what is being
said.
If you cannot determine what is said, place your best guess in double parentheses.
e.g. ((I mean))
[INAUDIBLE]
FDCH transcribers use [INAUDIBLE] when they were completely unable
to even guess at what was being said. These tags must be removed.
Listen to these passages closely and transcribe what you think you hear.
If you're unable to even guess, tag the passage with double empty parentheses,
with one space.
e.g. (( ))
Things to look out for/Miscellaneous
Keywords
FDCH transcribers sometimes use a dash between two keywords (focus
words) in a file. Remove the connecting dash from the keyword
pair. They should be two separate words.
Map Directions
FDCH transcribers sometimes use a dash between map directions (South-Southwest).
Remove
the connecting dash from the pair. They should be two separate
words.
Some arbitrary spelling decisions
| Use this | Not this |
| all right | alright |
| OK | okay, ok |
| alrighty | all righty |
Contractions Used
can't
didn't
don't
hasn't
he's
here's
how's
I'd
I'll
I'm
I've
It's
let's
she's
someone's something's
that's
there's
we'll
we're
we've
what's
where's
won't
you're
you've
can't
couldn't
didn't
doesn't
don't
hasn't
he'll
he's
isn't
it'll
it's
let's
must've
one's
she's
should've that'll
wasn't
that's
there's
they're
weren't
we'll
we're
we've
you're
what's
won't
wouldn't could've
you've
Punctuation Used
Checking Keywords
To check the spelling of keywords, go to this directory:
cd /mnt/unagi/speechd10/NRLand run this command (where the word in black is the word you're searching for):
grep -i keyword fdch_logs ex. grep -i peen fdch_logsSample Completed Transcript