Guidelines for Hub4/TDT Transcription 2000
NEW!! Second
Passing
-
At the Unix prompt, enter
bc-type tdt train
-
Double click/ highlight the file desired, and paste in at the prompt
The xwave window will not appear until the first attempt is made to
listen to a segment. Use the "Again" key to start. The transcription
shortcuts show the keyboard commands for listening, scrolling and changing
tags and timestamps in the editor.
Conventions
-
word-for-word transcription, using standard orthography and standard capitalization
-
Remove all text that is not present in the audio - copyright information,
speaker identification
-
use end-of-sentence punctuation (periods, question marks), commas, and
hyphens for hyphenated words but no other punctuation (please remove all
other extraneous marks, ie quotation marks, double hypens, etc.)
-
every speaker turn has to be indicated & timestamped, e.g.
<t
129.54>
Broadcast Labels
|
sr
|
start of <section type=report> all
files will start with this
|
|
t
|
start of (non-initial) turn within
section
|
|
b
|
turn interval breakpoints
|
|
e
|
end of turn within section, followed
by a non-speech region
|
|
o
|
start of overlap region (speaker one
is interrupted by speaker two)
|
|
e1
|
end of overlap where speaker one stops
and speaker two continues
|
|
e2
|
end of overlap where speaker two stops
and speaker one continues
|
for interruptions, use <o> & timestamp
to indicate beginning of overlapping speech region - overlapping speech
is determined by overlapping word boundaries, rather than the exact point
in the waveform which may sever a word in two-
<t 1122.443>
about the same thing I
<o 1123.276>
SPEAKER1:understand {laugh}
SPEAKER2:well,
<e1 1124.256>
oceanography is new exploration
and we're not
-
<e1 indicates
that the first speaker has now stopped, and the second speaker has continued
to speak. If BOTH speakers STOP at the same point in time, the next speaker
turn indicated by a if overlap
ends with non-speech section (silence, music, etc.), mark beginning of
non-speech section with <e> & timestamp
The [[NS]] tag can be used when there is an area within a turn that has
no speech within it , i.e. a musical interruption, or extended background
noise.
<b 123.456 >
The crowd was furious.
<b 124.567>
[[NS]]
<b128.987>
Calm was soon restored
by the arrival of the riot police.
indicate disflencies by using hyphen to mark partial words; transcribe
pause fillers, e.g.
We're
jus- just waiting for that uh tha- that report to to come in.
transcribe standard English contractions as they're spoken: they're, won't,
isn't, don't, etc.
for non-standard contractions like "gonna" and "wanna" spell out the entire
word: going to, want to.
identify extended non-speech sections (music, dead air, sound effects)
with <e> and timestamp at beginning of
section, followed by <t> and timestamp
when speech resumes, e.g.
<t
148.57> Sounds of gunfire filled the air.
<e
154.50>
<t
170.89> That sound greeted early morning visitors.
NOTE Several speakers
In situations when you have several people speaking at once, and it
is very difficult to make them out, insert an <e tag at the start of
the confused section. Then start the new turn at the next available clear
section.
<t 223.456> <<male>>
<b 225.678>
<e 230.302>
<t 232.563> <<female>>
use (( )) to indicate words or passsages that are hard to understand or
difficult to transcribe accurately
spell out all numerical sequences
twenty-five, sixty-six,
one oh seven
spell out all titles like "doctor" (instead of Dr.) and "junior" (instead
of Jr.), EXCEPT for Mr., Ms. and Mrs.
indicate proper names with a ^ (this is being
done so that we can standardize the spelling of proper names after transcription;
tags will be stripped out before delivery)
acronyms and spoken strings of letters will be indicated with ~,
e.g.
~FBI
Washington
~DC
we will use the following set of non-lexemes:
ah
eee
eh
ew
ha
hee
huh
hm
oh
oo
um
uh
other "special" words like interjections and acronyms which were specially
tagged in some versions of hub4 transcriptions will be transcribed as normal
words, i.e. not specially marked. for instance:
yeah,
uh-huh, okay, gee, hey, AIDS, NAFTA