| Linguistic Data Consortium | |
| Main Sample Transcript Keyboard Shortcuts Tools Help Emacs Help LDC | a href="mailto:nmartey@ldc.upenn.edu"> Questions?
<sr 115.45> <male,
spkr_1>
<t 129.54> <male, Bill_Clinton>
<t 135.67> <female, spkr_2>
|
|
start of <section type=report> news story section |
|
|
start of <section type=filler> filler section that was classified as < sn |
|
|
start of <section type=non-news> non-news section - commercials,etc |
|
|
start of (non-initial) speaker turn within section |
|
|
turn interval breakpoints (not accurate) |
|
|
end of turn within section, followed by a non-speech region |
|
|
start of overlap region (speaker one is interrupted by speaker two) |
|
|
end of overlap where speaker one stops and speaker two continues |
|
|
end of overlap where speaker two stops and speaker one continues |
| Punctuation | Include punctuation for ease of transcription | ,.? |
| Numerical sequences | Written in full | twenty-five, one hundred and six |
| Non-standard contractions | spell out in full - gonna, wanna | going to, want to |
| Misc speaker noises | indicate occurrences | {breath}, {laugh}, {cough}, {sneeze} |
| Titles of address | Written in full | Doctor, Mister |
| Proper names | ?? | ?? |
| Acronyms, pronounced letters | use tilde ~ | ~FBI, Washington ~DC |
| Interjections | uh-huh, mhm, yeah, uh-oh, whoa | |
| Non-lexemes | % | ah, eee, eh, ew, ha, hee, huh, hm, oh, oo, um, uh |
| Word fragments | indicate partial word with - followed by restart -- | inspecto- -- |
| Pronounced Acronyms | @ | @NAFTA, @AIDS |
| Made-up word | + | +poodlish |
| Difficult to understand words . Best guess attempt. | (( )) | we went to (( )) to grab a bite to eat. |
| Background noise,non-speech events within a turn.(also see usage of <e convention) | Indicated | [[NS]] |
Gender
The possibilities are "male" , "female", "child"
"altered", "unison". "Unison" occurs in situations where two or
more individuals say the same thing at the same
time.
SHORTCUTS for gender insertion
Proper names
Whenever possible, include the proper name of the
speaker. Examples of proper names include Jacques_Cousteau, William_Cohen,
and Madeleine_Albright.
If a speaker is not identified within a recording, a unique numerical index is used. For the convenience of the transcribers, a broad categorical identification may be used. The two categories currently supported are Reporter and Speaker.
Reporter refers to either the anchor of the news broadcast, or the reporter on location giving the story Speaker, on the other hand, refers to anyone interviewed on tape by the Reporter, when that person is not identified by name. Numbers do not overlap. Each successive anonymous speaker has a unique number, regardless of the category the speaker is assigned to. For example, the following sequence is entirely possible:
reporter_1
reporter_2
spkr_3
spkr_4
reporter_2 (assuming it
is the same voice as the previous Reporter_2)
reporter_6 (a new reporter
distinct from the two above)
Native, non-native, and
altered
In English broadcast, "native" speakers are standard
North American dialects. These are not marked.
"Non-native" speakers, are determined as foreign
accented speakers, including British-english
speakers."Altered" is used to tag deliberately altered
voice patterns, for instance in the case of a disguised
informant's speech, or for machine generated speech.
Examples
<sr 1.402> <<male, Leon_Harris>>
<sr 158.244> <<female, Joie_Chen>>
<t 196.813> <<male, spkr_1>>
<t 498.314> <<male, non-native, spkr_3>>
<t 567.215> <<male, altered, spkr_4>>
- during interruptions, <o> & timestamp will be used to indicate the start of the overlapping speech region corresponding to the interrupting speaker.
- - an overlapping speech region is determined by overlapping word boundaries, rather than the exact point in the waveform which may require splitting words.
<b 123.456 >
The crowd was furious.
<b 124.567>
[[NS]]
<b 128.987>
Calm was soon restored by the arrival of the riot police.
We're jus-- just waiting for that uh tha-- that report
to to come in.
<t 148.57>
Gunfire filled the air.
<e 154.50>
<t 170.89>
That sound greeted early morning visitors on tuesday.
<t 223.456> <<male, spkr_1>>
<b 225.678>
<e 230.302>
<t 232.563> <<female, spkr_3>>>
Speakers start simultaneously
Create start times for the speakers that are about one tenth of a second apart, and insert the overlap tag.
<t 1123.176> <male, Jacques_Cousteau>
about the same thing I
<o 1123.276> <female, spkr_2>
SPEAKER1:understand {laugh}
SPEAKER2:well,
<e1 1124.256>
oceanography is new exploration and we're not
Checking and separation of unintelligible (( )) speech
<t 859.405> <<male>>
Not only do we methodically destroy the coastal fringe
<b 863.598>
but we also throw back our toxic *1(( ))*2 directly in the sea
<b 868.453>
or under the sea when we feel ashamed.i) Use Ctrl-c s to "send" the segment to the waveform window. Then find the section which cannot be understood and isolate it as you would if you were placing breakpoints
ii) Return to the text area, and place the cursor in front of the brackets (*1)
iii) Hold Alt and the middle mouse button (M2), drag the cursor over the (( )) region, and release M2(*2). There is no highlighting over the region to show you that it is enabled, but you will receive a prompt for implementation of the change if it happens correctly.
Syntax Checking
Common Messages include:
- time-stamp without text data?
-The timestamp does not contain corresponding transcript data
- time-stamp follows non-empty line
- an empty line should follow each transcribed timestamp.
- turn should be on single line
-only one turn permitted for each line.
- <English ...> should not be inside (( ))
-foreign speech should not be contained within "guess" brackets.
- <English (()) > has no text content)
-rather the "guess" should be contained within the foreign language bracket
- closing angle (`>') should be followed by space
self evident
- bracket error with '[]'
may be a number of possibilities
- bracket error with '()'
may be a number of possibilities
punctuation should be inside `((...))'
If completely necessary, punctuation should go inside these brackets - in most case, no punctuation will be necessary.
- punctuation should be inside `<... >'
If there is punctuation immediately outside an < , please place on the inside of the bracket.
- bad spacing around punctuation `.'
There should exist a space after punctuation
- bad spacing around punctuation `?'
There should exist a space after punctuation
- closing paren (`))') should be followed by space
self explanatory
- turn contains ILLEGAL CHARACTER `!'
Some characters are not allowed within the text - for instance, exclamation points - please let your language leader know when you come across this error warning.
- digits found in text
There should not be any numerals in the text outside of the timestamps