In-house Transcription Conventions
A description of the characteristics of Hub4 and Hub5 transcripts can be found here.
|
|
|
|
|
| Numerals | write out numerals in full:
twenty-two |
write out numerals in full:
twenty-two |
write out numerals in full:
twenty-two |
| Acronyms pronounced as single letters | ~VCR | N/A | N/A |
| Acronyms pronounced as full words | @NATO | N/A | N/A |
| Pronounced individual letters (spelled-out words) | ~S ~I ~M ~P ~S ~O ~N | ~S ~I ~M ~P ~S ~O ~N | ~S ~I ~M ~P ~S ~O ~N |
| Proper names/places | ^Homer | ^George ^Allen ^Burns | ^George ^Allen ^Burns |
| Partial words (speaker) | absolu- | absolu- | absolu- |
| Partial words (audio signal cuts out) | N/A | +please | #please |
| Mispronounced words | word written in standard orthography:
+probably |
word written in standard orthography:
*probably |
word written in standard orthography:
+probably |
| Idiosyncratic words | *poodleish | N/A | N/A |
| Speaker noise | {noise written within curly brackets} | first two letters of noise preceded by backslash:
/ps |
{noise written within curly brackets} |
| Background | [text] | N/A | N/A |
| Backgroung noise (extended) | [text/] [/text] | N/A | N/A |
| Semi-intelligible speech | ((text)) | ((text)) | ((text)) |
| Unintelligible speech (token) | (( )) | (( )) | (( )) |
| Unintelligible speech (long span) | [[skip]] | N/A | N/A |
| Repeated section of speech | [[repeat]] | N/A | N/A |
| Foreign language | <language text> | N/A | N/A |
| Speaker aside | <as> text </as> | N/A | N/A |
| Overlapping speech (same channel) | <ov> text </ov> | N/A | N/A |
| Non-lexemes | list of non-lexemes; marked with % | use Hub5 list of non-lexemes plus additional; no markup | use Hub5 list of non-lexemes plus additional; no markup |
| Interjections | list of interjections; no markup | list of interjections; no markup | list of interjections; no markup |
| Punctuation | limited to . , ? | variable | limited to . , ? |
| Capitalization | Standard English | Standard English | Standard English |