TIMIT Annotation Overview


    Basic Info
    Summary statistics
    Linguistic Factors
    Social Factors
    Interaction of Factors
    VARBRUL Results
    Dual Annotation


    Corpus information
     
    • Corpus contains 6300 sentences; 54,387 words
    • Our regular expression, unfiltered, produced 3154 tokens for consideration
    • With filters, 2059 tokens
    • Of these, 1578 were annotated for -t/d deletion (others were cases of N/A)




    Summary statistics for TIMIT

    Overall Deletion Rate
    Total number of tokens deleted: 518 (32.8%)
    Total number of tokens retained: 1060 (67.2%)

    Unique tokens in TIMIT
    The TIMIT Corpus consists of 630 speakers reading a list of 10 phonetically-rich sentences (selected from a larger set).  Despite the use of read speech, there are few tokens that occur multiple times relative to the total number of tokens coded.
     
    Appearances in corpus
    Number/Percentage of tokens 
    1
    35.4% (n=558)
    2
    3.6% (n=58)
    6
    4.2% (n=66)
    7
    50.6% (n=798)
    14
    1.7% (n=28)
    All other numbers
    4.4% (n=69)



    Linguistic Factors

    Morphological Category
     
    Deletion Rate
    Total N
    Monomorphemes
    36.6%
    1081
    Irregular verbs (all-inclusive)**
    41.1%
    107
    Irregular verbs (excluding must)**
    24.7%
    73
    Irregular verbs (excluding must, went & all strong verbs)**
    15.4%
    39
    Regular Verbs
    22.6%
    513
    **See coding scheme for details of treatment of irregular verbs

    Preceding Environment
     
    Deletion Rate
    Total N
    Alveolar Nasal
    52.8%
    432
    Alveolar Fricative
    41.7%
    391
    Other Fricative
    24.7%
    73
    Stop
    23.4%
    244
    Lateral
    16%
    162
    Other Nasal
    16%
    25
    Rhotic
    8.7%
    252

    Following Environment
     
    Deletion Rate
    Total N
    Obstruent
    52.9%
    607
    Rhotic
    48.2%
    56
    Clustering Glide
    41.9%
    105
    Lateral
    29.4%
    17
    Other Glide
    21.4%
    14
    Pause
    18.3%
    252
    Vowel
    13.7%
    257

    Identical Preceding/Following Environment
     
    Deletion Rate
    Total N
    s_s (processed_soybeans)
    88.9%
    36
    obstruent_obstruent (stopped_passing)
    *includes s_s environment
    87.0%
    46
    liquid_liquid (guard_rail, old_lady)
    0%
    10
    overall
    71.4%
    56



    Non-Linguistic Factors

    Education

    Race/Ethnicity

    Age

    Sex
     
    Deletion Rate
    Total N
    Males
    31.6%
    1077
    Females
    35.5%
    501

    Region
     
    Deletion Rate
    Total N
    Southern
    38.2%
    254
    New York
    37.8%
    119
    Mixed (Army Brat)
    36.8%
    76
    New England
    35.4%
    113
    South Midland
    33.2%
    229
    North Midland
    32.4%
    278
    Northern
    28.2%
    262
    Western
    27.5%
    247

     
     



    Interaction of Factors

    Rate of Deletion before PAUSE by Geographical Region


    VARBRUL Results

    First Run
    Eight Factor Groups; 44 Factors Selected?
    Rule Application n/a
    Morphology yes
    Preceding yes
    Following yes
    Sex no
    Region no
    Age no
    Race yes
    Education yes


    Second Run
    -eliminated non-selected factor groups from first run (Gender, Education, Age)
    -recodes

      1) Factor Group Race: recode Asian/American Indian/Hispanic --> Other (15 tokens)


    RESULTS
    Summary Statistics and Factor Weights
     
    Group Factors Retained (%) Deleted (%) Total N (%) Factor Weight
    Morphology monomorpheme m 630 (62) 394 (38) 1024 (65) 0.535
    reg. past tense p 397 (77) 116 (23) 513 (33) 0.428
    irregular i 33 (80) 8 (20) 41 (3) 0.531
     
    Preceding Segment lateral l 135 (84) 26 (16) 161 (10) 0.240
    alveolar fricative f 228 (58) 163 (42) 391 (25) 0.635
    stop k 187 (77) 57 (23) 244 (15) 0.426
    rhotic r 230 (91) 22 (9) 252 (16) 0.161
    alveolar nasal n 204 (47) 228 (53) 432 (27) 0.756
    other fricative c 55 (75) 18 (25) 73 (5) 0.433
    other nasal s 21 (84) 4 (16) 25 (2) 0.390
     
    Following Segment rhotic h 29 (52) 27 (48) 56 (4) 0.650
    vowel v 455 (86) 72 (14) 527 (33) 0.245
    obstruent b 286 (47) 321 (53) 607 (38) 0.767
    pause q 206 (82) 46 (18) 252 (16) 0.305
    lateral a 12 (71) 5 (29) 17 (1) 0.380
    cluster glide g 61 (58) 44 (42) 105 (7) 0.645
    other glide o 11 (79) 3 (21) 14 (1) 0.330
    Group Factors Retained (%) Deleted (%) Total N (%) Factor Weight
    Race white W 991 (68) 464 (32) 1455 (92) 0.489
    black L 33 (49) 34 (51) 67 (4) 0.753
    unknown U 25 (61) 16 (39) 41 (3) 0.433
    other O 11 (73) 4 (27) 15 (1) 0.552
     
    Education Bachelors B 587 (67) 289 (33) 876 (56) 0.514
    High School H 133 (64) 74 (36) 207 (13) 0.524
    Masters T 250 (71) 100 (29) 350 (22) 0.436
    PhD P 45 (78) 13 (22) 58 (4) 0.357
    Unknown K 13 (42) 18 (58) 31 (2) 0.752
    Associates A 32 (57) 24 (43) 56 (4) 0.616
    TOTAL
    1060 (67%)
    518 (33%)
    1578
     

     

    Third Run
    -split the irregular verb category into four new categories:

    • semi-weak verbs: includes verbs that have a stem change AND t/d affixation in their past tense (keep > kept; lose > lost)
    • strong verbs: includes verbs that only have a stem change in their past tense, no t/d affixation (hold > held; wind > wound)
    • must: this is a modal verb, and has only one form in the past and present
    • went: there is no transparent connection between this verb's present and past forms (go > went)


    RESULTS
    Summary Statistics and Factor Weights
     
    Group Factors Retained (%) Deleted (%) Total N (%) Factor Weight
    Morphology monomorpheme m 611 (64) 347 (36) 958 (61) 0.521
    reg. past tense p 392 (76) 121 (24) 513 (33) 0.436
    strong verb s 18 (62) 11 (38) 29 (2) 0.476
    semi-weak verbs  i 26 (67) 13 (33) 39 (2) 0.553
    must t 11 (32) 23 (68) 34 (2) 0.747
    went w 2 (40) 3 (60) 5 (0) 0.837
     
    Preceding Segment lateral l 135 (84) 26 (16) 161 (10) 0.244
    alveolar fricative f 228 (58) 163 (42) 391 (25) 0.635
    stop k 187 (77) 57 (23) 244 (15) 0.395
    rhotic r 230 (91) 22 (9) 252 (16) 0.161
    alveolar nasal n 204 (47) 228 (53) 432 (27) 0.768
    other fricative c 55 (75) 18 (25) 73 (5) 0.436
    other nasal s 21 (84) 4 (16) 25 (2) 0.383
     
    Following Segment rhotic h 29 (52) 27 (48) 56 (4) 0.649
    vowel v 455 (86) 72 (14) 527 (33) 0.248
    obstruent b 286 (47) 321 (53) 607 (38) 0.759
    pause q 206 (82) 46 (18) 252 (16) 0.313
    lateral a 12 (71) 5 (29) 17 (1) 0.372
    cluster glide g 61 (58) 44 (42) 105 (7) 0.655
    other glide o 11 (79) 3 (21) 14 (1) 0.354
    Group Factors Retained (%) Deleted (%) Total N (%) Factor Weight
    Race white W 991 (68) 464 (32) 1455 (92) 0.489
    black L 33 (49) 34 (51) 67 (4) 0.751
    unknown U 25 (61) 16 (39) 41 (3) 0.430
    other O 11 (73) 4 (27) 15 (1) 0.556
     
    Education Bachelors B 587 (67) 289 (33) 876 (56) 0.512
    High School H 133 (64) 74 (36) 207 (13) 0.523
    Masters T 250 (71) 100 (29) 350 (22) 0.439
    PhD P 45 (78) 13 (22) 58 (4) 0.361
    Unknown K 13 (42) 18 (58) 31 (2) 0.725
    Associates A 32 (57) 24 (43) 56 (4) 0.622
    TOTAL
    1060 (67%)
    518 (33%)
    1578
     


    Dual Annotation Results
    Reannotation of 5% of TIMIT corpus is complete.  Details coming soon...



    Updated 3/28/2001

    DASL Homepage