Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Note 1: Phonetic Analysis Jongman, Wayland, and Wong (2000) report extensive analyses on the measures from their database that we use here as the basis of our models. They show that, individually, each cue differed as a function of place, sibilance and/or voicing, and most of these cues also differed as a function of the vowel context and/or the gender of the speaker. However, no attempt was made to compare the amount of variance in each cue due to each factor (although P2 was reported for many comparisons). Moreover, such an analysis has not been conducted for any of the new cues we measured here. Thus, we evaluated 1) which cues contribute to each categorical distinction; and 2) the contributions of contextual factors (speaker and vowel). This was done with a series of regression analyses that provide a standard effect size measure that can be compared across cues and effects. Crucially, we also used these analyses to highlight and explore the contributions of the newly proposed cues. In each analysis, a single cue was the dependent variable, and the independent variables were a combination of dummy codes reflecting a single factor of interest, such as fricative identity (7 variables), voicing (1 variable), sibilance (1 variable), or place of articulation (3 variables). In each regression, we first partialed out the effect of speaker (19 dummy codes) and vowel (5 dummy codes), before entering the effect of interest into the model. These regression analyses are necessarily exploratory, and we do not intend to draw broad conclusions Table S1: Summary of regression analyses examining effects of speaker (20), vowel (6) and fricative (8) for each cue. Shown are R2change values. from them. They are intended to Missing values were not significant at p<.05 level. provide an overall view of these cues and the factors that Contextual Factors Fricative Unexplained contribute to their variance. Identity Variance Results and Discussion The results of the regression analyses are summarized in Table S1 (which shows the overall effects of fricative and context) and Table S2 (which shows the specific effects of each feature). There are a number of important results worth highlighting. Fricative Identity. Every cue was affected by fricative identity. While effect sizes ranged from very large (10 / 24 cues had R2change> .40) to very small (vowel RMS, the smallest: R2change =.011), all were highly significant. Even cues that were originally measured to compensate for variance in other cues (e.g., vowel duration was measured to normalize fricative Cue MaxPF DURF DURV RMSF RMSV F3AMPF F3AMPV F5AMPF F5AMPV LF F0 F1 F2 F3 F4 F5 M1 M2 M3 M4 M1trans M2trans M3trans M4trans + p<.05 Speaker df=19,2860 0.084* 0.158* 0.475* 0.081* 0.570* 0.070* 0.140* 0.077* 0.203* 0.117* 0.838* 0.064* 0.109* 0.341* 0.428* 0.294* 0.122* 0.036* 0.064* 0.031* 0.066* 0.084* 0.029* 0.031* * p<.0001 -1- Vowel df=5,2855 * 0.021 0.316* 0.043* 0.028* 0.156* 0.012* 0.040* 0.004+ 0.007* 0.603* 0.514* 0.128* 0.050* 0.045* 0.043* 0.061* 0.079* 0.069* df=7,2848 0.493* 0.469* 0.060* 0.657* 0.011* 0.483* 0.076* 0.460* 0.046* 0.607* 0.023* 0.082* 0.119* 0.054* 0.121* 0.117* 0.425* 0.678* 0.387* 0.262* 0.430* 0.164* 0.403* 0.192* .423 .353 .149 .260 .376 .419 .628 .451 .712 .272 .132 .251 .257 .477 .401 .544 .451 .284 .548 .704 .461 .692 .490 .709 Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 duration) had significant effects. Spectral moments (especially the mean and variance) were particularly important, but, surprisingly, F2 (which has received substantial attention in the literature) had only a moderate effect (R2change=.119). Interestingly, its size was similar to that for F4 (R2change=.121) and F5 (R2change=.117), two cues which have not been previously examined. Some cues could clearly be attributed to one feature more than to another, although there were no cues that were associated with only a single feature. Duration cues were clearly related to voicing (DURF: R2change=.403; DURv: R2change =.055), not place of articulation (DURF: R2change =.052; DURv: R2change =.004) or sibilance (DURF: R2change =.052, DURv: R2change =.002). This was the same for low frequency energy (Voicing: R2change =.482), although this cue may also be involved in sibilance detection (R2change =.120). Other cues were clearly about sibilance. RMSF and F5AMPF were highly correlated with sibilance (RMSF: R2change=.419; F5AMPF: R2change=.394). They were also correlated with place of Table S2: Summary of more fine-grained analyses of each cue. Shown is R2change after speaker and vowel have been partialed out of the dataset (they will have the same R2 values as in table 1). Missing values were not significant at p<.05. The “Overall” column is the effect of fricative identity in general (8 categories). Other columns show specific features: sibilance, voicing and place of articulation (4 categories). The effect of place of articulation for sibilants and non-sibilants was computed for only that subset of the data, all other effects reflect analysis of the entire dataset. Cue MaxPF DURF DURV RMSF RMSV F3AMPF F3AMPV F5AMPF F5AMPV LF F0 F1 F2 F3 F4 F5 M1 M2 M3 M4 M1trans M2trans M3trans M4trans +p<.05 Overall df=7,2848 0.493* 0.469* 0.060* 0.657* 0.011* 0.483* 0.076* 0.460* 0.046* 0.607* 0.023* 0.082* 0.119* 0.054* 0.121* 0.117* 0.425* 0.678* 0.387* 0.262* 0.430* 0.164* 0.403* 0.192* Sibilance df=1,2854 0.260* 0.052* 0.002* 0.419* 0.001+ 0.239* 0.008* 0.394* 0.029* 0.120* 0.001* 0.031* 0.060* 0.034* 0.005* 0.026* 0.010* 0.441* 0.137* 0.022* 0.163* 0.012* 0.193* 0.052* Voicing df=1,2854 0.004+ 0.403* 0.055* 0.180* 0.002* 0.002+ 0.017* 0.024* 0.003+ 0.482* 0.021* 0.045* 0.002+ 0.002+ 0.085* 0.105* 0.017* 0.158* 0.022* 0.124* 0.035* *p<.0001 -2- Place of Articulation Overall Non-Sibilants Sibilants df=3,2852 df=1,1414 1,1414 0.483* 0.006+ 0.504* * 0.052 0.004* 0.004* 0.425* 0.004+ 0.025* * + 0.004 0.003 0.001+ * 0.450 0.444* 0.056* 0.043* 0.057* * 0.401 0.020* * * 0.038 0.014 0.005+ * + 0.124 0.003 0.005+ 0.001* 0.001+ * + 0.036 0.001 0.012* * * 0.114 0.057 0.060* 0.050* 0.038* * 0.119 0.083* 0.132* * * 0.116 0.082 0.101* * 0.269 0.552* 0.494* 0.015* 0.335* * 0.304 0.369* * * 0.159 0.021 0.189* 0.227* 0.026* 0.104* * * 0.072 0.038 0.100* * * 0.253 0.061 0.097* 0.106* 0.062* 0.504* Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 articulation (RMSF: R2change=.425; F5AMPF: R2change=.401), but when place was considered separately in sibilants and nonsibilants, little effect was seen in either, suggesting this cue primarily reveals sibilance. However, several other cues that were strongly associated with sibilance were also related to place of articulation. MaxPF and F3AmpF, for example, were strongly associated with sibilance (MaxPF: R2change=.260; F3AmpF: R2change =.239), but within sibilants were also useful for distinguishing alveolars and postalveolars (MaxPF: R2change=.504; F3AmpF: R2change=.444). Thus, these cues seem to be available to make two independent distinctions (sibilance in general, and place of articulation within sibilants). Of the formant frequencies, F2, F4 and F5 had moderate effects that were primarily limited to place of articulation (F2: R2change=.114; F4: R2change=.119; F5: R2change=.116), and these cues appeared to be similarly useful for both sibilants (F2: R2change=.060; F4: R2change=.132; F5: R2change=.101) and nonsibilants (F2: R2change=.057; F4: R2change=.083; F5: R2change=.082). Separate analyses examined place of articulation in sibilants (alveolar vs. postalveolar) and nonsibilants (labiodentals vs. interdentals) (Table S2). While there was a wealth of cues that were highly sensitive to place of articulation in sibilants, there were few that were related to place in nonsibilants, and these showed only moderate to low effect sizes. Of these, the best were F4 (R2change =.083), and F5 (R2change =.082) (two new cues for nonsibilants) and the skewness and kurtosis during the transition (M3trans: R2change=.061; M4trans: R2change=.062). As shown in Table S1, all of these cues are also highly context-dependent (F4: R2change =.478; F5: R2change=.339; M3trans: R2=.108; M4trans: R2=.10), suggesting that to take advantage of what little information there is for nonsibilants, listeners may need a compensatory mechanism. Context Effects. Contextual factors (speaker and vowel) accounted for a significant portion of the variance in every cue. Not surprisingly, speaker and vowel accounted for a massive amount of variance in cues like vowel duration (R2change=.792) and vowel amplitude (R2=.612), which were measured to capture some of the contextual variance. F0 and all five formants were also highly related to context with average effect sizes in the 40–60% range. For F1 and F2, this was largely due to the vowel (F1: R2change=.603; F2 R2change=.514), while for the other formants this was largely due to speaker. Most other cues showed much smaller effects, in the 8–10% range. Unexplained Variance. Finally, as Table S1 shows, there was a substantial amount of unexplained variance in each cue. For seven of these cues (F3AMPV, F5AMPV, F5, M3, M4, M2trans, and M4trans) there was more unexplained variance than the combined variance accounted for by speaker, vowel and fricative. For the others, this was still substantial. Even some of the best (near invariant) cues showed large amounts of unexplained variance, cues like MaxPF (42.3%), M2 (28.4%), M3 (54.8%) and M4 (70.4%). Of course, the presence of unaccounted for variance is common in regression analyses, but raises an interesting question here. This variance means that even fricatives that were spoken by the same person, in the same sentence context, and in the same recording session (e.g., repetitions 1, 2, and 3) differed substantially in terms of their acoustic realization. Across this data set, the only factors that systematically varied were speaker, vowel, and fricative identity, and these effects have been accounted for. Thus, the unexplained variance suggests that a large portion of the variance in speech cues may actually be due to random factors, that is, noise (see Newman, Clouse, & Burnham, 2001), that the listener must deal with. It does not appear that if we knew all of the relevant factors, we could account for all the variance in these cues. -3- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Discussion. There are several candidates for primary cues to place of articulation or sibilance, though none are completely invariant to context, and it is not clear that any are strongly related to place of articulation in nonsibilants. There were also no unique, contextually invariant cues to voicing. Fricative identity as well as vowel and speaker affect virtually every cue we studied. This may present problems for models that conflate these sources of information like exemplar models. At the same time, some cues are likely to be more informative than others: peak frequency, the narrow-band amplitudes, and the spectral moments are strongly related to place of articulation; RMSF, the narrowband amplitudes and M2 are strongly related to sibilance; and DURF and LF are strongly related to voicing (though they were also affected strongly by context). Most of the place cues were helpful with sibilants, and nonsibilants showed only weak relationships with primarily context-dependent cues. Finally, we found only a handful of cues that come close to invariance: MaxPF, narrow-band amplitudes in the fricative, and spectral moments 2–4 seemed somewhat context-independent, and cued various aspects of sibilance and place of articulation. As a whole, this strongly reinforces the notion that fricative identification requires the integration of many cues, and there are few, if any, that are invariant with respect to context and other factors. -4- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 1 1 0.8 0.9 % Correct Proportion Correct Note 2: Analysis of Perceptual Data The primary analysis used generalized estimating equations with a logistic linking function to approximate a mixed-design ANOVA with a binary dependent variable. The model included syllabletype as a between-subjects variable, along with vowel, speaker, place of articulation and voicing as repeated measures. Vowel and speaker were included in the model as main effects only. Accuracy was the dependent variable. This analysis was fully reported in the paper, but follow-up analyses were also run separating the data by syllable-type in order to understand the two-way interactions of place and voicing with syllabletype. Each analysis included place, voicing and their interaction as primary factors while also including independent (noninteracting) effects of vowel and speaker. In the complete-syllable condition we found a significant main effect of speaker (Wald 2 (9)=135.5, p<.0001). Z was not significant individually (Wald 2(1)=2.0, p=.156; Figure S2C). In the fricative-noise condition, speaker was still significant (Wald 2(9)=196.9, p<.0001), but vowel was no longer significant (Wald 2(2)=.9, p=.6; Figure S2B). This implies that the vowel effect seen in the complete-syllable condition was not due to the fact that the particular frication produced before an /i/ (or any other vowel) was more (or less) ambiguous. Heard alone, there was no effect of vowel. Rather, the vowel contributes something beyond simply altering the cues in the frication. As before, place of articulation was significant (Wald 2(3)=189.8, p<.0001; Figure S1A), with all three places differing significantly from postalveolars (labiodentals: Wald 2(1)=38.1, p<.0001; interdentals: A. B. 0.6 Complete Syllable 0.4 Noise Only 0.2 0.8 0.7 Complete Syllable Noise Only 0.6 Chance 0 0.5 f v ɵ ð s z ʃ ʒ i u Fricative Context Vowel C. 1 0.9 Proportion Correct Proportion Correct 1 0.8 0.7 0.6 Voiced Voiceless 0.5 0.4 0.3 labiodental a D. 0.9 0.8 0.7 0.6 Voiced Voiceless 0.5 0.4 0.3 interdental alveolar Place of Articulation postalveolar labiodental interdental alveolar Place of Articulation post alveolar Figure S1: Listeners’ performance (proportion correct). A) Performance on each of the eight fricatives as function of condition. B) Performance across fricatives as a function of vowel and condition. C) Performance as a function of place of articulation and voicing for the complete-syllable condition; D) The same for the noise-only condition. -5- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Wald 2(3)=65.0, p<.0001; alveolars: Wald 2(1)=10.9, p=.001). This time, voicing was significant (Wald 2(1)=15.0, p<.0001), with better performance on voiceless than voiced sounds. The voicing place interaction was also significant (Wald 2(1)=34.5, p<.0001; Figure S1D) due to a significant effect of voicing in interdentals (Wald 2(1)=12.0, p=.001), but not in labiodentals (Wald 2(1)=.4, p=.5) or alveolars (Wald 2(1)=.5, p=.48). To summarize our findings, we found that 1) performance without the vocalic portion was substantially worse than with it; 2) performance varied substantially across speakers; 3) sibilants were easier to identify than nonsibilants but there were place differences even within the sibilants; 4) voicing effects were largely restricted to the interdental fricatives; and 5) the identity of the vowel affected performance, but only in the complete-syllable condition. Thus, either particular vowels alter the secondary cues in the vocalic portion that mislead (or help) listeners, or the identity of the vowel causes subjects to treat the cues in the frication noise differently. Most likely it is the latter—the lip rounding created by /u/ has a particularly strong effect on the frication, and listeners’ ability to identify the vowel (and thus account for these effects) may thus offer a large benefit for fricatives preceding a /u/ that is not seen for the unrounded vowels. -6- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Note 3: Confusion Matrices While our primary analysis of the empirical data (and the model) focused on the overall accuracy as a function of fricative, vowel and speaker, listeners’ (and models’) responses were not dichotomous. Rather, listeners (and models) selected which of the eight fricatives was their response for each stimulus. In this section, we present the confusion matrices (the likelihood of responding with a given fricative given the one that was heard) as an alternative metric for evaluating the experiment and models. This necessarily ignores the context effects, but it paints a parallel picture to the analysis presented in the manuscript: the compensation / C-CuRE model performs like listeners in the complete-syllable condition, and the cue-integration model succeeding in the frication-only condition. 3.1 Listener Data Table S3 shows confusion matrices for each condition in the perceptual experiment. In the completesyllable condition, listeners were accurate overall (M=91.2%), particularly on the sibilants (M=97.4%). The only systematic confusions were for nonsibilants and within these participants typically chose the wrong place of articulation, but maintained voicing. For example, when /f/ was miscategorized, /v/ was Table S3: Confusion matrices for each condition of the perceptual experiment. Shaded cells represent a response rate of greater than 5%. Complete syllables Fricative Presented Fricative Responded f v ɵ ð s z ʃ ʒ f 88.2 0.8 8.2 0.3 0.1 v 1.0 89.9 0.1 17.0 0.2 ɵ 10.4 0.5 87.3 7.5 2.3 0.3 0.1 0.1 ð 0.2 8.5 3.4 74.4 0.7 2.5 0.1 0.1 s 0.3 z ʃ ʒ 0.3 1.0 0.1 94.1 0.1 0.2 0.5 2.6 96.4 0.2 0.2 0.2 99.4 0.1 0.5 0.3 99.6 %correct 88.2 89.9 87.3 74.4 94.1 96.4 99.4 99.6 Frication Noise Only Fricative Presented Fricative Responded f v ɵ ð s z ʃ ʒ f 71.8 8.1 21.3 3.2 0.4 0.1 0.3 v 0.9 72.3 0.8 50.6 0.1 0.6 0.1 ɵ 23.0 5.4 62.4 12.0 1.4 0.4 0.6 ð 3.6 13.2 11.5 30.9 0.3 1.5 0.3 0.7 s 0.5 0.1 3.7 0.3 92.9 8.1 0.2 0.2 -7- z 0.8 0.3 2.5 3.0 87.6 0.4 ʃ 0.1 ʒ 0.2 0.1 0.1 1.9 0.1 97.4 3.1 0.5 0.2 1.6 1.3 95.6 %correct 71.8 72.3 62.4 30.9 92.9 87.6 97.4 95.6 Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 selected 1% of the time, but /ɵ/ 10.4% of the time. Similarly, /v/ was confused with /ð/ 8.5% of the time, but with /f/ only 0.8%. The only exception was /ð/, the most difficult fricative (M=74.4%), which showed confusions for both voicing (/ɵ/: 7.5%) and place (/v/: 17.0%). In the frication-only condition, performance dropped substantially (M=76.3%), but the overall pattern remained. For nonsibilants, the majority of confusions were in terms of place of articulation (M=27.0% across all four), though there were more confusions in voicing (M=8.1%). The error rate for sibilants was higher than in the complete-syllable condition, but there were still few, and they slightly favored voicing (M=3.9%) over place (M=1.0%). Across both conditions, confusions respected sibilance. Nonsibilants tended to be confused with other nonsibilants (Complete-syllable: M=14.5%; Frication-only: M=38.4%) rather than sibilants (Complete-syllable: M=0.6%; frication-only: M=2.2%); and while sibilants were rarely confused in the complete-syllable condition, sibilants tended to be confused with other sibilants in the noise-only condition (M=5%) and rarely labeled as nonsibilants (M=1.7%). 3.2 Invariance Model The confusion matrix for this model (Table S4) shows some similarities to listeners but also major differences. Like listeners, this model was more likely to confuse place of articulation than voicing. However, unlike listeners, /f/ was classified as /v/ 4.8% of the time (listeners: M=1.0%). Also, like listeners, the model’s errors tended to respect sibilance, yet /f/ and /ð/ were exceptions to this. More surprisingly, all the sibilants showed a small but noticeable rate of confusion with nonsibilants. Finally, there were also a number of confusions that did not seem to resemble listeners at all. /ð/ was classified as /f/ and /z/ at very high rates (5.2% and 5.4%, respectively) compared with listeners (.3% and .5%). Conversely, /ð/ was rarely classified as /ɵ/ (1.7%) while that was common for listeners (7.5%). Finally, when listeners heard /s/, virtually all errors were to /z/ while the model’s errors were evenly distributed across all three other sibilants. Thus, the pattern of errors in the invariance model does not seem well correlated with those of listeners. 3.3 Cue-Integration Model Our analysis of the confusion data for the cue-integration revealed a closer match to listeners, particularly for the frication-only condition (Table S5). In the complete-syllable condition, errors on nonsibilants tended to respect voicing (like listeners, and unlike the invariance model), and nonsibilants were rarely classified as sibilants. Within sibilants, the model’s errors better reflected listeners’: /s/ was Table S4: Confusion matrix of the invariance model based on the probabilistic decision rule. Fricative Presented Fricative Responded f v ɵ ð s z ʃ ʒ f 62.6 1.8 30.8 5.2 .6 .2 1.2 v 4.8 61.8 .9 29.1 .2 .2 .1 .8 ɵ 23.2 .7 62.4 1.7 1.1 .5 .3 ð 3.7 35.7 2.4 58.2 .3 .2 .6 s .5 z 4.9 3.4 .1 87.1 2.9 5.1 5.4 3.3 92.0 .2 1.7 .1 7.1 -8- ʒ % Correct 88.4 .2 4.1 3.9 4.2 62.6 61.7 62.4 58.2 87.1 92.0 88.4 4.7 85.6 85.5 ʃ .2 3.2 Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Table S5: Confusion matrix of the cue-integration model with complete syllables (top) and frication-only (bottom) based on the probabilistic decision rule. A. Complete-Syllable Fricative Presented Fricative Responded f v ɵ ð s z ʃ ʒ f v ɵ ð s 77.3 1.3 32.9 4.6 .5 .4 3.6 3.6 71.3 0.5 25.9 .1 1.7 .1 14.5 .6 62.0 3.4 .8 .2 .5 1.9 26.8 2.9 63.0 .4 2.2 1.7 2.6 .4 z ʃ ʒ %correct 88.4 .1 2.3 1.5 62.6 61.7 62.4 58.2 87.1 92.0 88.4 4.0 89.9 85.5 ʃ ʒ %correct 54.0 54.9 59.8 47.1 81.9 83.0 91.0 85.9 .1 1.8 92.0 3.3 4.2 .7 3.2 3.6 89.9 4.9 2.4 B. Frication-Only Fricative Presented Fricative Responded f v ɵ ð s z ʃ ʒ f v ɵ ð 55.1 4.1 38.8 5.9 0.5 0.5 1.7 4.0 55.4 1.0 41.6 36.4 1.8 55.7 3.3 0.8 1.2 1.1 4.5 38.7 1.7 47.6 0.1 0.7 0.1 1.4 1.2 0.0 1.0 s 2.8 z 85.8 3.9 3.0 .1 1.5 6.0 91.7 0.2 87.4 2.3 0.8 6.6 0.0 7.6 1.0 89.0 4.5 most frequently identified as /z/ and then /ʃ/ by both this model and listeners (but not the invariance model). There were still differences. For listeners, /ʃ/ and /ʒ/ were the best identified tokens, while for the cue-integration model it was /z/. There were also a few odd confusions between sibilants and nonsibilants that the listeners did not display (/z/ classified as /ð/; /ʃ/ as /f/, and /f/ as /s/). An analysis of the confusion data for the frication-only condition (Table S5b) revealed similar results: the model captured the broad pattern of results, with a less close fit on specifics. Like listeners, errors generally stayed within sibilance class. Within nonsibilants, errors tended to be by place of articulation (not voicing), and for sibilants there were more errors on voicing than place of articulation. There were some differences in the details. Listeners made substantial errors on voicing for nonsibilants, hearing /v/ as /f/ 8.1% of the time, and /ɵ/ as /ð/ 11.5% of the time. Yet the model rarely made such errors, classifying /f/ as /v/ only 4.0% of the time and /ɵ/ as /ð/ only 1.6% of the time. This was despite fairly worse performance on the nonsibilants (Model: 53.9%; Listeners: 59.4%)—the model simply made more place errors. 3.4 Compensation / C-CuRE Model The analysis of the confusion data for the compensation / C-CuRE model yielded the closest match to -9- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Table S6: Confusion matrix of the C-CuRE model with complete syllables. Fricative Presented Fricative Responded f v ɵ ð s z ʃ ʒ f v ɵ ð 78.5 1.0 20.9 .2 2.2 74.4 .9 19.2 17.5 .6 76.0 2.5 1.7 24.0 2.1 78.1 .1 97.1 .8 .9 .1 .4 s .1 z 1.2 98.6 2.6 ʃ ʒ %correct 97.1 .2 .6 1.8 78.5 74.4 76.0 78.1 97.1 98.6 97.1 .6 96.4 96.4 1.4 listener performance (Table S6). The C-CuRE model never confused a nonsibilant with a sibilant, and was highly unlikely to classify a sibilant as a nonsibilant. Its errors on voicing were minimal, with the exception of /s/ being classified as /z/ (a pattern the listeners also showed). In short, there were no large errors that were not also shown by the listeners, and the large errors that the listeners made were reflected here. -10- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Note 4: The Effect of Mis-parsing The C-CuRE simulations make the assumption that listeners can perfectly identify the speaker and vowel as a precursor to compensation. This assumption was made in order to capture the upper limit of the information available in a set of compensated cues and it was clear that this was sufficient to predict listener performance overall. However, it is not clear how robust the C-CuRE parsing mechanism is to error (or rather, how robust the categorization architecture is to mis-parsed cues). The present simulations are intended to address that. In these simulations, we selected a probability that the system would misidentify the speaker, and an independent probability that it misidentified the vowel. For each token in the set of fricatives in the perceptual experiment, we then randomly selected (based on that probability) whether it would be parsed correctly or incorrectly for speaker, and independently for vowel. If it was incorrect (on either speaker or vowel), a random speaker (or vowel) was chosen and the cue values were parsed as if this was the speaker or vowel that was identified. This yielded a new data set of residuals in which some proportion of the cues had been mis-parsed. This was used as input to the original C-CuRE model (including the original estimated weights which were not retrained on the mis-parsed data) to determine how well it would perform under these circumstances. Given that the mis-parsing was random we simulated multiple runs of the model, although it was clear after the first few that with this many tokens (240) only a handful were needed to estimate overall performance. Initial simulations suggested that the effect of mis-parsing vowel and speaker were additive. Increasing the likelihood of either resulted in a fixed decrement in performance, and the effect of misparsing both was approximately the sum of either. Thus, for the simulations reported here, we covaried the likelihood of making a speaker and vowel error in steps of 5% ranging from 0 (the original C-CuRE model) to 50% (half the trials were misparsed). At each step, four simulations were run and we recorded the proportion correct as a function of the fricative, the speaker and the vowel. Proportion Correct Results Figure S2 shows the models’ accuracy (in the gray range, by both the discrete choice and probabilistic rule) as a function of the likelihood of misidentifying both speaker and vowel. For reference, the performance range of the cue1 integration model is represented by the C-CuRE model two lines, and that of the original Cue-integration model (perfect) C-CuRE model is at 0 on the 0.9 x-axis. There is a clear effect of misparsing on performance, with a performance dip of about 2.5% for each increase of 10% in the likelihood of 0.8 mis-parsing. However, it is not until the likelihood of an error on either factor reaches about 30–35% that the C-CuRE 0.7 model’s performance drops to the level 0 0.1 0.2 0.3 0.4 0.5 of the cue-integration model. Normal Proportion Misparsed Trials hearing listeners are far better than 30% Figure S2: Effect of the proportion of trials on which speaker or at identifying unambiguous vowels and vowel is mis-identified on performance of the parsing model. The speakers, suggesting that at the range of gray range is the parsing model’s performance, bounded by the performance likely shown by listeners discrete-choice rule (top) and the probabilistic rule (bottom). Thick lines represent the same for the cue-integration model. the C-CuRE model will still be the -11- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 A. Listeners 0.9 5-15% 20-30% 35-50% 0.8 0.7 0.9 F2 F3 F4 F5 M1 M2 M3 M4 M5 0% 10% 20% 30% 50% 40% 0.8 0.7 F1 B. 1 Proportion Correct Proportion Correct 1 i u ɑ Vowel Speaker Figure S3: Performance of the C-CuRE model when identification of speaker and vowel is imperfect. A) Overall performance (probabilistic rule) as a function of speaker and misidentification rate. B) Overall performance (probabilistic rule) as a function of vowel and misidentification rate. superior categorizer. Moreover, even with a noisy parser, the C-CuRE model still shows many of the same context effects that were diagnostic of human performance (and that the cue-integration and invariance models were unable to show. While its performance as function of speaker is lower overall when parsing is noisy, it is still robustly correlated with listener data (Figure S3A) with average correlations of R=.43 across misidentification rates of 5–15%, R=.29 at 20–30% and R=.24 at 35–50% (for comparison the cue-integration model had a correlation of R=-.01). More dramatically, the effect of vowel context showed the same pattern as listeners (i<u<ɑ), and this effect remained even at misidentification rates of 40 and 50% (Figure S3B). Discussion While there were clear performance decrements in the compensation / C-CuRE model when speaker and vowel identification were not perfect, it is clear that this model remains superior to the cue-integration model. Overall performance of the two does not converge until we assume that the C-CuRE model misidentified both speaker and vowel at least 30% of the time, which is far worse than listeners are likely to be. The correlations between the model and the listener data across speakers are robust even out to 50%; and the correct effect of vowel context can be seen at all rates of misidentification tested. Moreover, we made two simplifying assumptions in this simulation that may swing the pendulum too far in the other direction: this task is more challenging than what would likely be faced by real listeners or a scaled up model. First, when a misidentification occurs, errors were completely random. For example, when the model misidentified a male speaker it was just as likely to select a male as a female, despite the fact that within-gender pairs of voices are likely to be significantly more similar than across-gender. If the model’s confusions reflected real similarity data between speakers and between vowels, the effect of misparsing is likely to be much less deleterious. Second, the categorization component of this model used the original coefficients of the C-CuRE model, a model that was trained on perfectly parsed data. However, if we trained the model on data that had been occasionally misparsed, it may have developed a subtly different parameter-set that would be more effective at coping with poorly compensated cues. Nonetheless, despite these simplifying assumptions it is clear that by all of our criteria, the CCuRE model is superior to the cue-integration model, even when parsing is substantially imperfect. -12- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Note 5: Parsing by Coarse-Grained Categories In Note 4 we challenged the simplifying assumption of the C-CuRE model that the speaker and vowel can be accurately identified for every token. These simulations showed that even when parsing got the wrong speaker or vowel 25% of the time it outperformed the cue-integration model. An alternative approach to this an issue is to make the categorization of context easier such that near-ceiling identification performance is likely. For example, if the model categorized speakers as male or female, instead of identifying individuals, it is much less problematic to assume that this can be done perfectly as listeners are substantially less likely to err in such judgments. The flip side of this, however, is that partialing out coarser-grained sources of information like gender may not be as informative as partialing out fine-grained sources of variance like individual speakers. Thus, in this simulation we test this hypothesis, limiting the C-CuRE model to parse out speaker gender, and only two properties of the vowel (height and backness), and evaluating its performance. This simulation also indirectly addresses the issue of the number of parameters in the model. We describe in Section 3.2.3 how the linear regressions used for parsing introduce extra parameters into the complete model, specifically, 19 parameters to account for subject, 5 to account for vowel, and an intercept (25 parameters total for each of the 24 cues). As we describe, we did not count these parameters against the C-CuRE model when computing BIC for model evaluation as these were not truly degrees of freedom for the categorization model—they were not available to the optimizer when the categories were acquired. Nonetheless, it would be useful to determine if successful parsing could be implemented using regressions with many fewer parameters. In this case, each regression only has 4 parameters (for each of the 24 cues), a substantial decrease in the number of parameters in the C-CuRE model, from 25*24=600 parameters (for parsing alone) to 96. Thus, the success of this model could suggest a route to maintain parsing but in a model with fewer parameters. Methods Similarly to the full C-CuRE model, we started by running 24 individual linear regressions. These were run hierarchically first partialing out the gender of the speaker, and then adding two variables. The first was dichotomous indicating vowel backness; the second had three values (–1, 0, 1) and indicated height. Table S7 shows the results of these regressions. Gender accounted for substantially less variance across cues than using all 19 speaker codes. While the speaker codes accounted for an average of 18.8% of the variance across cues, and was significant for every cue; gender only accounted for 8.9% and was not significant for a number of cues. This broke down unevenly between cues because of the variety of ways that gender and speaker can influence articulation and acoustics. For example, the bulk of variance associated with speaker for cues like M1 or F0 is related to factors like vocal tract and larynx size which are also related to gender. Thus, one might expect similar speaker codes and gender to account for similar amounts of variance. However, for other cues, speaker codes are capturing things like mean syllable duration or amplitude, things that are not directly correlated with gender. Vowel information was well preserved when we used only two parameters to describe the six vowels instead of all five. The average R2 using all five vowel codes was 0.123; while it was 0.112 when only two were used. Moreover, every cue was significantly related to vowel in both regressions. This suggests that parsing by phonetic features (e.g., height) may be just as effective as parsing by individual phonemes, and will have lower dimensionality as an added bonus. Results The reduced C-CuRE model did surprisingly well. It performed at 90.8% correct on the tokens from the perceptual experiment when measured with the discrete choice rule, and 83.5% with the probabilistic -13- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 rule. This was lower than the full C- Table S7: Comparison of regression analyses examining effects of CuRE model but still better than the speaker and vowel on each cue (the complete C-CuRE model) vs. regression analyses presented here using only gender and vowel cue-integration model (Table S8). height and backness (the limited C-CuRE model). Shown are R2change Similarly, its BIC was midway values. Missing values were not significant at p<.05 level. between the two models. Finally, while the cue-integration model Complete C-CuRE Model Limited C-CuRE Model failed to use a number of the cues, Cue Speaker Vowel Gender Vowel Height and the C-CuRE model was able to (19) (5) (2) & Backness (2) use them all, this more limited CMaxPF 0.084* .024* * * CuRE model significantly used DURF 0.158 0.021 .001+ .02* * * * every cue except the narrow band DURV 0.475 0.316 .075 .312* RMSF 0.081* .010* amplitudes in the vowel (F3AMPV * * RMSV 0.570 0.043 .039* .008* and F5AMPV). * * * F3AMPF 0.070 0.028 .005 .024* Clearly, by accuracy alone * * 0.140 0.156 .120* the more limited C-CuRE model is a F3AMPV F5AMPF 0.077* 0.012* .003+ .011* substantial improvement over the * * F5AMPV 0.203 0.040 .018* cue-integration model, though not * + + LF 0.117 0.004 .002 .003* up to the complete C-CuRE model. F0 0.838* 0.007* .715* .007* However, we have also stressed the * * * F1 0.064 0.603 .01 .602* ability of the model to qualitatively * * * F2 0.109 0.514 .071 .489* fit the listener data, and the effect of F3 0.341* 0.128* .213* .103* vowel and speaker context on overt * * * F4 0.428 0.050 .301 .045* * * * performance was particularly F5 0.294 0.045 .204 .03* * * diagnostic. M1 0.122 .011 * Figure S4 shows a summary M2 0.036 of this data. Breaking performance M3 0.064* .001+ * down by fricative (Figure S4A) M4 0.031 .003+ M1trans 0.066* 0.043* .002+ .042* showed a fair match to the data * * * M2trans 0.084 0.061 .009 .053* although the model oddly appeared * * M3trans 0.029 0.079 .078* to overperform on /f/, /v/, and /ɵ/. * * M4trans 0.031 0.069 .063* This may have been due to the + * p<.05 p<.0001 particular tokens selected at test as this did not appear to be the case for the training tokens (on which the model Table S8: Model comparison performed appropriately worse). However, the effect of vowel context (Figure S4B) was Model % Correct exactly as predicted with performance worst for Discrete Prob. BIC /i/, better for /ɑ/ and best for /u/. Perhaps not Listeners 91.2 surprisingly, however, the effect of speaker was Cue Integration 85.0 79.2 3381 not as well correlated with listener data for the C-CuRE (complete) 92.9 87.0 2990 limited C-CuRE model (Figure S4C), although C-CuRE (limited) 90.8 83.5 3267 the correlation was still positive (R=.25). Discussion Our initial linear regressions suggested that collapsing speakers into gender deleted a lot of useful information (an indirect argument for the sort of speaker-specific encoding seen in exemplar models), -14- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 A. 0.6 0.4 Limited C-CuRE model Listeners (Complete syllable) 0.2 0 ɵ v ð s Fricative ʒ ʃ z Proportion Correct Proportion Correct 0.8 f B. 1 1 0.9 0.8 0.7 0.6 0.5 i u Vowel ɑ C. Proportion Correct 1 Figure S4: Performance of the limited C-CuRE model and human listeners on the complete-syllable condition. Model performance is represented by the gray range bounded by performance using the probabilistic and discrete-choice rules. A) As a function of fricative. B) As a function of vowel context. C) As a function of speaker. 0.9 0.8 0.7 0.6 0.5 F1 F2 F3 F4 F5 M1 Speaker M2 M3 M4 M5 while collapsing vowel from five variables to two deleted very little information. Thus, parsing by features may be almost as effective as parsing by phonemes. When these values were used in the categorization model it was clear that there was a performance boost to be gained by simply parsing out gender and the features of the vowels. While there were idiosyncrasies in the factors that affected accuracy, the effect of vowel, which was not seen in the cue-integration and invariance models, remained. Thus, a limited version of parsing which collapses phonemic influences on acoustic cues into features, and collapses individuals into categories (e.g., gender), may be of value as it reduces the number of free parameters in the model, and more importantly, ensures that the identifications required for parsing can be more accurate. Together with Note 4, these results suggest that there are at least two ways to retreat from the complete C-CuRE model and still achieve a benefit in both performance and in the ability to fit the human data. -15- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Note 6: Compound Cues The naïve invariance model emphasized first-order cues: single sources of information that directly cue a distinction. None of these cues met our strict criteria for invariance, but were statistically the best set of single cues that were available in this large corpus. However, work on invariance has also emphasized compound cues, measurements in which different properties of the signal are related to each other, for example relating frication duration to speaking rate by using the ratio of frication duration and vowel duration. Another way of describing such compound cues is that the combination instantiates a form of bottom-up compensation in which only cues them selves participate. In this section, we examine a number of such compound cues to both substantiate our claims regarding invariance, and to compare the interactive compensation mechanism proposed by parsing to some of the best-known purely bottom-up approaches (for fricative cues). Perhaps the best known compound cues are locus equations (Sussman et al., 1998). Locus equations are constructed by computing a line connecting a formant frequency (typically F2) at the onset and center of a vowel. This line is defined by its slope and intercept (the latter of which is equivalent to F2 at onset), offering two cues for stop consonant identity. It compensates for the effect of speaker and vowel on formant frequencies by taking into account the frequency at the steady state. F2 locus equations have been applied to fricatives in a number of studies (Fowler, 1994; Sussman & Shore, 1996; Jongman et al., 2000), and F2onset (the intercept) has also been used alone (e.g., Nittrouer, StuddertKennedy, & McGowan, 1989; Maniwa et al., 2009). There are a number of other compound cues that may also contribute to fricative identity. The duration of frication is often related to the broader duration of the syllable or the vocalic portion as a ratio (DURF / DURV) to compensate for variance in speaking rate. Similarly, the RMS amplitude of the vowel can be used as a baseline to normalize the amplitude of the fricative. The narrow-band amplitudes are also typically transformed to relative amplitude by subtracting the amplitude in the frequency range in the vowel from that of the fricative. Thus, compound cues can compensate for variation due to speaker and vowel, as well as variation in amplitude and speaking rate. As such, they should be more invariant with respect to context and their first-order counterparts. However, using such cues comes at a cost—while first-order cues can be identified readily in the signal, compound cues require substantially more effort to uncover and as a result we have only identified a handful of them. They may also require the system to store multiple bits of information for some time during online speech perception (e.g., storing frication duration till the end of the vowel). This would require a buffer of some kind, and may also delay listeners’ ability to make a commitment. It is also contrary to recent empirical data suggesting that for voicing at least there does not appear to be such a buffer (McMurray, Clayards, Aslin, & Tanenhaus, 2008). Interestingly, in many cases, compound cues appear to be doing similar things to parsing. Consider the slope of the F2 locus equation. This cue is defined as: MF2 = (F2onset - F2vowel) / (.5 * DURv) (1) Where MF2 is the slope of the locus equation and DURv is the duration of the vowel. Here, MF2 will be negative if F2onset is lower than the steady-state value and higher otherwise. Now, consider a version of F2onset in which vowel has been parsed out. To accomplish this, we first develop a regression equation predicting F2onset from the vowel: F2onsetP = 0 + 1V1 + 2V2 + 3V3 (2) Here, F2onsetP is the predicted F2onset, and 0 is the y-intercept. V1-3 are dummy-coded variables such that -16- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 V1 is 1 if the vowel was /i/ (and 0 otherwise), V2 is 1 if the vowel was /u/ (and so forth). 1-3 are the regression weights on this term. Since for any given token only one of the dummy codes is 1, this regression equation can be rewritten as F2onsetP = 0 + 1V1 (3) (if the vowel was /i/). Under this, 1 will be equal to the average difference in F2 between the /i/’s and all the other vowels, and 0 + 1 will be the mean F2 of all the /i/ tokens. Now, to parse the effects of vowel from the original F2 we simply subtract the predicted F2 from the actual. F2onsetR = F2onset - F2onsetP = F2onset – (0 + 1V1) = F2onset – Mean F2onset for /i/ (4) This is quite similar to the locus equation. In both cases, the relative (or parsed) value (F2onsetR) is linearly related to the difference between the actual value and some estimate of the vowel. The only substantive difference is that the locus equation uses the actual F2 of the vowel, while the parsed F2 uses the mean F2 across all tokens of that vowel. Thus, parsing in the C-CuRE may be a proxy for the compensation built into these compound cues. However, since we only need to know one measurement to parse, C-CuRE can be used on any cue. This is certainly useful for speech scientists (since we don’t need to exhaustively search all possible cues). More importantly, it may also be useful for listeners. During real-time processing when the vowel may not have been heard at early points in the syllable (e.g., McMurray, et al., 2008), relying on a memory of the mean value of that cue for the speaker/vowel (rather than waiting to get it) may be more efficient than storing the fricative cues, and waiting for the rest of the information. There may be tradeoffs, though: the actual F2vowel may be a better estimate than the mean; and the listener would still have to track mean cue values long-term, rather than computing them on the fly. But on the other hand, mean values may be more robust against noise than single values. Thus, the advantage (or disadvantage) of compound cues over cues parsed with C-CuRE is somewhat uncertain but there are clear arguments for both sides. The present analysis examined this issue. In our data set, there were five potential compound cues: F2onset can be relativized as a locus equation; duration can be relativized as the ratio DURF / DURV; and RMS and the narrowband amplitudes (F3 and F5) can be relativized as the difference between the frication and the vowel. Our goal was to determine if the compound values of these cues would outperform the parsed values (or vice versa). Of course, a model based on only five cues was not likely to perform well. There are too few cues, and the cues were selected based on empirical work, not their coverage of the relevant phonetic features. Thus, we also included five absolute cues to achieve broader coverage: the first four spectral moments and low frequency energy. We then compared several models. The first added five potentially compound cues as first-order cues in completely raw form (F2onset, DURF, RMSF, F3AMPF, F5AMPF). The second model used each of the cues in compound form: MF2, BF2 (the slope and intercept of the F2 locus equation), DURR (DURF / DURV), RMSR (RMSF - RMSV), F3AMPREL (F3AMPF – F3AMPV) and F5AMPREL (F5AMPF – F5AMPV). Finally, we considered several models examining expanded cue-sets and/or compensation with C-CuRE. As before, data was converted to Z-scores, and the perceptual tokens were used only for test. Unlike before, we did not replace missing data points with means, since for a handful of data points we were missing only one of the two values for a compound cue. Thus, any record that was missing data for any of the measures was excluded, resulting in 2,775 (out of 2,880) -17- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 records in the analysis. Results All five models fit the data well (see Table S9). The first-order cue model was highly significant (2(70)=8,666, p<.001), and reduced its BIC from 11,593 in the intercept-only model to 3,483. On the perceptual tokens, it averaged 80.0% correct using the discrete choice rule and 73.3% using the probabilistic rule. While this is not great performance compared with either of the other first-order cue models (the invariance and cue-integration models), this was expected, since these cues were only chosen due to the fact that they had compound-cue counterparts. Not, surprisingly, the compound-cue model did better. It fit the data significantly (2(70)= 8,803, p<.001), and its BIC was a lower 3,401. It also performed better than the first-order model on the discrete-choice (83.3% vs. 80.0%), and probabilistic rules (75.4% vs.73.2%). Despite better performance, the compound-cue model did not use all of the cues. Likelihood ratio tests showed that the first-order cue model used all five cues of interest (all 2(7)>26.5, p<.002), while the compound cue model was unable to use the relative amplitude at F5 (2(7)=9.6, p=.21), though it did use the others (all 2(7)>67.0, p<.001). Thus, as a whole, compound cues offer a benefit to perception over the equivalent first-order cues. However, the utility of any individual compound cue could be more variable. The comparison between the first-order and compound-cue models is somewhat unfair, however, as the compound cues benefit from two sources of information (e.g., DURR, used both DURF and DURV). The first-order model used only one (DURF), assuming that the compound cue is a cleaned up version of the same thing. However, without the need to relativize, both pieces of information could be entered as independent first-order cues and may offer a benefit. Thus, as a second comparison for the compound cues, we reran the first-order model, but added all of the component values as first-order cues. So, for example, where the compound cue model used DURR, (DURF / DURV), the extended firstorder model used both DURF and DURV as first-order cues. This model showed a good fit to the data (2(105)=9,061, p<.001) and had a lower BIC than either of the other two (3,365), despite the additional cues. This was reflected in its performance as it averaged 86.3% correct on the discrete choice rule and 78% by the probabilistic rule. Interestingly, all cues except F5AMPV were highly significant (2(7)>15)—even the contextual cues offer some firstorder discrimination between fricatives. Thus, given this set of cues, a cue-integration approach may be more effective than a compound-cue approach. Next, we examined a compensation / C-CuRE model using the smaller cue-set (e.g., for RMS, using only RMSF, but not RMSV). Its performance was slightly better than the equivalent compound cue model. Its overall fit was significant (2(70)=9,189, p<.0001) and its BIC was 2,959 (substantially lower than the prior three models). It was 85.0% correct on the discrete-choice rule and 78.7% on the Table S9: Summary of models examining relative cues. The cue column lists only the number of cues of primary interest (those that could participate in relative relationships). All models included 5 absolute cues: M1-4 and LF) in addition to the cues of interest. Model First-order Compound Cues First-order extended C-CuRE C-CuRE extended # Cues 5 6 10 5 10 Parsing No No No Yes Yes BIC 3483 3401 3365 2959 3040 -18- Performance Discrete Probabilistic 80.0 73.2 83.3 75.4 86.3 78.0 85.0 78.7 87.9 80.9 Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 probabilistic one. Thus, compensation in the C-CuRE scheme is more effective than compound cues. Finally, we asked whether the addition of extra variables to the C-CuRE model offers any benefit. A final model using parsing on the cues of the extended-first-order model showed a highly significant fit (2(105)=9,386, p<.001). While its BIC (3,040) was higher than the simpler C-CuRE model, it was lower than the compound cue model. More importantly, its performance jumped to 87.9% on the discrete choice rule and 80.9% on the probabilistic one. Thus, it appears that simply using every available information source and normalizing with parsing operation (rather than more complicated cuecombination rules) offers the best performance. Discussion Why did the compound cue model fare so poorly against C-CuRE? This may have been due to a number of reasons. First, as we described, for many of the compound cues, parsing offers a fairly similar way to normalize the data. Thus, we might have expected similar performance a priori. Second, Table S10 shows a comparison of the simple regression results for the compound cues and their absolute counterparts. One thing that is striking is that with the exception of F2locus , most of the relative measures show similar if not larger effects of context than the absolute ones. While this is counterintuitive, the reason is that measures like F3AMPREL consist of the difference between two components, one that is related to the fricative and one to the vowel. By combining them, we end up pooling their variance, essentially corrupting a more invariant measure (F3AMPF) with variance from a less invariant one (F3AMPV). Finally, since many of these compound measures are linear combinations, the first-order cue model can take advantage of them under some circumstances. For example, if the compound cue is a difference of two cues (e.g., F3AMPREL = F3AMPF – F3AMPV) and both are included in the model (as in the extended-first-order model) the model could set the -weight of one cue (F3AMPF) to +1 and the other (F3AMPV) to –1. That component of the regression equation is equivalent to the compound cue (F3AMPREL). If this was going to be helpful, the cue-integration model should be able to achieve some version of this relativization as the optimal weights. However, by fixing this relationship into a compound cue we force the model to only use coefficients of 1 and –1. If they needed to be weighted differently, this would be impossible. In contrast, the cue-integration and C-CuRE models with independent weights on both cues Table S10: Summary of regression analyses examining effects of speaker could be more flexible. (20), vowel (6) and fricative (8) for each of the relative cues. Also shown To conclude, then, while are their first-order counterparts from the previous analysis. Shown are compensation in the C-CuRE R2change values. Missing values were not significant at p<.05 level. framework appears to offer a Contextual Factors Fricative similar operation to compound Identity Cue Version Speaker Vowel cues like locus equations, in * * DUR F 0.158 0.021 0.469* practice it supports much better DUR DURREL 0.146* 0.165* 0.389* categorization performance. Given * RMSF 0.081 0.657* its advantages of not having to wait RMS RMSREL 0.065* 0.006+ 0.684* for all the necessary information, * * F3AMPF 0.070 0.028 0.483* and the fact that C-CuRE can be F3AMP F3AMPREL 0.133* done on any cue without extensive 0.034* 0.315* * * F5AMPF phonetic analysis, it may be the 0.077 0.012 0.460* F5AMP F5AMPREL 0.227* superior approach to relativization 0.019* 0.219* F2onset or compensation. 0.109* 0.514* 0.119* F2 * * F2locus 0.013 0.245 0.041* -19- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 Note 8: On Cue Sharing A recurring issue in speech perception is whether categorization decisions for one feature or phoneme are dependent on decisions made for another phoneme or feature (Miller & Nicely, 1955; Pisoni & Sawusch, 1974; Oden, 1978; Mermelstein, 1978; Whalen, 1989; Nearey, 1990). For example, in the context of fricatives, does a listener’s decision about the vowel affect his decision about the fricative? This is particularly relevant in instances of cue-sharing, where a single cue is used for more than one classification dimension, for example, when F2 must be used to determine the identity of both the fricative and the vowel. Such cue-sharing across places, voicing, and sibilance as well as vowel and speaker was observed in virtually all of the cues we studied here, and our C-CuRE model makes the fundamental claim that listeners actively identify contextual factors (speaker/vowel) as part of the compensation processes. A number of experiments have attempted to disentangle these issues using multidimensional speech continua. Results have been mixed. Pisoni and Sawusch (1974) showed that place and voicing judgments are not independent (where VOT is affected by both voicing and place: Lisker & Abramson, 1964), although Oden (1978) showed that the Fuzzy Logical Model of Perception (FLMP) which assumes independence can fit this data. Mermelstein (1978) examined vowel length (which partially distinguishes both word-final voicing and /æ/ from /ɛ/) and showed what appeared to be additive effects. Whalen (1989), however, replicated Mermelstein’s study with greater power and more precise analyses and found that listeners’ vowel judgments were dependent on their voicing judgments (and vice versa). When listeners identified a token as voiceless, they attributed the shorter vowel to voicelessness and were more likely to identify it as /æ/ (a long vowel) (a nice example of operations like parsing and C-CuRE). His second experiment showed a similar relationship between fricative (/s/ vs. /ʃ/) and vowel (/i/ vs. /u/) identification (the rounded /u/ lowers the M1 of the fricative), for the same stimulus. If listeners interpreted the vowel as /u/, they were more likely to assume that an ambiguous M1 was due to rounding and classify the fricative as /s/. There has been considerable debate about the nature of the perceptual mechanisms that give rise to this effect (Nearey, 1990; Whalen, 1992; Smits, 2001a, 2001b). Critically, this debate has been informed by logistic regression models that are formally similar to the present approach. Nearey (1990) applied a logistic regression analysis to Whalen’s data, comparing models in which the interaction between categories was based on a vowelcue interaction to models in which it was captured by simply a vowelfricative bias. The former is related to our C-CuRE model in which a cue is interpreted relative to the vowel; the latter is more of a decision-stage process, reflecting simply that subjects are more likely to respond with particular pairs of phonemes (a diphone bias), and is consistent with independent perceptual processing, but dependent decision making. Nearey (1990) found that this latter model was sufficient to account for the data, and that the addition of a vowelcue term offered little benefit (though he did not assess it alone so it is unclear if a vowelcue term can do similar work to the diphone bias). Smits (2001a, 2001b) later extended Nearey’s (1990) model by allowing the vowel category to affect how the secondary feature is interpreted in multiple ways. For example, a rounded vowel could shift the position of spectral mean boundary (for an s/ʃ judgment), its slope, or its orientation (in a 2D cue-space). He found much better fits to the data, including to a new corpus of Dutch fricatives (in which rounding and fricative place are independent). While this offers considerable more flexibility, it also requires that phoneme decisions are hierarchical in the sense that fricative decisions are affected by the vowel but vowel decisions are not affected by the fricative. Smits’ model-fits support this in the fricative-vowel sequences he examined, but this may not hold up for domains like place assimilation (e.g., Gow, 2003) and vowel-to-vowel coarticulation (Cole -20- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 et al., 2010) where listeners appear to Table S11: Analysis of our listener data based on Nearey (1990; see also, Smits, 2001a). Shown is the likelihood of choosing an alveolar or work in both directions. postalveolar as a function of vowel context and experiment. Only trials For both models, interactions in which the stimulus was alveolar or postalveolar (and the response between cues and categories are either was one or the other) are included. Nearey (1990) showed that listeners modeled as a decision-stage (diphone were more likely to answer /s/ after /u/ and /ʃ/ after /i/. We find no evidence of that here. bias) process, or as a modification of how cues are interpreted with respect to Stimulus Response the fricative. Though we didn’t include Condition Vowel Place Alveolar Postalveolar such terms in our model, our model alveolar .998 .0025 i could certainly accommodate such postalveolar .0013 .999 Completebiases. However, C-CuRE offers Syllables alveolar 0.995 0.0054 considerably more power, greater u postalveolar 0.0025 0.998 integration with a body of perceptual work, and more flexible processing. alveolar 0.997 0.0025 i Consider Nearey’s diphone bias. postalveolar 0.0012 0.999 FricationFirst, it is not clear that it can account for only alveolar 0.995 0.0054 our data. Unlike Whalen (1989), our u postalveolar 0.0025 0.9975 vowels were unambiguous, and equally likely with all fricatives. As a result, we didn’t observe a diphone bias in our perceptual experiment. To examine just the subset of phonemes in our studies that Nearey examined, in the context of /u/, listeners correctly identified /s/ 99.7% and /ʃ/ 99.7% of the time; similarly, in the context of /i/, they identified /s/ 100% of the time and /ʃ/ 99.7% of the time (Table S7). So the parsing benefit is more likely to derive from listeners’ use of the vowel to better interpret the fricative cues than from particular biased pairs. However, introducing such bias terms in a C-CuRE model could be beneficial. It could account for transition probabilities between phonemes. Likewise, by treating speaker similarly to vowels and fricatives, one could use a speakerfricative or speakervowel bias term to account for many of the indexical effects on speech perception (Pisoni & Levi, 2007, for a review). In this case, they could account for differences in the frequency of different phonemes produced by different speakers. Thus, diphone bias was unlikely to play a role in the parsing benefit we observed, but is not inconsistent with our approach and may be quite useful. Next, consider Smits’ use of vowel-sensitive weights on the cues to frication. Smits embeds these as part of the fricative decision. That is, if lip rounding leads to a lower spectral mean for fricatives, this information is stored as a component of the fricative decision process. However, spectral mean may be also useful for deducing speaker gender, and the coarticulatory effects of lip rounding could interfere with that as well. Under Smits’ model, such effects would have to be stored as independent factors in the gender-decision model. In contrast, C-CuRE recodes the continuous cue as a function of expectations derived from the vowel (e.g., lip-rounding), making this “corrected” cue value simultaneously available for all perceptual decisions. It can modify both the location of the cue-value (shifting the boundary) but also the scale (if standardized or studentized residuals, which are sensitive to the variation, are used), allowing it to capture effects on the slope as well. Thus, it can account for many of the classes of effects that Smits (2001b) models, without the redundancy of building the compensation into each phoneme decision, and with greater flexibility. More importantly, it also allows for a much more scalable model as only first-order cuecategory relationships need to be retained, and we don’t need to assume any hierarchical ordering to the categories. While the difference between models may be slight when only two cues are considered (as in Nearey, 1990; Smits, 2001a, 2001b), the failure of our invariance model on even -21- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 unambiguous fricatives suggests that only large numbers of cues will suffice, making scalability an important issue. Moreover, In contrast to the Smits (2001a, 2001b) approach, C-CuRE predicts that the encoding of the continuous cues is changed by the categories (and the compensation). This is consistent with a large body of work on compensation for coarticulation. Fowler and Brown (2000), for example, showed that listeners actually perceive nasalized vowels as more oral when the nasality can be attributed to an upcoming nasal stop (see also Pardo & Fowler, 1997; Beddor, Harnsberger, & Lindemann, 2002). Critically, this was tested with a 4AIX task, which is known to be sensitive to differences in continuous cues, not categories (Pisoni & Lazarus, 1974; Gerrits & Schouten, 1992). Finally, C-CuRE can account for speakercue effects such as the finding that ambiguous fricatives are more likely to be heard as /s/ with a male voice and /ʃ/ with a female voice (Strand, 1999). Such effects were not considered by the Smits (2001a, 2001b) model, and are potentially consistent with it. However, adding such terms in that framework would require a substantially more complex hierarchy. Adding them to a C-CuRE model is simple. With respect to the Nearey (1990) and Smits (2001a, 2001b) models, the addition of bias terms could help account for things like phoneme-to-phoneme transitions or indexical effects, while our parsing approach can account for what appears to be normalization effects or context dependencies, without the need to duplicate these efforts for every phoneme. Crucially, the model also illustrates how treating these fairly distinct factors (speaker and vowel) as simply sources of variance applied to cues may bring together exemplar and normalization approaches to perception. The present study offers a new way to frame the debate over the independence of phonetic identification processes. It demonstrates a concrete and quantifiable benefit to the idea of interpreting cues in light of other categories: parsing offered a 7–8% improvement over the cue-integration model. This compensation can be done without fitting a vowelcue or vowelfricative term to the fricative decision – one only needs to examine the direct relationships between context and cues (not their joint relation to the fricative). Given the statistical structure of a real speech corpus, there is a real benefit to a perceptual system that is sensitive to dependencies between categories (mediated through their influences on shared cues), and this benefit can be realized with simple statistical models. The classic question in this domain has been whether the perception of one phoneme category is influenced by another. Our model reframes this: is the encoding of the continuous cue dependent on a category (either phonetic or indexical)? -22- Supplemental Materials McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325 References Beddor, P. S., Harnsberger, J. D., & Lindemann, S. (2002). Language-specific patterns of vowel-to-vowel coarticulation: Acoustic structures and their perceptual correlates. Journal of Phonetics, 30, 591–627. Fowler, C. A. (1994). Invariants, specifiers, cues: An investigation of locus equations as information for place of articulation. Perception & Psychophysics, 55, 597–611. Fowler, C., & Brown, J. (2000). Perceptual parsing of acoustic consequences of velum lowering from information for vowels. Perception & Psychophysics, 62, 21–32. Gerrits, E., & Schouten, M. E. H. (2004). Categorical perception depends on the discrimination task. Perception & Psychophysics, 66, 363–376. Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives. Journal of the Acoustical Society of America, 106, 1252–1263. Lisker, L., & Abramson, A. S. (1964). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20, 384–422. Maniwa, K., Jongman, A., & Wade, T. (2008). Perception of clear fricatives by normal-hearing and simulated hearingimpaired listeners. Journal of the Acoustical Society of America, 123, 1114–1125. McMurray, B., Clayards, M., Tanenhaus, M., & Aslin, R. (2008). Tracking the timecourse of phonetic cue integration during spoken word recognition. Psychonomic Bulletin & Review, 15, 1064–1071. Mermelstein, P. (1978). On the relationship between vowel and consonant identification when cued by the same acoustic information. Perception & Psychophysics, 23, 331–336. Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338–352. Nearey, T. M. (1990). The segment as a unit of speech perception. Journal of Phonetics, 18, 347–373. Newman, R., Clouse, S., & Burnham, J. (2001). The perceptual consequences of within-talker variability in fricative production. Journal of the Acoustical Society of America, 109, 1181–1196. Nittrouer, S., Studdert-Kennedy, M., & McGowan, R. S. (1989). The emergence of phonetic segments: Evidence from the spectral structure of fricative-vowel syllables spoken by children and adults. Journal of Speech and Hearing Research, 32, 102–132. Oden, G. (1978). Integration of place and voicing information in the identification of synthetic stop consonants. Journal of Phonetics, 6, 82–93. Pardo, J. S., & Fowler, C. A. (1997). Perceiving the causes of coarticulatory acoustic variation: Consonant voicing and vowel pitch. Perception & Psychophysics, 59, 1141–1152. Pisoni, D. B., & Lazarus, J. H. (1974). Categorical and noncategorical modes of speech perception along the voicing continuum. Journal of the Acoustical Society of America, 55, 328–333. Pisoni, D. B., & Levi, S. (2007). Representations and representational specificity in speech perception and spoken word recognition. In M. G. Gaskell (Ed.), The Oxford handbook of psycholinguistics (pp. 3–18). Oxford, England: Oxford University Press. Pisoni, D. B., & Sawusch, J. R. (1974). On the identification of place and voicing features in synthetic stop consonants. Journal of Phonetics, 2, 181–194. Smits, R. (2001a). Evidence for hierarchical categorization of coarticulated phonemes. Journal of Experimental Psychology: Human Perception and Performance, 27, 1145–1162. Smits, R. (2001b). Hierarchical categorization of coarticulated phonemes: A theoretical analysis. Perception & Psychophysics, 63, 1109–1139. Strand, E. (1999). Uncovering the role of gender stereotypes in speech perception. Journal of Language and Social Psychology, 18, 86–100. Sussman, H., Fruchter, D., Hilbert, J., & Sirosh, J. (1998). Linear correlates in the speech signal: The orderly output constraint. Behavioral and Brain Sciences, 21, 241–299. Sussman, H. M., & Shore, J. (1996). Locus equations as phonetic descriptors of consonantal place of articulation. Perception & Psychophysics, 58, 936–946. Whalen, D. (1989). Vowel and consonant judgments are not independent when cued by the same information. Perception & Psychophysics, 46, 284–292. Whalen, D. (1992). Perception of overlapping segments: Thoughts on Nearey’s model. Journal of Phonetics, 20, 493– 496. -23-