REV-McMurray20100052 - American Psychological Association

advertisement
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Note 1: Phonetic Analysis
Jongman, Wayland, and Wong (2000) report extensive analyses on the measures from their database that
we use here as the basis of our models. They show that, individually, each cue differed as a function of
place, sibilance and/or voicing, and most of these cues also differed as a function of the vowel context
and/or the gender of the speaker. However, no attempt was made to compare the amount of variance in
each cue due to each factor (although P2 was reported for many comparisons). Moreover, such an
analysis has not been conducted for any of the new cues we measured here. Thus, we evaluated 1)
which cues contribute to each categorical distinction; and 2) the contributions of contextual factors
(speaker and vowel). This was done with a series of regression analyses that provide a standard effect
size measure that can be compared across cues and effects. Crucially, we also used these analyses to
highlight and explore the contributions of the newly proposed cues.
In each analysis, a single cue was the dependent variable, and the independent variables were a
combination of dummy codes reflecting a single factor of interest, such as fricative identity (7
variables), voicing (1 variable), sibilance (1 variable), or place of articulation (3 variables). In each
regression, we first partialed out the effect of speaker (19 dummy codes) and vowel (5 dummy codes),
before entering the effect of interest into the model. These regression analyses are necessarily
exploratory, and we do not
intend to draw broad conclusions Table S1: Summary of regression analyses examining effects of speaker
(20), vowel (6) and fricative (8) for each cue. Shown are R2change values.
from them. They are intended to
Missing values were not significant at p<.05 level.
provide an overall view of these
cues and the factors that
Contextual Factors
Fricative
Unexplained
contribute to their variance.
Identity
Variance
Results and Discussion
The results of the regression
analyses are summarized in
Table S1 (which shows the
overall effects of fricative and
context) and Table S2 (which
shows the specific effects of
each feature). There are a
number of important results
worth highlighting.
Fricative Identity.
Every cue was affected by
fricative identity. While effect
sizes ranged from very large (10
/ 24 cues had R2change> .40) to
very small (vowel RMS, the
smallest: R2change =.011), all were
highly significant. Even cues
that were originally measured to
compensate for variance in other
cues (e.g., vowel duration was
measured to normalize fricative
Cue
MaxPF
DURF
DURV
RMSF
RMSV
F3AMPF
F3AMPV
F5AMPF
F5AMPV
LF
F0
F1
F2
F3
F4
F5
M1
M2
M3
M4
M1trans
M2trans
M3trans
M4trans
+
p<.05
Speaker
df=19,2860
0.084*
0.158*
0.475*
0.081*
0.570*
0.070*
0.140*
0.077*
0.203*
0.117*
0.838*
0.064*
0.109*
0.341*
0.428*
0.294*
0.122*
0.036*
0.064*
0.031*
0.066*
0.084*
0.029*
0.031*
*
p<.0001
-1-
Vowel
df=5,2855
*
0.021
0.316*
0.043*
0.028*
0.156*
0.012*
0.040*
0.004+
0.007*
0.603*
0.514*
0.128*
0.050*
0.045*
0.043*
0.061*
0.079*
0.069*
df=7,2848
0.493*
0.469*
0.060*
0.657*
0.011*
0.483*
0.076*
0.460*
0.046*
0.607*
0.023*
0.082*
0.119*
0.054*
0.121*
0.117*
0.425*
0.678*
0.387*
0.262*
0.430*
0.164*
0.403*
0.192*
.423
.353
.149
.260
.376
.419
.628
.451
.712
.272
.132
.251
.257
.477
.401
.544
.451
.284
.548
.704
.461
.692
.490
.709
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
duration) had significant effects. Spectral moments (especially the mean and variance) were particularly
important, but, surprisingly, F2 (which has received substantial attention in the literature) had only a
moderate effect (R2change=.119). Interestingly, its size was similar to that for F4 (R2change=.121) and F5
(R2change=.117), two cues which have not been previously examined.
Some cues could clearly be attributed to one feature more than to another, although there were
no cues that were associated with only a single feature. Duration cues were clearly related to voicing
(DURF: R2change=.403; DURv: R2change =.055), not place of articulation (DURF: R2change =.052; DURv:
R2change =.004) or sibilance (DURF: R2change =.052, DURv: R2change =.002). This was the same for low
frequency energy (Voicing: R2change =.482), although this cue may also be involved in sibilance detection
(R2change =.120).
Other cues were clearly about sibilance. RMSF and F5AMPF were highly correlated with
sibilance (RMSF: R2change=.419; F5AMPF: R2change=.394). They were also correlated with place of
Table S2: Summary of more fine-grained analyses of each cue. Shown is R2change after speaker and vowel
have been partialed out of the dataset (they will have the same R2 values as in table 1). Missing values
were not significant at p<.05. The “Overall” column is the effect of fricative identity in general (8
categories). Other columns show specific features: sibilance, voicing and place of articulation (4
categories). The effect of place of articulation for sibilants and non-sibilants was computed for only that
subset of the data, all other effects reflect analysis of the entire dataset.
Cue
MaxPF
DURF
DURV
RMSF
RMSV
F3AMPF
F3AMPV
F5AMPF
F5AMPV
LF
F0
F1
F2
F3
F4
F5
M1
M2
M3
M4
M1trans
M2trans
M3trans
M4trans
+p<.05
Overall
df=7,2848
0.493*
0.469*
0.060*
0.657*
0.011*
0.483*
0.076*
0.460*
0.046*
0.607*
0.023*
0.082*
0.119*
0.054*
0.121*
0.117*
0.425*
0.678*
0.387*
0.262*
0.430*
0.164*
0.403*
0.192*
Sibilance
df=1,2854
0.260*
0.052*
0.002*
0.419*
0.001+
0.239*
0.008*
0.394*
0.029*
0.120*
0.001*
0.031*
0.060*
0.034*
0.005*
0.026*
0.010*
0.441*
0.137*
0.022*
0.163*
0.012*
0.193*
0.052*
Voicing
df=1,2854
0.004+
0.403*
0.055*
0.180*
0.002*
0.002+
0.017*
0.024*
0.003+
0.482*
0.021*
0.045*
0.002+
0.002+
0.085*
0.105*
0.017*
0.158*
0.022*
0.124*
0.035*
*p<.0001
-2-
Place of Articulation
Overall
Non-Sibilants Sibilants
df=3,2852
df=1,1414
1,1414
0.483*
0.006+
0.504*
*
0.052
0.004*
0.004*
0.425*
0.004+
0.025*
*
+
0.004
0.003
0.001+
*
0.450
0.444*
0.056*
0.043*
0.057*
*
0.401
0.020*
*
*
0.038
0.014
0.005+
*
+
0.124
0.003
0.005+
0.001*
0.001+
*
+
0.036
0.001
0.012*
*
*
0.114
0.057
0.060*
0.050*
0.038*
*
0.119
0.083*
0.132*
*
*
0.116
0.082
0.101*
*
0.269
0.552*
0.494*
0.015*
0.335*
*
0.304
0.369*
*
*
0.159
0.021
0.189*
0.227*
0.026*
0.104*
*
*
0.072
0.038
0.100*
*
*
0.253
0.061
0.097*
0.106*
0.062*
0.504*
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
articulation (RMSF: R2change=.425; F5AMPF: R2change=.401), but when place was considered separately in
sibilants and nonsibilants, little effect was seen in either, suggesting this cue primarily reveals sibilance.
However, several other cues that were strongly associated with sibilance were also related to place of
articulation. MaxPF and F3AmpF, for example, were strongly associated with sibilance (MaxPF:
R2change=.260; F3AmpF: R2change =.239), but within sibilants were also useful for distinguishing alveolars
and postalveolars (MaxPF: R2change=.504; F3AmpF: R2change=.444). Thus, these cues seem to be available
to make two independent distinctions (sibilance in general, and place of articulation within sibilants).
Of the formant frequencies, F2, F4 and F5 had moderate effects that were primarily limited to
place of articulation (F2: R2change=.114; F4: R2change=.119; F5: R2change=.116), and these cues appeared to
be similarly useful for both sibilants (F2: R2change=.060; F4: R2change=.132; F5: R2change=.101) and
nonsibilants (F2: R2change=.057; F4: R2change=.083; F5: R2change=.082).
Separate analyses examined place of articulation in sibilants (alveolar vs. postalveolar) and
nonsibilants (labiodentals vs. interdentals) (Table S2). While there was a wealth of cues that were highly
sensitive to place of articulation in sibilants, there were few that were related to place in nonsibilants,
and these showed only moderate to low effect sizes. Of these, the best were F4 (R2change =.083), and F5
(R2change =.082) (two new cues for nonsibilants) and the skewness and kurtosis during the transition
(M3trans: R2change=.061; M4trans: R2change=.062). As shown in Table S1, all of these cues are also highly
context-dependent (F4: R2change =.478; F5: R2change=.339; M3trans: R2=.108; M4trans: R2=.10), suggesting
that to take advantage of what little information there is for nonsibilants, listeners may need a
compensatory mechanism.
Context Effects.
Contextual factors (speaker and vowel) accounted for a significant portion of the variance in every cue.
Not surprisingly, speaker and vowel accounted for a massive amount of variance in cues like vowel
duration (R2change=.792) and vowel amplitude (R2=.612), which were measured to capture some of the
contextual variance. F0 and all five formants were also highly related to context with average effect
sizes in the 40–60% range. For F1 and F2, this was largely due to the vowel (F1: R2change=.603; F2
R2change=.514), while for the other formants this was largely due to speaker. Most other cues showed
much smaller effects, in the 8–10% range.
Unexplained Variance.
Finally, as Table S1 shows, there was a substantial amount of unexplained variance in each cue. For
seven of these cues (F3AMPV, F5AMPV, F5, M3, M4, M2trans, and M4trans) there was more unexplained
variance than the combined variance accounted for by speaker, vowel and fricative. For the others, this
was still substantial. Even some of the best (near invariant) cues showed large amounts of unexplained
variance, cues like MaxPF (42.3%), M2 (28.4%), M3 (54.8%) and M4 (70.4%). Of course, the presence
of unaccounted for variance is common in regression analyses, but raises an interesting question here.
This variance means that even fricatives that were spoken by the same person, in the same sentence
context, and in the same recording session (e.g., repetitions 1, 2, and 3) differed substantially in terms of
their acoustic realization. Across this data set, the only factors that systematically varied were speaker,
vowel, and fricative identity, and these effects have been accounted for. Thus, the unexplained variance
suggests that a large portion of the variance in speech cues may actually be due to random factors, that
is, noise (see Newman, Clouse, & Burnham, 2001), that the listener must deal with. It does not appear
that if we knew all of the relevant factors, we could account for all the variance in these cues.
-3-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Discussion.
There are several candidates for primary cues to place of articulation or sibilance, though none are
completely invariant to context, and it is not clear that any are strongly related to place of articulation in
nonsibilants. There were also no unique, contextually invariant cues to voicing.
Fricative identity as well as vowel and speaker affect virtually every cue we studied. This may
present problems for models that conflate these sources of information like exemplar models. At the
same time, some cues are likely to be more informative than others: peak frequency, the narrow-band
amplitudes, and the spectral moments are strongly related to place of articulation; RMSF, the narrowband amplitudes and M2 are strongly related to sibilance; and DURF and LF are strongly related to
voicing (though they were also affected strongly by context). Most of the place cues were helpful with
sibilants, and nonsibilants showed only weak relationships with primarily context-dependent cues.
Finally, we found only a handful of cues that come close to invariance: MaxPF, narrow-band amplitudes
in the fricative, and spectral moments 2–4 seemed somewhat context-independent, and cued various
aspects of sibilance and place of articulation. As a whole, this strongly reinforces the notion that
fricative identification requires the integration of many cues, and there are few, if any, that are invariant
with respect to context and other factors.
-4-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
1
1
0.8
0.9
% Correct
Proportion Correct
Note 2: Analysis of Perceptual Data
The primary analysis used generalized estimating equations with a logistic linking function to
approximate a mixed-design ANOVA with a binary dependent variable. The model included syllabletype as a between-subjects variable, along with vowel, speaker, place of articulation and voicing as
repeated measures. Vowel and speaker were included in the model as main effects only. Accuracy was
the dependent variable.
This analysis was fully reported in the paper, but follow-up analyses were also run separating the
data by syllable-type in order to understand the two-way interactions of place and voicing with syllabletype. Each analysis included place, voicing and their interaction as primary factors while also including
independent (noninteracting) effects of vowel and speaker.
In the complete-syllable condition we found a significant main effect of speaker (Wald
2
 (9)=135.5, p<.0001). Z was not significant individually (Wald 2(1)=2.0, p=.156; Figure S2C).
In the fricative-noise condition, speaker was still significant (Wald 2(9)=196.9, p<.0001), but
vowel was no longer significant (Wald 2(2)=.9, p=.6; Figure S2B). This implies that the vowel effect
seen in the complete-syllable condition was not due to the fact that the particular frication produced
before an /i/ (or any other vowel) was more (or less) ambiguous. Heard alone, there was no effect of
vowel. Rather, the vowel contributes something beyond simply altering the cues in the frication. As
before, place of articulation was significant (Wald 2(3)=189.8, p<.0001; Figure S1A), with all three
places differing significantly from postalveolars (labiodentals: Wald 2(1)=38.1, p<.0001; interdentals:
A.
B.
0.6
Complete Syllable
0.4
Noise Only
0.2
0.8
0.7
Complete Syllable
Noise Only
0.6
Chance
0
0.5
f
v
ɵ
ð
s
z
ʃ
ʒ
i
u
Fricative
Context Vowel
C.
1
0.9
Proportion Correct
Proportion Correct
1
0.8
0.7
0.6
Voiced
Voiceless
0.5
0.4
0.3
labiodental
a
D.
0.9
0.8
0.7
0.6
Voiced
Voiceless
0.5
0.4
0.3
interdental
alveolar
Place of Articulation
postalveolar
labiodental
interdental
alveolar
Place of Articulation
post
alveolar
Figure S1: Listeners’ performance (proportion correct). A) Performance on each of the
eight fricatives as function of condition. B) Performance across fricatives as a function of
vowel and condition. C) Performance as a function of place of articulation and voicing for
the complete-syllable condition; D) The same for the noise-only condition.
-5-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Wald 2(3)=65.0, p<.0001; alveolars: Wald 2(1)=10.9, p=.001). This time, voicing was significant
(Wald 2(1)=15.0, p<.0001), with better performance on voiceless than voiced sounds. The voicing 
place interaction was also significant (Wald 2(1)=34.5, p<.0001; Figure S1D) due to a significant effect
of voicing in interdentals (Wald 2(1)=12.0, p=.001), but not in labiodentals (Wald 2(1)=.4, p=.5) or
alveolars (Wald 2(1)=.5, p=.48).
To summarize our findings, we found that 1) performance without the vocalic portion was
substantially worse than with it; 2) performance varied substantially across speakers; 3) sibilants were
easier to identify than nonsibilants but there were place differences even within the sibilants; 4) voicing
effects were largely restricted to the interdental fricatives; and 5) the identity of the vowel affected
performance, but only in the complete-syllable condition. Thus, either particular vowels alter the
secondary cues in the vocalic portion that mislead (or help) listeners, or the identity of the vowel causes
subjects to treat the cues in the frication noise differently. Most likely it is the latter—the lip rounding
created by /u/ has a particularly strong effect on the frication, and listeners’ ability to identify the vowel
(and thus account for these effects) may thus offer a large benefit for fricatives preceding a /u/ that is not
seen for the unrounded vowels.
-6-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Note 3: Confusion Matrices
While our primary analysis of the empirical data (and the model) focused on the overall accuracy as a
function of fricative, vowel and speaker, listeners’ (and models’) responses were not dichotomous.
Rather, listeners (and models) selected which of the eight fricatives was their response for each stimulus.
In this section, we present the confusion matrices (the likelihood of responding with a given fricative
given the one that was heard) as an alternative metric for evaluating the experiment and models. This
necessarily ignores the context effects, but it paints a parallel picture to the analysis presented in the
manuscript: the compensation / C-CuRE model performs like listeners in the complete-syllable
condition, and the cue-integration model succeeding in the frication-only condition.
3.1
Listener Data
Table S3 shows confusion matrices for each condition in the perceptual experiment. In the completesyllable condition, listeners were accurate overall (M=91.2%), particularly on the sibilants (M=97.4%).
The only systematic confusions were for nonsibilants and within these participants typically chose the
wrong place of articulation, but maintained voicing. For example, when /f/ was miscategorized, /v/ was
Table S3: Confusion matrices for each condition of the perceptual experiment. Shaded cells
represent a response rate of greater than 5%.
Complete syllables
Fricative Presented
Fricative Responded
f
v
ɵ
ð
s
z
ʃ
ʒ
f
88.2
0.8
8.2
0.3
0.1
v
1.0
89.9
0.1
17.0
0.2
ɵ
10.4
0.5
87.3
7.5
2.3
0.3
0.1
0.1
ð
0.2
8.5
3.4
74.4
0.7
2.5
0.1
0.1
s
0.3
z
ʃ
ʒ
0.3
1.0
0.1
94.1
0.1
0.2
0.5
2.6
96.4
0.2
0.2
0.2
99.4
0.1
0.5
0.3
99.6
%correct
88.2
89.9
87.3
74.4
94.1
96.4
99.4
99.6
Frication Noise Only
Fricative Presented
Fricative Responded
f
v
ɵ
ð
s
z
ʃ
ʒ
f
71.8
8.1
21.3
3.2
0.4
0.1
0.3
v
0.9
72.3
0.8
50.6
0.1
0.6
0.1
ɵ
23.0
5.4
62.4
12.0
1.4
0.4
0.6
ð
3.6
13.2
11.5
30.9
0.3
1.5
0.3
0.7
s
0.5
0.1
3.7
0.3
92.9
8.1
0.2
0.2
-7-
z
0.8
0.3
2.5
3.0
87.6
0.4
ʃ
0.1
ʒ
0.2
0.1
0.1
1.9
0.1
97.4
3.1
0.5
0.2
1.6
1.3
95.6
%correct
71.8
72.3
62.4
30.9
92.9
87.6
97.4
95.6
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
selected 1% of the time, but /ɵ/ 10.4% of the time. Similarly, /v/ was confused with /ð/ 8.5% of the
time, but with /f/ only 0.8%. The only exception was /ð/, the most difficult fricative (M=74.4%), which
showed confusions for both voicing (/ɵ/: 7.5%) and place (/v/: 17.0%).
In the frication-only condition, performance dropped substantially (M=76.3%), but the overall
pattern remained. For nonsibilants, the majority of confusions were in terms of place of articulation
(M=27.0% across all four), though there were more confusions in voicing (M=8.1%). The error rate for
sibilants was higher than in the complete-syllable condition, but there were still few, and they slightly
favored voicing (M=3.9%) over place (M=1.0%).
Across both conditions, confusions respected sibilance. Nonsibilants tended to be confused with
other nonsibilants (Complete-syllable: M=14.5%; Frication-only: M=38.4%) rather than sibilants
(Complete-syllable: M=0.6%; frication-only: M=2.2%); and while sibilants were rarely confused in the
complete-syllable condition, sibilants tended to be confused with other sibilants in the noise-only
condition (M=5%) and rarely labeled as nonsibilants (M=1.7%).
3.2
Invariance Model
The confusion matrix for this model (Table S4) shows some similarities to listeners but also major
differences. Like listeners, this model was more likely to confuse place of articulation than voicing.
However, unlike listeners, /f/ was classified as /v/ 4.8% of the time (listeners: M=1.0%). Also, like
listeners, the model’s errors tended to respect sibilance, yet /f/ and /ð/ were exceptions to this. More
surprisingly, all the sibilants showed a small but noticeable rate of confusion with nonsibilants.
Finally, there were also a number of confusions that did not seem to resemble listeners at all. /ð/
was classified as /f/ and /z/ at very high rates (5.2% and 5.4%, respectively) compared with listeners
(.3% and .5%). Conversely, /ð/ was rarely classified as /ɵ/ (1.7%) while that was common for listeners
(7.5%). Finally, when listeners heard /s/, virtually all errors were to /z/ while the model’s errors were
evenly distributed across all three other sibilants. Thus, the pattern of errors in the invariance model
does not seem well correlated with those of listeners.
3.3
Cue-Integration Model
Our analysis of the confusion data for the cue-integration revealed a closer match to listeners,
particularly for the frication-only condition (Table S5). In the complete-syllable condition, errors on
nonsibilants tended to respect voicing (like listeners, and unlike the invariance model), and nonsibilants
were rarely classified as sibilants. Within sibilants, the model’s errors better reflected listeners’: /s/ was
Table S4: Confusion matrix of the invariance model based on the probabilistic decision rule.
Fricative Presented
Fricative Responded
f
v
ɵ
ð
s
z
ʃ
ʒ
f
62.6
1.8
30.8
5.2
.6
.2
1.2
v
4.8
61.8
.9
29.1
.2
.2
.1
.8
ɵ
23.2
.7
62.4
1.7
1.1
.5
.3
ð
3.7
35.7
2.4
58.2
.3
.2
.6
s
.5
z
4.9
3.4
.1
87.1
2.9
5.1
5.4
3.3
92.0
.2
1.7
.1
7.1
-8-
ʒ
%
Correct
88.4
.2
4.1
3.9
4.2
62.6
61.7
62.4
58.2
87.1
92.0
88.4
4.7
85.6
85.5
ʃ
.2
3.2
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Table S5: Confusion matrix of the cue-integration model with complete syllables (top) and frication-only
(bottom) based on the probabilistic decision rule.
A. Complete-Syllable
Fricative Presented
Fricative Responded
f
v
ɵ
ð
s
z
ʃ
ʒ
f
v
ɵ
ð
s
77.3
1.3
32.9
4.6
.5
.4
3.6
3.6
71.3
0.5
25.9
.1
1.7
.1
14.5
.6
62.0
3.4
.8
.2
.5
1.9
26.8
2.9
63.0
.4
2.2
1.7
2.6
.4
z
ʃ
ʒ
%correct
88.4
.1
2.3
1.5
62.6
61.7
62.4
58.2
87.1
92.0
88.4
4.0
89.9
85.5
ʃ
ʒ
%correct
54.0
54.9
59.8
47.1
81.9
83.0
91.0
85.9
.1
1.8
92.0
3.3
4.2
.7
3.2
3.6
89.9
4.9
2.4
B. Frication-Only
Fricative Presented
Fricative Responded
f
v
ɵ
ð
s
z
ʃ
ʒ
f
v
ɵ
ð
55.1
4.1
38.8
5.9
0.5
0.5
1.7
4.0
55.4
1.0
41.6
36.4
1.8
55.7
3.3
0.8
1.2
1.1
4.5
38.7
1.7
47.6
0.1
0.7
0.1
1.4
1.2
0.0
1.0
s
2.8
z
85.8
3.9
3.0
.1
1.5
6.0
91.7
0.2
87.4
2.3
0.8
6.6
0.0
7.6
1.0
89.0
4.5
most frequently identified as /z/ and then /ʃ/ by both this model and listeners (but not the invariance
model). There were still differences. For listeners, /ʃ/ and /ʒ/ were the best identified tokens, while for
the cue-integration model it was /z/. There were also a few odd confusions between sibilants and
nonsibilants that the listeners did not display (/z/ classified as /ð/; /ʃ/ as /f/, and /f/ as /s/).
An analysis of the confusion data for the frication-only condition (Table S5b) revealed similar
results: the model captured the broad pattern of results, with a less close fit on specifics. Like listeners,
errors generally stayed within sibilance class. Within nonsibilants, errors tended to be by place of
articulation (not voicing), and for sibilants there were more errors on voicing than place of articulation.
There were some differences in the details. Listeners made substantial errors on voicing for
nonsibilants, hearing /v/ as /f/ 8.1% of the time, and /ɵ/ as /ð/ 11.5% of the time. Yet the model rarely
made such errors, classifying /f/ as /v/ only 4.0% of the time and /ɵ/ as /ð/ only 1.6% of the time. This
was despite fairly worse performance on the nonsibilants (Model: 53.9%; Listeners: 59.4%)—the model
simply made more place errors.
3.4
Compensation / C-CuRE Model
The analysis of the confusion data for the compensation / C-CuRE model yielded the closest match to
-9-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Table S6: Confusion matrix of the C-CuRE model with complete syllables.
Fricative Presented
Fricative Responded
f
v
ɵ
ð
s
z
ʃ
ʒ
f
v
ɵ
ð
78.5
1.0
20.9
.2
2.2
74.4
.9
19.2
17.5
.6
76.0
2.5
1.7
24.0
2.1
78.1
.1
97.1
.8
.9
.1
.4
s
.1
z
1.2
98.6
2.6
ʃ
ʒ
%correct
97.1
.2
.6
1.8
78.5
74.4
76.0
78.1
97.1
98.6
97.1
.6
96.4
96.4
1.4
listener performance (Table S6). The C-CuRE model never confused a nonsibilant with a sibilant, and
was highly unlikely to classify a sibilant as a nonsibilant. Its errors on voicing were minimal, with the
exception of /s/ being classified as /z/ (a pattern the listeners also showed). In short, there were no large
errors that were not also shown by the listeners, and the large errors that the listeners made were
reflected here.
-10-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Note 4: The Effect of Mis-parsing
The C-CuRE simulations make the assumption that listeners can perfectly identify the speaker and
vowel as a precursor to compensation. This assumption was made in order to capture the upper limit of
the information available in a set of compensated cues and it was clear that this was sufficient to predict
listener performance overall. However, it is not clear how robust the C-CuRE parsing mechanism is to
error (or rather, how robust the categorization architecture is to mis-parsed cues). The present
simulations are intended to address that.
In these simulations, we selected a probability that the system would misidentify the speaker, and
an independent probability that it misidentified the vowel. For each token in the set of fricatives in the
perceptual experiment, we then randomly selected (based on that probability) whether it would be
parsed correctly or incorrectly for speaker, and independently for vowel. If it was incorrect (on either
speaker or vowel), a random speaker (or vowel) was chosen and the cue values were parsed as if this
was the speaker or vowel that was identified. This yielded a new data set of residuals in which some
proportion of the cues had been mis-parsed. This was used as input to the original C-CuRE model
(including the original estimated weights which were not retrained on the mis-parsed data) to determine
how well it would perform under these circumstances. Given that the mis-parsing was random we
simulated multiple runs of the model, although it was clear after the first few that with this many tokens
(240) only a handful were needed to estimate overall performance.
Initial simulations suggested that the effect of mis-parsing vowel and speaker were additive.
Increasing the likelihood of either resulted in a fixed decrement in performance, and the effect of
misparsing both was approximately the sum of either. Thus, for the simulations reported here, we
covaried the likelihood of making a speaker and vowel error in steps of 5% ranging from 0 (the original
C-CuRE model) to 50% (half the trials were misparsed). At each step, four simulations were run and we
recorded the proportion correct as a function of the fricative, the speaker and the vowel.
Proportion Correct
Results
Figure S2 shows the models’ accuracy (in the gray range, by both the discrete choice and probabilistic
rule) as a function of the likelihood of misidentifying both speaker and vowel. For reference, the
performance range of the cue1
integration model is represented by the
C-CuRE model
two lines, and that of the original
Cue-integration model
(perfect) C-CuRE model is at 0 on the
0.9
x-axis. There is a clear effect of misparsing on performance, with a
performance dip of about 2.5% for each
increase of 10% in the likelihood of
0.8
mis-parsing. However, it is not until the
likelihood of an error on either factor
reaches about 30–35% that the C-CuRE
0.7
model’s performance drops to the level
0
0.1
0.2
0.3
0.4
0.5
of the cue-integration model. Normal
Proportion Misparsed Trials
hearing listeners are far better than 30%
Figure S2: Effect of the proportion of trials on which speaker or
at identifying unambiguous vowels and
vowel is mis-identified on performance of the parsing model. The
speakers, suggesting that at the range of
gray range is the parsing model’s performance, bounded by the
performance likely shown by listeners
discrete-choice rule (top) and the probabilistic rule (bottom). Thick
lines represent the same for the cue-integration model.
the C-CuRE model will still be the
-11-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
A.
Listeners
0.9
5-15%
20-30%
35-50%
0.8
0.7
0.9
F2
F3
F4
F5
M1
M2
M3
M4
M5
0%
10%
20%
30%
50%
40%
0.8
0.7
F1
B.
1
Proportion Correct
Proportion Correct
1
i
u
ɑ
Vowel
Speaker
Figure S3: Performance of the C-CuRE model when identification of speaker and vowel is imperfect. A)
Overall performance (probabilistic rule) as a function of speaker and misidentification rate. B) Overall
performance (probabilistic rule) as a function of vowel and misidentification rate.
superior categorizer.
Moreover, even with a noisy parser, the C-CuRE model still shows many of the same context
effects that were diagnostic of human performance (and that the cue-integration and invariance models
were unable to show. While its performance as function of speaker is lower overall when parsing is
noisy, it is still robustly correlated with listener data (Figure S3A) with average correlations of R=.43
across misidentification rates of 5–15%, R=.29 at 20–30% and R=.24 at 35–50% (for comparison the
cue-integration model had a correlation of R=-.01). More dramatically, the effect of vowel context
showed the same pattern as listeners (i<u<ɑ), and this effect remained even at misidentification rates of
40 and 50% (Figure S3B).
Discussion
While there were clear performance decrements in the compensation / C-CuRE model when speaker and
vowel identification were not perfect, it is clear that this model remains superior to the cue-integration
model. Overall performance of the two does not converge until we assume that the C-CuRE model
misidentified both speaker and vowel at least 30% of the time, which is far worse than listeners are
likely to be. The correlations between the model and the listener data across speakers are robust even
out to 50%; and the correct effect of vowel context can be seen at all rates of misidentification tested.
Moreover, we made two simplifying assumptions in this simulation that may swing the
pendulum too far in the other direction: this task is more challenging than what would likely be faced by
real listeners or a scaled up model. First, when a misidentification occurs, errors were completely
random. For example, when the model misidentified a male speaker it was just as likely to select a male
as a female, despite the fact that within-gender pairs of voices are likely to be significantly more similar
than across-gender. If the model’s confusions reflected real similarity data between speakers and
between vowels, the effect of misparsing is likely to be much less deleterious. Second, the
categorization component of this model used the original coefficients of the C-CuRE model, a model
that was trained on perfectly parsed data. However, if we trained the model on data that had been
occasionally misparsed, it may have developed a subtly different parameter-set that would be more
effective at coping with poorly compensated cues.
Nonetheless, despite these simplifying assumptions it is clear that by all of our criteria, the CCuRE model is superior to the cue-integration model, even when parsing is substantially imperfect.
-12-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Note 5: Parsing by Coarse-Grained Categories
In Note 4 we challenged the simplifying assumption of the C-CuRE model that the speaker and vowel
can be accurately identified for every token. These simulations showed that even when parsing got the
wrong speaker or vowel 25% of the time it outperformed the cue-integration model. An alternative
approach to this an issue is to make the categorization of context easier such that near-ceiling
identification performance is likely. For example, if the model categorized speakers as male or female,
instead of identifying individuals, it is much less problematic to assume that this can be done perfectly
as listeners are substantially less likely to err in such judgments. The flip side of this, however, is that
partialing out coarser-grained sources of information like gender may not be as informative as partialing
out fine-grained sources of variance like individual speakers. Thus, in this simulation we test this
hypothesis, limiting the C-CuRE model to parse out speaker gender, and only two properties of the
vowel (height and backness), and evaluating its performance.
This simulation also indirectly addresses the issue of the number of parameters in the model. We
describe in Section 3.2.3 how the linear regressions used for parsing introduce extra parameters into the
complete model, specifically, 19 parameters to account for subject, 5 to account for vowel, and an
intercept (25 parameters total for each of the 24 cues). As we describe, we did not count these
parameters against the C-CuRE model when computing BIC for model evaluation as these were not
truly degrees of freedom for the categorization model—they were not available to the optimizer when
the categories were acquired. Nonetheless, it would be useful to determine if successful parsing could
be implemented using regressions with many fewer parameters. In this case, each regression only has 4
parameters (for each of the 24 cues), a substantial decrease in the number of parameters in the C-CuRE
model, from 25*24=600 parameters (for parsing alone) to 96. Thus, the success of this model could
suggest a route to maintain parsing but in a model with fewer parameters.
Methods
Similarly to the full C-CuRE model, we started by running 24 individual linear regressions. These were
run hierarchically first partialing out the gender of the speaker, and then adding two variables. The first
was dichotomous indicating vowel backness; the second had three values (–1, 0, 1) and indicated height.
Table S7 shows the results of these regressions. Gender accounted for substantially less variance
across cues than using all 19 speaker codes. While the speaker codes accounted for an average of 18.8%
of the variance across cues, and was significant for every cue; gender only accounted for 8.9% and was
not significant for a number of cues. This broke down unevenly between cues because of the variety of
ways that gender and speaker can influence articulation and acoustics. For example, the bulk of
variance associated with speaker for cues like M1 or F0 is related to factors like vocal tract and larynx
size which are also related to gender. Thus, one might expect similar speaker codes and gender to
account for similar amounts of variance. However, for other cues, speaker codes are capturing things
like mean syllable duration or amplitude, things that are not directly correlated with gender.
Vowel information was well preserved when we used only two parameters to describe the six
vowels instead of all five. The average R2 using all five vowel codes was 0.123; while it was 0.112
when only two were used. Moreover, every cue was significantly related to vowel in both regressions.
This suggests that parsing by phonetic features (e.g., height) may be just as effective as parsing by
individual phonemes, and will have lower dimensionality as an added bonus.
Results
The reduced C-CuRE model did surprisingly well. It performed at 90.8% correct on the tokens from the
perceptual experiment when measured with the discrete choice rule, and 83.5% with the probabilistic
-13-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
rule. This was lower than the full C- Table S7: Comparison of regression analyses examining effects of
CuRE model but still better than the
speaker and vowel on each cue (the complete C-CuRE model) vs.
regression analyses presented here using only gender and vowel
cue-integration model (Table S8).
height and backness (the limited C-CuRE model). Shown are R2change
Similarly, its BIC was midway
values. Missing values were not significant at p<.05 level.
between the two models. Finally,
while the cue-integration model
Complete C-CuRE Model
Limited C-CuRE Model
failed to use a number of the cues,
Cue
Speaker
Vowel
Gender
Vowel Height
and the C-CuRE model was able to
(19)
(5)
(2)
& Backness (2)
use them all, this more limited CMaxPF
0.084*
.024*
*
*
CuRE model significantly used
DURF
0.158
0.021
.001+
.02*
*
*
*
every cue except the narrow band
DURV
0.475
0.316
.075
.312*
RMSF
0.081*
.010*
amplitudes in the vowel (F3AMPV
*
*
RMSV
0.570
0.043
.039*
.008*
and F5AMPV).
*
*
*
F3AMPF
0.070
0.028
.005
.024*
Clearly, by accuracy alone
*
*
0.140
0.156
.120*
the more limited C-CuRE model is a F3AMPV
F5AMPF
0.077*
0.012*
.003+
.011*
substantial improvement over the
*
*
F5AMPV
0.203
0.040
.018*
cue-integration model, though not
*
+
+
LF
0.117
0.004
.002
.003*
up to the complete C-CuRE model.
F0
0.838*
0.007*
.715*
.007*
However, we have also stressed the
*
*
*
F1
0.064
0.603
.01
.602*
ability of the model to qualitatively
*
*
*
F2
0.109
0.514
.071
.489*
fit the listener data, and the effect of
F3
0.341*
0.128*
.213*
.103*
vowel and speaker context on overt
*
*
*
F4
0.428
0.050
.301
.045*
*
*
*
performance was particularly
F5
0.294
0.045
.204
.03*
*
*
diagnostic.
M1
0.122
.011
*
Figure S4 shows a summary
M2
0.036
of this data. Breaking performance
M3
0.064*
.001+
*
down by fricative (Figure S4A)
M4
0.031
.003+
M1trans
0.066*
0.043*
.002+
.042*
showed a fair match to the data
*
*
*
M2trans
0.084
0.061
.009
.053*
although the model oddly appeared
*
*
M3trans
0.029
0.079
.078*
to overperform on /f/, /v/, and /ɵ/.
*
*
M4trans
0.031
0.069
.063*
This may have been due to the
+
*
p<.05
p<.0001
particular tokens selected at test as
this did not appear to be the case for
the training tokens (on which the model
Table S8: Model comparison
performed appropriately worse). However, the
effect of vowel context (Figure S4B) was
Model
% Correct
exactly as predicted with performance worst for
Discrete
Prob.
BIC
/i/, better for /ɑ/ and best for /u/. Perhaps not
Listeners
91.2
surprisingly, however, the effect of speaker was
Cue Integration
85.0
79.2
3381
not as well correlated with listener data for the
C-CuRE (complete)
92.9
87.0
2990
limited C-CuRE model (Figure S4C), although
C-CuRE (limited)
90.8
83.5
3267
the correlation was still positive (R=.25).
Discussion
Our initial linear regressions suggested that collapsing speakers into gender deleted a lot of useful
information (an indirect argument for the sort of speaker-specific encoding seen in exemplar models),
-14-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
A.
0.6
0.4
Limited C-CuRE model
Listeners (Complete syllable)
0.2
0
ɵ
v
ð
s
Fricative
ʒ
ʃ
z
Proportion Correct
Proportion Correct
0.8
f
B.
1
1
0.9
0.8
0.7
0.6
0.5
i
u
Vowel
ɑ
C.
Proportion Correct
1
Figure S4: Performance of the limited
C-CuRE model and human listeners on
the complete-syllable condition. Model
performance is represented by the gray
range bounded by performance using
the probabilistic and discrete-choice
rules. A) As a function of fricative. B)
As a function of vowel context. C) As a
function of speaker.
0.9
0.8
0.7
0.6
0.5
F1
F2
F3
F4
F5 M1
Speaker
M2
M3
M4
M5
while collapsing vowel from five variables to two deleted very little information. Thus, parsing by
features may be almost as effective as parsing by phonemes.
When these values were used in the categorization model it was clear that there was a
performance boost to be gained by simply parsing out gender and the features of the vowels. While
there were idiosyncrasies in the factors that affected accuracy, the effect of vowel, which was not seen in
the cue-integration and invariance models, remained. Thus, a limited version of parsing which collapses
phonemic influences on acoustic cues into features, and collapses individuals into categories (e.g.,
gender), may be of value as it reduces the number of free parameters in the model, and more
importantly, ensures that the identifications required for parsing can be more accurate.
Together with Note 4, these results suggest that there are at least two ways to retreat from the
complete C-CuRE model and still achieve a benefit in both performance and in the ability to fit the
human data.
-15-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Note 6: Compound Cues
The naïve invariance model emphasized first-order cues: single sources of information that directly cue
a distinction. None of these cues met our strict criteria for invariance, but were statistically the best set
of single cues that were available in this large corpus. However, work on invariance has also
emphasized compound cues, measurements in which different properties of the signal are related to each
other, for example relating frication duration to speaking rate by using the ratio of frication duration and
vowel duration. Another way of describing such compound cues is that the combination instantiates a
form of bottom-up compensation in which only cues them selves participate.
In this section, we examine a number of such compound cues to both substantiate our claims
regarding invariance, and to compare the interactive compensation mechanism proposed by parsing to
some of the best-known purely bottom-up approaches (for fricative cues).
Perhaps the best known compound cues are locus equations (Sussman et al., 1998). Locus
equations are constructed by computing a line connecting a formant frequency (typically F2) at the onset
and center of a vowel. This line is defined by its slope and intercept (the latter of which is equivalent to
F2 at onset), offering two cues for stop consonant identity. It compensates for the effect of speaker and
vowel on formant frequencies by taking into account the frequency at the steady state. F2 locus
equations have been applied to fricatives in a number of studies (Fowler, 1994; Sussman & Shore, 1996;
Jongman et al., 2000), and F2onset (the intercept) has also been used alone (e.g., Nittrouer, StuddertKennedy, & McGowan, 1989; Maniwa et al., 2009).
There are a number of other compound cues that may also contribute to fricative identity. The
duration of frication is often related to the broader duration of the syllable or the vocalic portion as a
ratio (DURF / DURV) to compensate for variance in speaking rate. Similarly, the RMS amplitude of the
vowel can be used as a baseline to normalize the amplitude of the fricative. The narrow-band
amplitudes are also typically transformed to relative amplitude by subtracting the amplitude in the
frequency range in the vowel from that of the fricative.
Thus, compound cues can compensate for variation due to speaker and vowel, as well as
variation in amplitude and speaking rate. As such, they should be more invariant with respect to context
and their first-order counterparts. However, using such cues comes at a cost—while first-order cues can
be identified readily in the signal, compound cues require substantially more effort to uncover and as a
result we have only identified a handful of them. They may also require the system to store multiple bits
of information for some time during online speech perception (e.g., storing frication duration till the end
of the vowel). This would require a buffer of some kind, and may also delay listeners’ ability to make a
commitment. It is also contrary to recent empirical data suggesting that for voicing at least there does
not appear to be such a buffer (McMurray, Clayards, Aslin, & Tanenhaus, 2008).
Interestingly, in many cases, compound cues appear to be doing similar things to parsing.
Consider the slope of the F2 locus equation. This cue is defined as:
MF2 = (F2onset - F2vowel) / (.5 * DURv)
(1)
Where MF2 is the slope of the locus equation and DURv is the duration of the vowel. Here, MF2 will be
negative if F2onset is lower than the steady-state value and higher otherwise.
Now, consider a version of F2onset in which vowel has been parsed out. To accomplish this, we
first develop a regression equation predicting F2onset from the vowel:
F2onsetP = 0 + 1V1 + 2V2 + 3V3
(2)
Here, F2onsetP is the predicted F2onset, and 0 is the y-intercept. V1-3 are dummy-coded variables such that
-16-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
V1 is 1 if the vowel was /i/ (and 0 otherwise), V2 is 1 if the vowel was /u/ (and so forth). 1-3 are the
regression weights on this term. Since for any given token only one of the dummy codes is 1, this
regression equation can be rewritten as
F2onsetP = 0 + 1V1
(3)
(if the vowel was /i/). Under this, 1 will be equal to the average difference in F2 between the /i/’s and
all the other vowels, and 0 + 1 will be the mean F2 of all the /i/ tokens. Now, to parse the effects of
vowel from the original F2 we simply subtract the predicted F2 from the actual.
F2onsetR = F2onset - F2onsetP
= F2onset – (0 + 1V1)
= F2onset – Mean F2onset for /i/
(4)
This is quite similar to the locus equation. In both cases, the relative (or parsed) value (F2onsetR) is
linearly related to the difference between the actual value and some estimate of the vowel. The only
substantive difference is that the locus equation uses the actual F2 of the vowel, while the parsed F2 uses
the mean F2 across all tokens of that vowel.
Thus, parsing in the C-CuRE may be a proxy for the compensation built into these compound
cues. However, since we only need to know one measurement to parse, C-CuRE can be used on any cue.
This is certainly useful for speech scientists (since we don’t need to exhaustively search all possible
cues). More importantly, it may also be useful for listeners. During real-time processing when the
vowel may not have been heard at early points in the syllable (e.g., McMurray, et al., 2008), relying on a
memory of the mean value of that cue for the speaker/vowel (rather than waiting to get it) may be more
efficient than storing the fricative cues, and waiting for the rest of the information. There may be tradeoffs, though: the actual F2vowel may be a better estimate than the mean; and the listener would still have
to track mean cue values long-term, rather than computing them on the fly. But on the other hand, mean
values may be more robust against noise than single values. Thus, the advantage (or disadvantage) of
compound cues over cues parsed with C-CuRE is somewhat uncertain but there are clear arguments for
both sides.
The present analysis examined this issue. In our data set, there were five potential compound
cues: F2onset can be relativized as a locus equation; duration can be relativized as the ratio DURF / DURV;
and RMS and the narrowband amplitudes (F3 and F5) can be relativized as the difference between the
frication and the vowel. Our goal was to determine if the compound values of these cues would
outperform the parsed values (or vice versa).
Of course, a model based on only five cues was not likely to perform well. There are too few
cues, and the cues were selected based on empirical work, not their coverage of the relevant phonetic
features. Thus, we also included five absolute cues to achieve broader coverage: the first four spectral
moments and low frequency energy. We then compared several models. The first added five potentially
compound cues as first-order cues in completely raw form (F2onset, DURF, RMSF, F3AMPF, F5AMPF).
The second model used each of the cues in compound form: MF2, BF2 (the slope and intercept of the F2
locus equation), DURR (DURF / DURV), RMSR (RMSF - RMSV), F3AMPREL (F3AMPF – F3AMPV) and
F5AMPREL (F5AMPF – F5AMPV). Finally, we considered several models examining expanded cue-sets
and/or compensation with C-CuRE. As before, data was converted to Z-scores, and the perceptual
tokens were used only for test. Unlike before, we did not replace missing data points with means, since
for a handful of data points we were missing only one of the two values for a compound cue. Thus, any
record that was missing data for any of the measures was excluded, resulting in 2,775 (out of 2,880)
-17-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
records in the analysis.
Results
All five models fit the data well (see Table S9). The first-order cue model was highly significant
(2(70)=8,666, p<.001), and reduced its BIC from 11,593 in the intercept-only model to 3,483. On the
perceptual tokens, it averaged 80.0% correct using the discrete choice rule and 73.3% using the
probabilistic rule. While this is not great performance compared with either of the other first-order cue
models (the invariance and cue-integration models), this was expected, since these cues were only
chosen due to the fact that they had compound-cue counterparts.
Not, surprisingly, the compound-cue model did better. It fit the data significantly (2(70)=
8,803, p<.001), and its BIC was a lower 3,401. It also performed better than the first-order model on the
discrete-choice (83.3% vs. 80.0%), and probabilistic rules (75.4% vs.73.2%). Despite better
performance, the compound-cue model did not use all of the cues. Likelihood ratio tests showed that the
first-order cue model used all five cues of interest (all 2(7)>26.5, p<.002), while the compound cue
model was unable to use the relative amplitude at F5 (2(7)=9.6, p=.21), though it did use the others (all
2(7)>67.0, p<.001). Thus, as a whole, compound cues offer a benefit to perception over the equivalent
first-order cues. However, the utility of any individual compound cue could be more variable.
The comparison between the first-order and compound-cue models is somewhat unfair, however,
as the compound cues benefit from two sources of information (e.g., DURR, used both DURF and
DURV). The first-order model used only one (DURF), assuming that the compound cue is a cleaned up
version of the same thing. However, without the need to relativize, both pieces of information could be
entered as independent first-order cues and may offer a benefit. Thus, as a second comparison for the
compound cues, we reran the first-order model, but added all of the component values as first-order
cues. So, for example, where the compound cue model used DURR, (DURF / DURV), the extended firstorder model used both DURF and DURV as first-order cues.
This model showed a good fit to the data (2(105)=9,061, p<.001) and had a lower BIC than
either of the other two (3,365), despite the additional cues. This was reflected in its performance as it
averaged 86.3% correct on the discrete choice rule and 78% by the probabilistic rule. Interestingly, all
cues except F5AMPV were highly significant (2(7)>15)—even the contextual cues offer some firstorder discrimination between fricatives. Thus, given this set of cues, a cue-integration approach may be
more effective than a compound-cue approach.
Next, we examined a compensation / C-CuRE model using the smaller cue-set (e.g., for RMS,
using only RMSF, but not RMSV). Its performance was slightly better than the equivalent compound
cue model. Its overall fit was significant (2(70)=9,189, p<.0001) and its BIC was 2,959 (substantially
lower than the prior three models). It was 85.0% correct on the discrete-choice rule and 78.7% on the
Table S9: Summary of models examining relative cues. The cue column lists only the number of cues of
primary interest (those that could participate in relative relationships). All models included 5 absolute
cues: M1-4 and LF) in addition to the cues of interest.
Model
First-order
Compound Cues
First-order extended
C-CuRE
C-CuRE extended
# Cues
5
6
10
5
10
Parsing
No
No
No
Yes
Yes
BIC
3483
3401
3365
2959
3040
-18-
Performance
Discrete
Probabilistic
80.0
73.2
83.3
75.4
86.3
78.0
85.0
78.7
87.9
80.9
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
probabilistic one. Thus, compensation in the C-CuRE scheme is more effective than compound cues.
Finally, we asked whether the addition of extra variables to the C-CuRE model offers any
benefit. A final model using parsing on the cues of the extended-first-order model showed a highly
significant fit (2(105)=9,386, p<.001). While its BIC (3,040) was higher than the simpler C-CuRE
model, it was lower than the compound cue model. More importantly, its performance jumped to 87.9%
on the discrete choice rule and 80.9% on the probabilistic one. Thus, it appears that simply using every
available information source and normalizing with parsing operation (rather than more complicated cuecombination rules) offers the best performance.
Discussion
Why did the compound cue model fare so poorly against C-CuRE? This may have been due to a
number of reasons. First, as we described, for many of the compound cues, parsing offers a fairly
similar way to normalize the data. Thus, we might have expected similar performance a priori.
Second, Table S10 shows a comparison of the simple regression results for the compound cues
and their absolute counterparts. One thing that is striking is that with the exception of F2locus , most of
the relative measures show similar if not larger effects of context than the absolute ones. While this is
counterintuitive, the reason is that measures like F3AMPREL consist of the difference between two
components, one that is related to the fricative and one to the vowel. By combining them, we end up
pooling their variance, essentially corrupting a more invariant measure (F3AMPF) with variance from a
less invariant one (F3AMPV).
Finally, since many of these compound measures are linear combinations, the first-order cue
model can take advantage of them under some circumstances. For example, if the compound cue is a
difference of two cues (e.g., F3AMPREL = F3AMPF – F3AMPV) and both are included in the model (as
in the extended-first-order model) the model could set the -weight of one cue (F3AMPF) to +1 and the
other (F3AMPV) to –1. That component of the regression equation is equivalent to the compound cue
(F3AMPREL). If this was going to be helpful, the cue-integration model should be able to achieve some
version of this relativization as the optimal weights. However, by fixing this relationship into a
compound cue we force the model to only use coefficients of 1 and –1. If they needed to be weighted
differently, this would be impossible. In contrast, the cue-integration and C-CuRE models with
independent weights on both cues
Table S10: Summary of regression analyses examining effects of speaker
could be more flexible.
(20), vowel (6) and fricative (8) for each of the relative cues. Also shown
To conclude, then, while
are their first-order counterparts from the previous analysis. Shown are
compensation in the C-CuRE
R2change values. Missing values were not significant at p<.05 level.
framework appears to offer a
Contextual Factors
Fricative
similar operation to compound
Identity
Cue
Version
Speaker
Vowel
cues like locus equations, in
*
*
DUR
F
0.158
0.021
0.469*
practice it supports much better
DUR
DURREL
0.146*
0.165*
0.389*
categorization performance. Given
*
RMSF
0.081
0.657*
its advantages of not having to wait RMS
RMSREL
0.065*
0.006+
0.684*
for all the necessary information,
*
*
F3AMPF
0.070
0.028
0.483*
and the fact that C-CuRE can be
F3AMP
F3AMPREL 0.133*
done on any cue without extensive
0.034*
0.315*
*
*
F5AMPF
phonetic analysis, it may be the
0.077
0.012
0.460*
F5AMP
F5AMPREL 0.227*
superior approach to relativization
0.019*
0.219*
F2onset
or compensation.
0.109*
0.514*
0.119*
F2
*
*
F2locus
0.013
0.245
0.041*
-19-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
Note 8: On Cue Sharing
A recurring issue in speech perception is whether categorization decisions for one feature or phoneme
are dependent on decisions made for another phoneme or feature (Miller & Nicely, 1955; Pisoni &
Sawusch, 1974; Oden, 1978; Mermelstein, 1978; Whalen, 1989; Nearey, 1990). For example, in the
context of fricatives, does a listener’s decision about the vowel affect his decision about the fricative?
This is particularly relevant in instances of cue-sharing, where a single cue is used for more than one
classification dimension, for example, when F2 must be used to determine the identity of both the
fricative and the vowel. Such cue-sharing across places, voicing, and sibilance as well as vowel and
speaker was observed in virtually all of the cues we studied here, and our C-CuRE model makes the
fundamental claim that listeners actively identify contextual factors (speaker/vowel) as part of the
compensation processes.
A number of experiments have attempted to disentangle these issues using multidimensional
speech continua. Results have been mixed. Pisoni and Sawusch (1974) showed that place and voicing
judgments are not independent (where VOT is affected by both voicing and place: Lisker & Abramson,
1964), although Oden (1978) showed that the Fuzzy Logical Model of Perception (FLMP) which
assumes independence can fit this data.
Mermelstein (1978) examined vowel length (which partially distinguishes both word-final
voicing and /æ/ from /ɛ/) and showed what appeared to be additive effects. Whalen (1989), however,
replicated Mermelstein’s study with greater power and more precise analyses and found that listeners’
vowel judgments were dependent on their voicing judgments (and vice versa). When listeners identified
a token as voiceless, they attributed the shorter vowel to voicelessness and were more likely to identify it
as /æ/ (a long vowel) (a nice example of operations like parsing and C-CuRE). His second experiment
showed a similar relationship between fricative (/s/ vs. /ʃ/) and vowel (/i/ vs. /u/) identification (the
rounded /u/ lowers the M1 of the fricative), for the same stimulus. If listeners interpreted the vowel as
/u/, they were more likely to assume that an ambiguous M1 was due to rounding and classify the
fricative as /s/.
There has been considerable debate about the nature of the perceptual mechanisms that give rise
to this effect (Nearey, 1990; Whalen, 1992; Smits, 2001a, 2001b). Critically, this debate has been
informed by logistic regression models that are formally similar to the present approach. Nearey (1990)
applied a logistic regression analysis to Whalen’s data, comparing models in which the interaction
between categories was based on a vowelcue interaction to models in which it was captured by simply
a vowelfricative bias. The former is related to our C-CuRE model in which a cue is interpreted relative
to the vowel; the latter is more of a decision-stage process, reflecting simply that subjects are more
likely to respond with particular pairs of phonemes (a diphone bias), and is consistent with independent
perceptual processing, but dependent decision making.
Nearey (1990) found that this latter model was sufficient to account for the data, and that the
addition of a vowelcue term offered little benefit (though he did not assess it alone so it is unclear if a
vowelcue term can do similar work to the diphone bias). Smits (2001a, 2001b) later extended Nearey’s
(1990) model by allowing the vowel category to affect how the secondary feature is interpreted in
multiple ways. For example, a rounded vowel could shift the position of spectral mean boundary (for an
s/ʃ judgment), its slope, or its orientation (in a 2D cue-space). He found much better fits to the data,
including to a new corpus of Dutch fricatives (in which rounding and fricative place are independent).
While this offers considerable more flexibility, it also requires that phoneme decisions are hierarchical
in the sense that fricative decisions are affected by the vowel but vowel decisions are not affected by the
fricative. Smits’ model-fits support this in the fricative-vowel sequences he examined, but this may not
hold up for domains like place assimilation (e.g., Gow, 2003) and vowel-to-vowel coarticulation (Cole
-20-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
et al., 2010) where listeners appear to
Table S11: Analysis of our listener data based on Nearey (1990; see
also, Smits, 2001a). Shown is the likelihood of choosing an alveolar or
work in both directions.
postalveolar as a function of vowel context and experiment. Only trials
For both models, interactions
in which the stimulus was alveolar or postalveolar (and the response
between cues and categories are either
was one or the other) are included. Nearey (1990) showed that listeners
modeled as a decision-stage (diphone
were more likely to answer /s/ after /u/ and /ʃ/ after /i/. We find no
evidence of that here.
bias) process, or as a modification of
how cues are interpreted with respect to
Stimulus
Response
the fricative. Though we didn’t include
Condition
Vowel
Place
Alveolar Postalveolar
such terms in our model, our model
alveolar
.998
.0025
i
could certainly accommodate such
postalveolar
.0013
.999
Completebiases. However, C-CuRE offers
Syllables
alveolar
0.995
0.0054
considerably more power, greater
u
postalveolar
0.0025
0.998
integration with a body of perceptual
work, and more flexible processing.
alveolar
0.997
0.0025
i
Consider Nearey’s diphone bias.
postalveolar
0.0012
0.999
FricationFirst, it is not clear that it can account for
only
alveolar
0.995
0.0054
our data. Unlike Whalen (1989), our
u
postalveolar
0.0025
0.9975
vowels were unambiguous, and equally
likely with all fricatives. As a result, we didn’t observe a diphone bias in our perceptual experiment. To
examine just the subset of phonemes in our studies that Nearey examined, in the context of /u/, listeners
correctly identified /s/ 99.7% and /ʃ/ 99.7% of the time; similarly, in the context of /i/, they identified /s/
100% of the time and /ʃ/ 99.7% of the time (Table S7). So the parsing benefit is more likely to derive
from listeners’ use of the vowel to better interpret the fricative cues than from particular biased pairs.
However, introducing such bias terms in a C-CuRE model could be beneficial. It could account for
transition probabilities between phonemes. Likewise, by treating speaker similarly to vowels and
fricatives, one could use a speakerfricative or speakervowel bias term to account for many of the
indexical effects on speech perception (Pisoni & Levi, 2007, for a review). In this case, they could
account for differences in the frequency of different phonemes produced by different speakers. Thus,
diphone bias was unlikely to play a role in the parsing benefit we observed, but is not inconsistent with
our approach and may be quite useful.
Next, consider Smits’ use of vowel-sensitive weights on the cues to frication. Smits embeds
these as part of the fricative decision. That is, if lip rounding leads to a lower spectral mean for
fricatives, this information is stored as a component of the fricative decision process. However, spectral
mean may be also useful for deducing speaker gender, and the coarticulatory effects of lip rounding
could interfere with that as well. Under Smits’ model, such effects would have to be stored as
independent factors in the gender-decision model. In contrast, C-CuRE recodes the continuous cue as a
function of expectations derived from the vowel (e.g., lip-rounding), making this “corrected” cue value
simultaneously available for all perceptual decisions. It can modify both the location of the cue-value
(shifting the boundary) but also the scale (if standardized or studentized residuals, which are sensitive to
the variation, are used), allowing it to capture effects on the slope as well. Thus, it can account for many
of the classes of effects that Smits (2001b) models, without the redundancy of building the
compensation into each phoneme decision, and with greater flexibility.
More importantly, it also allows for a much more scalable model as only first-order
cuecategory relationships need to be retained, and we don’t need to assume any hierarchical ordering
to the categories. While the difference between models may be slight when only two cues are
considered (as in Nearey, 1990; Smits, 2001a, 2001b), the failure of our invariance model on even
-21-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
unambiguous fricatives suggests that only large numbers of cues will suffice, making scalability an
important issue.
Moreover, In contrast to the Smits (2001a, 2001b) approach, C-CuRE predicts that the encoding
of the continuous cues is changed by the categories (and the compensation). This is consistent with a
large body of work on compensation for coarticulation. Fowler and Brown (2000), for example, showed
that listeners actually perceive nasalized vowels as more oral when the nasality can be attributed to an
upcoming nasal stop (see also Pardo & Fowler, 1997; Beddor, Harnsberger, & Lindemann, 2002).
Critically, this was tested with a 4AIX task, which is known to be sensitive to differences in continuous
cues, not categories (Pisoni & Lazarus, 1974; Gerrits & Schouten, 1992).
Finally, C-CuRE can account for speakercue effects such as the finding that ambiguous
fricatives are more likely to be heard as /s/ with a male voice and /ʃ/ with a female voice (Strand, 1999).
Such effects were not considered by the Smits (2001a, 2001b) model, and are potentially consistent with
it. However, adding such terms in that framework would require a substantially more complex
hierarchy. Adding them to a C-CuRE model is simple.
With respect to the Nearey (1990) and Smits (2001a, 2001b) models, the addition of bias terms
could help account for things like phoneme-to-phoneme transitions or indexical effects, while our
parsing approach can account for what appears to be normalization effects or context dependencies,
without the need to duplicate these efforts for every phoneme. Crucially, the model also illustrates how
treating these fairly distinct factors (speaker and vowel) as simply sources of variance applied to cues
may bring together exemplar and normalization approaches to perception.
The present study offers a new way to frame the debate over the independence of phonetic
identification processes. It demonstrates a concrete and quantifiable benefit to the idea of interpreting
cues in light of other categories: parsing offered a 7–8% improvement over the cue-integration model.
This compensation can be done without fitting a vowelcue or vowelfricative term to the fricative
decision – one only needs to examine the direct relationships between context and cues (not their joint
relation to the fricative). Given the statistical structure of a real speech corpus, there is a real benefit to a
perceptual system that is sensitive to dependencies between categories (mediated through their
influences on shared cues), and this benefit can be realized with simple statistical models. The classic
question in this domain has been whether the perception of one phoneme category is influenced by
another. Our model reframes this: is the encoding of the continuous cue dependent on a category (either
phonetic or indexical)?
-22-
Supplemental Materials
McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the
speech signal by integrating cues computed relative to expectations. Psychological Review. doi:10.1037/a0022325
References
Beddor, P. S., Harnsberger, J. D., & Lindemann, S. (2002). Language-specific patterns of vowel-to-vowel
coarticulation: Acoustic structures and their perceptual correlates. Journal of Phonetics, 30, 591–627.
Fowler, C. A. (1994). Invariants, specifiers, cues: An investigation of locus equations as information for place of
articulation. Perception & Psychophysics, 55, 597–611.
Fowler, C., & Brown, J. (2000). Perceptual parsing of acoustic consequences of velum lowering from information for
vowels. Perception & Psychophysics, 62, 21–32.
Gerrits, E., & Schouten, M. E. H. (2004). Categorical perception depends on the discrimination task. Perception &
Psychophysics, 66, 363–376.
Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives. Journal of the Acoustical
Society of America, 106, 1252–1263.
Lisker, L., & Abramson, A. S. (1964). A cross-language study of voicing in initial stops: Acoustical measurements.
Word, 20, 384–422.
Maniwa, K., Jongman, A., & Wade, T. (2008). Perception of clear fricatives by normal-hearing and simulated hearingimpaired listeners. Journal of the Acoustical Society of America, 123, 1114–1125.
McMurray, B., Clayards, M., Tanenhaus, M., & Aslin, R. (2008). Tracking the timecourse of phonetic cue integration
during spoken word recognition. Psychonomic Bulletin & Review, 15, 1064–1071.
Mermelstein, P. (1978). On the relationship between vowel and consonant identification when cued by the same
acoustic information. Perception & Psychophysics, 23, 331–336.
Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. Journal of
the Acoustical Society of America, 27, 338–352.
Nearey, T. M. (1990). The segment as a unit of speech perception. Journal of Phonetics, 18, 347–373.
Newman, R., Clouse, S., & Burnham, J. (2001). The perceptual consequences of within-talker variability in fricative
production. Journal of the Acoustical Society of America, 109, 1181–1196.
Nittrouer, S., Studdert-Kennedy, M., & McGowan, R. S. (1989). The emergence of phonetic segments: Evidence from
the spectral structure of fricative-vowel syllables spoken by children and adults. Journal of Speech and
Hearing Research, 32, 102–132.
Oden, G. (1978). Integration of place and voicing information in the identification of synthetic stop consonants.
Journal of Phonetics, 6, 82–93.
Pardo, J. S., & Fowler, C. A. (1997). Perceiving the causes of coarticulatory acoustic variation: Consonant voicing and
vowel pitch. Perception & Psychophysics, 59, 1141–1152.
Pisoni, D. B., & Lazarus, J. H. (1974). Categorical and noncategorical modes of speech perception along the voicing
continuum. Journal of the Acoustical Society of America, 55, 328–333.
Pisoni, D. B., & Levi, S. (2007). Representations and representational specificity in speech perception and spoken word
recognition. In M. G. Gaskell (Ed.), The Oxford handbook of psycholinguistics (pp. 3–18). Oxford, England:
Oxford University Press.
Pisoni, D. B., & Sawusch, J. R. (1974). On the identification of place and voicing features in synthetic stop consonants.
Journal of Phonetics, 2, 181–194.
Smits, R. (2001a). Evidence for hierarchical categorization of coarticulated phonemes. Journal of Experimental
Psychology: Human Perception and Performance, 27, 1145–1162.
Smits, R. (2001b). Hierarchical categorization of coarticulated phonemes: A theoretical analysis. Perception &
Psychophysics, 63, 1109–1139.
Strand, E. (1999). Uncovering the role of gender stereotypes in speech perception. Journal of Language and Social
Psychology, 18, 86–100.
Sussman, H., Fruchter, D., Hilbert, J., & Sirosh, J. (1998). Linear correlates in the speech signal: The orderly output
constraint. Behavioral and Brain Sciences, 21, 241–299.
Sussman, H. M., & Shore, J. (1996). Locus equations as phonetic descriptors of consonantal place of articulation.
Perception & Psychophysics, 58, 936–946.
Whalen, D. (1989). Vowel and consonant judgments are not independent when cued by the same information.
Perception & Psychophysics, 46, 284–292.
Whalen, D. (1992). Perception of overlapping segments: Thoughts on Nearey’s model. Journal of Phonetics, 20, 493–
496.
-23-
Download