CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception

CS 551/651:
Structure of Spoken Language
Lecture 12: Tests of Human Speech
John-Paul Hosom
Fall 2008
• Recommended Reading:
Chapter 5: Strong/Weak Forms, Intonation, and Stress
Chapter 11, pp. 267 − 275: Balance between Phonetic
Forces and “Physical Phonetics”
• Final Exam will be a take-home exam with  10 questions
(same style as midterm, but may require use of calculator)
and a number of spectrograms to be deciphered. It will be
handed out at the end of class on Wednesday December 3.
The exam will be due back to me by Friday December 12.
This is worth about 30% of your grade.
• The final will cover material from Lecture 7 (“Syllable
Structure…”) until the end of the term. Material covered on the
midterm will probably not be covered on the final.
• The spectrogram reading exercises will be similar to the midterm,
but will include the other classes of speech that we’ve been
studying (nasals, approximants, and affricates) as well as the
usual vowels (and diphthongs), fricatives, and stops.
The Perceptual Second Formant: F2'
Most vowels can be simulated using two resonances:
400 Hz
2200 Hz
In one study, the lower resonance was fixed at the frequency
of a vowel formant, and the subject was asked to vary the
higher resonance (F2') until the perceived sound most closely
matched the target vowel.
400 Hz
target: /ih/
For back vowels and central vowels, subjects adjusted F2' to a
frequency near the vowel’s F2
For front vowels except for /iy/, F2' was between the vowel’s
F2 and F3; for /iy/, F2' was at or above the vowel’s F3
The Perceptual Second Formant: F2'
These finding suggest that when formants are close in
frequency, they are integrated so that there is a single
“effective” formant equivalent to an average of the two peaks
It has also been shown that when two or more formants occur
within 3 to 3.5 Barks, the perceived vowel quality is
equivalent to a resonance pattern with a single formant at the
center of gravity of the two formants
So, for two formants within 3 Barks, the formant positions
affect a center of gravity measure of a single perceived
resonance; beyond 3 Barks, two formants are heard as
perceptually distinct.
These results suggest that for steady vowels, there is an
internal representation that has fairly low resolution.
Bark( f ) 
26.81 f
 0.53
1960  f
Perception of Coarticulation
In most cases, vowels are affected by coarticulation. In some
cases, the vowel does not reach its “target” formant pattern.
How does the brain deal with this variation in the signal?
The acoustic effects of coarticulation referred to by Lindblom
as “target undershoot”; the amount of undershoot depends on
syllable duration, as well as on speaking style, and varies both
across and within speakers.
In vowel perception, Lindblom hypothesized that people
compensate for target undershoot, and attempt to recover the
canonical vowel targets.
In an experiment, synthetic speech stimuli in a wVw and yVy
context were presented to listeners, with the F2 of V varying
from high (for an /ih/ vowel) to low (for an /uh/ vowel).
Perception of Coarticulation
The boundary for perception of /ih/ and /uh/ (given the
varying F2 values) was different in the wVw context and yVy
In yVy contexts, mid-level values of F2 were heard as /uh/,
and in wVw contexts, mid-level values of F2 heard as /ih/.
Perception of Coarticulation
This demonstrates perceptual overshoot; subjects are relying
on direction and slope of formant transitions to classify
Lindblom proposed Perceptual Compensation model, which
“normalizes” formant frequencies based on formants of the
surrounding consonants, canonical vowel targets, and syllable
However, many factors may account for target undershoot,
and so a simple model is not effective in this case.
Also, if applied to automatic speech recognition, determining
locations of consonants and vowels is a non-trivial problem.
Are Formant Targets Important??
Strange et al. did experiment in which target information,
dynamic information (in formant transitions), and duration
information were manipulated independently in CVC
Given a CVC, the middle region of the V was removed, or the
transition regions were removed, or the duration was
normalized, or some combination of these was applied
The CVCs were presented to subjects, who were asked to
identify the vowel.
Regions with no target information are “Silent-Center”,
regions with no transitions are “Centers-Alone”, and timenormalized versions are referred to as “Neutral-Duration”
Are Formant Targets Important??
Are Formant Targets Important??
Identification of Silent-Center vowels was “remarkably
accurate”; in some cases, as good as identification of
unmodified CVC.
Neutral-Duration Silent-Center vowels not correctly identified
as often as Silent-Center vowels.
However, Neutral-Duration Silent Center vowels still more
often correctly identified than Neutral-Duration Center-Alone
(1) when vowel transition and duration information is present,
recognition is highly accurate
(2) with no duration information, transition information is
more useful than nucleus information for vowel ID.
(3) vowel targets alone are neither sufficient nor necessary
Are Formant Targets Important??
In another study by Furui, CV syllables were truncated either
from the beginning or from the ending, and perception of the
truncated syllable was measured
In another experiment, both initial and final sections of the
syllable were truncated, with a minimum duration of 40 msec
The “perceptual critical point” was defined as the truncation
position at which there was 80% correct recognition.
Furui found:
(a) The 10 msec during the point of greatest spectral transition
is most important for identification of CV syllables, and
(b) The crucial information for both vowels and consonants is
in this 10-msec region; consonants can be mainly
perceived by the spectral transition into the following
Are Formant Targets Important??
Tekieli and Cullinan showed that
(a) Given first 10 msec of isolated vowel, Place and Height
can be distinguished at levels above chance; the tense-lax
feature requires 30 msec.
(b) Place of articulation in CV can be identified based on 10
msec after release, but voicing feature requires 20-30
In short, timing information is critical for tense-lax and
voiced-unvoiced distinctions, and making these distinctions
requires about 30 msec of speech; other features can be
identified in 10 msec.
Finally, DiBenedetto demonstrated that the F1 trajectory
influenced perception of front vowels; synthetic syllables in
which F1 targets are reached earlier than normal are perceived
as lower in Height (/iy//ih/, /ih/ /eh/, /eh/ /ae/).
Perception of Place of Articulation
Acoustic cues to perception of place of articulation reside
primarily in spectral transitions between phonemes (with
some exceptions, notably weak /f, th/ vs. strong /s, sh/)
In perceptual experiments with two synthetic formants,
different bursts can be heard by changing the slope of the
initial part of F2; a locus of 720 Hz causes perception of /b/, a
locus of 1800 Hz causes perception of /d/, and a locus of 3000
Hz often causes perception of /g/.
Different plosives can also be perceived based on the shape of
the burst (see next slide).
Perception of Place of Articulation
Categorical Perception
In labeling speech, we use a fixed symbol set (e.g. Worldbet,
IPA, etc.) to record what is spoken
But what do we hear? Do we hear discrete symbols, or a
continuum of sounds? In other words, is perception
categorical, or continuous?
If categorical, then there will be a range of stimuli that will
yield no perceptual difference, a boundary at which the
perception will change, and another range of stimuli with no
perceptual difference.
One example of a categorically-perceived feature is voiceonset time (VOT); if VOT is long, people hear unvoiced
plosives, if VOT is short, people hear voiced plosives. But
people don’t hear ambiguous plosives at the boundary
between short and long VOT (30 msec).
Categorical Perception
In another experiment, the F2 transition was varied along a
continuous scale, but what was heard were “essentially
quantal jumps from one perceptual category to another”
(namely /b/, /d/, and /g/). (Moore, p. 283)
On the other hand, small changes in the formants of vowels
are easily perceived, leading to perception of “blended”
However, for continuous-speech vowels, perception may be
more categorical (Stevens, 1968) and there is evidence that
vowels are encoded in memory using distinctive features
(when vowels are forgotten, other vowels with similar features
are more likely to be remembered, Cole 1968).
Other evidence for categorical perception is in secondlanguage learning; e.g. Japanese distinguishing /r/ and /l/ (by
the age of 6, perception of speech is altered)
Categorical Perception
However, another study presented subjects with a range of
stimuli between /b/, /d/, and /g/, but subjects were asked to
respond with either /b/ or /g/. If perception were completely
categorical, the responses in the /d/ region should have been
random, but in fact there were systematic responses. (Barclay,
Perception may be continuous but have sharp category
boundaries, e.g.
(Massaro, 1998)
Cue Trading
Perception of “slit” vs. “split”, with
duration of silence between /s/ and /l/ varied, and
formant transitions of /l/ varied to be flat or more toward /p/
Long silence durations yield “split”, however, words with
formants closer to /p/ transition required less silence to be
heard as “split”
Conclusion: both acoustic cues are integrated by the listener
into a single phonemic perception; cues can be “traded” so
that more of one cue requires less of another for one type of
perception (e.g. “split”)
Cue Trading
As Moore stated, “within limits, a change in the setting or
value of one cue, which leads to a change in the phonetic
percept, can be offset by an opposed setting of a change in
another cue so as to maintain the original phonetic percept.”
(p. 291)
McGurk Effect:
(1) audio signal contains /ga/, video signal contains /ba/,
perceived sound is /da/
(2) audio signal contains /ma/, video signal contains /ta/,
perceived sound is /na/
subjects not aware of the conflicting cues
Fuzzy-Logic Model of Perception (FLMP)
Massaro has proposed the FLMP, in which cues are:
(a) evaluated according to their degree of presence; this
evaluation returns a high number (up to 1.0) if the feature
is present, and a low number (as low as 0.0) if the feature
is absent.
(b) matched to a prototype higher-level feature, such as a high
degree of lip rounding matching a bilabial sound.
(c) incorporated in a pattern-classification step, to determine
which higher-level feature best matches the available cues
The “best” high-level feature is selected as the actual feature.
For example, given the following prototypes:
phn(labial, voiced) = /b/
phn(labial, not_voiced) = /p/
phn(alveolar, voiced) = /d/
phn(alvoelar, not_voiced) = /t/
Fuzzy-Logic Model of Perception (FLMP)
And then given measurements of place of articulation along a
scale of
0.0 = bilabial, 1.0 = alveolar,
0.0 = not_voiced, 1.0 = voiced
Then the probability of identifying the sound as /b/ is:
(1  A)V
p( /b/ | A,V ) 
(1  A)V  (1  A)(1  V )  AV  A(1  V )
where A is the evidence of alveolar, and V is the evidence for
This assumes that all of the evidence (cues) are independent.
This is equivalent to Bayes’ rule if the “fuzzy” scales are
interpreted as probabilities
Fuzzy-Logic Model of Perception (FLMP)
With exponential weights on the pieces of evidence, the
predicted probabilities of identification agree well with actual
probabilities of identification, varying place of articulation and
voice-onset-time of synthetic speech sounds: