Puzzles and Patterns in 50 years of Research on Speech Perception

advertisement
Puzzles and Patterns
in 50 years of Research on
Speech Perception
Sarah Hawkins
University of Cambridge
sh110@cam.ac.uk
Three periods
1. 1950-1965 Broad-based exploration
2. 1965-1990s Narrowed to focus on the
search for invariance in the relationship
between speech signal and its percept: THEORY
3. 1995…. This focus is broadening again
–
–
to include ‘discrepant’ data & new understanding
which requires changes in conceptualization of
• task goals
• processes involved
The Main Message
• Speech perception is at an exciting stage:
we are beginning to
integrate areas of old research
with the mainstream theoretical
work of the last 30 years or so
• A paradigm shift?
Early work
Glorious Discovery
Early work
• often looked at effects on the whole signal
• but as puzzles arose, and we looked more
closely, then attention became focused on
small domains in an effort both to simplify
and to clarify
Early work: source separation
Cocktail party effect / multi-talker perception
Cherry (1953)
• continuous natural speech, with different types
of content, presented in different ways
• a huge wealth of observations relevant to
– memory
– attention
– transitional probabilities
– speaker vs message
Cherry (1953) JASA 25, 975-979
Early work: source separation
Cocktail party effect / multi-talker perception
Broadbent & Ladefoged (1957)
• separate synthetic formants fuse to sound like
a single vowel when presented to the same or
different ears, only if they have the same f0
• compared ‘natural’ and ‘sustained’ formants
• extensions to theories of hearing (e.g. Licklider)
ASA special session, 2004
Broadbent & Ladefoged (1957) JASA 29, 708-710
Darwin (1981) QJEP 33, 185-207
Bregman (1990) Auditory Scene Analysis
Cooke & Ellis (2001) Sp. Comm. 35, 141–177
Early work: source integration
Sumby & Pollack (1954)
Especially in high levels of noise:
• audiovisual presentation increases intelligibility
(visual contribution is relative to the available
auditory contribution)
Sumby & Pollack (1954) JASA 26: 212-215
Massaro (1998) Perceiving Talking Faces
Widespread AV groups and applications
Early work: source integration
Sumby & Pollack (1954)
Especially in high levels of noise:
• audiovisual presentation increases intelligibility
(visual contribution is relative to the available
auditory contribution)
Sumby & Pollack (1954) JASA 26: 212-215
Massaro (1998) Perceiving Talking Faces
Widespread AV groups and applications
• in auditory-only presentations, polysyllables are
more intelligible than monosyllables
(overall shape... neighborhoods…cohorts…)
Richard Warren, Paul Luce, Marslen-Wilson
Early work: brain function
Kimura (1961)
• speech is processed more efficiently by the ear that is
contralateral to the language-dominant hemisphere
• independent of handedness and right/left focus of
damage due to epilepsy
 complexities of auditory pathways, cerebral
dominance, and speech processing
Kimura (1961) Canadian J. Psychol., 15, 166-171
The new ‘cognitive neuroscience/psychology’…
Early work: memory
Miller (1956)
• short term memory span for unrelated items
– The Magical Number Seven ± Two
• can increase this span by:
– making relative rather than absolute judgments
– increasing the number of dimensions
– chunking into larger items
• recoding is a crucial process
Miller (1956) Psychological Review 63, 81-97
Serial learning and recall (e.g. Underwood)
Lashley (1951) Serial order in behavior
Pisoni (1973) and later
Early work: intelligibility
Context of Possible Responses
Miller, Heise & Lichten
(1951)
• monosyllables
• size of test vocabulary
affects identification
• 2…256…all monsylls
• though presumably
there are limits:
– two vs six
– five vs nine !
Miller, Heise & Lichten, (1951) J.Exp.Psych. 41, 329-335
Early work: intelligibility
Phonetic Context
Pickett & Pollack (1963)
• excerpts from connected speech must be
≥ 800 ms long to be fully intelligible
• regardless of rate:
– faster rates need more syllables to be understood
(slowing the speech down does not help)
 crucial role of coarticulation & style
(‘connected speech processes’)
Pickett & Pollack (1963) Language & Speech 6, 165-171
Early work: preceding context
affects the interpretation of the current sound
Ladefoged and Broadbent (1957)
• "Please say what this word is:
bit bet bat but
F1 of CARRIER
bet
200-380 Hz
bit
380-660 Hz
Ladefoged and Broadbent (1957) JASA 29, 98-104
Early work: immediate context
determines the interpretation of the current stimulus
Synthesizing
bursts and
transitionless
vowels
Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606
Early work: immediate context
determines the interpretation of the current stimulus
Identification
of bursts and
transitionless
vowels:
the CV is
identified as
the minimal
acoustic unit
Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606
Early work: immediate context
determines the interpretation of the current stimulus
Identification of
burstless stops
with different
vowels:
transitions
are
all you
need!
Delattre, Liberman, & Cooper (1955) JASA 27, 769-773
Categorical Perception
of obstruent consonants
Equal acoustic changes  unequal auditory percepts
place of articulation of stops: /b/ vs /d/ vs /g/
b
d
g
Liberman, Harris, Hoffman, and Griffith (1957)
Journal of Experimental Psychology 54, 358-368
Categorical Perception
of obstruent consonants
• together with a theoretical bias in favor of
binary oppositions
• encouraged a focused search for simple
transformations from the encoded signal to an
unambiguous, formal linguistic mental
representation
This narrower focus
• required clear conceptualisation of
– identity of the important unit(s) of perception
– process of abstraction
• On the whole, the units and levels of linguistic
description were rather uncritically adopted
…units of linguistic description
were rather uncritically adopted
“we….had undertaken to find the ‘invariants’
of speech, a term which implies, at least in its
simplest interpretation, a one-to-one
correspondence between something halfhidden in the spectrogram and the successive
phonemes of the message.”
Cooper, Delattre, Liberman, Borst & Gerstman,
Perception of synthetic speech sounds
JASA (1952) 24, 604-5
…though not without some
misgivings
“…one should not expect always to be able to
find acoustic invariants for the individual
phonemes…we are trying to [compile] the
code book, one in which there is one
column for acoustic entries and another
column for message units, whether these be
phonemes, syllables, words, or whatever.”
Cooper, Delattre, Liberman, Borst & Gerstman,
Perception of synthetic speech sounds
JASA (1952) 24, 604-5
Middle period
The search for essence:
‘invariance’
Middle period:
the search for essence
• Impose order on the chaos!
• Focus: non-linearity between variation in
acoustic signal and perceptual response
Categorical Perception
(of consonants)
• Context becomes seen as variability, so we
control for it ever more stringently
• to discover the crucial—invariant—properties
requires a view of what is fundamental
• The basic syllable! ba
–
–
–
–
CV
in isolation
stressed
possibly with only one V if we’re looking at Cs,
and only one C if we’re looking at Vs
Imposing order on chaos
• The basic syllable: ba (context: silence)
• What was lost?
–
–
–
–
–
–
–
–
–
polysyllables
unstressed syllables
prosody
accounting for rate changes
connected speech
informativeness of variation esp. in connected speech
meaning
communication
(most things really)
Development of theory and the
search for essence
• Two main approaches
The Motor Theory
Quantal Theory leading to
Acoustic/Auditory Invariance
The Motor Theory of Speech
Perception
Liberman, Cooper, Shankweiler &
Studdert-Kennedy (1967) Psychological Review 74, 431–461
Liberman & Mattingly (1985) Cognition 21, 1-36
• Listeners interpret speech sounds in terms of
– motoric gestures they would make them with (1967)
– intended gestures of the speaker (1985)
• Gestural unit: ‘phonetic category’
Quantal Theory of Speech
Perception (and production)
Stevens (1972, 1989)
• Regions of stability in
the acoustic signal, or
auditory response,
provide a basis for
forming categories of
sounds
• Unit: distinctive feature
(Chomsky & Halle 1968)
Stevens (1989) Journal of Phonetics 17, 3-45
Stevens (1972) In David & Denes Human Communication. 51-66
Quantal Theory becomes
Acoustic/Auditory invariance theory
Stevens & Blumstein (1978)
……. Stevens (2002)
• For each DF there is a binary
response to an invariant
acoustic or auditory property
• e.g. particular changes in
spectral shape over short time
periods at crucial parts of the
signal
– segment boundaries
– vowel steady states
+consonantal
change
-consonantal
little change
Stevens (2002) JASA 111, 1872-1891
Stevens & Blumstein (1978) JASA 64, 1358-1368
Acoustic/Auditory invariance theory
Stevens (2002)
+strident
-strident
• landmarks:
– islands of reliability
– built-in local context
• connected speech…
Stevens (2002) JASA 111, 1872-1891
Common properties
• Motor and Acoustic Invariance theories
have much in common
– dynamic
– early abstraction
– discrete units
– phonological
Common properties
• Motor and Invariance theories have
much in common
– dynamic
– early abstraction
– discrete units
allowed psycholinguistic
– phonological theories to assume an input
that is abstract and discrete:
to ignore phonetic
information
Psycholinguistic theories
• Focus on word segmentation & identification
• Top-down knowledge compensates for
impoverished (phonemic) input
– metrical stress, possible words, phonotactics….
• Statistical, probabilistic
• Some names:
– McClelland & Elman (TRACE)
– Cutler, Norris, McQueen (Race, Shortlist, Merge)
– Marslen-Wilson, Gaskell… (Cohort)
extensions, questions:
is simplicity the best answer?
Kewley-Port (1983)
• better identification
with overall pattern
(more detail?)
Klatt (1979)
• Lexical Access From
Spectra (LAFS)
• whole-word patterns?
Kewley-Port (1983) JASA 73, 322-335
Klatt (1979) Journal of Phonetics 7, 279-312
extensions, questions:
wider influences
Ganong (1980)
nonword-word: dask-task
• identification expt
word-nonword: dash-tash
100
• VOT continuum
• word at one end, nonword at the other
% /d/
• perception is more
forgiving when the
0
long VOT (t)
sound means something! short VOT (d)
Ganong (1980) J. Exp. Psych: HPP 6, 110-125
Summary:
‘context’ and ‘signal’
• ‘Units’ functionally inseparable from ‘context’
• The context and the signal together determine
whether the signal is coherent
– and hence what each unit ‘is’
Recent developments
(since early-to-mid 90s)
systematic subtle variation as
linguistically informative:
classify the contexts in a more
linguistically-sophisticated way
Combining old and new themes
• re-examination and extension of information
provided by systematic phonetic variation
• new areas, e.g.
– cross-linguistic work (Best, Beddor, Bradlow...)
– memory & learning (Goldinger, Pisoni...)
– functional brain imaging (Sophie Scott)
Listeners use fine phonetic detail
Allen & Miller (2004)
• speaker identity: listeners generalize talkerspecific VOT information to a novel word
Smith (2004)
• lexical identity: slightly inappropriate
allophones in a sentence disrupt word-spotting
only when speaker is familiar to listener
• familiarization to speakers is fast
Allen & Miller (2004) JASA 116, 3171-3183
Smith (2004) PhD Dissertation, Cambridge University
• Spoken word
recognition test, which
is used to establish
cerebral dominance
Chinese English Spanish
• large groups of native speakers
of Chinese/English/Spanish
• coronal MRI slices, data for 3 Ss,
>200 ms post-stimulus onset
• Lateralisation (%Ss):
Spanish 100% left
English 80% left
Chinese 79% bilateral
(tone lang.)
Valaki et al. (2004) Neuropsychologia 42, 967–979
What sort of model?
•
•
•
•
biologically plausible
roles of attention, memory & learning
focus on meaning (‘sound to sense’)
multiple potential ‘units of perception’
no obligatory units?
• structure from incomplete information
Adaptive Resonance Theory (ART) ?
Grossberg 1986…
Grossberg (2003) Journal of Phonetics 31, 423-445
A key issue
• what is a phonetic category?
(Carol Fowler, May 2004: ‘never been sure’)
• mental representations of phonetic
categories are dynamic, relational, & plastic
– Repp, Lindblom, Studdert-Kennedy
– Bradlow, Pisoni, Hawkins…..
Hawkins (2003) Journal of Phonetics 31, 373-405
bottom-up vs top-down?
• phonetic variation that systematically indicates
linguistic structure makes many ‘top-down’
processes unnecessary
– e.g. allophonic detail vs Possible Word Constraint
• and blurs the traditional distinction between
signal & knowledge
A Challenge
• to define and refine new questions in testable
ways – i.e. to refocus, but to do it in ways that:
– are rigorous yet focus on meaning and
communication
– avoid the ‘new understanding’ becoming
doctrinaire
– build on past contributions
Some topics I haven’t mentioned
but should have…
and could have, if I’d told the same story in a different way
•
•
•
•
infants’ & animals’ perception (periods 2 & 3)
vowel perception (dynamics; center of gravity)
sine wave speech
more theories (direct perception, auditory
enhancement, FLMP)
• more on memory (incl. associations) & learning
• connections with psychoacoustics
• production-perception connections
Categorical Perception
Run a discrimination experiment
Run an identification experiment
1 versus 3
100
Discrimination peak
% /b/
% difft
0
1
...
3
…
5
…
7
Courtesy Chris Darwin’s web site
Valaki et al. (2004) Neuropsychologia 42, 967–979
• Monolingual/near monolingual native speakers:
– 30 Mandarin-Chinese
– 20 Spanish speakers
– 42 American English
all right handed
• Whole-head MEG, auditory word recognition test,
used clinically to establish hemispheric dominance
for receptive language: 63 abstract words/language
– 33 target words, each in 3 lists, with 10 novel nontarget words in each list
– lift finger when you recognize a target word
Patterns of dominance (%)
LH RH bilateral
Spanish
Laterality Index:
(LH – RH) / (LH + RH)
English
100
80
Mandarin 14
20
7
79
Vowel-to-vowel coarticulation
/ibbi/ vs /AbbA/
Naturally spoken
Schwas exchanged
/ibbi/
/AbbA/
Download