Puzzles and Patterns in 50 years of Research on Speech Perception Sarah Hawkins University of Cambridge sh110@cam.ac.uk Three periods 1. 1950-1965 Broad-based exploration 2. 1965-1990s Narrowed to focus on the search for invariance in the relationship between speech signal and its percept: THEORY 3. 1995…. This focus is broadening again – – to include ‘discrepant’ data & new understanding which requires changes in conceptualization of • task goals • processes involved The Main Message • Speech perception is at an exciting stage: we are beginning to integrate areas of old research with the mainstream theoretical work of the last 30 years or so • A paradigm shift? Early work Glorious Discovery Early work • often looked at effects on the whole signal • but as puzzles arose, and we looked more closely, then attention became focused on small domains in an effort both to simplify and to clarify Early work: source separation Cocktail party effect / multi-talker perception Cherry (1953) • continuous natural speech, with different types of content, presented in different ways • a huge wealth of observations relevant to – memory – attention – transitional probabilities – speaker vs message Cherry (1953) JASA 25, 975-979 Early work: source separation Cocktail party effect / multi-talker perception Broadbent & Ladefoged (1957) • separate synthetic formants fuse to sound like a single vowel when presented to the same or different ears, only if they have the same f0 • compared ‘natural’ and ‘sustained’ formants • extensions to theories of hearing (e.g. Licklider) ASA special session, 2004 Broadbent & Ladefoged (1957) JASA 29, 708-710 Darwin (1981) QJEP 33, 185-207 Bregman (1990) Auditory Scene Analysis Cooke & Ellis (2001) Sp. Comm. 35, 141–177 Early work: source integration Sumby & Pollack (1954) Especially in high levels of noise: • audiovisual presentation increases intelligibility (visual contribution is relative to the available auditory contribution) Sumby & Pollack (1954) JASA 26: 212-215 Massaro (1998) Perceiving Talking Faces Widespread AV groups and applications Early work: source integration Sumby & Pollack (1954) Especially in high levels of noise: • audiovisual presentation increases intelligibility (visual contribution is relative to the available auditory contribution) Sumby & Pollack (1954) JASA 26: 212-215 Massaro (1998) Perceiving Talking Faces Widespread AV groups and applications • in auditory-only presentations, polysyllables are more intelligible than monosyllables (overall shape... neighborhoods…cohorts…) Richard Warren, Paul Luce, Marslen-Wilson Early work: brain function Kimura (1961) • speech is processed more efficiently by the ear that is contralateral to the language-dominant hemisphere • independent of handedness and right/left focus of damage due to epilepsy complexities of auditory pathways, cerebral dominance, and speech processing Kimura (1961) Canadian J. Psychol., 15, 166-171 The new ‘cognitive neuroscience/psychology’… Early work: memory Miller (1956) • short term memory span for unrelated items – The Magical Number Seven ± Two • can increase this span by: – making relative rather than absolute judgments – increasing the number of dimensions – chunking into larger items • recoding is a crucial process Miller (1956) Psychological Review 63, 81-97 Serial learning and recall (e.g. Underwood) Lashley (1951) Serial order in behavior Pisoni (1973) and later Early work: intelligibility Context of Possible Responses Miller, Heise & Lichten (1951) • monosyllables • size of test vocabulary affects identification • 2…256…all monsylls • though presumably there are limits: – two vs six – five vs nine ! Miller, Heise & Lichten, (1951) J.Exp.Psych. 41, 329-335 Early work: intelligibility Phonetic Context Pickett & Pollack (1963) • excerpts from connected speech must be ≥ 800 ms long to be fully intelligible • regardless of rate: – faster rates need more syllables to be understood (slowing the speech down does not help) crucial role of coarticulation & style (‘connected speech processes’) Pickett & Pollack (1963) Language & Speech 6, 165-171 Early work: preceding context affects the interpretation of the current sound Ladefoged and Broadbent (1957) • "Please say what this word is: bit bet bat but F1 of CARRIER bet 200-380 Hz bit 380-660 Hz Ladefoged and Broadbent (1957) JASA 29, 98-104 Early work: immediate context determines the interpretation of the current stimulus Synthesizing bursts and transitionless vowels Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606 Early work: immediate context determines the interpretation of the current stimulus Identification of bursts and transitionless vowels: the CV is identified as the minimal acoustic unit Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606 Early work: immediate context determines the interpretation of the current stimulus Identification of burstless stops with different vowels: transitions are all you need! Delattre, Liberman, & Cooper (1955) JASA 27, 769-773 Categorical Perception of obstruent consonants Equal acoustic changes unequal auditory percepts place of articulation of stops: /b/ vs /d/ vs /g/ b d g Liberman, Harris, Hoffman, and Griffith (1957) Journal of Experimental Psychology 54, 358-368 Categorical Perception of obstruent consonants • together with a theoretical bias in favor of binary oppositions • encouraged a focused search for simple transformations from the encoded signal to an unambiguous, formal linguistic mental representation This narrower focus • required clear conceptualisation of – identity of the important unit(s) of perception – process of abstraction • On the whole, the units and levels of linguistic description were rather uncritically adopted …units of linguistic description were rather uncritically adopted “we….had undertaken to find the ‘invariants’ of speech, a term which implies, at least in its simplest interpretation, a one-to-one correspondence between something halfhidden in the spectrogram and the successive phonemes of the message.” Cooper, Delattre, Liberman, Borst & Gerstman, Perception of synthetic speech sounds JASA (1952) 24, 604-5 …though not without some misgivings “…one should not expect always to be able to find acoustic invariants for the individual phonemes…we are trying to [compile] the code book, one in which there is one column for acoustic entries and another column for message units, whether these be phonemes, syllables, words, or whatever.” Cooper, Delattre, Liberman, Borst & Gerstman, Perception of synthetic speech sounds JASA (1952) 24, 604-5 Middle period The search for essence: ‘invariance’ Middle period: the search for essence • Impose order on the chaos! • Focus: non-linearity between variation in acoustic signal and perceptual response Categorical Perception (of consonants) • Context becomes seen as variability, so we control for it ever more stringently • to discover the crucial—invariant—properties requires a view of what is fundamental • The basic syllable! ba – – – – CV in isolation stressed possibly with only one V if we’re looking at Cs, and only one C if we’re looking at Vs Imposing order on chaos • The basic syllable: ba (context: silence) • What was lost? – – – – – – – – – polysyllables unstressed syllables prosody accounting for rate changes connected speech informativeness of variation esp. in connected speech meaning communication (most things really) Development of theory and the search for essence • Two main approaches The Motor Theory Quantal Theory leading to Acoustic/Auditory Invariance The Motor Theory of Speech Perception Liberman, Cooper, Shankweiler & Studdert-Kennedy (1967) Psychological Review 74, 431–461 Liberman & Mattingly (1985) Cognition 21, 1-36 • Listeners interpret speech sounds in terms of – motoric gestures they would make them with (1967) – intended gestures of the speaker (1985) • Gestural unit: ‘phonetic category’ Quantal Theory of Speech Perception (and production) Stevens (1972, 1989) • Regions of stability in the acoustic signal, or auditory response, provide a basis for forming categories of sounds • Unit: distinctive feature (Chomsky & Halle 1968) Stevens (1989) Journal of Phonetics 17, 3-45 Stevens (1972) In David & Denes Human Communication. 51-66 Quantal Theory becomes Acoustic/Auditory invariance theory Stevens & Blumstein (1978) ……. Stevens (2002) • For each DF there is a binary response to an invariant acoustic or auditory property • e.g. particular changes in spectral shape over short time periods at crucial parts of the signal – segment boundaries – vowel steady states +consonantal change -consonantal little change Stevens (2002) JASA 111, 1872-1891 Stevens & Blumstein (1978) JASA 64, 1358-1368 Acoustic/Auditory invariance theory Stevens (2002) +strident -strident • landmarks: – islands of reliability – built-in local context • connected speech… Stevens (2002) JASA 111, 1872-1891 Common properties • Motor and Acoustic Invariance theories have much in common – dynamic – early abstraction – discrete units – phonological Common properties • Motor and Invariance theories have much in common – dynamic – early abstraction – discrete units allowed psycholinguistic – phonological theories to assume an input that is abstract and discrete: to ignore phonetic information Psycholinguistic theories • Focus on word segmentation & identification • Top-down knowledge compensates for impoverished (phonemic) input – metrical stress, possible words, phonotactics…. • Statistical, probabilistic • Some names: – McClelland & Elman (TRACE) – Cutler, Norris, McQueen (Race, Shortlist, Merge) – Marslen-Wilson, Gaskell… (Cohort) extensions, questions: is simplicity the best answer? Kewley-Port (1983) • better identification with overall pattern (more detail?) Klatt (1979) • Lexical Access From Spectra (LAFS) • whole-word patterns? Kewley-Port (1983) JASA 73, 322-335 Klatt (1979) Journal of Phonetics 7, 279-312 extensions, questions: wider influences Ganong (1980) nonword-word: dask-task • identification expt word-nonword: dash-tash 100 • VOT continuum • word at one end, nonword at the other % /d/ • perception is more forgiving when the 0 long VOT (t) sound means something! short VOT (d) Ganong (1980) J. Exp. Psych: HPP 6, 110-125 Summary: ‘context’ and ‘signal’ • ‘Units’ functionally inseparable from ‘context’ • The context and the signal together determine whether the signal is coherent – and hence what each unit ‘is’ Recent developments (since early-to-mid 90s) systematic subtle variation as linguistically informative: classify the contexts in a more linguistically-sophisticated way Combining old and new themes • re-examination and extension of information provided by systematic phonetic variation • new areas, e.g. – cross-linguistic work (Best, Beddor, Bradlow...) – memory & learning (Goldinger, Pisoni...) – functional brain imaging (Sophie Scott) Listeners use fine phonetic detail Allen & Miller (2004) • speaker identity: listeners generalize talkerspecific VOT information to a novel word Smith (2004) • lexical identity: slightly inappropriate allophones in a sentence disrupt word-spotting only when speaker is familiar to listener • familiarization to speakers is fast Allen & Miller (2004) JASA 116, 3171-3183 Smith (2004) PhD Dissertation, Cambridge University • Spoken word recognition test, which is used to establish cerebral dominance Chinese English Spanish • large groups of native speakers of Chinese/English/Spanish • coronal MRI slices, data for 3 Ss, >200 ms post-stimulus onset • Lateralisation (%Ss): Spanish 100% left English 80% left Chinese 79% bilateral (tone lang.) Valaki et al. (2004) Neuropsychologia 42, 967–979 What sort of model? • • • • biologically plausible roles of attention, memory & learning focus on meaning (‘sound to sense’) multiple potential ‘units of perception’ no obligatory units? • structure from incomplete information Adaptive Resonance Theory (ART) ? Grossberg 1986… Grossberg (2003) Journal of Phonetics 31, 423-445 A key issue • what is a phonetic category? (Carol Fowler, May 2004: ‘never been sure’) • mental representations of phonetic categories are dynamic, relational, & plastic – Repp, Lindblom, Studdert-Kennedy – Bradlow, Pisoni, Hawkins….. Hawkins (2003) Journal of Phonetics 31, 373-405 bottom-up vs top-down? • phonetic variation that systematically indicates linguistic structure makes many ‘top-down’ processes unnecessary – e.g. allophonic detail vs Possible Word Constraint • and blurs the traditional distinction between signal & knowledge A Challenge • to define and refine new questions in testable ways – i.e. to refocus, but to do it in ways that: – are rigorous yet focus on meaning and communication – avoid the ‘new understanding’ becoming doctrinaire – build on past contributions Some topics I haven’t mentioned but should have… and could have, if I’d told the same story in a different way • • • • infants’ & animals’ perception (periods 2 & 3) vowel perception (dynamics; center of gravity) sine wave speech more theories (direct perception, auditory enhancement, FLMP) • more on memory (incl. associations) & learning • connections with psychoacoustics • production-perception connections Categorical Perception Run a discrimination experiment Run an identification experiment 1 versus 3 100 Discrimination peak % /b/ % difft 0 1 ... 3 … 5 … 7 Courtesy Chris Darwin’s web site Valaki et al. (2004) Neuropsychologia 42, 967–979 • Monolingual/near monolingual native speakers: – 30 Mandarin-Chinese – 20 Spanish speakers – 42 American English all right handed • Whole-head MEG, auditory word recognition test, used clinically to establish hemispheric dominance for receptive language: 63 abstract words/language – 33 target words, each in 3 lists, with 10 novel nontarget words in each list – lift finger when you recognize a target word Patterns of dominance (%) LH RH bilateral Spanish Laterality Index: (LH – RH) / (LH + RH) English 100 80 Mandarin 14 20 7 79 Vowel-to-vowel coarticulation /ibbi/ vs /AbbA/ Naturally spoken Schwas exchanged /ibbi/ /AbbA/