When do which sounds tell you who says what? A phonetic investigation of the familiar talker advantage in word recognition. University of Calgary Linguistics Brown Bag presentation May 25, 2011 Steve Winters What’s the Big Idea? • The “Familiar Talker Advantage” = • • Speech is more intelligible when produced by familiar talkers, rather than unfamiliar talkers. First elaborated by Nygaard et al. (1994) and Nygaard and Pisoni (1998): 1. Trained listeners to identify 10 voices (5 male, 5 female) over 9 days. 2. Tested trained listeners’ ability to identify words produced by: • • Trained “familiar” voices • Novel “unfamiliar” voices Word recognition scores: familiar > unfamiliar Who Says What? Talker 1 Talker 2 Talker 1 Talker 2 Talker 1 Talker 2 Conditions • Nygaard and Pisoni (1998) recognized that there were conditions on the emergence of the Familiar Talker Advantage: • Only exhibited by listeners who had performed well (> 70%) on the talker identification training task. • Nonetheless, similar interactions between talker identity and speech perception has been observed in: • infants, who prefer to listen to their mother’s voice (DeCasper & Fifer, 1980) • sinewave speech, which supports word recognition and talker identification (Remez, Fellowes & Rubin, 1997) Why Do We Care? • The Nygaard et al. studies emphasized the intersection of indexical and linguistic information in the signal. • Abercrombie (1967): • The linguistic properties of speech support the identification of linguistic (phonemic, etc.) contrasts. • The indexical properties of speech support the identification of “extralinguistic” aspects of the speaker: • Physical characteristics, dialect/group membership, gender, emotional/mental state. • An old idea: speech perception must filter out indexical properties to extract the linguistic message. Normalization • A speech signal stripped of its indexical properties is highly abstract. • = “Normalized” • “...when we learn a new word we practically never remember most of the salient acoustic properties that must have been present in the signal that struck our ears; for example, we do not remember the voice quality, speed of utterance, and other properties directly linked to the unique circumstances surrounding every utterance.” -- Morris Halle (1985) Unfiltered • In contrast, exemplar theories of speech perception (Johnson, 2007) emphasize the utility of not breaking down the signal into separate components. • Conjecture: listeners store unanalyzed representations of speech that are “rich” with informative detail: • linguistic representations might include indexical (talker-specific) information; • and indexical representations include linguistic information. • Generalizations emerge on the fly, from summed activations of similar exemplars. • The “Familiar Talker Advantage” effect seemingly supports this view. Adaptation • Contemporary versions of the “normalization” theory emphasize the active role of the listener in speech processing… • Rather than the (abstract) content of representations. • “The lack of invariance in the mapping of acoustic patterns onto phonetic categories is computationally nondeterministic…the nondeterministic mapping must be solved by mechanisms incorporating active control structures.” (Magnuson & Nusbaum, 2007) • Q: Is speech perception even possible with the unanalytical approach of exemplar theory? • Roughly: rules vs. representations Which One? • Experiment 1 attempts to adjudicate between these competing theories by exploiting a known asymmetry in the processing of indexical information in speech. • Winters et al. (2008): tested identification of bilingual talkers across languages. 1. English-speaking listeners trained to identify: • English-speaking bilinguals • German-speaking bilinguals 2. Tested on same talkers speaking other language: • English German: loss of ID accuracy • German English: no loss in ID accuracy Implications • Winters et al. (2008) concluded: • Talker representations from a known language are language-specific • Talker representations from an unknown language are language-independent • Exemplar-style representations of voices (= integrated linguistic and indexical information) only emerged within a known language. • Q: The Familiar Talker Advantage emerges (for good listeners) within a known language; • Will it emerge for the same talkers across languages, as well? Predictions • Known: listeners show complete generalization of talker knowledge from German to English. • These listeners identify talkers based on languageindependent information in speech. • Exemplar-based prediction: • Learning to identify talkers in German will not facilitate word recognition in English. • (Listeners do not develop integrated representations.) • Normalization-based prediction: • Listeners filter same talker properties in both languages Familiar Talker Advantage should transfer across languages. Experiment 1 • Q: Does knowledge of a talker in one language facilitate linguistic processing of that talker in another? • Training task: talker identification • English-speaking listeners (monolingual) • Bilingual talkers, speaking in either English or German. • Testing task: English word recognition in noise • Three talker groups: • Familiar bilinguals • Unfamiliar bilinguals • Native English talkers Experiment 1: Materials • 10 L1 German / L2 English talkers • All female • These talkers produced • 360 CVC English words (e.g., buzz, cheek) • 360 CVC German words (e.g., hoch, Rahm) • 5 talkers were designated as “Group 1”; • The other 5 were “Group 2” • Groups were balanced for intelligibility • Also: 5 female monolingual English talkers • produced the same set of 360 CVC English words Experiment 1: Training • 3 days of training • 2 sessions per day (~30 min each) • Each session involved: • Familiarization: same 5 words from each talker • Re-familiarization: same word from each talker • Recognition: 5 words/talker, heard twice • with feedback • Testing: 10 words/speaker • no feedback • Half trained in German; half trained in English x2 Experiment 1: Task Demo Training Demo Experiment 1: Word Recognition • Trained listeners identified 24 words each from 15 different talkers: • Group 1: 5 unfamiliar English talkers • Group 2: 5 familiar German-English bilinguals • Group 3: 5 unfamiliar German-English bilinguals • Words were presented in four levels of white noise: • Clear, +10 dB SNR, +5 dB SNR, 0 dB SNR • Responses scored in terms of words, phonemes, features correct… Experiment 1: Training • “Good” learners reached 70% accuracy at some point in training • No difference in learning rate between language groups (again) Results: Phoneme Recognition • Familiar Talker Advantage for Good English listeners only • (p = .008) • No effect for German listeners (good or bad) • Modest trend in poor learners towards an Unfamiliar Talker Advantage! Discussion • (Good) English-trained listeners exhibited better word recognition scores for familiar talkers. • (Good) German-trained listeners did not. • Familiar Talker Advantage was not supported by language-independent talker representations. • Familiar talker effect is based on rich, talker-specific linguistic representations… • • rather than a filtering of “extra-linguistic” talker information. Caveat: some listeners develop these representations better than others. Patterns 1. English-trained listeners displayed: • Interactions between linguistic and talker categories in both experiments. 2. German-trained listeners: • • No interactions between linguistic and talker categories in either experiment. Implications: • English-trained listeners develop richly detailed, exemplar-like representations of voices. • German-trained listeners develop sparser, language-independent representations of voices. Life’s Persistent Questions, Part 2 • When learning to identify voices within a known language, listeners develop rich, exemplar-style representations. • Will the Familiar Talker Advantage (within a known language) be affected by a change in the phonetic quality of the familiar voices? • Note: listeners attend closely to phonetic cues that are consistently associated with a talker’s voice. • E.g., for the German bilingual voices, F0 patterns were an (unintentionally) consistent cue. Talker Identity Cues • Winters (submitted) trained listeners to identify Thaispeaking voices consistently associated with particular phonetic cues: 1. Lexical tones 2. VOT (voiced, unaspirated, aspirated) 3. Vowel categories (front, central, back) • Trained listeners were then tested on stimuli without these talker-cue associations. • Associated cue salience hierarchy: • Tones > VOT, Vowel However… • Voice quality seemed to be an even more distinctive cue to talker identity. Talker Distinctiveness, Tone Training 4.00 3.50 3.00 D-Prime 2.50 2.00 1.50 1.00 0.50 0.00 1 2 3 4 5 6 Testing Session 1 2 3 4 5 English Thai Background: Voice Quality • Note that there are three primary types of vocal fold vibration: 1. modal • vocal folds lightly adducted; flow of air causes periodic opening and closing of folds (“trilling”) 2. breathy • vocal folds slightly apart; flow of air makes folds “wave” in the wind 3. creaky • vocal folds tensely adducted; low airflow causes irregular, low frequency voicing Experiment 2 • Two groups of English Listeners were trained to identify English-speaking talkers: 1. Each talker produced stimuli only in a particular voice quality (Quality-dependent training) 2. Each talker produced stimuli in a variety of voice qualities (Quality-neutral training) • After training, listeners completed a generalization task: 1. Talkers only produced (novel) words in voice qualities not presented in training 2. Talkers only produced novel words not presented in training--still in a variety of voice qualities Experiment 2 • Listeners also completed a word recognition task: • Words produced by familiar and unfamiliar talkers 1. Familiar-talker words in both trained and untrained voice qualities 2. Familiar-talker words in a variety of voice qualities. • Exemplar-based Predictions: • Voice quality will form part of the representation of the talkers’ voice • Familiar talkers will be more intelligible than unfamiliar talkers • Familiar voice qualities will be more intelligible than unfamiliar voice qualities (for the same talker) Experiment 2: Materials • Phonetically trained talkers produced the same list of 360 English CVC words in three different voice qualities: • modal • breathy • creaky • In all, I recorded 6 female talkers and 6 male talkers • Only the female talkers were presented in the experiment. • Two recording sessions, lasting about an hour each; • Talkers were paid $40 for their time and effort. Experiment 2: Materials • The “unfamiliar” talkers consisted of six female talkers recorded for the database used in Experiment #1. • Note: these talkers were from a different dialect region of the United States (Indiana) • Also note: the breathy and creaky tokens tended to be longer in duration. • (Longer duration reflects less fluency with the articulation) 443 ms 562 ms 654 ms Experiment 2: Methods • Training methods were identical to those used in Experiment 1. • Listeners learned to identify six different (female) voices over the course of three days • Two training sessions on each day 1. Quality-dependent: each talker only produced words in a particular voice quality • 2 modal talkers, 2 creaky talkers, 2 breathy talkers 2. Quality-neutral: all talkers produced words in a variety of voice qualities • The relationship between talker and voice quality was randomized in each group. Experiment 2: Participants • 16 participants in each group • Listeners were recruited from introductory linguistics classes • • (so they had some, but not a lot, of phonetics knowledge) On the fourth day of the experiment, listeners completed two tasks: 1. Generalization 2. Word Recognition • Order of tasks was counterbalanced across listeners • Listeners were paid $60 for their time and trouble. Experiment 2: Generalization • Task: talker identification • Quality-dependent listeners: • • Quality-neutral listeners: • • All talkers produced words in the two voice qualities that they did not produce in training (5 in each) The relationship between talker and voice quality was still random (10 words/talker) Both groups identified talkers from words that were not presented in training. Experiment 2: Word Recognition • Listeners identified words, presented in pink noise (0 dB SNR), as produced by two sets of talkers: 1. Familiar (6 voices; 12 words each) 2. Unfamiliar (6 voices; 12 words each) • For the quality-dependent listeners, words were evenly split between the voice quality associated with each talker in training (6) and the two voice qualities the talker did not produce in training (3 each). • For example: • Analysis: responses were scored in terms of words correct and phonemes correct (onset, nucleus, coda) Results: Training •QN listeners learned consistently over the six training sessions • QD listeners effectively performed at ceiling, right from the start. Results: Generalization • No change for the QN group • Catastrophic collapse for the QD group • Voice quality was a highly salient cue for talker identity • (and one that listeners relied on heavily in training) Results: Word Recognition • Strong effect of voice quality: modal > creaky > breathy • Familiar (modal) voices more intelligible than unfamiliar voices • No effect of training condition, however. • Also: no poor vs. good learner problem Results: Word Recognition • Tendency for word recognition accuracy to be higher for trained voice qualities • …but it was nowhere near significant. • Familiar Talker Advantage does not depend on salient cues to talker identity. (?!) Discussion: What? • Combined results: what listeners are attending to most closely in the talker identification task is not useful to the word recognition task. • …and yet the Familiar Talker Advantage emerges anyway. • Basic idea: the Familiar Talker Advantage is supported by that which is meaningful in the signal (i.e., that which supports word recognition) to the listeners • Crucial: hearing how a talker produces particular sequences of segments. • Non-contrastive phonetic details may contribute to talker identification--and may even affect word recognition--but do not support talker-based word recognition. Discussion: Huh? • Perhaps Abercrombie (1967) was right: voice quality is “extralinguistic” in a language like English. • = noise in the linguistic signal • Note: listening to “meaningless” words in another language also does not induce the Familiar Talker Advantage. • Two possible interpretations: • Exemplar-based representations may depend on the meaningfulness of particular phonetic details. • Word recognition may be an automatic process, whereas talker identification is not. Where to Next? • Q: Did the same Familiar Talker Advantage emerge in this study? • Dialect issues. • Make sure that both groups of talkers are equivalent in intelligibility • Computational modeling of unanalyzed vs. analyzed (source-filter) spectral similarity matching. • STRAIGHT • Also: How do listeners learn from “pre-filtered” stimuli? • Alison Harding (2011): • F0 vs. segmental contributions to tone perception. Results: Word Recognition • Go over the general voice quality results first • Modal > Creaky > Breathy • Why? • I guess because there is an inherently higher noise-tosignal ratio in breathy voice. • And creaky voice? Not entirely sure about an explanation, other than that it’s more unusual than modal voice. Results: Word Recognition • Training interactions: there were none. • Show the training interaction graph regardless. Results: Word Recognition • Word recognition scores for familiar vs. unfamiliar voices. Future Directions • Test raw word recognition scores to make sure that both groups of talkers are equivalent in intelligibility (for modal voice) • Experiment idea: determine whether it’s more difficult to distinguish male from female voices in creaky and/or breathy voice. • Computational modeling of unanalyzed vs. analyzed (source-filter) spectral similarity matching. • Oh also: maybe mention Alison’s thesis • “Analysis” of F0 vs. segmental contributions to tone perception. Results: Generalization • No change for the QN group; • catastrophic collapse for the QD group • Voice quality was a highly salient cue for talker identity • (and one that listeners relied on heavily) • Note the two or three QD listeners who didn’t bomb out completely in generalization. • One (Hamish) told me that he noticed over time that there were other differences between speakers than just voice quality. Indexical and Linguistic • Abercrombie quote? The Split • Morris Halle quote. The Familiar Talker Advantage • Describe the Nygaard et al. series of studies. • Also mention the stuff that Suzanne has found on babies’ tendency to demonstrate the same ability. • Other stuff to think about: • The same finding in sinewave speech • The Remez business about looking for the phonetic locus of the facilitation. Theoretical Implications • Exemplar-based representational spin. • An Alternative View • Present the basics of the “analytical” model of speech perception. • Which should no longer be considered a “normalization” model, apparently. • Rules vs. Representations • Exemplar models focus more on the details in the signal; • they assume that generalizations can emerge from those details, working in concert with one another • Operations on the signal are minimal and deemphasized • Analytical models focus more on the operations of the listener What I/we have found • The Familiar Talker Advantage is fragile. • It does not transfer across languages. • It does not encompass all phonetic aspects of the speech signal. • The objective here: change the “voice” in two different ways: • Its linguistic content • Its acoustic (phonetic?) content • Q: do either of these changes affect the emergence of the familiar talker advantage? • A: Yes, the linguistic change does. This suggests that the FTA is a product of higher-level speech processing-- Experiment 1: Motivation • Basically: perhaps the familiar talker advantage can help us adjudicate between the exemplar and analytical models of speech perception. • What we’re trying to find out is--does the familiar talker advantage emerge because: 1. Talker properties are stripped away from the signal, thereby making the linguistic properties clearer? 2. More robust representations are formed of the interaction of the linguistic and indexical properties in the signal? Experiment 1: Theoretical Predictions • Walk through the exemplar story in detail • (I.e., see if you can figure it out for yourself) • The analytical story: • the familiar talker advantage emerges from a perceptual clarification of which aspects of the signal are speaker-based (indexical), and which are segment-based (linguistic). • The earlier data suggest that, in a familiar language, indexical processing is language-dependent • But in an unfamiliar language, talker identification is Experiment 1: Predictions • Identification of voices transfers completely from an unfamiliar language to a familiar one: • whatever “filtering” methods are used in one language apply (without loss) to the other • familiarity with a voice in one (unknown) language should lead to a word recognition advantage for that voice in a known language. Persistent Questions, part 2 • The Familiar Talker Advantage emerges (for good listeners) within a known language; • Will it emerge for the same talkers across languages, as well? • Experiment 2: Does the Familiar Talker Advantage depend on particular qualities of a talker’s voice? Experiment 2: Motivation • Known: ability to identify a talker’s voice facilitates recognition of words spoken by that talker. (Nygaard et al., 1994) 1. Exemplar-based account: linguistic representations include talker-specific information. • Processing is facilitated by similarity to traces in memory. 2. Property-based account: listeners learn how to filter indexical properties of particular talkers. • …thereby becoming more adept at revealing the linguistic core of the spoken word. Experiment 2: Predictions • Known: listeners show complete generalization of talker knowledge from German to English. (Experiment 1) • These listeners identify talkers based on languageindependent information in speech. • Exemplar-based prediction: • Learning to identify talkers in German will not facilitate word recognition in English. • (Listeners do not develop integrated representations.) • Property-based prediction: • Listeners filter same talker properties in both languages facilitation should occur across languages. Experiment 1: Training • Listeners were trained to identify voices of either: • Group 1 (five German L1 female talkers) • Group 2 (five German L1 female talkers) • Half trained in German; half trained in English • Three days of training • Two sessions per day Listener Split • Some listeners performed better on the talker identification task than others. Experiment 2: Testing • Trained listeners identified 24 words each from 15 different talkers: • Group 1: 5 unfamiliar English talkers • Group 2: 5 familiar German-English bilinguals • Group 3: 5 unfamiliar German-English bilinguals • Words were presented in four levels of white noise: • Clear, +10 dB SNR, +5 dB SNR, 0 dB SNR • Responses scored in terms of words, phonemes, features correct… Word Recognition, Untrained Listeners Word Recognition Across Groups 0.5 % Words Correct 0.48 0.46 0.44 0.42 0.4 English Group One Talker Group Group Two Percent Whole Words Correctly Identified Results: Word Recognition, all listeners 60% 50% 40% 30% 20% Group 1 10% Group 2 English Learners Gp 1 Talke rs Group 1 Group 2 German Learners Gp 2 Talke rs Interaction between listener and talker groups is not significant. Goats and Sheep • Review of literature revealed that Nygaard et al. (1994) split listeners up into “good” and “poor” listeners. • Good listeners = • 70% correct or better in training. • Poor listeners = • < 70% correct in training. • Splitting listeners in the same way yielded significant interactions in Experiment 2 data. Percent Whole Words Correctly Identified Results: Word Recognition, English Listeners 60% 50% 40% 30% 20% Group 1 10% Group 2 Good Learners Gp 1 Talke rs Group 1 Group 2 Poor Learners Gp 2 Talke rs Interaction (Good learners): p = .008; Interaction (Poor learners): p = .025. Percent Whole Words Correctly Identified Results: Word Recognition, German Listeners 60% 50% 40% 30% 20% Group 1 10% Group 2 Good Learners Gp 1 Talke rs Group 1 Group 2 Poor Learners Gp 2 Talke rs Interaction between listener and talker groups is not significant. Some More Implications • Certain properties of the signal are only informative for talker identification, and not word recognition • (i.e., Abercrombie was right.) • Maybe mention the “Who” and the “What” streams in the brain. • It doesn’t seem like exemplar theory can get the job done with comparisons of unpacked spectral slices. • Minimally, I might suggest a perceptual unraveling of the signal into source + filter characteristics. • Maximally, I might suggest that exempla-based perception starts with an articulatory model of what gestures produced a particular acoustic sequence. Voice Quality Description? • Examples and explanation of the three different voice qualities? • The laryngeal settings necessary to produce these three different qualities are largely under a speaker’s control; • however, female voices tend to be a bit breathier (all other things being equal) due to the relative thinness of their vocal folds (which makes complete closure more difficult to attain). • That being said, it is quite common these days to hear young, female speakers of American/Canadian English use creaky voice. • Point: voice quality is sub-phonemic in English; it does not signal meaningful segmental contrasts in any way. Experiment 2: Stimuli • Some discussion, perhaps, of the difficulties in recording the stimuli, and the acoustic differences that resulted. • Specifically: the breathy and creaky tokens tended to be longer in duration. • (Longer duration reflects less fluency with the articulation) • Note that it is possible that the extended durations of these articulations made the particular segments in them easier to understand. Experiment 2: Methods • These methods will effectively be the same as in Experiment 1. • Listeners learned to identify six different (female) voices over the course of four days • Two training sessions on each day • Each training session consisted of: 1. Familiarization/re-familiarization (5 words/talker, presented once--same words for everybody) 2. Recognition (5 words/talker, presented twice, with feedback) 3. Test (10 words/talker, presented once, without feedback) Experiment 2: Conditions • On the three days of training, listeners were split into one of two groups: 1. Quality-dependent: each talker only produced words in a particular voice quality • 2 modal talkers, 2 creaky talkers, 2 breathy talkers 2. Quality-neutral: all talkers produced words in a variety of voice qualities • The relationship between talker and voice quality was strictly randomized. What? • Why did we not have to split up the learners into good and poor learners in order to get the Familiar Talker Advantage in this second study? • Is the Familiar Talker Advantage the same thing in this study? • Maybe it only appears to be so, because the talkers from a different dialect area are not actually as intelligible to the listeners as the Canadian talkers.