UBC

advertisement
When do which sounds tell you
who says what?
A phonetic investigation of the familiar talker
advantage in word recognition.
University of Calgary Linguistics
Brown Bag presentation
May 25, 2011
Steve Winters
What’s the Big Idea?
•
The “Familiar Talker Advantage” =
•
•
Speech is more intelligible when produced by
familiar talkers, rather than unfamiliar talkers.
First elaborated by Nygaard et al. (1994) and
Nygaard and Pisoni (1998):
1. Trained listeners to identify 10 voices (5 male, 5
female) over 9 days.
2. Tested trained listeners’ ability to identify words
produced by:
•
•
Trained “familiar” voices
•
Novel “unfamiliar” voices
Word recognition scores: familiar > unfamiliar
Who Says What?
Talker 1
Talker 2
Talker 1
Talker 2
Talker 1
Talker 2
Conditions
• Nygaard and Pisoni (1998) recognized that there were
conditions on the emergence of the Familiar Talker
Advantage:
• Only exhibited by listeners who had performed well (>
70%) on the talker identification training task.
• Nonetheless, similar interactions between talker identity
and speech perception has been observed in:
• infants, who prefer to listen to their mother’s voice
(DeCasper & Fifer, 1980)
• sinewave speech, which supports word recognition
and talker identification (Remez, Fellowes & Rubin,
1997)
Why Do We Care?
• The Nygaard et al. studies emphasized the intersection
of indexical and linguistic information in the signal.
• Abercrombie (1967):
• The linguistic properties of speech support the
identification of linguistic (phonemic, etc.) contrasts.
• The indexical properties of speech support the
identification of “extralinguistic” aspects of the
speaker:
• Physical characteristics, dialect/group
membership, gender, emotional/mental state.
• An old idea: speech perception must filter out indexical
properties to extract the linguistic message.
Normalization
• A speech signal stripped of its indexical properties is
highly abstract.
• = “Normalized”
• “...when we learn a new word we practically
never remember most of the salient acoustic
properties that must have been present in the
signal that struck our ears; for example, we do
not remember the voice quality, speed of
utterance, and other properties directly linked to
the unique circumstances surrounding every
utterance.” -- Morris Halle (1985)
Unfiltered
• In contrast, exemplar theories of speech perception
(Johnson, 2007) emphasize the utility of not breaking
down the signal into separate components.
• Conjecture: listeners store unanalyzed representations of
speech that are “rich” with informative detail:
• linguistic representations might include indexical
(talker-specific) information;
• and indexical representations include linguistic
information.
• Generalizations emerge on the fly, from summed
activations of similar exemplars.
• The “Familiar Talker Advantage” effect seemingly
supports this view.
Adaptation
• Contemporary versions of the “normalization” theory
emphasize the active role of the listener in speech
processing…
• Rather than the (abstract) content of representations.
• “The lack of invariance in the mapping of acoustic
patterns onto phonetic categories is computationally nondeterministic…the nondeterministic mapping must be
solved by mechanisms incorporating active control
structures.” (Magnuson & Nusbaum, 2007)
•  Q: Is speech perception even possible with the
unanalytical approach of exemplar theory?
• Roughly: rules vs. representations
Which One?
•
Experiment 1 attempts to adjudicate between these
competing theories by exploiting a known asymmetry in
the processing of indexical information in speech.
•
Winters et al. (2008): tested identification of bilingual
talkers across languages.
1. English-speaking listeners trained to identify:
•
English-speaking bilinguals
•
German-speaking bilinguals
2. Tested on same talkers speaking other language:
•
English  German: loss of ID accuracy
•
German  English: no loss in ID accuracy
Implications
• Winters et al. (2008) concluded:
• Talker representations from a known language are
language-specific
• Talker representations from an unknown language are
language-independent
•  Exemplar-style representations of voices (= integrated
linguistic and indexical information) only emerged within a
known language.
• Q: The Familiar Talker Advantage emerges (for good
listeners) within a known language;
• Will it emerge for the same talkers across languages,
as well?
Predictions
• Known: listeners show complete generalization of talker
knowledge from German to English.
•  These listeners identify talkers based on languageindependent information in speech.
• Exemplar-based prediction:
• Learning to identify talkers in German will not facilitate
word recognition in English.
• (Listeners do not develop integrated representations.)
• Normalization-based prediction:
• Listeners filter same talker properties in both
languages  Familiar Talker Advantage should transfer
across languages.
Experiment 1
• Q: Does knowledge of a talker in one language
facilitate linguistic processing of that talker in another?
• Training task: talker identification
• English-speaking listeners (monolingual)
• Bilingual talkers, speaking in either English or
German.
• Testing task: English word recognition in noise
• Three talker groups:
• Familiar bilinguals
• Unfamiliar bilinguals
• Native English talkers
Experiment 1: Materials
• 10 L1 German / L2 English talkers
• All female
• These talkers produced
• 360 CVC English words (e.g., buzz, cheek)
• 360 CVC German words (e.g., hoch, Rahm)
• 5 talkers were designated as “Group 1”;
• The other 5 were “Group 2”
• Groups were balanced for intelligibility
• Also: 5 female monolingual English talkers
• produced the same set of 360 CVC English words
Experiment 1: Training
• 3 days of training
• 2 sessions per day (~30 min each)
• Each session involved:
• Familiarization: same 5 words from each talker
• Re-familiarization: same word from each talker
• Recognition: 5 words/talker, heard twice
• with feedback
• Testing: 10 words/speaker
• no feedback
• Half trained in German; half trained in English
x2
Experiment 1: Task Demo
Training Demo
Experiment 1: Word Recognition
• Trained listeners identified 24 words each from 15
different talkers:
• Group 1: 5 unfamiliar English talkers
• Group 2: 5 familiar German-English bilinguals
• Group 3: 5 unfamiliar German-English bilinguals
• Words were presented in four levels of white noise:
• Clear, +10 dB SNR, +5 dB SNR, 0 dB SNR
• Responses scored in terms of words, phonemes,
features correct…
Experiment 1: Training
• “Good”
learners
reached 70%
accuracy at
some point in
training
• No difference
in learning rate
between
language
groups (again)
Results: Phoneme Recognition
• Familiar Talker
Advantage for
Good English
listeners only
• (p = .008)
• No effect for
German listeners
(good or bad)
• Modest trend in
poor learners
towards an
Unfamiliar Talker
Advantage!
Discussion
•
(Good) English-trained listeners exhibited better word
recognition scores for familiar talkers.
•
(Good) German-trained listeners did not.
•
Familiar Talker Advantage was not supported by
language-independent talker representations.
•
 Familiar talker effect is based on rich, talker-specific
linguistic representations…
•
•
rather than a filtering of “extra-linguistic” talker
information.
Caveat: some listeners develop these representations
better than others.
Patterns
1. English-trained listeners displayed:
•
Interactions between linguistic and talker
categories in both experiments.
2. German-trained listeners:
•
•
No interactions between linguistic and talker
categories in either experiment.
Implications:
•
English-trained listeners develop richly detailed,
exemplar-like representations of voices.
•
German-trained listeners develop sparser,
language-independent representations of voices.
Life’s Persistent Questions, Part 2
•  When learning to identify voices within a known
language, listeners develop rich, exemplar-style
representations.
• Will the Familiar Talker Advantage (within a known
language) be affected by a change in the phonetic quality of
the familiar voices?
• Note: listeners attend closely to phonetic cues that are
consistently associated with a talker’s voice.
• E.g., for the German bilingual voices, F0 patterns were
an (unintentionally) consistent cue.
Talker Identity Cues
•
Winters (submitted) trained listeners to identify Thaispeaking voices consistently associated with particular
phonetic cues:
1. Lexical tones
2. VOT (voiced, unaspirated, aspirated)
3. Vowel categories (front, central, back)
•
Trained listeners were then tested on stimuli without
these talker-cue associations.
•
Associated cue salience hierarchy:
•
Tones > VOT, Vowel
However…
• Voice quality seemed to be an even more distinctive cue
to talker identity.
Talker Distinctiveness, Tone Training
4.00
3.50
3.00
D-Prime
2.50
2.00
1.50
1.00
0.50
0.00
1
2
3
4
5
6
Testing Session
1
2
3
4
5
English
Thai
Background: Voice Quality
•
Note that there are three primary types of vocal fold
vibration:
1. modal
•
vocal folds lightly adducted; flow of air causes
periodic opening and closing of folds (“trilling”)
2. breathy
•
vocal folds slightly apart; flow of air makes folds
“wave” in the wind
3. creaky
•
vocal folds tensely adducted; low airflow causes
irregular, low frequency voicing
Experiment 2
•
Two groups of English Listeners were trained to
identify English-speaking talkers:
1. Each talker produced stimuli only in a particular voice
quality (Quality-dependent training)
2. Each talker produced stimuli in a variety of voice
qualities (Quality-neutral training)
•
After training, listeners completed a generalization
task:
1. Talkers only produced (novel) words in voice
qualities not presented in training
2. Talkers only produced novel words not presented in
training--still in a variety of voice qualities
Experiment 2
•
Listeners also completed a word recognition task:
•
Words produced by familiar and unfamiliar talkers
1. Familiar-talker words in both trained and untrained
voice qualities
2. Familiar-talker words in a variety of voice qualities.
•
Exemplar-based Predictions:
•
Voice quality will form part of the representation of
the talkers’ voice
•
 Familiar talkers will be more intelligible than
unfamiliar talkers
•
 Familiar voice qualities will be more intelligible
than unfamiliar voice qualities (for the same talker)
Experiment 2: Materials
• Phonetically trained talkers produced the same list of 360
English CVC words in three different voice qualities:
• modal
• breathy
• creaky
• In all, I recorded 6 female talkers and 6 male talkers
• Only the female talkers were presented in the
experiment.
• Two recording sessions, lasting about an hour each;
• Talkers were paid $40 for their time and effort.
Experiment 2: Materials
• The “unfamiliar” talkers consisted of six female talkers
recorded for the database used in Experiment #1.
• Note: these talkers were from a different dialect region
of the United States (Indiana)
• Also note: the breathy and creaky tokens tended to be
longer in duration.
• (Longer duration reflects less fluency with the
articulation)
443 ms
562 ms
654 ms
Experiment 2: Methods
•
Training methods were identical to those used in
Experiment 1.
•
Listeners learned to identify six different (female) voices
over the course of three days
•
Two training sessions on each day
1. Quality-dependent: each talker only produced words in
a particular voice quality
•
2 modal talkers, 2 creaky talkers, 2 breathy talkers
2. Quality-neutral: all talkers produced words in a variety
of voice qualities
•
The relationship between talker and voice quality was
randomized in each group.
Experiment 2: Participants
•
16 participants in each group
•
Listeners were recruited from introductory linguistics
classes
•
•
(so they had some, but not a lot, of phonetics
knowledge)
On the fourth day of the experiment, listeners completed
two tasks:
1. Generalization
2. Word Recognition
•
Order of tasks was counterbalanced across listeners
•
Listeners were paid $60 for their time and trouble.
Experiment 2: Generalization
•
Task: talker identification
•
Quality-dependent listeners:
•
•
Quality-neutral listeners:
•
•
All talkers produced words in the two voice qualities
that they did not produce in training (5 in each)
The relationship between talker and voice quality
was still random (10 words/talker)
Both groups identified talkers from words that were not
presented in training.
Experiment 2: Word Recognition
•
Listeners identified words, presented in pink noise (0 dB
SNR), as produced by two sets of talkers:
1. Familiar (6 voices; 12 words each)
2. Unfamiliar (6 voices; 12 words each)
•
For the quality-dependent listeners, words were evenly
split between the voice quality associated with each
talker in training (6) and the two voice qualities the talker
did not produce in training (3 each).
•
For example:
•
Analysis: responses were scored in terms of words
correct and phonemes correct (onset, nucleus, coda)
Results: Training
•QN listeners learned
consistently over the
six training sessions
• QD listeners
effectively performed
at ceiling, right from
the start.
Results: Generalization
• No change for the
QN group
• Catastrophic
collapse for the QD
group
•  Voice quality
was a highly salient
cue for talker
identity
• (and one that
listeners relied on
heavily in training)
Results: Word Recognition
• Strong effect of
voice quality:
modal > creaky >
breathy
• Familiar (modal)
voices more
intelligible than
unfamiliar voices
• No effect of training
condition, however.
• Also: no poor vs.
good learner problem
Results: Word Recognition
• Tendency for word
recognition accuracy
to be higher for trained
voice qualities
• …but it was nowhere
near significant.
•  Familiar Talker
Advantage does not
depend on salient
cues to talker identity.
(?!)
Discussion: What?
• Combined results: what listeners are attending to most
closely in the talker identification task is not useful to the
word recognition task.
• …and yet the Familiar Talker Advantage emerges
anyway.
• Basic idea: the Familiar Talker Advantage is supported by
that which is meaningful in the signal (i.e., that which
supports word recognition) to the listeners
• Crucial: hearing how a talker produces particular
sequences of segments.
• Non-contrastive phonetic details may contribute to talker
identification--and may even affect word recognition--but do
not support talker-based word recognition.
Discussion: Huh?
• Perhaps Abercrombie (1967) was right: voice quality is
“extralinguistic” in a language like English.
• = noise in the linguistic signal
• Note: listening to “meaningless” words in another
language also does not induce the Familiar Talker
Advantage.
• Two possible interpretations:
• Exemplar-based representations may depend on
the meaningfulness of particular phonetic details.
• Word recognition may be an automatic process,
whereas talker identification is not.
Where to Next?
• Q: Did the same Familiar Talker Advantage emerge in
this study?
• Dialect issues.
• Make sure that both groups of talkers are equivalent
in intelligibility
• Computational modeling of unanalyzed vs. analyzed
(source-filter) spectral similarity matching.
• STRAIGHT
• Also: How do listeners learn from “pre-filtered” stimuli?
• Alison Harding (2011):
• F0 vs. segmental contributions to tone perception.
Results: Word Recognition
• Go over the general voice quality results first
• Modal > Creaky > Breathy
• Why?
• I guess because there is an inherently higher noise-tosignal ratio in breathy voice.
• And creaky voice? Not entirely sure about an explanation,
other than that it’s more unusual than modal voice.
Results: Word Recognition
• Training interactions: there were none.
• Show the training interaction graph regardless.
Results: Word Recognition
• Word recognition scores for familiar vs. unfamiliar voices.
Future Directions
• Test raw word recognition scores to make sure that both
groups of talkers are equivalent in intelligibility (for modal
voice)
• Experiment idea: determine whether it’s more difficult to
distinguish male from female voices in creaky and/or
breathy voice.
• Computational modeling of unanalyzed vs. analyzed
(source-filter) spectral similarity matching.
• Oh also: maybe mention Alison’s thesis
• “Analysis” of F0 vs. segmental contributions to tone
perception.
Results: Generalization
• No change for the QN group;
• catastrophic collapse for the QD group
•  Voice quality was a highly salient cue for talker identity
• (and one that listeners relied on heavily)
• Note the two or three QD listeners who didn’t bomb out
completely in generalization.
• One (Hamish) told me that he noticed over time that
there were other differences between speakers than
just voice quality.
Indexical and Linguistic
• Abercrombie quote?
The Split
• Morris Halle quote.
The Familiar Talker Advantage
• Describe the Nygaard et al. series of studies.
• Also mention the stuff that Suzanne has found on babies’
tendency to demonstrate the same ability.
• Other stuff to think about:
• The same finding in sinewave speech
• The Remez business about looking for the phonetic
locus of the facilitation.
Theoretical Implications
• Exemplar-based representational spin.
•
An Alternative View
• Present the basics of the “analytical” model of speech
perception.
• Which should no longer be considered a
“normalization” model, apparently.
• Rules vs. Representations
• Exemplar models focus more on the details in the signal;
• they assume that generalizations can emerge from
those details, working in concert with one another
• Operations on the signal are minimal and deemphasized
• Analytical models focus more on the operations of the
listener
What I/we have found
• The Familiar Talker Advantage is fragile.
• It does not transfer across languages.
• It does not encompass all phonetic aspects of the
speech signal.
• The objective here: change the “voice” in two different
ways:
• Its linguistic content
• Its acoustic (phonetic?) content
• Q: do either of these changes affect the emergence of
the familiar talker advantage?
• A: Yes, the linguistic change does. This suggests that
the FTA is a product of higher-level speech processing--
Experiment 1: Motivation
•
Basically: perhaps the familiar talker advantage can
help us adjudicate between the exemplar and analytical
models of speech perception.
•
What we’re trying to find out is--does the familiar talker
advantage emerge because:
1. Talker properties are stripped away from the signal,
thereby making the linguistic properties clearer?
2. More robust representations are formed of the
interaction of the linguistic and indexical properties
in the signal?
Experiment 1: Theoretical
Predictions
• Walk through the exemplar story in detail
• (I.e., see if you can figure it out for yourself)
• The analytical story:
• the familiar talker advantage emerges from a
perceptual clarification of which aspects of the signal
are speaker-based (indexical), and which are
segment-based (linguistic).
• The earlier data suggest that, in a familiar language,
indexical processing is language-dependent
• But in an unfamiliar language, talker identification is
Experiment 1: Predictions
• Identification of voices transfers completely from an
unfamiliar language to a familiar one:
•  whatever “filtering” methods are used in one
language apply (without loss) to the other
•  familiarity with a voice in one (unknown) language
should lead to a word recognition advantage for that
voice in a known language.
Persistent Questions, part 2
• The Familiar Talker Advantage emerges (for good
listeners) within a known language;
• Will it emerge for the same talkers across languages,
as well?
• Experiment 2: Does the Familiar Talker Advantage
depend on particular qualities of a talker’s voice?
Experiment 2: Motivation
•
Known: ability to identify a talker’s voice facilitates
recognition of words spoken by that talker. (Nygaard
et al., 1994)
1. Exemplar-based account: linguistic representations
include talker-specific information.
•
Processing is facilitated by similarity to traces in
memory.
2. Property-based account: listeners learn how to filter
indexical properties of particular talkers.
•
…thereby becoming more adept at revealing the
linguistic core of the spoken word.
Experiment 2: Predictions
• Known: listeners show complete generalization of talker
knowledge from German to English. (Experiment 1)
•  These listeners identify talkers based on languageindependent information in speech.
• Exemplar-based prediction:
• Learning to identify talkers in German will not facilitate
word recognition in English.
• (Listeners do not develop integrated representations.)
• Property-based prediction:
• Listeners filter same talker properties in both
languages  facilitation should occur across
languages.
Experiment 1: Training
• Listeners were trained to identify voices of either:
• Group 1 (five German L1 female talkers)
• Group 2 (five German L1 female talkers)
• Half trained in German; half trained in English
• Three days of training
• Two sessions per day
Listener Split
• Some listeners performed better on the talker
identification task than others.
Experiment 2: Testing
• Trained listeners identified 24 words each from 15
different talkers:
• Group 1: 5 unfamiliar English talkers
• Group 2: 5 familiar German-English bilinguals
• Group 3: 5 unfamiliar German-English bilinguals
• Words were presented in four levels of white noise:
• Clear, +10 dB SNR, +5 dB SNR, 0 dB SNR
• Responses scored in terms of words, phonemes,
features correct…
Word Recognition, Untrained Listeners
Word Recognition Across Groups
0.5
% Words Correct
0.48
0.46
0.44
0.42
0.4
English
Group One
Talker Group
Group Two
Percent Whole Words Correctly Identified
Results:
Word Recognition, all listeners
60%
50%
40%
30%
20%
Group 1
10%
Group 2
English Learners
Gp 1 Talke rs
Group 1
Group 2
German Learners
Gp 2 Talke rs
Interaction between listener and talker groups is not significant.
Goats and Sheep
• Review of literature revealed that Nygaard et al. (1994)
split listeners up into “good” and “poor” listeners.
• Good listeners =
• 70% correct or better in training.
• Poor listeners =
• < 70% correct in training.
• Splitting listeners in the same way yielded significant
interactions in Experiment 2 data.
Percent Whole Words Correctly Identified
Results:
Word Recognition, English Listeners
60%
50%
40%
30%
20%
Group 1
10%
Group 2
Good Learners
Gp 1 Talke rs
Group 1
Group 2
Poor Learners
Gp 2 Talke rs
Interaction (Good learners): p = .008; Interaction (Poor learners): p = .025.
Percent Whole Words Correctly Identified
Results:
Word Recognition, German Listeners
60%
50%
40%
30%
20%
Group 1
10%
Group 2
Good Learners
Gp 1 Talke rs
Group 1
Group 2
Poor Learners
Gp 2 Talke rs
Interaction between listener and talker groups is not significant.
Some More Implications
• Certain properties of the signal are only informative for
talker identification, and not word recognition
• (i.e., Abercrombie was right.)
• Maybe mention the “Who” and the “What” streams in the
brain.
• It doesn’t seem like exemplar theory can get the job
done with comparisons of unpacked spectral slices.
• Minimally, I might suggest a perceptual unraveling of the
signal into source + filter characteristics.
• Maximally, I might suggest that exempla-based
perception starts with an articulatory model of what
gestures produced a particular acoustic sequence.
Voice Quality Description?
• Examples and explanation of the three different voice
qualities?
• The laryngeal settings necessary to produce these three
different qualities are largely under a speaker’s control;
• however, female voices tend to be a bit breathier (all
other things being equal) due to the relative thinness of
their vocal folds (which makes complete closure more
difficult to attain).
• That being said, it is quite common these days to hear
young, female speakers of American/Canadian English
use creaky voice.
• Point: voice quality is sub-phonemic in English; it does
not signal meaningful segmental contrasts in any way.
Experiment 2: Stimuli
• Some discussion, perhaps, of the difficulties in recording
the stimuli, and the acoustic differences that resulted.
• Specifically: the breathy and creaky tokens tended to
be longer in duration.
• (Longer duration reflects less fluency with the
articulation)
• Note that it is possible that the extended durations of
these articulations made the particular segments in
them easier to understand.
Experiment 2: Methods
•
These methods will effectively be the same as in
Experiment 1.
•
Listeners learned to identify six different (female) voices
over the course of four days
•
Two training sessions on each day
•
Each training session consisted of:
1. Familiarization/re-familiarization (5 words/talker,
presented once--same words for everybody)
2. Recognition (5 words/talker, presented twice, with
feedback)
3. Test (10 words/talker, presented once, without
feedback)
Experiment 2: Conditions
•
On the three days of training, listeners were split into
one of two groups:
1. Quality-dependent: each talker only produced words
in a particular voice quality
•
2 modal talkers, 2 creaky talkers, 2 breathy talkers
2. Quality-neutral: all talkers produced words in a variety
of voice qualities
•
The relationship between talker and voice quality
was strictly randomized.
What?
• Why did we not have to split up the learners into good
and poor learners in order to get the Familiar Talker
Advantage in this second study?
• Is the Familiar Talker Advantage the same thing in this
study?
• Maybe it only appears to be so, because the talkers
from a different dialect area are not actually as
intelligible to the listeners as the Canadian talkers.
Download