Lecture 4-04

advertisement
Phonetic Context Effects
Major Theories of Speech
Perception
Motor Theory:
Specialized module
(later version)
represents speech
sounds in terms of
intended gestures
through a model of
or knowledge of
vocal tracts
Direct Realism:
Perceptual system
recovers
(phoneticallyrelevant) gestures
by picking up the
specifying
information in the
speech signal.
Explanatory level = gesture
General Approaches:
Speech is processed in
the same way as other
sounds.
Representation is a
function of the
auditory system and
experience with
language.
Explanatory level = sound
Fluent Speech Production
The vocal tract is subject
to physical constraints...
Mass
Inertia
Radical Context Dependency
Also a result of the motor plan
Coarticulation = Assimilation
adjacent speech
becomes more similar
An Example
Place of Articulation in stops
Say /da/
Say /ga/
Anterior
Posterior
An Example
Place of Articulation in stops
Say /al/
Say /ar/
Anterior
Posterior
An Example
Place of Articulation in stops
Say /al da/
Say /ar da/
Say /al ga/
Say /ar ga/
Place of articulation changes = Coarticulation
An Example
Place of Articulation in stops
Say /al ga/
Say /ar da/
Coarticulation
has acoustical
consequences
/al da/
/al ga/
*
How does the listener deal with this?
f
t
/ar da/
*
/ar ga/
Speech Perception
/ga/
/al/
/da/
/ar/
100
none
90
al
80
ar
70
60
50
40
30
20
10
0
1
[ga]
2
3
4
5
6
7
[da]
Percent “g” Responses
Percent “g” Responses
Identifying in Context
/al/
/ar/
Direction of Effect
Production
/al/
More /da/-like
/ar/
More /ga/-like
Perception
/al/
More /ga/-like
/ar/
More /da/-like
Perceptual Compensation
For
Coarticulation
What happens when
there is no coarticulation?
AT&T Natural Voices Text-To-Speech Engine
“ALL DA”
“ARE GA”
Further Findings
4 ½ month old infants
(Fowler et al. 1990)
Native Japanese listeners who do
not discriminate /al/ from /ar/
(Mann, 1986)
Theoretical Interpretations
“There may exist a universally shared level where
representation of speech sounds more closely corresponds
to articulatory gestures that give rise to the speech signal.”
(Mann, 1986)
Motor Theory
“Presumably human listeners possess implicit knowledge
of coarticulation.” (Repp, 1982)
Major Theories of Speech
Perception
Motor Theory:
“Knowledge” of
coarticulation
allows perceptual
system to
compensate for its
predicted effects on
the speech signal.
Direct Realism:
Coarticulation is
information for the
gestures involved.
Signal is parsed
along the gestural
lines. Coart. is
assigned to gesture.
General Approaches:
Those other guys are
wrong.
Theoretical Interpretations
Common Thread:
Detailed correspondence between
speech production and perception
Special Module for Speech Perception
Two Predictions:
•
•
Talker-Specific
Speech-Specific
Testing Hypothesis #1
Talker-specific
Should only compensate for
the speech coming from a
single speaker
Testing Hypothesis #1
Talker-specific
Male /al/
Male /da/ - /ga/
Male /ar/
Female /al/
Female /ar/
Mean context shift
40
30
20
10
0
Male
Female
Testing Hypothesis #2
Speech-specific
Compensation should only occur
for speech sounds
SPEECH
/al/
TONES
SPEECH
/ar/
TONES
Testing Hypothesis #2
80
/al/
Mean % /ga/ Responses
70
/ar/
60
50
40
30
20
10
0
Speech
Non-Speech
Condition
Does this rule out motor theory?
It may be that the special speech module is broadly tuned.
If it acts like speech it went through speech module. If not, not.
SPEECH PRECURSORS
/al/
/ar/
Training the Quail
1
/ga/
Withheld from
training
2
3 4 5 6
7
/da/
CV series varying in F3 onset frequency
Withheld from training
/al/
/ar/
Context-Dependent Speech Perception
by an avian species
Conclusions
1 Links to speech production are not necessary
Neither speech-specific nor species-specific
2
Learning is not necessary
Quail had no experience with covariation
3 General auditory processes
play a substantive role in maintaining
perceptual compensation for coarticulation
Major Theories of Speech
Perception
Motor Theory:
“Knowledge” of
coarticulation
allows perceptual
system to
compensate for its
predicted effects on
the speech signal.
Direct Realism:
Coarticulation is
information for the
gestures involved.
Signal is parsed
along the gestural
lines. Coart. is
assigned to gesture.
General Approaches:
General Auditory
Processes
GAP
Effects of Context
a familiar example
How well does this analogy hold up
for context effects in speech?
Effects of Context
a familiar example
/al da/
/al ga/
*
f
t
/ar da/
*
/ar ga/
Hypothesis: Spectral Contrast
the case of [ar]
Production
Perception
F3 assimilated toward lower frequency F3 is perceived as a higher frequency
/ar da/
100
none
90
al
80
ar
70
60
50
40
30
20
10
0
1
Time
2
3
4
5
F3 Step
6
7
Evidence for General Approach
The Empire Strikes Back
Fowler, et al. (2000)
audio
Ambiguous precursor
Test syllable:
/ga/-/da/ series
video
Visual cue: face
“AL” or “AR”
Precursor conditions differed only in visual information
Results of Fowler, et al. (2000)
• More /ga/ responses when video cued /al/
100
video /al/
video /ar/
Percent /ga/ responses
90
80
70
60
50
40
30
20
10
0
1
2
3
4
5
6
Test syllable
7
8
9
10
Experiment 1: Results
– F(1,8) = 3.2,
p = .111
90
Percent /ga/ responses
• No context effect
on test syllable
100
80
video /al/
video /ar/
70
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
Test syllable
%ga responses by condition for 9 participants
A closer look…
• 2 videos: /alda/ /arda/
• Video information during test syllable
presentation
• Should be the same in both conditions
/alda/ video
/arda/ video
…more consistent
with /ga/?
…more consistent
with /da/?
% GA responses
Results
100
90
80
70
60
50
40
30
20
10
0
Video /da/ from /alda/
Video /da/ from /arda/
1
2
3
4
5
6
7
Test syllable
8
9
10
Comparisons
100
Percent /ga/ responses
80
% GA responses
video /al/
video /ar/
90
70
60
50
40
30
20
10
100
90
80
70
60
50
40
30
20
10
0
Video /da/ from /alda/
Video /da/ from /arda/
1
0
1
2
3
4
5
6
Test syllable
7
8
9
2
3
4
5
6
7
10
Test syllable
8
9
10
Conclusions
1 No evidence of visually mediated phonetic
context effect
2
No evidence that gestural information is
required
3 Spectral contrast is best current account
But what about backwards effects???
The Stimulus Paradigm
Sine-wave Tone
Context
(High or Low Freq)
Noise Burst
(/t/ or /k/)
Time
Low High
Frequency (Hz)
Target
Speech
Stimulus
/da-ga/
Got
Dot
Gawk
Dock
Time (ms)




Speaker Normalization
CARRIER SENTENCE
“Please say what this word is…”
Original, F1, F1
+
TARGET
“bit”, “bet”, “bat”, “but”
Ladefoged & Broadbent (1957)
• TARGET acoustics were constant
• TARGET perception shifted with changes in “speaker”
• Spectral characteristics of the preceding sentence predicted perception
‘Talker/Speaker Normalization’
Sensitivity to Accent, Etc.
Experiment Model



Natural speech
F2 & F3 onset edited to create 9-step series
Varying perceptually from /ga/ to /da/
1
Speech
Token
589 ms
9
/ga/
/da/
Time
No Effect of Adjacent Context
with intermediate spectral characteristics
Standard
Tone
70 ms
Silent
Interval
50 ms
2300 Hz
PILOT TEST:
No context effect on speech perception
(t(9)=1.35, p=.21)
Speech
Token
589 ms
Time
Acoustic Histories
Acoustic
History
2100 ms

Standard
Tone
70 ms
Silent
Interval
50 ms
Speech
Token
589 ms
Time
ACOUSTIC HISTORY: The critical context stimulus for these
experiments is not a single sound, but a distribution of sounds


21 70-ms tones, sampled from a distribution
30-ms silent interval between tones
Acoustic History Distributions
1
Frequency of Presentation
Low Mean = 1800 Hz
1
1300
1800
2300
2800
3300
2300
2800
3300
High Mean = 2800 Hz
1300
1800
Tone Frequency (Hz)
Frequency (Hz)
Example Stimuli
A
B
2800 Hz Mean
1800 Hz Mean
Time (ms)
Characteristics of the Context
Acoustic
History
2100 ms
Standard
Tone
70 ms
Silent
Interval
50 ms
Speech
Token
589 ms
Time

Context is not local
 Standard tone immediately precedes each stimulus,
independent of condition. On its own, this has no effect of
context on /ga/-/da/ stimuli.

Context is defined by distribution characteristics
 Sampling of the distribution varies on each trial
Precise acoustic characteristics vary with trial
 Context unfolds over a broad time course
Results
Percent GA Responses
100
90
High Mean
80
Low Mean
Contrastive
70
60
50
40
30
20
10
0
p<.0001
1
2
3
4
5
6
Stimulus Step
7
8
9
Notched Noise Histories
Frequency (Hz)
A
B
Time (ms)
100 Hz BW for each notch
Results
Tones
90
Low Mean
80
70
60
50
40
30
20
10
100
High Mean
p<0.01
Percent "GA" Responses
Percent "GA" Responses
100
Notched Noise
Low Mean
80
70
60
50
40
30
20
10
0
High Mean
90
p<0.04
0
1
2
3
4
5
6
Stimulus Step
7
8
9
1
2
3
4
5
6
7
8
9
Stimulus Step
N=10
Joint Effects?
Acoustic
History
2100 ms
Standard
Tone
70 ms
Silent
Interval
50 ms
Speech
Token
589 ms
Time
Acoustic
History
2100 ms
/al/ or
/ar/
300 ms
Silent
Interval
50 ms
Speech
Token
589 ms
Time
Conflicting  e.g., High Mean + /ar/
Cooperating  e.g., High Mean + /al/
Interaction of Speech/N.S.
80
80
70
60
50
40
30
20
/al/
10
/ar/
p<.0001
70
60
50
40
30
20
/al/
10
/ar/
Percent "ga" Responses
p=.007 90
Percent "ga" Responses
Percent "ga" Responses
90
0
100
100
100
Conflicting
Cooperating
Speech Only
p=.009
90
80
70
60
50
40
30
20
/al/
10
/ar/
0
1 2 3 4 5 6 7 8 9
0
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Speech Target
Speech Target
Speech Target
Significantly greater Same magnitude as Speech
Only, opposite direction
than speech alone
Follows NS spectra
Download