The importance of auditory illusions Computational Auditory Scene Analysis: Auditory Scene Analysis

advertisement
TOP
TOP
TOP
The importance of auditory illusions
for artificial listeners
Computational Auditory Scene Analysis:
An overview and some observations
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
“The organization of complex sound scenes
according to their inferred sources”
Computational Auditory Scene Analysis
1
Modeling Auditory Scene Analysis
2
A survey of CASA
2
A survey of CASA
3
Illusions & prediction-driven CASA
3
Prediction-driven CASA
4
CASA and speech recognition
4
CASA and speech recognition
5
Implications for duplex perception
5
Implications for other domains
6
Conclusions
6
Conclusions
1997oct24/5 - 1
CASA talk - Haskins/NUWC - Dan Ellis
TOP
1997oct24/5 - 2
•
Automatic sound organization?
- convert an undifferentiated signal into a
description in terms of different sources
•
Psychoacoustics defines grouping ‘rules’
- e.g. [Bregman 1990]
- translate into computer programs?
1997oct24/5 - 4
TOP
Human audition is very effective
- unexpectedly difficult to model
•
‘Correct’ analysis defined by goal
- human beings have particular interests...
- (in)dependence as the key attribute of a source
- ecological constraints enable organization
CASA1:
CASA survey
•
Early work on co-channel speech
- listeners benefit from pitch difference
- algorithms for separating periodicities
•
Utterance-sized signals need more
- cannot predict number of signals (0, 1, 2 ...)
- birth/death processes
•
Hypothesis systems
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 5
Other approaches
•
Cooke & Brown (1991-3)
- divide time-frequency plane into elements
- apply grouping rules to form sources
- pull single periodic target out of noise
CASA talk - Haskins/NUWC - Dan Ellis
•
Blind source separation (Bell & Sejnowski)
- find exact separation parameters by maximizing
statistic e.g. signal independence
•
Klassner 1996
- search for a combination of templates
- high-level hypotheses permit front-end tuning
•
HMM decomposition (RK Moore)
- recover combined source states directly
•
Neural models (Malsburg, Wang & Brown)
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration?
Perception is not direct
but a search for plausible hypotheses
•
Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 8
Data-driven...
vs. Prediction-driven
•
1997oct24/5 - 7
1997oct24/5 - 6
Prediction-driven CASA
3
Okuno et al. (1994-)
- ‘tracers’ follow each harmonic + noise ‘agent’
- residue-driven: account for whole signal
CASA talk - Haskins/NUWC - Dan Ellis
Periodic pieces
Weintraub 1985
- separate male & female voices
- find periodicities in each frequency channel by
auto-coincidence
- number of voices is ‘hidden state’
TOP
CASA3:
•
•
1997oct24/5 - 3
•
Ultimately, more constraints needed
- nonperiodic signals
- masked cues
- ambiguous signals
TOP
CASA2:
•
TOP
2
Motivations & Applications
- it’s a puzzle: new processing principles?
- real-world interactive systems (speech, robots)
- hearing prostheses (enhancement, description)
- advanced processing (remixing)
- multimedia indexing (movies etc.)
CASA talk - Haskins/NUWC - Dan Ellis
Sounds rarely occur in isolation
- getting useful information from real-world sound
requires auditory organization
CASA talk - Haskins/NUWC - Dan Ellis
TOP
Computational Auditory Scene Analysis
(CASA)
•
•
Outline
1
CASA talk - Haskins/NUWC - Dan Ellis
Auditory Scene Analysis
1
Novel features
- reconcile complete explanation to input
- ‘vocabulary’ of noise/transient/periodic
- multiple hypotheses
- sufficient detail for reconstruction
- explanation hierarchy
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 9
TOP
TOP
TOP
Analyzing the continuity illusion
•
•
•
PDCASA example:
Construction-site ambience
Interrupted tone heard as continuous
- .. if the interruption could be a masker
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 10
Problems
- error allocation
- source hierarchy
1997oct24/5 - 11
Recognize combined states? (Moore)
- ‘state’ becomes very complex
•
Data-driven: CASA as preprocessor
- problems with ‘holes’ (Cooke, Okuno)
- doesn’t exploit knowledge of speech structure
•
Prediction-driven: speech as component
- same ‘reconciliation’ of speech hypotheses
- need to express ‘predictions’ in signal domain
CASA talk - Haskins/NUWC - Dan Ellis
TOP
Example of speech & nonspeech
1997oct24/5 - 12
TOP
Duplex perception as
masking & restoration
Prediction-driven analysis
and duplex perception
5
•
Single element → 2 percepts?
- e.g. contralateral formant transition
- doesn’t fit into exclusive support hierarchy
•
But: two elements at same position
- hypotheses suggest overlap
- predictions combine
- reconciliation is OK
•
Order debate is sidestepped
- .. not a left-to-right data path
•
Account for masking could ‘work’ for duplex
- bilateral masking levels?
- masking spread?
- tolerable colorations?
•
Sinewave speech as a plausible masker?
- formants hiding under each whistle?
- greedy speech hypothesis generator
•
Problems:
- where do hypotheses come from? (priming)
- what limits on illusory speech?
Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 13
TOP
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 14
CASA talk - Haskins/NUWC - Dan Ellis
TOP
Problem: inadequate signal data
- hearing: masking
- vision: occlusion
- other sensor domains: noise/limits
General answer: employ constraints
- high-level prior expectations
- mid-level regularities
- low-level continuity
•
Hearing is a admirable solution
•
Prediction-driven approach suggests priorities
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 16
1997oct24/5 - 15
TOP
Essential features of PDCASA
Lessons for other domains
•
•
- rating hypotheses
- resynthesis
CASA talk - Haskins/NUWC - Dan Ellis
TOP
•
Speech recognition is very fragile
- lots of motivation to use ‘source separation’
Prediction-driven can accommodate
- special case or general principle?
5
•
Data-driven just sees gaps
•
•
CASA for speech recognition
4
•
Prediction-reconciliation of hypotheses
- specific hypotheses are pursued
- lack-of-refutation standard
•
Provide a complete explanation
- keeping track of the obstruction can help in
compensating for its effects
•
Hierarchic representation
- useful constraints occur at many levels:
want to be able to apply where appropriate
•
Preserve detail
- even when resynthesis is not a goal
- helps gauge goodness-of-fit
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 17
Conclusions
6
•
Auditory organization is indispensable in real
environments
•
We don’t know how listeners do it!
- plenty of modeling interest
•
Prediction-reconciliation can account for
‘illusions’
- use ‘knowledge’ when signal is inadequate
- important in a wider range of circumstances?
•
Speech recognizers are a good source of
knowledge
•
Wider implications of the prediction-driven
approach
- understanding perceptual paradoxes
- applications in other domains
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 18
TOP
The importance of auditory illusions
for artificial listeners
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1
Computational Auditory Scene Analysis
2
A survey of CASA
3
Illusions & prediction-driven CASA
4
CASA and speech recognition
5
Implications for duplex perception
6
Conclusions
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 1
TOP
Computational Auditory Scene Analysis:
An overview and some observations
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1
Modeling Auditory Scene Analysis
2
A survey of CASA
3
Prediction-driven CASA
4
CASA and speech recognition
5
Implications for other domains
6
Conclusions
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 2
TOP
Auditory Scene Analysis
1
“The organization of complex sound scenes
according to their inferred sources”
•
Sounds rarely occur in isolation
- getting useful information from real-world sound
requires auditory organization
•
Human audition is very effective
- unexpectedly difficult to model
•
‘Correct’ analysis defined by goal
- human beings have particular interests...
- (in)dependence as the key attribute of a source
- ecological constraints enable organization
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 3
TOP
Computational Auditory Scene Analysis
(CASA)
•
Automatic sound organization?
- convert an undifferentiated signal into a
description in terms of different sources
•
Psychoacoustics defines grouping ‘rules’
- e.g. [Bregman 1990]
- translate into computer programs?
•
Motivations & Applications
- it’s a puzzle: new processing principles?
- real-world interactive systems (speech, robots)
- hearing prostheses (enhancement, description)
- advanced processing (remixing)
- multimedia indexing (movies etc.)
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 4
TOP
CASA survey
2
•
Early work on co-channel speech
- listeners benefit from pitch difference
- algorithms for separating periodicities
•
Utterance-sized signals need more
- cannot predict number of signals (0, 1, 2 ...)
- birth/death processes
•
Ultimately, more constraints needed
- nonperiodic signals
- masked cues
- ambiguous signals
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 5
TOP
CASA1:
frq/Hz
Periodic pieces
•
Weintraub 1985
- separate male & female voices
- find periodicities in each frequency channel by
auto-coincidence
- number of voices is ‘hidden state’
•
Cooke & Brown (1991-3)
- divide time-frequency plane into elements
- apply grouping rules to form sources
- pull single periodic target out of noise
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
0.6
0.8
CASA talk - Haskins/NUWC - Dan Ellis
1.0
time/s
0.2
0.4
0.6
0.8
1.0
1997oct24/5 - 6
time/s
TOP
CASA2:
Hypothesis systems
•
Okuno et al. (1994-)
- ‘tracers’ follow each harmonic + noise ‘agent’
- residue-driven: account for whole signal
•
Klassner 1996
- search for a combination of templates
- high-level hypotheses permit front-end tuning
3760 Hz
Buzzer-Alarm
2540 Hz
2230 Hz
2350 Hz
Glass-Clink
1675 Hz
1475 Hz
950 Hz
500 Hz
460 Hz
420 Hz
Phone-Ring
1.0
•
Siren-Chirp
2.0
3.0
4.0 sec
1.0
2.0
TIME
TIME
(a)
(b)
3.0
4.0 sec
Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 7
TOP
CASA3:
Other approaches
•
Blind source separation (Bell & Sejnowski)
- find exact separation parameters by maximizing
statistic e.g. signal independence
•
HMM decomposition (RK Moore)
- recover combined source states directly
•
Neural models (Malsburg, Wang & Brown)
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration?
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 8
TOP
Prediction-driven CASA
3
Perception is not direct
but a search for plausible hypotheses
•
Data-driven...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
vs. Prediction-driven
Grouping
rules
Source
groups
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
•
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
Novel features
- reconcile complete explanation to input
- ‘vocabulary’ of noise/transient/periodic
- multiple hypotheses
- sufficient detail for reconstruction
- explanation hierarchy
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 9
TOP
Analyzing the continuity illusion
•
Interrupted tone heard as continuous
- .. if the interruption could be a masker
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
time/s
•
Data-driven just sees gaps
•
Prediction-driven can accommodate
- special case or general principle?
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 10
TOP
PDCASA example:
Construction-site ambience
f/Hz
4000
2000
1000
400
200
1000
400
200
100
50
Construction
0
1
2
3
4
5
6
7
Noise2
f/Hz
8
9
Wefts8,10
f/Hz
4000
2000
1000
400
200
4000
2000
1000
400
200
1000
400
200
100
50
Saw (10/10)
Voice (6/10)
f/Hz
4000
2000
1000
400
200
Click1 Clicks2,3
Click4
Clicks5,6
Clicks7,8
Wood hit (7/10)
Metal hit (8/10)
Wood drop (10/10)
Clink1 (4/10)
Clink2 (7/10)
f/Hz
4000
2000
1000
400
200
1000
400
200
100
50
f/Hz
4000
2000
1000
400
200
Wefts1−6
Wefts7,9
−30
Noise1
−40
−50
−60
dB
0
•
1
2
3
4
Problems
- error allocation
- source hierarchy
CASA talk - Haskins/NUWC - Dan Ellis
5
6
7
8
9
time/s
- rating hypotheses
- resynthesis
1997oct24/5 - 11
TOP
CASA for speech recognition
4
•
Speech recognition is very fragile
- lots of motivation to use ‘source separation’
•
Recognize combined states? (Moore)
- ‘state’ becomes very complex
•
Data-driven: CASA as preprocessor
- problems with ‘holes’ (Cooke, Okuno)
- doesn’t exploit knowledge of speech structure
•
Prediction-driven: speech as component
- same ‘reconciliation’ of speech hypotheses
- need to express ‘predictions’ in signal domain
Speech
components
Hypothesis
management
input
mixture
Noise
components
Predict
& combine
Periodic
components
Front end
CASA talk - Haskins/NUWC - Dan Ellis
Compare
& reconcile
1997oct24/5 - 12
TOP
Example of speech & nonspeech
f/Hz
223cl
4000
2000
1000
400
200
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
0.4
0.6
0.8
1.0
1.2
1.4
Speech1
Click5
20
Clicks1−7
10
0
−10
dB
•
0.0
0.2
time/s
Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 13
TOP
Prediction-driven analysis
and duplex perception
5
•
Single element → 2 percepts?
- e.g. contralateral formant transition
- doesn’t fit into exclusive support hierarchy
•
But: two elements at same position
- hypotheses suggest overlap
- predictions combine
- reconciliation is OK
•
Order debate is sidestepped
- .. not a left-to-right data path
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 14
TOP
Duplex perception as
masking & restoration
•
Account for masking could ‘work’ for duplex
- bilateral masking levels?
- masking spread?
- tolerable colorations?
•
Sinewave speech as a plausible masker?
- formants hiding under each whistle?
- greedy speech hypothesis generator
•
Problems:
- where do hypotheses come from? (priming)
- what limits on illusory speech?
f/Bark
15
80
60
S1−env.pf:0
10
5
40
0.0
0.2
CASA talk - Haskins/NUWC - Dan Ellis
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
1997oct24/5 - 15
TOP
Lessons for other domains
5
•
Problem: inadequate signal data
- hearing: masking
- vision: occlusion
- other sensor domains: noise/limits
•
General answer: employ constraints
- high-level prior expectations
- mid-level regularities
- low-level continuity
•
Hearing is a admirable solution
•
Prediction-driven approach suggests priorities
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 16
TOP
Essential features of PDCASA
•
Prediction-reconciliation of hypotheses
- specific hypotheses are pursued
- lack-of-refutation standard
•
Provide a complete explanation
- keeping track of the obstruction can help in
compensating for its effects
•
Hierarchic representation
- useful constraints occur at many levels:
want to be able to apply where appropriate
•
Preserve detail
- even when resynthesis is not a goal
- helps gauge goodness-of-fit
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 17
TOP
Conclusions
6
•
Auditory organization is indispensable in real
environments
•
We don’t know how listeners do it!
- plenty of modeling interest
•
Prediction-reconciliation can account for
‘illusions’
- use ‘knowledge’ when signal is inadequate
- important in a wider range of circumstances?
•
Speech recognizers are a good source of
knowledge
•
Wider implications of the prediction-driven
approach
- understanding perceptual paradoxes
- applications in other domains
CASA talk - Haskins/NUWC - Dan Ellis
1997oct24/5 - 18
Download