Computational Auditory Scene Analysis: Principles, Practice and Applications

advertisement
Computational Auditory Scene Analysis:
Principles, Practice and Applications
Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1
Auditory Scene Analysis (ASA)
2
Computational ASA (CASA)
3
Context, expectation & predictions
4
Applications: speech recognition, indexing
5
Conclusions and open issues
CASA for TICSP - Dan Ellis
1999sep01 - 1
Auditory Scene Analysis
1
“The organization of sound scenes
according to their inferred sources”
f/Hz
•
Sounds rarely occur in isolation
- need to ‘separate’ for useful information
•
Human audition is very effective
- it’s a shock that modeling it is so hard
city22
4000
−40
2000
−50
1000
400
−60
200
−70
0
1
2
CASA for TICSP - Dan Ellis
3
4
5
6
7
8
9
dB
time/s
1999sep01 - 2
How can we separate sources?
•
In general, we can’t
- mathematically, too few degrees of freedom
- sensor noise limits separation
•
‘Correct’ analysis is subjectively defined
- sources exhibit independence, continuity, ...
→ecological constraints enable organization
•
Goal not complete separation (reconstruction)
but organization of available information
into useful structures
- telling us useful things about the outside world
•
‘The auditory system as a separation machine’
(Alain de Cheveigné)
CASA for TICSP - Dan Ellis
1999sep01 - 3
Psychology of ASA
•
Extensive experimental research
- perception of simplified stimuli (sinusoids, noise)
•
“Auditory Scene Analysis” [Bregman 1990]
- first: break mixture into small elements
- elements are grouped in to sources using cues
- sources have aggregate attributes
•
Grouping ‘rules’ (Darwin, Carlyon, ...):
- common onset/offset/modulation,harmonicity,
spatial location, ...
(from
Darwin 1996)
CASA for TICSP - Dan Ellis
1999sep01 - 4
Thinking about information processing:
Marr’s levels-of-explanation
•
Three distinct aspects to info. processing
Computational
Theory
‘what’ and ‘why’;
the overall goal
Sound
source
organization
Algorithm
‘how’;
an approach to
meeting the goal
Auditory
grouping
Implementation
practical
realization of the
process.
Feature
calculation &
binding
Why bother?
CASA for TICSP - Dan Ellis
- helps organize interpretation
- it’s OK to consider levels
separately, one at a time
1999sep01 - 5
Cues to grouping
•
Common onset/offset/modulation (“fate”)
•
Common periodicity (“pitch”)
Common onset
Periodicity
Computational
theory
Acoustic
consequences tend
to be synchronized
(Nonlinear) cyclic
processes are
common
Algorithm
Group elements that
start in a time range
? Place patterns
? Autocorrelation
Implementation
Onset detector cells
Synchronized osc’s?
? Delay-and-mult
? Modulation spect
•
Spatial location (ITD, ILD, spectral cues)
•
Sequential cues...
•
Source-specific cues...
CASA for TICSP - Dan Ellis
1999sep01 - 6
Simple grouping
•
E.g. isolated tones
freq
time
Computational
theory
Algorithm
Implementation
CASA for TICSP - Dan Ellis
• common onset
• common period (harmonicity)
• locate elements (tracks)
• group by shared features
? exhaustive search
• evolution in time
1999sep01 - 7
Complications for grouping:
1: Cues in conflict
•
Mistuned harmonic (Moore, Darwin..):
freq
time
- harmonic usually groups by onset & periodicity
- can alter frequency and/or onset time
- ‘degree of grouping’ from overall pitch match
•
Gradual, various results:
pitch shift
3%
mistuning
- heard as separate tone, still affects pitch
CASA for TICSP - Dan Ellis
1999sep01 - 8
Complications for grouping:
2: The effect of time
•
Added harmonics:
freq
time
- onset cue initially segregates;
periodicity eventually fuses
•
The effect of time
- some cues take time to become apparent
- onset cue becomes increasingly distant...
•
What is the impetus for fission?
- e.g. double vowels
- depends on what you expect .. ?
CASA for TICSP - Dan Ellis
1999sep01 - 9
Outline
1
Auditory Scene Analysis (ASA)
2
Computational ASA (CASA)
- A simple model of grouping
- Other systems
3
Context, expectation & predictions
4
Applications: speech recognition, indexing
5
Conclusions and open issues
CASA for TICSP - Dan Ellis
1999sep01 - 10
2 Computational Auditory Scene Analysis
(CASA)
•
Automatic sound organization?
- convert an undifferentiated signal into a
description in terms of different sources
•
Translate psych. rules into programs?
- representations to reveal common onset,
harmonicity ...
•
Motivations & Applications
- it’s a puzzle: new processing principles?
- real-world interactive systems (speech, robots)
- hearing prostheses (enhancement, description)
- advanced processing (remixing)
- multimedia indexing
CASA for TICSP - Dan Ellis
1999sep01 - 11
A simple model of grouping
•
input
mixture
“Bregman at face value” (e.g. Brown 1992):
Front end
signal
features
(maps)
Object
formation
discrete
objects
Grouping
rules
Source
groups
freq
onset
time
period
frq.mod
-
feature maps
periodicity cue
common-onset boost
resynthesis
CASA for TICSP - Dan Ellis
1999sep01 - 12
Grouping model results
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
0.6
0.8
1.0
time/s
0.2
0.4
•
Periodicity is the primary cue
- how to handle aperiodic energy?
•
Limitations
- resynthesis via filter-mask
- only periodic targets
- robustness of discrete objects
CASA for TICSP - Dan Ellis
0.6
0.8
1999sep01 - 13
1.0
time/s
Other CASA systems (1)
•
Weintraub 1985
- separate male & female voices
- find periodicities in each frequency channel by
auto-coincidence
- number of voices is ‘hidden state’
•
Cooke 1991
- ‘Synchrony strands’ auditory model
- Fusing resolved harmonics and AM formants
- led to [Brown 1992]
CASA for TICSP - Dan Ellis
1999sep01 - 14
Other CASA systems (2)
•
Okuno, Nakatani &c (1994-)
- ‘tracers’ follow each harmonic + noise ‘agent’
- residue-driven: account for whole signal
•
Klassner 1996
- search for a combination of templates
- high-level hypotheses permit front-end tuning
3760 Hz
Buzzer-Alarm
2540 Hz
2230 Hz
2350 Hz
Glass-Clink
1675 Hz
1475 Hz
950 Hz
500 Hz
460 Hz
420 Hz
Phone-Ring
1.0
•
Siren-Chirp
2.0
3.0
4.0 sec
1.0
2.0
TIME
TIME
(a)
(b)
3.0
4.0 sec
Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses
CASA for TICSP - Dan Ellis
1999sep01 - 15
Other signal-separation approaches
•
HMM decomposition (RK Moore ’86)
- recover combined source states directly
•
Blind source separation (Bell & Sejnowski ’94)
- find exact separation parameters by maximizing
statistic e.g. signal independence
•
Neural models (Malsburg, Wang & Brown)
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration?
CASA for TICSP - Dan Ellis
1999sep01 - 16
Outline
1
Auditory Scene Analysis (ASA)
2
Computational ASA (CASA)
3
Context, expectation & predictions
- the effect of context
- streaming, illusions and restoration
- prediction-driven (PD) CASA
4
Applications: speech recognition, indexing
5
Conclusions and open issues
CASA for TICSP - Dan Ellis
1999sep01 - 17
3
Context, expectations & predictions
Perception is not direct
but a search for plausible hypotheses
•
Data-driven...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
vs. Prediction-driven
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
•
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
Motivations
- detect non-tonal events (noise & clicks)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses,
resynthesis
CASA for TICSP - Dan Ellis
1999sep01 - 18
The effect of context
•
Context can create an ‘expectation’:
i.e. a bias towards a particular interpretation
•
e.g. Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an
added source whenever possible
freq/kHz
2
1
+
0
0.0
0.4
0.8
1.2
time/s
- a different division of the same energy
depending on what preceded it
CASA for TICSP - Dan Ellis
1999sep01 - 19
Streaming
•
Successive tone events form separate streams
freq.
TRT: 60-150 ms
1 kHz
∆f:
±2 octaves
time
•
Order, rhythm &c within, not between, streams
Computational
theory
Algorithm
Implementation
CASA for TICSP - Dan Ellis
Consistency of properties for
successive source events
• ‘expectation window’ for known
streams (widens with time)
• competing time-frequency
affinity weights...
1999sep01 - 20
Restoration & illusions
•
Direct evidence may be masked or distorted
→make best guess using available information
•
E.g. the ‘continuity illusion’:
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
time/s
- tones alternates with noise bursts
- noise is strong enough to mask tone
... so listener discriminate presence
- continuous tone distinctly perceived
for gaps ~100s of ms
→ Inference acts at low, preconscious level
CASA for TICSP - Dan Ellis
1999sep01 - 21
Speech restoration
•
Speech provides very strong bases for
inference (coarticulation, grammar, semantics):
frq/Hz
nsoffee.aif
3500
3000
2500
•
2000
Phonemic
restoration
1500
1000
500
0
1.2
1.3
1.4
1.5
1.6
1.7
time/s
Temporal compound (1998jul10)
20
•
Temporal
compounds
40
60
80
100
120
•
Sinewave
speech
(duplex?)
50
f/Bark
15
80
60
100
150
200
250
300
time / ms
350
400
450
500
550
S1−env.pf:0
10
5
40
0.0
CASA for TICSP - Dan Ellis
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1999sep01 - 22
1.8
Modeling top-down processing:
‘Prediction-driven’ CASA (PDCASA):
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
•
An approach as well as an implementation...
•
Key features:
- ‘complete explanation’ of all scene energy
- vocabulary of periodic/noise/transient elements
- multiple hypotheses
- explanation hierarchy
CASA for TICSP - Dan Ellis
1999sep01 - 23
PDCASA for old-plus-new
•
Incremental analysis
t1
t2
t3
Input signal
Time t1:
initial element
created
Time t2:
Additional
element required
Time t3:
Second element
finished
CASA for TICSP - Dan Ellis
1999sep01 - 24
PDCASA for the continuity illusion
•
Subjects hear the tone as continuous
... if the noise is a plausible masker
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
i
1.4
/
•
Data-driven analysis gives just visible portions:
•
Prediction-driven can infer masking:
CASA for TICSP - Dan Ellis
1999sep01 - 25
PDCASA analysis of a complex scene
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
CASA for TICSP - Dan Ellis
1
2
3
4
5
6
7
8
9
time/s
dB
1999sep01 - 26
Problems in PDCASA
•
Subjective ground-truth in mixtures?
- listening tests collect ‘perceived events’:
•
Other problems
- error allocation
- source hierarchy
CASA for TICSP - Dan Ellis
- rating hypotheses
- resynthesis
1999sep01 - 27
Marrian analysis of PDCASA
•
Marr invoked to separate high-level function
from low-level details
Computational
theory
Algorithm
Implementation
• Objects persist predictably
• Observations interact irreversibly
• Build hypotheses from generic
elements
• Update by prediction-reconciliation
???
“It is not enough to be able to describe the response of single
cells, nor predict the results of psychophysical experiments.
Nor is it enough even to write computer programs that perform
approximately in the desired way:
One has to do all these things at once, and also be very aware
of the computational theory...”
CASA for TICSP - Dan Ellis
1999sep01 - 28
Outline
1
Auditory Scene Analysis (ASA)
2
Computational ASA (CASA)
3
Context, expectation & predictions
4
Applications
- CASA for speech recognition
- CASA for audio indexing
5
Conclusions and open issues
CASA for TICSP - Dan Ellis
1999sep01 - 29
4 .1 Applications: Speech recognition
•
Conventional speech recognition:
signal
Feature
extraction
low-dim.
features
Phone
classifier
phone
probabilities
HMM
decoder
words
- signal assumed entirely speech
- find valid segmentation using discrete labels
- class models from training data
•
Some problems:
- need to ignore lexically-irrelevant variation
(microphone, voice pitch etc.)
- compact feature space → everything speech-like
•
Very fragile to nonspeech, background
- scene-analysis methods very attractive...
CASA for TICSP - Dan Ellis
1999sep01 - 30
CASA for speech recognition
•
Data-driven: CASA as preprocessor
- problems with ‘holes’ (but: Okuno)
- doesn’t exploit knowledge of speech structure
•
Missing data (Cooke &c, de Cheveigné)
- CASA cues distinguish present/absent
- use ‘aware’ classifier
•
Prediction-driven: speech as component
- same ‘reconciliation’ of speech hypotheses
- need to express ‘predictions’ in signal domain
Speech
components
Hypothesis
management
input
mixture
CASA for TICSP - Dan Ellis
Noise
components
Predict
& combine
Periodic
components
Front end
Compare
& reconcile
1999sep01 - 31
Example of speech & nonspeech
f/Bark
15
(a) Clap
10
5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
(b) Speech plus clap
dB
60
40
20
(c) Recognizer output
h#
w n
h#
ay
n
n
ay
tcl t
uw
n
t
f
uw
ay
ah
f
s
ay
ay ow
h# v
s eh
v
ow
h#
five
oh
<SIL>
s
v ah
eh
v
n
ax
h#
n
tcl
<SIL>
nine
two
seven
(d) Reconstruction from labels alone
(e) Slowly−varying portion of original
(f) Predicted speech element ( = (d)+(e) )
(g) Click5 from nonspeech analysis
(h) Spurious elements from nonspeech analysis
0.0
•
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
CASA for TICSP - Dan Ellis
1999sep01 - 32
CASA & Missing Data
•
CASA indicates source energy regions
- e.g. the time-frequency mask of [Brown 1992]
•
‘Missing data’ theory permits inference:
- skip dimensions of an uncorrelated Gaussian
- perform full integral over unknown range
- ‘data imputation’ e.g. for deltas, cepstra
•
... or just weighting of information streams
- 4 band recognizer [Berthommier et al. 1998]
•
RESPITE project
CASA
Signal
quality
labels
SNR
signal
Feature
stream
division
per-stream
quality labels
Confidence
estimation
per-stream
confidences
Missing-data
classifier
multiple information
streams
CASA for TICSP - Dan Ellis
Missing data/
multistream
recombination
per-stream
class probabilites
1999sep01 - 33
words
The RESPITE CASA Toolkit
(Barker, Ellis & Cooke 1999)
CASA toolkit block diagram
dpwe@icsi.berkeley.edu
1999mar19
Interaural
analysis
Modulation
analysis
• simple cycle
• blackboard
• ASR-style
decoding
Frequency
analysis
CxFxP
Periodicity
tracking
Object periodicity
hypotheses
• summary & tracking
• channel voting
• modulation filtering
• stabilized image
• cancellation
• autocorrelation (FFT/delay&mult)
CxF
Spectral:
• Gammatone
• FFT
• general FIR/IIR
Envelope:
• IHC model
• HWR/FWR + LPF
• analytic envelope
Spectral
modeling
• t-f samples
• LPC
Crossspectral
analysis
• local maxima
• synchrony
Sinusoid
tracking
Multidimensional continuous signal
Semi-discrete object hypotheses
CASA for TICSP - Dan Ellis
Object location
hypotheses
• cross-correlation(FFT/sampled)
• interaural level difference
Execution/
hypothesis
control
Sound
C
CxFxA
Spatial
tracking
Spectral envelope
hypotheses
Onset hypotheses
Integration
& object
formation
• evidence
weighting
• ‘re-entry’
Spectral partition
hypotheses
Sinusoid
hyps
Sinusoid
grouping
Sinusoid group
hypotheses
C = number of input channels (monaural/binaural)
F = number of frequency channels (2-1024)
P = number of periodicity bins (25-500)
A = number of spatial (azimuth) bins (16-256)
1999sep01 - 34
4 .2
f/Hz
Applications: audio indexing
city22
4000
horn
−40
2000
yell
−60
400
200
car noise
−70
0
1
2
3
4
5
6
7
8
9
horn
door crash
−50
1000
dB
time/s
•
Current approaches
- speech recognition (Informedia etc.)
- whole-sample statistics (Muscle Fish)
•
What are the ‘objects’ in a soundtrack?
- i.e. the analog of words in text IR
- subjective definition → need auditory model
•
Problems
- parts vs. wholes
- general vs. specific
- how to be ‘data-driven’
CASA for TICSP - Dan Ellis
1999sep01 - 35
Element-based audio indexing
•
Segment-level features
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
•
Using ‘generic sound elements’
Segment
feature
analysis
Continuous audio
archive
Element representations
Seach/
comparison
Results
Segment
feature
analysis
Query example
- search for subset (but: masked features?)
- how to generalize?
- how to use segment-style features?
CASA for TICSP - Dan Ellis
1999sep01 - 36
Object-based audio indexing
•
Organizing elements into objects reveals
higher-order properties
Segment
feature
analysis
Continuous audio
archive
Object
formation
Seach/
comparison
Element representations
Segment
feature
analysis
Query example
Object
formation
(?)
Objects + properties
•
How to form objects?
- heuristics (onset, harmonicity, continuity)
- machine learning:
associative recall, clustering, ‘data mining’
•
Which higher-order properties?
- current wisdom (brightness, roughness...)
- psychoacoustics
- (semi) data-driven hierarchies
CASA for TICSP - Dan Ellis
1999sep01 - 37
Results
Open issues in automatic indexing
•
How to do CASA for element descriptions?
- PDCASA: ‘generic’ primitives
+ constraining hierarchy
- (semi?) automatic learning of object structure
•
Classification
- connecting subjective & objective properties
→ finding subjective invariants, prominence
- representation of sound-object ‘classes’
- matching incompletely-described objects
•
Queries
- .. by example (which part?)
- .. by symbolic descriptions of classes?
•
Related applications
- ‘structured audio encoder’
- semantic hearing aid / robot listener
CASA for TICSP - Dan Ellis
1999sep01 - 38
Outline
1
Auditory Scene Analysis (ASA)
2
Computational ASA (CASA)
3
Context, expectation & predictions
4
Applications: speech recognition, indexing
5
Conclusions and open issues
CASA for TICSP - Dan Ellis
1999sep01 - 39
CASA: Open issues
5
•
We’re still looking for the right perspective
- bottom up vs. top down
- physiology, psychology, levels of description
•
What is the goal?
- simulating listeners on contrived tasks?
- solving practical engineering problems?
- laying the conceptual groundwork
•
How to evaluate CASA work?
- evaluation is critical for a healthy field
- .. but people have to agree on a task
- subjectively defined → listening tests
•
Looming on the horizon...
- learning
- attention
CASA for TICSP - Dan Ellis
1999sep01 - 40
Conclusions
•
Auditory organization is required in real
environments
•
We don’t know how listeners do it!
- plenty of modeling interest
•
Prediction-reconciliation can account for
‘illusions’
- use ‘knowledge’ when signal is inadequate
- important in a wider range of circumstances?
•
Speech & speech recognizers
- urgent application for CASA
- good source of signal knowledge?
•
Automatic indexing implies ‘synthetic listener’
- need to solve a lot of modeling issues
- the next big thing?
CASA for TICSP - Dan Ellis
1999sep01 - 41
Download