Computational Models of Auditory Organization

advertisement
Computational Models of
Auditory Organization
Dan Ellis
Electrical Engineering, Columbia University
http://www.ee.columbia.edu/~dpwe/
Outline
1
Sound organization
2
Human Auditory Scene Analysis (ASA)
3
Computational ASA (CASA)
4
CASA issues & applications
5
Summary
CASA @ ICTP - Dan Ellis
2001-08-09 - 1
Lab
ROSA
Outline
1
Sound organization
- the information in sound
- Marr’s levels of explanation
2
Human Auditory Scene Analysis (ASA)
3
Computational ASA (CASA)
4
CASA issues & applications
5
Summary
CASA @ ICTP - Dan Ellis
2001-08-09 - 2
Lab
ROSA
Sound organization
1
frq/Hz
4000
2000
0
0
2
4
6
8
10
12
time/s
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Central operation:
- continuous sound mixture
→ distinct objects & events
•
Perceptual impression is very strong
- but hard to ‘see’ in signal
CASA @ ICTP - Dan Ellis
2001-08-09 - 3
Lab
ROSA
Bregman’s lake
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
•
Received waveform is a mixture
- two sensors, N signals ...
•
Disentangling mixtures as primary goal
- perfect solution is not possible
- need knowledge-based constraints
CASA @ ICTP - Dan Ellis
2001-08-09 - 4
Lab
ROSA
The information in sound
freq / Hz
Steps 1
Steps 2
4000
3000
2000
1000
0
0
1
2
3
4
0
1
2
3
4
time / s
•
A sense of hearing is evolutionarily useful
- gives organisms ‘relevant’ information
•
Auditory perception is ecologically grounded
- scene analysis is preconscious (→ illusions)
- special-purpose processing reflects
‘natural scene’ properties
- subjective not canonical (ambiguity)
CASA @ ICTP - Dan Ellis
2001-08-09 - 5
Lab
ROSA
Sound mixtures
•
Speech
4000
Frequency
Sound ‘scene’ is almost always a mixture
- always stuff going on
- sound is ‘transparent’
Noise
Speech+Noise
3000
=
+
2000
1000
0
0
0.5
1
Time
0
0.5
1
Time
0
0.5
1
Time
•
Need information related to our ‘world model’
- i.e. separate objects
- a wolf howling in a blizzard is the same as
a wolf howling in a rainstorm
- whole-signal statistics won’t do this
•
‘Separateness’ is similar to independence
- objects/sounds that change in isolation
- but: depends on the situation e.g.
passing car vs. mechanic’s diagnosis
CASA @ ICTP - Dan Ellis
2001-08-09 - 6
Lab
ROSA
Vision and hearing
•
Hearing and seeing are complementary
- hearing is omindirectional
- hearing works in the dark
•
Reveal different things about the world
- vision is good for examining static situations
- physical motion almost always makes sound
CASA @ ICTP - Dan Ellis
2001-08-09 - 7
Lab
ROSA
Thinking about information processing: *
Marr’s levels-of-explanation
•
Three distinct aspects to info. processing
Computational
Theory
‘what’ and ‘why’;
the overall goal
Sound
source
organization
Algorithm
‘how’;
an approach to
meeting the goal
Auditory
grouping
Implementation
practical
realization of the
process.
Feature
calculation &
binding
Why bother?
CASA @ ICTP - Dan Ellis
- helps organize interpretation
- it’s OK to consider levels
separately, one at a time
2001-08-09 - 8
Lab
ROSA
An example: Neural inhibition
Computational
theory
Frequencydomain
processing
*
X(f)
f
Algorithm
Discrete-time
filtering
(subtraction)
Implementation
Neurons with
GABAergic
inhibitions
CASA @ ICTP - Dan Ellis
2001-08-09 - 9
Lab
ROSA
Outline
1
Sound organization
2
Human Auditory Scene Analysis (ASA)
- experimental psychoacoustics
- grouping and cues
- perception of complex scenes
3
Computational ASA (CASA)
4
CASA issues & applications
5
Summary
CASA @ ICTP - Dan Ellis
2001-08-09 - 10
Lab
ROSA
Human Sound Organization
2
•
“Auditory Scene Analysis” [Bregman 1990]
- break mixture into small elements (in time-freq)
- elements are grouped in to sources using cues
- sources have aggregate attributes
•
Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation,
harmonicity, spatial location, ...
(from
Darwin 1996)
CASA @ ICTP - Dan Ellis
2001-08-09 - 11
Lab
ROSA
Cues to simultaneous grouping
•
Common attributes and ‘fate’
freq
time
Common onset
Periodicity
Computational
theory
Acoustic
consequences tend
to be synchronized
(Nonlinear) cyclic
processes are
common
Algorithm
Group elements that
start in a time range
Place patterns?
Autocorrelation?
Implementation
Onset detector cells
Synchronized osc’s?
Delay-and-mult?
Modulation spect?
•
+ Spatial location (ITD, ILD, spectral)...
CASA @ ICTP - Dan Ellis
2001-08-09 - 12
Lab
ROSA
Complications for grouping:
1: Cues in conflict
•
Mistuned harmonic (Moore, Darwin..):
freq
∆f
∆t
time
- determine how ∆t and ∆f affect
• segregation of harmonic
• pitch of complex
•
Gradual, various results:
pitch shift
3%
mistuning
http://www.dcs.shef.ac.uk/~martin/MAD/docs/mad.htm
CASA @ ICTP - Dan Ellis
2001-08-09 - 13
Lab
ROSA
Complications for grouping:
2: The effect of time
•
*
Added harmonics:
freq
time
- onset cue initially segregates;
periodicity eventually fuses
•
The effect of time
- some cues take time to become apparent
- onset cue becomes increasingly distant...
•
What is the impetus for fission?
- e.g. double vowels
- depends on what you expect .. ?
CASA @ ICTP - Dan Ellis
2001-08-09 - 14
Lab
ROSA
Sequential grouping: Streaming
•
Successive tone events form separate streams
freq.
TRT: 60-150 ms
1 kHz
∆f:
±2 octaves
time
•
Order, rhythm &c within, not between, streams
Computational
theory
Algorithm
Implementation
CASA @ ICTP - Dan Ellis
Consistency of properties for
successive source events
• ‘expectation window’ for known
streams (widens with time)
• competing time-frequency
affinity weights...
2001-08-09 - 15
Lab
ROSA
The effect of context
Context can create an ‘expectation’:
i.e. a bias towards a particular interpretation
•
e.g. Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an
added source whenever possible
frequency / kHz
•
2
1
+
0
0.0
0.4
0.8
1.2
time / s
- a different division of the same energy
depending on what preceded it
CASA @ ICTP - Dan Ellis
2001-08-09 - 16
Lab
ROSA
Restoration & illusions
•
‘Illusions’ illuminate algorithm
- what model would ‘misbehave’ this way?
•
E.g. the ‘continuity illusion’:
- making ‘best guess’ for masked information
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
time/s
- tones alternates with noise bursts
- noise is strong enough to mask tone
- continuous tone distinctly perceived
for gaps ~100s of ms
→ Inference acts at low, preconscious level
CASA @ ICTP - Dan Ellis
2001-08-09 - 17
Lab
ROSA
Speech restoration
•
Speech provides a very strong basis for
inference (coarticulation, grammar, semantics):
frq/Hz
nsoffee.aif
3500
3000
•
Phonemic
restoration
2500
2000
1500
1000
500
0
1.2
•
Sinewave
speech
(duplex?)
CASA @ ICTP - Dan Ellis
f/Bark
15
80
60
1.3
1.4
1.5
1.6
1.7
time/s
S1−env.pf:0
10
5
40
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
2001-08-09 - 18
1.6
1.8
Lab
ROSA
Ground truth in complex scenes
•
*
What do people hear in sound mixtures?
- do interpretations match?
→ Listening tests to collect ‘perceived events’:
CASA @ ICTP - Dan Ellis
2001-08-09 - 19
Lab
ROSA
Ground-truth results
f/Hz
*
City
4000
2000
1000
400
200
0
1
2
3
4
5
6
7
8
9
Horn1 (10/10)
S9−horn 2
S10−car horn
S4−horn1
S6−double horn
S2−first double horn
S7−horn
S7−horn2
S3−1st horn
S5−Honk
S8−car horns
S1−honk, honk
Crash (10/10)
S7−gunshot
S8−large object crash
S6−slam
S9−door Slam?
S2−crash
S4−crash
S10−door slamming
S5−Trash can
S3−crash (not car)
S1−slam
Horn2 (5/10)
S9−horn 5
S8−car horns
S2−horn during crash
S6−doppler horn
S7−horn3
Truck (7/10)
S8−truck engine
S2−truck accelerating
S5−Acceleration
S1−rev up/passing
S6−acceleration
S3−closeup car
S10−wheels on road
Horn3 (5/10)
S7−horn4
S9−horn 3
S8−car horns
S3−2nd horn
S10−car horn
Squeal (6/10)
S6−airbrake
S4−squeal
S2−squeek
S3−squeal
S8−break Squeaks
S1−squeal
Horn4 (8/10)
S7−horn5
S9−horn 4
S2−horn during Squeek
S8−car horns
S3−2nd horn
S10−car horn
S5−Honk 2
S6−horn3
H
5 (10/10)
CASA @ ICTP - Dan Ellis
2001-08-09 - 20
Lab
ROSA
Outline
1
Sound organization
2
Human Auditory Scene Analysis (ASA)
3
Computational ASA (CASA)
- bottom-up models
- top-down predictions
- other approaches
4
CASA issues & applications
5
Summary
CASA @ ICTP - Dan Ellis
2001-08-09 - 21
Lab
ROSA
Computational ASA
3
CASA
Object 1 description
Object 2 description
Object 3 description
...
•
Goal: Automatic sound organization ;
Systems to ‘pick out’ sounds in a mixture
- ... like people do
•
E.g. voice against a noisy background
- to improve speech recognition
•
Approach:
- psychoacoustics describes grouping ‘rules’
- ... just implement them?
CASA @ ICTP - Dan Ellis
2001-08-09 - 22
Lab
ROSA
CASA front-end processing
•
Correlogram:
Loosely based on known/possible physiology
de
lay
l
ine
short-time
autocorrelation
sound
Cochlea
filterbank
frequency channels
Correlogram
slice
envelope
follower
freq
lag
time
-
linear filterbank cochlear approximation
static nonlinearity
zero-delay slice is like spectrogram
periodicity from delay-and-multiply detectors
CASA @ ICTP - Dan Ellis
2001-08-09 - 23
Lab
ROSA
The Representational Approach
(Brown & Cooke 1993)
•
input
mixture
Implement psychoacoustic theory
signal
features
Front end
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- ‘bottom-up’ processing
- uses common onset & periodicity cues
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
0.6
0.8
CASA @ ICTP - Dan Ellis
1.0
time/s
0.2
0.4
0.6
0.8
2001-08-09 - 24
1.0
time/s
Lab
ROSA
Problems with ‘bottom-up’ CASA
freq
time
•
Circumscribing time-frequency elements
- need to have ‘regions’, but hard to find
•
Periodicity is the primary cue
- how to handle aperiodic energy?
•
Resynthesis via masked filtering
- cannot separate within a single t-f element
•
Bottom-up leaves no ambiguity or context
- how to model illusions?
CASA @ ICTP - Dan Ellis
2001-08-09 - 25
Lab
ROSA
Adding top-down cues
Perception is not direct
but a search for plausible hypotheses
•
Data-driven (bottom-up)...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
vs. Prediction-driven (top-down) (PDCASA)
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
•
Front end
signal
features
Compare
& reconcile
Periodic
components
Predict
& combine
predicted
features
Motivations
- detect non-tonal events (noise & click elements)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses, ...
CASA @ ICTP - Dan Ellis
2001-08-09 - 26
Lab
ROSA
Generic sound elements for PDCASA
•
f/Hz
Goal is a representational space that
- covers real-world perceptual sounds
- minimal parameterization (sparseness)
- separate attributes in separate parameters
NoiseObject
profile
magProfile
H[k]
4000
ClickObject1
decayTimes
XT [n,k]
T[k]
f/Hz
2000
4000
2000
1000
1000
400
400
200
H[k]
0
0.00
magContour
A[n]
0.005
0.01
2.2
2.4
time / s
0.0
0.1
time / s
0.000
0.1
p[n]
200
100
XC[n,k] = H[k]·exp((n - n0) / T[k])
0.0
H W[n,k]
200
f/Hz
400
XN[n,k]
0.010
WeftObject1
0.2
time/s
XN[n,k] = H[k]·A[n]
50
f/Hz
400
Correlogram
analysis
Periodogram
200
100
50
0.0
0.2
0.4
0.6
0.8
time/s
XW[n,k] = HW[n,k]·P[n,k]
•
Object hierarchies built on top...
CASA @ ICTP - Dan Ellis
2001-08-09 - 27
Lab
ROSA
PDCASA for old-plus-new
•
t1
Incremental analysis
t2
t3
Input signal
Time t1:
initial element
created
Time t2:
Additional
element required
Time t3:
Second element
finished
CASA @ ICTP - Dan Ellis
2001-08-09 - 28
Lab
ROSA
PDCASA for the continuity illusion
•
*
Subjects hear the tone as continuous
... if the noise is a plausible masker
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
i
1.4
/
•
Data-driven analysis gives just visible portions:
•
Prediction-driven can infer masking:
CASA @ ICTP - Dan Ellis
2001-08-09 - 29
Lab
ROSA
PDCASA and complex scenes
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0
f/Hz
1
2
3
Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
CASA @ ICTP - Dan Ellis
1
2
3
4
5
6
7
8
9
time/s
2001-08-09 - 30
dB
Lab
ROSA
Marrian analysis of PDCASA
•
*
Marr invoked to separate high-level function
from low-level details
Computational
theory
Algorithm
Implementation
• Objects persist predictably
• Observations interact irreversibly
• Build hypotheses from generic
elements
• Update by prediction-reconciliation
???
“It is not enough to be able to describe the response of single
cells, nor predict the results of psychophysical experiments.
Nor is it enough even to write computer programs that perform
approximately in the desired way:
One has to do all these things at once, and also be very aware
of the computational theory...”
CASA @ ICTP - Dan Ellis
2001-08-09 - 31
Lab
ROSA
Other approaches: ICA
*
(Bell & Sejnowski etc.)
•
General idea:
Drive a parameterized separation algorithm to
maximize indepdendence of outputs
m1
m2
x
a11 a12
a21 a22
s1
s2
−δ MutInfo
δa
•
Attractions:
- mathematically rigorous, minimal assumptions
•
Problems:
- limitations of separation algorithm (N x N)
- essentially bottom-up
CASA @ ICTP - Dan Ellis
2001-08-09 - 32
Lab
ROSA
Other approaches: Neural Oscillators
*
(Malsburg, Wang & Brown)
•
Locally-excited, globally-inhibited networks
form separate phases of synchrony
•
Advantages:
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration
•
Only concerns the implementation level?
CASA @ ICTP - Dan Ellis
2001-08-09 - 33
Lab
ROSA
Outline
1
Sound organization
2
Human Auditory Scene Analysis (ASA)
3
Computational ASA (CASA)
4
CASA issues & applications
- learning
- missing data
- audio information retrieval
- the machine listener
5
Summary
CASA @ ICTP - Dan Ellis
2001-08-09 - 34
Lab
ROSA
Learning & acquisition
•
The speech recognition lesson:
How to exploit large databases?
•
‘Maximum likelihood’ sound organization
(e.g. Roweis)
*
- learn model→sound distributions P ( X M ) by
analyzing isolated sound databases
- combine models with physics: P ( X { M i } )
- learn patterns of model combinations P ( { M i } )
- search for most likely combinations of models
to explain observed sound mixtures
max P ( { M i } X ) = P ( X { M i } ) ⋅ P ( { M i } )
•
Short-term learning
- hearing a particular source can alter short-term
interpretations of mixtures
CASA @ ICTP - Dan Ellis
2001-08-09 - 35
Lab
ROSA
Missing data recognition
*
(Cooke, Green, Barker... @ Sheffield)
•
Energy overlaps in time-freq. hide features
- some observations are effectively missing
•
Use missing feature theory...
- integrate over missing data dimensions xm
p( x q) =
•
∫ p ( xg
x m, q ) p ( x m q ) d x m
Effective in speech recognition
- trick is finding good/bad data mask
"1754" + noise
Missing-data
Recognition
"1754"
Mask based on
stationary noise estimate
CASA @ ICTP - Dan Ellis
2001-08-09 - 36
Lab
ROSA
Multi-source decoding
*
(Jon Barker @ Sheffield)
•
Search of sound-fragment interpretations
"1754" + noise
Mask based on
stationary noise estimate
Mask split into subbands
`Grouping' applied within bands:
Spectro-Temporal Proximity
Common Onset/Offset
Multisource
Decoder
"1754"
•
CASA for masks/fragments
- larger fragments → quicker search
•
Use with nonspeech models?
CASA @ ICTP - Dan Ellis
2001-08-09 - 37
Lab
ROSA
Audio Information Retrieval
*
•
Searching in a database of audio
- speech .. use ASR
- text annotations .. search them
- sound effects library?
•
e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions
- search by proximity in weighted feature space
Segment
feature
analysis
Sound segment
database
Feature vectors
Seach/
comparison
Results
Segment
feature
analysis
Query example
- features are ‘global’ for each soundfile,
no attempt to separate mixtures
CASA @ ICTP - Dan Ellis
2001-08-09 - 38
Lab
ROSA
CASA for audio retrieval
*
•
When audio material contains mixtures,
global features are insufficient
•
Retrieval based on element/object analysis:
Generic
element
analysis
Continuous audio
archive
Object
formation
Objects + properties
Element representations
Generic
element
analysis
Seach/
comparison
Results
(Object
formation)
Query example
rushing water
Symbolic query
Word-to-class
mapping
Properties alone
- features are calculated over grouped subsets
CASA @ ICTP - Dan Ellis
2001-08-09 - 39
Lab
ROSA
freq / Hz
Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
•
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
representation of energy in time-frequency
formation of atomic elements
grouping by common properties (onset &c.)
classify by attributes...
Key: recognize despite background
CASA @ ICTP - Dan Ellis
2001-08-09 - 40
Lab
ROSA
Future prosthetic listening devices
•
CASA to replace lost hearing ability
- sound mixtures are difficult for hearing impaired
•
Signal enhancement
- resynthesize a single source without background
- (need very good resynthesis)
•
Signal understanding
- monitor for particular sounds (doorbell, knocks)
& translate into alternative mode (vibro alarm)
- real-time textual descriptions
i.e. “automatic subtitles for real life”
[thunder]
S: I THINK THE
WEATHER'S
CHANGING
CASA @ ICTP - Dan Ellis
2001-08-09 - 41
Lab
ROSA
The ‘Machine listener’
•
Goal: An auditory system for machines
- use same environmental information as people
•
Aspects:
- recognize spoken commands (but not others)
- track ‘acoustic channel’ quality (for responses)
- categorize environment (conversation, crowd...)
•
Scenarios
- personal listener → summary of your day
- autonomous robots: need awareness
CASA @ ICTP - Dan Ellis
2001-08-09 - 42
Lab
ROSA
Outline
1
Information from sound
2
Human Auditory Scene Analysis (ASA)
3
Computational ASA (CASA)
4
CASA issues & applications
5
Summary
CASA @ ICTP - Dan Ellis
2001-08-09 - 43
Lab
ROSA
Summary
5
•
Sound contains lots of information
... but it’s not easy to extract
•
We know a little about how humans hear
... at least for simplified sounds
•
We have some ways to copy it
... which we hope to improve
•
CASA would have many useful applications
... machines to listen and remember for us
CASA @ ICTP - Dan Ellis
2001-08-09 - 44
Lab
ROSA
Further reading
[BarkCE00]
J. Barker, M.P. Cooke & D. Ellis (2000). “Decoding speech in the
presence of other sound sources,” Proc. ICSLP-2000, Beijing.
ftp://ftp.icsi.berkeley.edu/pub/speech/papers/icslp00-msd.pdf
[Breg90]
A.S. Bregman (1990). Auditory Scene Analysis: the perceptual organization of sound, MIT Press.
[BrowC94]
G.J. Brown & M.P. Cooke (1994). “Computational auditory scene
analysis,” Computer Speech and Language 8, 297-336.
[Chev00]
A. de Cheveigné (2000). “The Auditory System as a Separation
Machine,” Proc. Intl. Symposium on Hearing.
http://www.ircam.fr/pcm/cheveign/sh/ps/ATReats98.pdf
[CookE01]
M. Cooke, D. Ellis (2001). “The auditory organization of speech and
other sources in listeners and computational models,” Speech Communication (accepted for publication).
http://www.ee.columbia.edu/~dpwe/pubs/tcfkas.pdf
[DarC95]
C.J. Darwin, R.P. Carlyon (1995). “Auditory Grouping,” in The Handbook of Perception and Cognition, Vol 6, Hearing (ed: B.C.J. Moore),
Academic Press, 387-424.
[Ellis99]
D.P.W. Ellis (1999). “Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis, and its
application to speech/nonspeech mixtures,” Speech Communications
27.
http://www.icsi.berkeley.edu/~dpwe/research/spcomcasa98/
spcomcasa98.pdf
CASA @ ICTP - Dan Ellis
2001-08-09 - 45
Lab
ROSA
Download