Sound, Mixtures, and Learning ROSA )

advertisement
Sound, Mixtures, and Learning
Dan Ellis
<dpwe@ee.columbia.edu>
Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Electrical Engineering, Columbia University
http://labrosa.ee.columbia.edu/
Outline
1
Human sound organization
2
Computational Auditory Scene Analysis
3
Speech models and knowledge
4
Sound mixture recognition
5
Learning opportunities
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 1
Lab
ROSA
Human sound organization
1
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
0
2
4
6
8
10
12
time/s
-60
level / dB
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
•
Analyzing and describing complex sounds:
- continuous sound mixture → distinct events
•
Hearing is ecologically grounded
- reflects ‘natural scene’ properties
- subjective not canonical (ambiguity)
- mixture analysis as primary goal
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 2
Lab
ROSA
Sound mixtures
•
Sound ‘scene’ is almost always a mixture
- always stuff going on
- sound is ‘transparent’ - but big energy range
freq / kHz
Speech
Noise
Speech + Noise
4
3
+
2
=
1
0
0
0.5
1
time / s
0
0.5
1
time / s
0
0.5
1
time / s
•
Need information related to our ‘world model’
- i.e. separate objects
- a wolf howling in a blizzard is the same as
a wolf howling in a rainstorm
- whole-signal statistics won’t do this
•
‘Separateness’ is similar to independence
- objects/sounds that change in isolation
- but: depends on the situation e.g.
passing car vs. mechanic’s diagnosis
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 3
Lab
ROSA
Auditory scene analysis
(Bregman 1990)
•
How do people analyze sound mixtures?
- break mixture into small elements (in time-freq)
- elements are grouped in to sources using cues
- sources have aggregate attributes
•
Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation,
harmonicity, spatial location, ...
Onset
map
Frequency
analysis
Harmonicity
map
Grouping
mechanism
Source
properties
Position
map
(after Darwin, 1996)
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 4
Lab
ROSA
Cues to simultaneous grouping
freq / Hz
•
Elements + attributes
8000
6000
4000
2000
0
0
1
2
3
4
5
6
7
8
9 time / s
•
Common onset
- simultaneous energy has common source
•
Periodicity
- energy in different bands with same cycle
•
Other cues
- spatial (ITD/IID), familiarity, ...
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 5
Lab
ROSA
The effect of context
Context can create an ‘expectation’:
i.e. a bias towards a particular interpretation
•
e.g. Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an
added source whenever possible
frequency / kHz
•
2
1
+
0
0.0
0.4
0.8
1.2
time / s
- a different division of the same energy
depending on what preceded it
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 6
Lab
ROSA
Outline
1
Human sound organization
2
Computational Auditory Scene Analysis
- sound source separation
- bottom-up models
- top-down constraints
3
Speech models and knowledge
4
Sound mixture recognition
5
Learning opportunities
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 7
Lab
ROSA
2 Computational Auditory Scene Analysis
(CASA)
CASA
Object 1 description
Object 2 description
Object 3 description
...
•
Goal: Automatic sound organization ;
Systems to ‘pick out’ sounds in a mixture
- ... like people do
•
E.g. voice against a noisy background
- to improve speech recognition
•
Approach:
- psychoacoustics describes grouping ‘rules’
- ... just implement them?
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 8
Lab
ROSA
CASA front-end processing
•
Correlogram:
Loosely based on known/possible physiology
de
lay
l
ine
short-time
autocorrelation
sound
Cochlea
filterbank
frequency channels
Correlogram
slice
envelope
follower
freq
lag
time
-
linear filterbank cochlear approximation
static nonlinearity
zero-delay slice is like spectrogram
periodicity from delay-and-multiply detectors
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 9
Lab
ROSA
The Representational Approach
(Brown & Cooke 1993)
•
input
mixture
Implement psychoacoustic theory
signal
features
Front end
(maps)
Object
formation
discrete
objects
Source
groups
Grouping
rules
freq
onset
time
period
frq.mod
- ‘bottom-up’ processing
- uses common onset & periodicity cues
•
frq/Hz
Able to extract voiced speech:
brn1h.aif
frq/Hz
3000
3000
2000
1500
2000
1500
1000
1000
600
600
400
300
400
300
200
150
200
150
100
brn1h.fi.aif
100
0.2
0.4
0.6
0.8
1.0
time/s
Sound, mixtures, learning @ Snowbird - Dan Ellis
0.2
0.4
0.6
0.8
2002-04-04 - 10
1.0
time/s
Lab
ROSA
Problems with ‘bottom-up’ CASA
freq
time
•
Circumscribing time-frequency elements
- need to have ‘regions’, but hard to find
•
Periodicity is the primary cue
- how to handle aperiodic energy?
•
Resynthesis via masked filtering
- cannot separate within a single t-f element
•
Bottom-up leaves no ambiguity or context
- how to model illusions?
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 11
Lab
ROSA
Restoration in sound perception
•
Auditory ‘illusions’ = hearing what’s not there
•
The continuity illusion
f/Hz
ptshort
4000
2000
1000
0.0
•
0.2
0.4
0.6
0.8
1.0
1.2
1.4
time/s
SWS
f/Bark
15
80
60
S1−env.pf:0
10
5
40
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
- duplex perception
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 12
Lab
ROSA
Adding top-down constraints
Perception is not direct
but a search for plausible hypotheses
•
Data-driven (bottom-up)...
input
mixture
Front end
signal
features
Object
formation
discrete
objects
Source
groups
Grouping
rules
- objects irresistibly appear
vs. Prediction-driven (top-down)
hypotheses
Noise
components
Hypothesis
management
prediction
errors
input
mixture
Front end
signal
features
Compare
& reconcile
Predict
& combine
Periodic
components
predicted
features
- match observations
with parameters of a world-model
- need world-model constraints...
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 13
Lab
ROSA
Aside: Optimal techniques (ICA, ABF)
(Bell & Sejnowski etc.)
•
General idea:
Drive a parameterized separation algorithm to
maximize independence of outputs
m1
m2
x
a11 a12
a21 a22
s1
s2
−δ MutInfo
δa
•
Attractions:
- mathematically rigorous, minimal assumptions
•
Problems:
- limitations of separation algorithm (N x N)
- essentially bottom-up
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 14
Lab
ROSA
Outline
1
Human sound organization
2
Computational Auditory Scene Analysis
3
Speech models and knowledge
- automatic speech recognition
- subword states
- cepstral coefficients
4
Sound mixture recognition
5
Learning opportunities
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 15
Lab
ROSA
Speech models & knowledge
3
•
Standard speech recognition
sound
D AT A
Feature
calculation
feature vectors
Acoustic model
parameters
Word models
s
ah
t
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
•
Acoustic
classifier
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 16
Lab
ROSA
freq / Hz
Speech units
•
Speech is highly variable
- simple templates won’t do
- spectral variation (voice quality)
- time-warp problems
•
Match short segments (states), allow repeats
- model with pseudo-stationary slices of ~ 10 ms
sil
g
w
eh
n
sil
4000
3000
2000
1000
0
0
0.05
•
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45 time / s
Speech models are distributions p ( X q )
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 17
Lab
ROSA
Speech features: Cepstra
•
Idea: Decorrelate & summarize spectral slices:
X m [ l ] = IDFT { log S [ mH , k ] }
- easier to model:
Cepstral
coefficients
Auditory
spectrum
Features
Covariance matrix
25
20
20
16
15
12
10
8
5
4
Example joint distrib (10,15)
-2
-3
-4
-5
20
18
16
14
12
10
8
6
4
2
16
12
8
4
50
100
150 frames
5
10
15
20
3
2
1
0
-1
-2
-3
-4
-5
0
5
- C0 ‘normalizes out’ average log energy
•
Decorrelated pdfs fit diagonal Gaussians
- DCT is close to PCA for log spectra
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 18
Lab
ROSA
Acoustic model training
•
Goal: describe p ( X q ) with e.g. GMMs
p(x|ω1)
Labeled
training
examples
{xn,ωxn}
Sort
according
to class
•
Estimate
conditional pdf
for class ω1
Training data labels from:
- manual phonetic annotation
- ‘best path’ from earlier classifier (Viterbi)
- EM: joint estimation of labels & pdfs
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 19
Lab
ROSA
HMM decoding
•
WE
w
iy
Feature vectors cannot be reliably classified
into phonemes
DO
HAVE
d uwhh
ae
JUST
A
jh ah s
FEW
tcl
v
freq / kHz
w iy dcl
duw
4
3
2
1
0
0
hh
aaw
e
0.5
y
d
uw
DAYS
ey z
jh ah stcl
taxch
jhy iy
uwdcl
d ey
1
LEFT
THE
l eh n
z l eh
1.5
tcl
h#
f
h#
2
dh
2.5
time / s
•
Use top-down constraints to get good results
- allowable phonemes
- dictionary of known words
- grammar of possible sentences
•
Decoder searches all possible state sequences
- at least notionally; pruning makes it possible
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 20
Lab
ROSA
Outline
1
Human sound organization
2
Computational Auditory Scene Analysis
3
Speech models and knowledge
4
Sound mixture recognition
- feature invariance
- mixtures including
- general mixtures
5
Learning opportunities
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 21
Lab
ROSA
Sound mixture recognition
4
•
Biggest problem in speech recognition is
background noise interference
•
Feature invariance approach
- use features that reflect only speech
- e.g. normalization, mean subtraction
- but: non-static noise?
•
Or: more complex models of the signal
- HMM decomposition
- missing-data recognition
•
Generalize to other, multiple sounds
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 22
Lab
ROSA
Feature normalization
•
Idea: feature variations, not absolute level
•
Hence: calculate average level & subtract it:
X [ k ] = S [ k ] – mean { S [ k ] }
•
Factors out fixed channel frequency response:
s[n] = h[n] * e[n]
log S [ k ] = log H [ k ] + log E [ k ]
•
Normalize variance to handle added noise?
freq / kHz
clean/MGG95958
N2_SNR5/MGG95958
4
3
2
1
cepstra
0
8
6
4
2
0
0.5
1
1.5
2
2.5 time / s
Sound, mixtures, learning @ Snowbird - Dan Ellis
0
0.5
1
1.5
2
2002-04-04 - 23
2.5 time / s
Lab
ROSA
HMM decomposition
(e.g. Varga & Moore 1991, Roweis 2000)
•
Total signal model has independent state
sequences for 2+ component sources
model 2
model 1
observations / time
•
New combined state space q' = {q1 q2}
- new observation pdfs for each combination
i i i
p ( X q 1, q 2 )
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 24
Lab
ROSA
Problems with HMM decomposition
•
O(qk)N is exponentially large...
•
Normalization no longer holds!
- each source has a different gain
→ model at various SNRs?
- models typically don’t use overall energy C0
- each source has a different channel H[k]
•
Modeling every possible sub-state combination
is inefficient, inelegant and impractical
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 25
Lab
ROSA
Missing data recognition
(Cooke, Green, Barker @ Sheffield)
•
Energy overlaps in time-freq. hide features
- some observations are effectively missing
•
Use missing feature theory...
- integrate over missing data xm under model M
p( x M ) =
•
∫ p( x p
x m, M ) p ( x m M ) d x m
Effective in speech recognition
"1754" + noise
90
AURORA 2000 - Test Set A
80
Missing-data
Recognition
WER
70
MD Discrete SNR
MD Soft SNR
HTK clean training
HTK multi-condition
60
50
40
"1754"
Mask based on
stationary noise estimate
30
20
10
0
-5
•
0
5
10
Problem: finding the missing data mask
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 26
15
Lab
ROSA
20
Clean
Maximum-likelihood data mask
(Jon Barker @ Sheffield)
•
Search of sound-fragment interpretations
"1754" + noise
Mask based on
stationary noise estimate
Mask split into subbands
`Grouping' applied within bands:
Spectro-Temporal Proximity
Common Onset/Offset
Multisource
Decoder
•
"1754"
Decoder searches over data mask K:
p ( M, K x ) ∝ p ( x K , M ) p ( K M ) p ( M )
- how to estimate p(K)
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 27
Lab
ROSA
Multi-source decoding
•
Search for more than one source
q2(t)
O(t)
K2(t)
K1(t)
q1(t)
•
Mutually-dependent data masks
•
Use CASA processing to propose masks
- locally coherent regions
- p(K|q)
•
Theoretical vs. practical limits
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 28
Lab
ROSA
General sound mixtures
•
Search for generative explanation:
Generation
model
Channel
parameters
Source
models
Source
signals
M1
Y1
Received
signals
X1
C2
Model
dependence
Θ
C1
Observations
M2
Y2
X2
O
Analysis
structure
Observations
O
Fragment
formation
Mask
allocation
p(X|Mi,Ki)
Sound, mixtures, learning @ Snowbird - Dan Ellis
{Ki}
Model
fitting
{Mi}
Likelihood
evaluation
2002-04-04 - 29
Lab
ROSA
Outline
1
Human sound organization
2
Computational Auditory Scene Analysis
3
Speech models and knowledge
4
Sound mixture recognition
5
Opportunities for learning
- learnable aspects of modeling
- tractable decoding
- some examples
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 30
Lab
ROSA
Opportunities for learning
5
•
Per model feature distributions P ( Y M )
- e.g. analyzing isolated sound databases
•
Channel modifications P ( X Y )
- e.g. by comparing multi-mic recordings
•
Signal combinations P ( O { X i } )
- determined by acoustics
•
Patterns of model combinations P ( { M i } )
- loose dependence between sources
•
Search for most likely explanations
•
Short-term structure: repeating events
P ( { M i } O ) ∝ P ( O { X i } )P ( { X i } { M i } )P ( { M i } )
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 31
Lab
ROSA
Source models
•
The speech recognition lesson:
Use the data as much as possible
- what can we do with unlimited data feeds?
•
Data sources
- clean data corpora
- identify near-clean segments in real sound
•
Model types
- templates
- parametric/constraint models
- HMMs
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 32
Lab
ROSA
What are the HMM states?
•
No sub-units defined for nonspeech sounds
•
Final states depend on EM initialization
- labels
- clusters
- transition matrix
•
Have ideas of what we’d like to get
- investigate features/initialization to get there
dogBarks2
freq / kHz
s7
s3
s2 s4
s2 s3
s7 s2
s3s5
s2
s3 s2 s4 s3 s4
s3 s7
s4
8
7
6
5
4
3
2
1
time
1.15
1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 33
Lab
ROSA
Tractable decoding
•
Speech decoder notionally searches all states
•
Parametric models give infinite space
- need closed-form partial explanations
- examine residual, iterate, converge
•
Need general cues to get started
- return to Auditory Scene Analysis:
- onsets
- harmonic patterns
- then parametric fitting
•
Need multiple hypothesis search,
pruning, efficiency tricks
•
Learning?
Parameters for new source events
- e.g. from artificial (hence labeled) mixtures
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 34
Lab
ROSA
freq / Hz
Example: Alarm sound detection
•
Alarm sounds have particular structure
- people ‘know them when they hear them’
•
Isolate alarms in sound mixtures
5000
5000
5000
4000
4000
4000
3000
3000
3000
2000
2000
2000
1000
1000
1000
0
1
1.5
2
2.5
0
1
1.5
2
time / sec
2.5
0
1
1
1.5
2
2.5
freq / Hz
- sinusoid peaks have invariant properties
Speech + alarm
4000
2000
Pr(alarm)
0
1
0.5
0
•
1
2
3
4
5
6
7
8
9 time / sec
Learn model parameters from examples
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 35
Lab
ROSA
Example: Music transcription
(e.g. Masataka Goto)
•
High-quality training material:
Synthesizer sample kits
•
Ground truth available:
Musical scores
•
Find ML explanations for scores
- guide by multiple pitch tracking (hyp. search)
•
Applications in similarity matching
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 36
Lab
ROSA
Summary
•
Sound contains lots of information
... but it’s always mixed up
•
Psychologists describe ASA
... but bottom-up computer models don’t work
•
Speech recognition works for isolated speech
... by exploiting top-down, context constraints
•
Speech in mixtures via multiple-source models
... practical combinatorics are the main problem
•
Generalize this idea for all sounds
... need models of ‘all sounds’
... plus models of channel modification
... plus ways to propose segmentations
... plus missing-data recognition
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 37
Lab
ROSA
Further reading
[BarkCE00]
J. Barker, M.P. Cooke & D. Ellis (2000). “Decoding speech in the
presence of other sound sources,” Proc. ICSLP-2000, Beijing.
ftp://ftp.icsi.berkeley.edu/pub/speech/papers/icslp00-msd.pdf
[Breg90]
A.S. Bregman (1990). Auditory Scene Analysis: the perceptual organization of sound, MIT Press.
[Chev00]
A. de Cheveigné (2000). “The Auditory System as a Separation
Machine,” Proc. Intl. Symposium on Hearing.
http://www.ircam.fr/pcm/cheveign/sh/ps/ATReats98.pdf
[CookE01]
M. Cooke, D. Ellis (2001). “The auditory organization of speech and
other sources in listeners and computational models,” Speech Communication (accepted for publication).
http://www.ee.columbia.edu/~dpwe/pubs/tcfkas.pdf
[Ellis99]
D.P.W. Ellis (1999). “Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis...,”
Speech Communications 27.
http://www.ee.columbia.edu/~dpwe/pubs/spcomcasa98.pdf
[Roweis00]
S. Roweis (2000). “One microphone source separation.,” Proc. NIPS
2000.
http://www.ee.columbia.edu/~dpwe/papers/roweis-nips2000.pdf
Sound, mixtures, learning @ Snowbird - Dan Ellis
2002-04-04 - 38
Lab
ROSA
Download