Lecture 10: Signal Separation

advertisement
EE E6820: Speech & Audio Processing & Recognition
Lecture 10:
Signal Separation
Dan Ellis <dpwe@ee.columbia.edu>
Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820
April 14, 2009
1
2
3
4
Sound mixture organization
Computational auditory scene analysis
Independent component analysis
Model-based separation
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
1 / 45
Outline
1
Sound mixture organization
2
Computational auditory scene analysis
3
Independent component analysis
4
Model-based separation
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
2 / 45
Sound Mixture Organization
4000
frq/Hz
3000
0
-20
2000
-40
1000
0
0
2
4
6
8
10
12
time/s
-60
level / dB
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
Auditory Scene Analysis: describing a complex sound in terms
of high-level sources / events
. . . like listeners do
Hearing is ecologically grounded
I
I
reflects ‘natural scene’ properties
subjective, not absolute
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
3 / 45
Sound, mixtures, and learning
freq / kHz
Speech
Noise
Speech + Noise
4
3
+
2
=
1
0
0
0.5
1
time / s
0
0.5
1
time / s
0
0.5
1
time / s
Sound
I
I
carries useful information about the world
complements vision
Mixtures
. . . are the rule, not the exception
I medium is ‘transparent’, sources are many
I must be handled!
Learning
I
I
the ‘speech recognition’ lesson:
let the data do the work
like listeners
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
4 / 45
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a lake,
with handkerchiefs stretched across each one. Looking only at the
motion of the handkerchiefs, you are to answer questions such as:
How many boats are there on the lake and where are they?” (after
Bregman, 1990)
Received waveform is a mixture
I
two sensors, N signals . . . underconstrained
Disentangling mixtures as the primary goal?
I
I
perfect solution is not possible
need experience-based constraints
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
5 / 45
Approaches to sound mixture recognition
Separate signals, then recognize
e.g. Computational Auditory Scene Analysis (CASA),
Independent Component Analysis (ICA)
I nice, if you can do it
Recognize combined signal
I
I
‘multicondition training’
combinatorics. . .
Recognize with parallel models
I
I
full joint-state space?
divide signal into fragments,
then use missing-data recognition
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
6 / 45
What is the goal of sound mixture analysis?
CASA
?
Separate signals?
I
I
I
output is unmixed waveforms
underconstrained, very hard . . .
too hard? not required?
Source classification?
I
I
output is set of event-names
listeners do more than this. . .
Something in-between?
Identify independent sources + characteristics
I
standard task, results?
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
7 / 45
Segregation vs. Inference
Source separation requires attribute separation
I
I
sources are characterized by attributes
(pitch, loudness, timbre, and finer details)
need to identify and gather different attributes for different
sources. . .
Need representation that segregates attributes
I
I
spectral decomposition
periodicity decomposition
Sometimes values can’t be separated
e.g. unvoiced speech
I maybe infer factors from probabilistic model?
p(O, x, y ) → p(x, y | O)
I
or: just skip those values & infer from higher-level context
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
8 / 45
Outline
1
Sound mixture organization
2
Computational auditory scene analysis
3
Independent component analysis
4
Model-based separation
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
9 / 45
Auditory Scene Analysis (Bregman, 1990)
How do people analyze sound mixtures?
I
I
I
break mixture into small elements (in time-freq)
elements are grouped in to sources using cues
sources have aggregate attributes
Grouping ‘rules’ (Darwin and Carlyon, 1995)
I
cues: common onset/offset/modulation,
harmonicity, spatial location, . . .
Onset
map
Sound
Frequency
analysis
Elements
Harmonicity
map
Sources
Grouping
mechanism
Source
properties
Spatial
map
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
10 / 45
Cues to simultaneous grouping
freq / Hz
Elements + attributes
8000
6000
4000
2000
0
0
1
2
3
4
5
6
7
8
9 time / s
Common onset
I
simultaneous energy has common source
Periodicity
I
energy in different bands with same cycle
Other cues
I
spatial (ITD/IID), familiarity, . . .
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
11 / 45
The effect of context
Context can create an ‘expectation’
i.e. a bias towards a particular interpretation
I
A change in a signal will be interpreted as an added source
whenever possible
frequency / kHz
e.g. Bregman’s “old-plus-new” principle:
2
1
+
0
0.0
0.4
0.8
1.2
time / s
I
a different division of the same energy
depending on what preceded it
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
12 / 45
Computational Auditory Scene Analysis (CASA)
Sound mixture
Object 1 percept
Object 2 percept
Object 3 percept
Goal: Automatic sound organization
I Systems to ‘pick out’ sounds in a mixture
. . . like people do
e.g. voice against a noisy background
I
to improve speech recognition
Approach
psychoacoustics describes grouping ‘rules’
. . . just implement them?
I
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
13 / 45
"#$#!%&'()*+(,!-&'.+//0(1
CASA front-end processing
2 "'&&+3'1&456
Correlogram:
Loosely based on known/possible physiology
7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18
e
lin
lay
short-time
autocorrelation
de
sound
Cochlea
filterbank
frequency channels
"'&&+3'1&45
/30.+
envelope
follower
freq
lag
time
I
I
I
I
- linear filterbank cochlear approximation
linear filterbank cochlear approximation
static nonlinearity
static -nonlinearity
zero-delay
slice is likeslice
spectrogram
- zero-delay
is like spectrogram
periodicity
from
delay-and-multiply
detectors
- periodicity from delay-and-multiply
detectors
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
14 / 45
Bottom-Up Approach (Brown and Cooke, 1994)
Implement psychoacoustic theory
input
mixture
Front end
signal
features
(maps)
Object
formation
discrete
objects
Grouping
rules
Source
groups
freq
onset
time
period
frq.mod
I
I
left-to-right processing
uses common onset & periodicity cues
Able to extract voiced speech
v3n7 - Hu & Wang 02
freq / khz
freq / khz
v3n7 original mix
8
8
7
20
6
6
10
5
5
0
4
4
-10
3
3
-20
2
2
-30
1
1
0
0
7
0
0.2
0.4
0.6
E6820 (Ellis & Mandel)
0.8
1
1.2
1.4
time / sec
0
-40
0.2
0.4
0.6
0.8
1
L10: Signal separation
1.2
1.4
time / sec
level ./ dB
April 14, 2009
15 / 45
Problems with ‘bottom-up’ CASA
"#$%&'()!*+,-!.%$,,$(/012!3454
freq
time
6
3+#70()7#+%+89!,+('/:#';0'87<!'&'('8,)
Circumscribing
time-frequency
elements
- need
to have ‘regions’,
but hard to find
need
have ‘regions’, but hard to find
6 to"'#+$=+7+,<!+)!,-'!1#+(>#<!70'
Periodicity -ishow
the to
primary
handlecue
aperiodic energy?
I how to handle aperiodic energy?
6 ?')<8,-')+)!@+>!(>)A'=!!&,'#+89
Resynthesis via masked filtering
- cannot separate within a single t-f element
I cannot separate within a single t-f element
6 B$,,$(/01!&'>@')!8$!>(%+90+,<!$#!7$8,'C,
Bottom-up
leaves no ambiguity or context
- how to model illusions?
I how to model illusions?
I
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
16 / 45
Restoration in sound perception
Auditory ‘illusions’ = hearing what’s not there
The continuity illusion & Sinewave Speech (SWS)
8000
Frequency
6000
4000
2000
0
0
0.5
1
1.5
Time
2
2.5
2
Time
2.5
3
Frequency
5000
4000
3000
2000
1000
0
I
0.5
1
1.5
3.5
duplex perception
What kind of model accounts for this?
I
is it an important part of hearing?
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
17 / 45
Adding top-down constraints: Prediction-Driven CASA
Perception is not direct
but a search for plausible hypotheses Data-driven (bottom-up). . .
input
mixture
I
Front end
signal
features
Object
formation
discrete
objects
Grouping
rules
Source
groups
objects irresistibly appear
vs. Prediction-driven (top-down)
hypotheses
Hypothesis
management
prediction
errors
input
mixture
I
I
Front end
signal
features
Compare
& reconcile
Noise
components
Periodic
components
Predict
& combine
predicted
features
match observations with a ‘world-model’
need world-model constraints. . .
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
18 / 45
Generic sound elements for PDCASA (Ellis, 1996)
Goal is a representational space that
I
I
I
f/Hz
covers real-world perceptual sounds
minimal parameterization (sparseness)
separate attributes in separate parameters
NoiseObject
profile
magProfile
H[k]
4000
ClickObject
XC[n,k]
decayTimes
T[k]
f/Hz
2000
4000
2000
1000
1000
WeftObject
H W[n,k]
400
400
H[k]
200
0
200
f/Hz
400
XN[n,k]
0.010
0.00
magContour
A[n]
0.005
0.01
2.2
2.4
time / s
0.0
0.1
time / s
0.000
0.0
0.1
200
p[n]
100
XC[n,k] = H[k]·exp((n - n0) / T[k])
50
f/Hz
400
0.2
time/s
Correlogram
analysis
Periodogram
200
XN[n,k] = H[k]·A[n]
100
50
0.0
0.2
0.4
0.6
0.8
time/s
XW[n,k] = HW[n,k]·P[n,k]
Object hierarchies built on top. . .
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
19 / 45
PDCASA for old-plus-new
Incremental analysis
t1
t2
t3
Input Signal
Time t1:
Initial element
created
Time t2:
Additional
element required
Time t2:
Second element
finished
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
20 / 45
PDCASA for the continuity illusion
Subjects hear the tone as continuous
. . . if the noise is a plausible masker
f/Hz
ptshort
4000
2000
1000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
t/s
Data-driven analysis gives just visible portions:
Prediction-driven can infer masking:
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
21 / 45
Prediction-Driven CASA
Explain a complex sound with basic elements
f/Hz City
4000
2000
1000
400
200
1000
400
200
100
50
0
1
2
3
f/Hz Wefts1−4
4
5
Weft5
6
7
Wefts6,7
8
Weft8
9
Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz Noise1
4000
2000
1000
−40
400
200
−50
−60
Squeal (6/10)
Truck (7/10)
−70
0
E6820 (Ellis & Mandel)
1
2
3
4
5
6
7
8
L10: Signal separation
9
time/s
dB
April 14, 2009
22 / 45
Aside: Ground Truth
What do people hear in sound mixtures?
I
do interpretations match?
→ Listening tests to collect ‘perceived events’:
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
23 / 45
Aside: Evaluation
Evaluation is a big problem for CASA
I
I
I
what is the goal, really?
what is a good test domain?
how do you measure performance?
SNR improvement
I
I
I
tricky to derive from before/after signals: correspondence
problem
can do with fixed filtering mask
differentiate removing signal from adding noise
Speech Recognition (ASR) improvement
I
recognizers often sensitive to artifacts
‘Real’ task?
I
mixture corpus with specific sound events. . .
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
24 / 45
Outline
1
Sound mixture organization
2
Computational auditory scene analysis
3
Independent component analysis
4
Model-based separation
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
25 / 45
Independent Component Analysis (ICA)
(Bell and Sejnowski, 1995, etc.)
If mixing is like matrix multiplication, then separation is
searching for the inverse matrix
a11 a12
a21 a22
s1
×
s1
s2
x1
x2
a11
w11 w12
×
w21 w22
a21
a12
a22
s2
x1
x2
s^1
s^2
−δ MutInfo
δwij
i.e. W ≈ A−1
I with N different versions of the mixed signals (microphones),
we can find N different input contributions (sources)
I how to rate quality of outputs?
i.e. when do outputs look separate?
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
26 / 45
Gaussianity, Kurtosis, & Independence
A signal can be characterized by its PDF p(x)
i.e. as if successive time values are drawn from
a random variable (RV)
I Gaussian PDF is ‘least interesting’
I Sums of independent RVs (PDFs convolved) tend to Gaussian
PDF (central limit theorem)
Measures of deviations from Gaussianity:
4th moment is Kurtosis (“bulging”)
0.8
exp(-|x|) kurt = 2.5
0.6
"
kurt(y ) = E
y −µ
σ
4 #
−3
0.4
I
I
I
4
exp(-x ) kurt = -1
0.2
0
-4
I
Gaussian kurt = 0
-2
0
2
4
kurtosis of Gaussian is zero (this def.)
‘heavy tails’ → kurt > 0
closer to uniform dist. → kurt < 0
Directly related to KL divergence from Gaussian PDF
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
27 / 45
Independence in Mixtures
Scatter plots & Kurtosis values
s1 vs. s2
0.6
x1 vs. x2
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-0.4
-0.2
-0.6
-0.4
-0.4
-0.2
E6820 (Ellis & Mandel)
-0.2
0
0.2
0.4
0.6
-0.5
0
0.5
p(s1) kurtosis = 27.90
p(x1) kurtosis = 18.50
p(s2) kurtosis = 53.85
p(x2) kurtosis = 27.90
-0.1
0
0.1
0.2
-0.2
L10: Signal separation
-0.1
0
0.1
0.2
April 14, 2009
28 / 45
Finding Independent Components
Sums of independent RVs are more Gaussian
→ minimize Gaussianity to undo sums
i.e. search over wij terms in inverse matrix
Kurtosis vs. θ
s2
0.6
0.4
kurtosis
mix 2
Mixture Scatter
0.8
s1
12
10
8
0.2
6
0
4
-0.2
2
-0.4
s1
-0.6
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0
0
0.2
0.4
0.6
mix 1
s2
0.8
θ/π
1
Solve by Gradient descent or Newton-Raphson:
w + = E[xg (w T x)] − E[g 0 (w T x)]w
w=
w+
kw + k
“Fast ICA”, (Hyvärinen and Oja, 2000)
http://www.cis.hut.fi/projects/ica/fastica/
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
29 / 45
Limitations of ICA
Assumes instantaneous mixing
I
I
real world mixtures have delays & reflections
STFT domain?
x1 (t) = a11 (t) ∗ s1 (t) + a12 (t) ∗ s2 (t)
⇒ X1 (ω) = A11 (ω)S1 (ω) + A12 (ω)S2 (ω)
I
Solve ω subbands separately, match up answers
Searching for best possible inverse matrix
I
I
cannot find more than N outputs from N inputs
but: “projection pursuit” ideas + time-frequency masking. . .
Cancellation inherently fragile
I
I
I
ŝ1 = w11 x1 + w12 x2 to cancel out s2
sensitive to noise in x channels
time-varying mixtures are a problem
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
30 / 45
Outline
1
Sound mixture organization
2
Computational auditory scene analysis
3
Independent component analysis
4
Model-based separation
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
31 / 45
Model-Based Separation: HMM decomposition
(e.g. Varga and Moore, 1990; Gales and Young, 1993)
Independent state sequences for 2+ component source models
model 2
model 1
observations / time
New combined state space q 0 = q1 × q2
I
need pdfs for combinations p(X | q1 , q2 )
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
32 / 45
One-channel Separation: Masked Filtering
Multichannel → ICA: Inverse filter and cancel
target s
s^
+
-
mix
interference n
n^
proc
One channel: find a time-frequency mask
mix s+n
spectro
gram
stft
x
istft
s^
mask
proc
Frequency
Frequency
Cannot remove overlapping noise in t-f cells,
but surprisingly effective (psych masking?):
0 dB SNR
8000
-12 dB SNR
-24 dB SNR
6000
Noise
mix
4000
2000
8000
6000
Mask
resynth
4000
2000
0
0
1
2
Time
3
E6820 (Ellis & Mandel)
0
1
2
Time
3
0
1
2
Time
L10: Signal separation
3
April 14, 2009
33 / 45
“One microphone source separation”
(Roweis, 2001)
State sequences → t-f estimates → mask
Original
voices
freq / Hz
Speaker 1
Speaker 2
4000
3000
2000
1000
0
Mixture
4000
3000
2000
1000
0
Resynthesis
masks
State
means
4000
3000
2000
1000
0
4000
3000
2000
1000
0
I
I
0
1
2
3
4
5
6
time / sec
0
1
2
3
4
5
6
time / sec
1000 states/model (→ 106 transition probs.)
simplify by subbands (coupled HMM)?
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
34 / 45
Speech Fragment Recognition
(Barker et al., 2005)
Signal separation is too hard! Instead:
I
I
segregate features into partially-observed sources
then classify
Made possible by missing data recognition
I
integrate over uncertainty in observations
for true posterior distribution
Goal: Relate clean speech models P(X | M)
to speech-plus-noise mixture observations
. . . and make it tractable
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
35 / 45
"#$$#%&!'()(!*+,-&%#)#-%
!! !
!!!
. Speech
/0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777
models p(x | m) are multidimensional. . .
9i.e.'<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&
means, variances
for every freq. channel
"#$$#%&!'()(!*+,-&%#)#-%
I need values for all dimensions to get p(·)
9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•)
. /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777
Missing Data Recognition
.
But: can evaluate over a subset of dimensions xk
9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2&
86)9!,(%!+:(46()+!-:+5!(!
xu
9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!
p(•)
p(x
k,xu)
$6;$+)!-<!3#2+%$#-%$!xk
. 86)9!,(%!+:(46()+!-:+5!(!
p ! x k m " =Z $ p ! $6;$+)!-<!3#2+%$#-%$!
x # x u m " dx u xk
p(xk | m) = p(xk , kxu | m)
dx
p ! x k m " = $ p ! x k#ux u m " dx u
.
!!
y
p(xk|xu<y )
p(xk|xu<y )
xk
p(x
xk k )
p(xk )
P(x P(x
| q)| q)= =
P(x1 | q)
dimension !
P(x·1P(x
| q)
2 | q)
· P(x3 q)
· P(x
2 | | q)
· P(x4 | q)
· P(x
3 |5 q)
· P(x
| q)
· P(x
· P(x
| q)
4 |6 q)
time !
· P(x5 | q)
· P(x6 | q)
hard part9 isC#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G
finding the mask (segregation)
dimension !
I
p(xk,xu)
y
=+%,+>!
. =+%,+>!
2#$$#%&!3()(!5+,-&%#)#-%9
2#$$#%&!3()(!5+,-&%#)#-%9
Hence, missing data recognition:
?5+$+%)!3()(!2($@
?5+$+%)!3()(!2($@
xu
"#$!%&&'(
E6820 (Ellis & Mandel)
)*+$,-!.'/0+12(!3!42#1$'$5
time
! separation
L10: Signal
67789::976!9!:7!;!68
April 14, 2009
36 / 45
Missing Data Results
Estimate static background noise level N(f )
Cells with energy close to background are considered
“missing”
Factory Noise
"1754" + noise
SNR mask
Missing Data
Classifier
"1754"
Digit recognition accuracy / %
100
80
60
40
20
0
a priori
missing data
MFCC+CMN
5
10
15
20
SNR (dB)
I
"
must use spectral features!
But: nonstationary noise → spurious mask bits
I
can we try removing parts of mask?
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
37 / 45
Comparing different segregations
Standard classification chooses between models M to match
source features X
M ∗ = argmax p(M | X ) = argmax p(X | M)p(M)
M
M
Mixtures: observed features Y , segregation S,
all related by p(X | Y , S)
Observation
Y(f)
Source
X(f)
Segregation S
freq
Joint classification of model and segregation:
Z
p(X | Y , S)
p(M, S | Y ) = p(M) p(X | M)
dX p(S | Y )
p(X )
I
P(X ) no longer constant
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
38 / 45
Calculating fragment matches
Z
p(M, S | Y ) = p(M)
p(X | M)
p(X | Y , S)
dX p(S | Y )
p(X )
p(X | M) – the clean-signal feature model
p(X | Y ,S)
p(X )
– is X ‘visible’ given segregation?
Integration collapses some bands. . .
p(S|Y ) – segregation inferred from observation
I
I
just assume uniform, find S for most likely M
or: use extra information in Y to distinguish Ss. . .
Result:
I
I
I
probabilistically-correct relation between
clean-source models p(X | M) and
inferred, recognized source + segregation p(M, S | Y )
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
39 / 45
Using CASA features
p(S | Y ) links acoustic information to segregation
I
I
is this segregation worth considering?
how likely is it?
Opportunity for CASA-style information to contribute
I
I
periodicity/harmonicity: these different frequency bands
belong together
onset/continuity: this time-frequency region must be whole
Frequency Proximity
E6820 (Ellis & Mandel)
Common Onset
L10: Signal separation
Harmonicity
April 14, 2009
40 / 45
Fragment decoding
Limiting S to whole fragments makes hypothesis search
tractable:
Fragments
Hypotheses
w1
w5
w2
w6
w7
w3
w8
w4
time
T1
T2
T3
T4
T5
T6
I choice of fragments reflects p(S | Y )p(X | M)
i.e. best combination of segregation and match to speech models
Merging hypotheses limits space demands
. . . but erases specific history
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
41 / 45
Speech fragment decoder results
Simple p(S | Y ) model forces contiguous regions to stay
together
I
big efficiency gain when searching S space
"1754" + noise
SNR mask
Fragments
Fragment
Decoder
"1754"
Clean-models-based recognition rivals trained-in-noise
recognition
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
42 / 45
Multi-source decoding
Search for more than one source
q2(t)
S2(t)
Y(t)
S1(t)
q1(t)
Mutually-dependent data masks
I
I
I
disjoint subsets of cells for each source
each model match p(Mx | Sx , Y ) is independent
masks are mutually dependent: p(S1 , S2 | Y )
Huge practical advantage over full search
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
43 / 45
Summary
Auditory Scene Analysis:
I
Hearing: partially understood, very successful
Independent Component Analysis:
I
Simple and powerful, some practical limits
Model-based separation:
I
Real-world constraints, implementation tricks
Parting thought
Mixture separation the main obstacle in many applications e.g.
soundtrack recognition
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
44 / 45
References
Albert S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound.
The MIT Press, 1990. ISBN 0262521954.
C.J. Darwin and R.P. Carlyon. Auditory grouping. In B.C.J. Moore, editor, The
Handbook of Perception and Cognition, Vol 6, Hearing, pages 387–424. Academic
Press, 1995.
G. J. Brown and M. P. Cooke. Computational auditory scene analysis. Computer
speech and language, 8:297–336, 1994.
D. P. W. Ellis. Prediction–driven computational auditory scene analysis. PhD thesis,
Department of Electrtical Engineering and Computer Science, M.I.T., 1996.
Anthony J. Bell and Terrence J. Sejnowski. An information-maximization approach to
blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159,
1995. ftp://ftp.cnl.salk.edu/pub/tony/bell.blind.ps.Z.
A. Hyvärinen and E. Oja. Independent component analysis: Algorithms and
applications. Neural Networks, 13(4–5):411–430, 2000.
http://www.cis.hut.fi/aapo/papers/IJCNN99_tutorialweb/.
A. Varga and R. Moore. Hidden markov model decomposition of speech and noise. In
Proceedings of the 1990 IEEE International Conference on Acoustics, Speech and
Signal Processing, pages 845–848, 1990.
M.J.F. Gales and S.J. Young. Hmm recognition in noise using parallel model
combination. In Proc. Eurospeech-93, volume 2, pages 837–840, 1993.
S. Roweis. One-microphone source separation. In NIPS, volume 11, pages 609–616.
MIT Press, Cambridge MA, 2001.
Jon Barker, Martin Cooke, and Daniel P. W. Ellis. Decoding speech in the presence of
other sources. Speech Communication, 45(1):5–25, 2005. URL
http://www.ee.columbia.edu/~dpwe/pubs/BarkCE05-sfd.pdf.
E6820 (Ellis & Mandel)
L10: Signal separation
April 14, 2009
45 / 45
Download