Lecture 14: Source Separation

advertisement
ELEN E4896 MUSIC SIGNAL PROCESSING
Lecture 14:
Source Separation
1.
2.
3.
4.
Sources, Mixtures, & Perception
Spatial Filtering
Time-Frequency Masking
Model-Based Separation
Dan Ellis
Dept. Electrical Engineering, Columbia University
dpwe@ee.columbia.edu
E4896 Music Signal Processing (Dan Ellis)
http://www.ee.columbia.edu/~dpwe/e4896/
2013-04-29 - 1 /19
1. Sources, Mixtures, & Perception
• Sound is a linear process (superposition)
no “opacity” (unlike vision)
sources → “auditory scenes” (polyphony)
4000
frq/Hz
3000
0
2000
-20
1000
-40
0
-60
level / dB
0
2
4
6
8
10
12
time/s
02_m+s-15-evil-goodvoice-fade
Analysis
Voice (evil)
Voice (pleasant)
Stab
Rumble
Choir
Strings
• Humans perceive discrete sources
.. a subjective construct
E4896 Music Signal Processing (Dan Ellis)
2013-04-29 - 2 /19
Spatial Hearing
• People perceive sources based on cues
Blauert ’96
spatial (binaural): ITD, ILD
R
L
head shadow
(high freq)
path length
difference
source
shatr78m3 waveform
0.1
Left
0.05
0
-0.05
Right
-0.1
2.2
2.205
2.21
E4896 Music Signal Processing (Dan Ellis)
2.215
2.22
2.225
2.23
2.235
time / s
2013-04-29 - 3 /19
Auditory Scene Analysis
• Spatial cues may not be enough/available
single channel signal
• Brain uses signal-intrinsic cues to form
Bregman ’90
sources
onset, harmonicity
freq / kHz
Reynolds-McAdams Oboe
4
30
20
3
10
0
2
-10
-20
1
-30
0
-40
0
0.5
1
1.5
E4896 Music Signal Processing (Dan Ellis)
2
2.5
3
3.5
4
time / sec
level / dB
2013-04-29 - 4 /19
Auditory Scene Analysis
“Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one.
Looking only at the motion of the handkerchiefs, you are
to answer questions such as: How many boats are there
on the lake and where are they?” (after Bregman’90)
• Quite a challenge!
E4896 Music Signal Processing (Dan Ellis)
2013-04-29 - 5 /19
Audio Mixing
• Studio recording combines separate tracks
into, e.g., 2 channels (stereo)
different levels
panning
other effects
• Stereo Intensity
1
0.5
0
−1
−0.5
0
0.5
1
Panning
manipulating
ILD only
constant power
more channels:
use just nearest pair?
L
E4896 Music Signal Processing (Dan Ellis)
R
2013-04-29 - 6 /19
2. Spatial Filtering
• N sources detected by M sensors
degrees of freedom
(else need other constraints)
•
Consider
2 x 2 case:
m1
s1
a11
a22
directional
mics
a12
a21
s2
m2
→mixing matrix:
m1
a11 a12
=
m2
a21 a22
E4896 Music Signal Processing (Dan Ellis)
s1
s2
ŝ1
ŝ2
= Â
1
m
2013-04-29 - 7 /19
Source Cancelation
• Simple 2 x 2 case example:
m1
m2
=
1
0.8
0.5
1
s1
s2
m1 (t) = s1 (t) + 0.5s2 (t)
m2 (t) = 0.8s1 (t) + s2 (t)
m1 (t) 0.5m2 (t) = 0.6s1 (t)
if no delay and
linearly-independent sums,
can cancel one source per
combination
E4896 Music Signal Processing (Dan Ellis)
2013-04-29 - 8 /19
Independent Component Analysis
• Can separate “blind” combinations by
Bell & Sejnowski ’95
maximizing independence of outputs
m1
m2
x
a11 a12
a21 a22
s1
s2
−δ MutInfo
δa
kurtosis
y
kurt(y) = E
4
µ
3
Kurtosis vs.
0.8
s2
0.6
0.4
kurtosis
mix 2
Mixture Scatter
for independence?
s1
12
10
8
0.2
6
0
4
-0.2
-0.4
2
-0.6
-0.3
0
s1
-0.2
-0.1
0
0.1
0.2
0.3
0.4
mix 1
E4896 Music Signal Processing (Dan Ellis)
0
0.2
0.4
0.6
s2
0.8
/
1
2013-04-29 - 9 /19
Microphone Arrays
• If interference is diffuse, can simply
Benesty, Chen, Huang ’08
boost energy from target direction
e.g. shotgun mic - delay-and-sum
0
∆x = c . D
λ = 4D
-20
-40
λ = 2D
λ=D
+
D
+
D
+
D
off-axis spectral coloration
many variants - filter & sum, sidelobe cancelation ...
E4896 Music Signal Processing (Dan Ellis)
2013-04-29 - 10/19
What if there is only one channel?
cannot have fixed cancellation
but could have fast time-varying filtering:
freq / kHz
•
3. Time-Frequency Masking
Brown & Cooke ’94
Roweis ’01
8
6
4
2
freq / kHz
0
8
6
4
2
0
0.5
1
1.5
2
2.5
3
• The trick is finding the right mask...
E4896 Music Signal Processing (Dan Ellis)
time / s
2013-04-29 - 11/19
Time-Frequency Masking
Original
freq / kHz
• Works well for overlapping voices
Female
Male
8
40
6
20
4
0
2
-20
0
level / dB
Oraclebased
Resynth
freq / kHz
Mix +
Oracle
Labels
cooke-v3n7.wav
8
6
4
2
0
cooke-v3msk-ideal.wav
0
0.5
1
time / sec
0
0.5
1
cooke-n7msk-ideal.wav
time / sec
time-frequency resolution?
E4896 Music Signal Processing (Dan Ellis)
2013-04-29 - 12/19
Pan-Based Filtering
Avendano 2003
• Can use time-frequency masking
even for stereo
6
4
0
2
−20
0
−5
ILD mask 1024 pt win −2.5 .. +1.0 dB
freq / kHz
ILD / dB
5
0
6
20
4
0
2
0
level / dB
20
level / dB
freq / kHz
e.g. calculate “panning index” as ILD
mask cells matching that ILD
−20
0
5
E4896 Music Signal Processing (Dan Ellis)
10
15
20
time / s
2013-04-29 - 13/19
Harmonic-based Masking
• Time-frequency masking
Denbigh & Zhao 1992
can be used to pick out harmonics
freq / kHz
given pitch track, know where to expect harmonics
4
3
2
1
freq / kHz
0
4
3
2
1
0
0
1
2
E4896 Music Signal Processing (Dan Ellis)
3
4
5
6
7
time / s
8
2013-04-29 - 14/19
Harmonic Filtering
Avery Wang 1995
• Given pitch track, could use
time-varying comb filter to get harmonics
freq / kHz
or: isolate each harmonic
by heterodyning:
x̂(t) =
âk (t)cos(k ˆ 0 (t)t)
k
âk (t) = LP F {|x(t)e
4
jk ˆ 0 (t)t
|}
3
2
1
freq / kHz
0
4
3
2
1
freq / kHz
0
4
3
2
1
0
0
1
2
E4896 Music Signal Processing (Dan Ellis)
3
4
5
6
7
time / s 8
2013-04-29 - 15/19
Nonnegative Matrix Factorization
• Decomposition of spectrograms
into templates + activation
Lee & Seung ’99
Abdallah & Plumbley ’04
Smaragdis & Brown ’03
Virtanen ’07
X=W·H
fast & forgiving gradient descent algorithm
fits neatly with time-frequency masking
100
150
200
1
2
120
100
80
60
40
20
1
2
Bases from all W
3
t
E4896 Music Signal Processing (Dan Ellis)
50
100
150
Time (DFT slices)
200
Smaragdis ’04
3
Frequency (DFT index)
Rows of H
50
Virtanen ’03
sounds
2013-04-29 - 16/19
4. Model-Based Separation
• When N (sources) > M (sensors), need
additional constraints to solve problem
e.g. assumption of single dominant pitch
• Can assemble into a model M of source si
defines set of “possible” waveforms
..probabilistically: P r(si |M )
• Source separation from mixture as inference:
s = {si } = arg max P r(x|s, A)P (A)
s
i
P r(si |M )
where P r(x|s, A) = N (x|As, )
E4896 Music Signal Processing (Dan Ellis)
2013-04-29 - 17/19
Source Models
Ozerov, Vincent & Bimbot 2010
http://bass-db.gforge.inria.fr/fasst/
• Can constrain:
source spectra (e.g. harmonic, noisy, smooth)
temporal evolution (piecewise-continuous)
spatial arrangements (point-source, diffuse)
Stereo − instantaneous mix
Separated source 1
4000
4000
3000
3000
Frequency
Frequency
• Factored decomposition:
2000
1000
0
2000
1000
0.5
1
1.5
Time
2
2.5
0
3
0.5
1
1.5
Time
2
2.5
3
2.5
3
2.5
3
4000
3000
3000
Frequency
Frequency
Separated source 2
4000
2000
1000
0
2000
1000
0.5
1
1.5
Time
2
2.5
0
3
0.5
1
1.5
Time
2
Separated source 3
4000
Frequency
3000
2000
1000
0
0.5
1
1.5
Time
2
Music: Shannon Hurley / Mix: Michel Desnoues & Alexey Ozerov / Separations: Alexey Ozerov
ex
ex
Vjex =
Wjex Uex
j Gj Hj
E4896 Music Signal Processing (Dan
Ellis)
2013-04-29 - 18/19
Summary
• Acoustic Source Mixtures
The normal situation in real-world sounds
• Spatial filtering
Canceling sources by subtracting channels
• Time-Frequency Masking
Selecting spectrogram cells
• Model-Based Separation
Exploiting regularities in source signals
E4896 Music Signal Processing (Dan Ellis)
2013-04-29 - 19/19
References
S. Abdallah & M. Plumbley, “Polyphonic transcription by non-negative sparse coding of power
spectra”, Proc. Int. Symp. Music Info. Retrieval, 2004.
C. Avendano, “Frequency-domain source identification and manipulation in stereo mixes for
enhancement, suppression and re-panning applications,” IEEE WASPAA, Mohonk, pp. 55-58, 2003.
A. Bell, T. Sejnowski, “An information-maximization approach to blind separation and blind
deconvolution,” Neural Computation, vol. 7 no. 6, pp. 1129-1159, 1995.
J. Benesty, J. Chen,Y. Huang, Microphone Array Signal Processing, Springer, 2008.
J. Blauert, Spatial Hearing, MIT Press, 1996.
A. Bregman, Auditory Scene Analysis, MIT Press, 1990.
G. Brown & M. Cooke, “Computational auditory scene analysis,” Computer Speech and Language,
vol. 8 no. 4, pp. 297-336, 1994.
P. Denbigh & J. Zhao, “Pitch extraction and separation of overlapping speech,” Speech
Communication, vol. 11 no. 2-3, pp. 119-125, 1992.
D. Lee & S. Seung, “Learning the Parts of Objects by Non-negative Matrix Factorization”, Nature
401, 788, 1999.
A. Ozerov, E.Vincent, & F. Bimbot, “A general flexible framework for the handling of prior
information in audio source separation,” INRIA Tech. Rep. 7453, Nov. 2010.
S. Roweis, “One microphone source separation,” Adv. Neural Info. Proc. Sys., pp. 793-799, 2001.
P. Smaragdis & J. Brown, “Non-negative Matrix Factorization for Polyphonic Music Transcription”,
Proc. IEEE WASPAA,177-180, October, 2003.
T.Virtanen “Monaural sound source separation by nonnegative matrix factorization with temporal
continuity and sparseness criteria,” IEEE Tr. Audio, Speech, & Lang. Proc. 15(3), 1066–1074, 2007.
Avery Wang, Instantaneous and frequency-warped signal processing techniques for auditory source
separation, Ph.D. dissertation, Stanford CCRMA, 1995.
E4896 Music Signal Processing (Dan Ellis)
2013-04-29 - 20/19
Download