ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 14: Source Separation 1. 2. 3. 4. Sources, Mixtures, & Perception Spatial Filtering Time-Frequency Masking Model-Based Separation Dan Ellis Dept. Electrical Engineering, Columbia University dpwe@ee.columbia.edu E4896 Music Signal Processing (Dan Ellis) http://www.ee.columbia.edu/~dpwe/e4896/ 2013-04-29 - 1 /19 1. Sources, Mixtures, & Perception • Sound is a linear process (superposition) no “opacity” (unlike vision) sources → “auditory scenes” (polyphony) 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 -60 level / dB 0 2 4 6 8 10 12 time/s 02_m+s-15-evil-goodvoice-fade Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Humans perceive discrete sources .. a subjective construct E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 2 /19 Spatial Hearing • People perceive sources based on cues Blauert ’96 spatial (binaural): ITD, ILD R L head shadow (high freq) path length difference source shatr78m3 waveform 0.1 Left 0.05 0 -0.05 Right -0.1 2.2 2.205 2.21 E4896 Music Signal Processing (Dan Ellis) 2.215 2.22 2.225 2.23 2.235 time / s 2013-04-29 - 3 /19 Auditory Scene Analysis • Spatial cues may not be enough/available single channel signal • Brain uses signal-intrinsic cues to form Bregman ’90 sources onset, harmonicity freq / kHz Reynolds-McAdams Oboe 4 30 20 3 10 0 2 -10 -20 1 -30 0 -40 0 0.5 1 1.5 E4896 Music Signal Processing (Dan Ellis) 2 2.5 3 3.5 4 time / sec level / dB 2013-04-29 - 4 /19 Auditory Scene Analysis “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Quite a challenge! E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 5 /19 Audio Mixing • Studio recording combines separate tracks into, e.g., 2 channels (stereo) different levels panning other effects • Stereo Intensity 1 0.5 0 −1 −0.5 0 0.5 1 Panning manipulating ILD only constant power more channels: use just nearest pair? L E4896 Music Signal Processing (Dan Ellis) R 2013-04-29 - 6 /19 2. Spatial Filtering • N sources detected by M sensors degrees of freedom (else need other constraints) • Consider 2 x 2 case: m1 s1 a11 a22 directional mics a12 a21 s2 m2 →mixing matrix: m1 a11 a12 = m2 a21 a22 E4896 Music Signal Processing (Dan Ellis) s1 s2 ŝ1 ŝ2 = Â 1 m 2013-04-29 - 7 /19 Source Cancelation • Simple 2 x 2 case example: m1 m2 = 1 0.8 0.5 1 s1 s2 m1 (t) = s1 (t) + 0.5s2 (t) m2 (t) = 0.8s1 (t) + s2 (t) m1 (t) 0.5m2 (t) = 0.6s1 (t) if no delay and linearly-independent sums, can cancel one source per combination E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 8 /19 Independent Component Analysis • Can separate “blind” combinations by Bell & Sejnowski ’95 maximizing independence of outputs m1 m2 x a11 a12 a21 a22 s1 s2 −δ MutInfo δa kurtosis y kurt(y) = E 4 µ 3 Kurtosis vs. 0.8 s2 0.6 0.4 kurtosis mix 2 Mixture Scatter for independence? s1 12 10 8 0.2 6 0 4 -0.2 -0.4 2 -0.6 -0.3 0 s1 -0.2 -0.1 0 0.1 0.2 0.3 0.4 mix 1 E4896 Music Signal Processing (Dan Ellis) 0 0.2 0.4 0.6 s2 0.8 / 1 2013-04-29 - 9 /19 Microphone Arrays • If interference is diffuse, can simply Benesty, Chen, Huang ’08 boost energy from target direction e.g. shotgun mic - delay-and-sum 0 ∆x = c . D λ = 4D -20 -40 λ = 2D λ=D + D + D + D off-axis spectral coloration many variants - filter & sum, sidelobe cancelation ... E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 10/19 What if there is only one channel? cannot have fixed cancellation but could have fast time-varying filtering: freq / kHz • 3. Time-Frequency Masking Brown & Cooke ’94 Roweis ’01 8 6 4 2 freq / kHz 0 8 6 4 2 0 0.5 1 1.5 2 2.5 3 • The trick is finding the right mask... E4896 Music Signal Processing (Dan Ellis) time / s 2013-04-29 - 11/19 Time-Frequency Masking Original freq / kHz • Works well for overlapping voices Female Male 8 40 6 20 4 0 2 -20 0 level / dB Oraclebased Resynth freq / kHz Mix + Oracle Labels cooke-v3n7.wav 8 6 4 2 0 cooke-v3msk-ideal.wav 0 0.5 1 time / sec 0 0.5 1 cooke-n7msk-ideal.wav time / sec time-frequency resolution? E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 12/19 Pan-Based Filtering Avendano 2003 • Can use time-frequency masking even for stereo 6 4 0 2 −20 0 −5 ILD mask 1024 pt win −2.5 .. +1.0 dB freq / kHz ILD / dB 5 0 6 20 4 0 2 0 level / dB 20 level / dB freq / kHz e.g. calculate “panning index” as ILD mask cells matching that ILD −20 0 5 E4896 Music Signal Processing (Dan Ellis) 10 15 20 time / s 2013-04-29 - 13/19 Harmonic-based Masking • Time-frequency masking Denbigh & Zhao 1992 can be used to pick out harmonics freq / kHz given pitch track, know where to expect harmonics 4 3 2 1 freq / kHz 0 4 3 2 1 0 0 1 2 E4896 Music Signal Processing (Dan Ellis) 3 4 5 6 7 time / s 8 2013-04-29 - 14/19 Harmonic Filtering Avery Wang 1995 • Given pitch track, could use time-varying comb filter to get harmonics freq / kHz or: isolate each harmonic by heterodyning: x̂(t) = âk (t)cos(k ˆ 0 (t)t) k âk (t) = LP F {|x(t)e 4 jk ˆ 0 (t)t |} 3 2 1 freq / kHz 0 4 3 2 1 freq / kHz 0 4 3 2 1 0 0 1 2 E4896 Music Signal Processing (Dan Ellis) 3 4 5 6 7 time / s 8 2013-04-29 - 15/19 Nonnegative Matrix Factorization • Decomposition of spectrograms into templates + activation Lee & Seung ’99 Abdallah & Plumbley ’04 Smaragdis & Brown ’03 Virtanen ’07 X=W·H fast & forgiving gradient descent algorithm fits neatly with time-frequency masking 100 150 200 1 2 120 100 80 60 40 20 1 2 Bases from all W 3 t E4896 Music Signal Processing (Dan Ellis) 50 100 150 Time (DFT slices) 200 Smaragdis ’04 3 Frequency (DFT index) Rows of H 50 Virtanen ’03 sounds 2013-04-29 - 16/19 4. Model-Based Separation • When N (sources) > M (sensors), need additional constraints to solve problem e.g. assumption of single dominant pitch • Can assemble into a model M of source si defines set of “possible” waveforms ..probabilistically: P r(si |M ) • Source separation from mixture as inference: s = {si } = arg max P r(x|s, A)P (A) s i P r(si |M ) where P r(x|s, A) = N (x|As, ) E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 17/19 Source Models Ozerov, Vincent & Bimbot 2010 http://bass-db.gforge.inria.fr/fasst/ • Can constrain: source spectra (e.g. harmonic, noisy, smooth) temporal evolution (piecewise-continuous) spatial arrangements (point-source, diffuse) Stereo − instantaneous mix Separated source 1 4000 4000 3000 3000 Frequency Frequency • Factored decomposition: 2000 1000 0 2000 1000 0.5 1 1.5 Time 2 2.5 0 3 0.5 1 1.5 Time 2 2.5 3 2.5 3 2.5 3 4000 3000 3000 Frequency Frequency Separated source 2 4000 2000 1000 0 2000 1000 0.5 1 1.5 Time 2 2.5 0 3 0.5 1 1.5 Time 2 Separated source 3 4000 Frequency 3000 2000 1000 0 0.5 1 1.5 Time 2 Music: Shannon Hurley / Mix: Michel Desnoues & Alexey Ozerov / Separations: Alexey Ozerov ex ex Vjex = Wjex Uex j Gj Hj E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 18/19 Summary • Acoustic Source Mixtures The normal situation in real-world sounds • Spatial filtering Canceling sources by subtracting channels • Time-Frequency Masking Selecting spectrogram cells • Model-Based Separation Exploiting regularities in source signals E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 19/19 References S. Abdallah & M. Plumbley, “Polyphonic transcription by non-negative sparse coding of power spectra”, Proc. Int. Symp. Music Info. Retrieval, 2004. C. Avendano, “Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications,” IEEE WASPAA, Mohonk, pp. 55-58, 2003. A. Bell, T. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7 no. 6, pp. 1129-1159, 1995. J. Benesty, J. Chen,Y. Huang, Microphone Array Signal Processing, Springer, 2008. J. Blauert, Spatial Hearing, MIT Press, 1996. A. Bregman, Auditory Scene Analysis, MIT Press, 1990. G. Brown & M. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8 no. 4, pp. 297-336, 1994. P. Denbigh & J. Zhao, “Pitch extraction and separation of overlapping speech,” Speech Communication, vol. 11 no. 2-3, pp. 119-125, 1992. D. Lee & S. Seung, “Learning the Parts of Objects by Non-negative Matrix Factorization”, Nature 401, 788, 1999. A. Ozerov, E.Vincent, & F. Bimbot, “A general flexible framework for the handling of prior information in audio source separation,” INRIA Tech. Rep. 7453, Nov. 2010. S. Roweis, “One microphone source separation,” Adv. Neural Info. Proc. Sys., pp. 793-799, 2001. P. Smaragdis & J. Brown, “Non-negative Matrix Factorization for Polyphonic Music Transcription”, Proc. IEEE WASPAA,177-180, October, 2003. T.Virtanen “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Tr. Audio, Speech, & Lang. Proc. 15(3), 1066–1074, 2007. Avery Wang, Instantaneous and frequency-warped signal processing techniques for auditory source separation, Ph.D. dissertation, Stanford CCRMA, 1995. E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - 20/19