Model-Based Separation in Humans and Machines Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Audio Source Separation 2. Human Performance 3. Model-Based Separation Model-based Separation - Dan Ellis 2006-03-08 - 1 /19 1. Audio Source Separation • Sounds rarely occurs in isolation .. but organizing mixtures is a problem .. for humans and machines )re+ & ,-. chan 9 )re+ & ,-. chan 3 )re+ & ,-. chan 1 3 3 )re+ & ,-. chan 0 mr-2000-11-02-14:57:43 3 3 1 1/ / -1/ -3/ -5/ le;el & d= 1 1 1 / / 0 1 2 3 Model-based Separation - Dan Ellis 4 5 6 7 8 time & sec 2006-03-08 - 2 /19 mr-20001102-1440-cE+1743.wav Audio Separation Scenarios • Interactive voice systems human-level understanding is expected • Speech prostheses crowds: #1 complaint of hearing aid users • Multimedia archive analysis freq / kHz identifying and isolating speech, other events dpwe-2004-09-10-13:15:40 8 20 6 0 4 -20 2 0 -40 0 1 2 3 4 5 • Surveillance... Model-based Separation - Dan Ellis 6 7 8 9 level / dB time / sec pa-2004-09-10-131540.wav 2006-03-08 - 3 /19 How Can We Separate? • By between-sensor differences (spatial cues) ‘steer a null’ onto a compact interfering source • By finding a ‘separable representation’ spectral? but speech is broadband periodicity? maybe – for voiced speech something more signal-specific... • By inference (based on knowledge/models) speech is redundant → use part to guess the remainder Model-based Separation - Dan Ellis 2006-03-08 - 4 /19 Outline 1. Audio Source Separation 2. Human Performance scene analysis speech separation by location speech separation by voice characteristics 3. Model-Based Separation Model-based Separation - Dan Ellis 2006-03-08 - 5 /19 Auditory Scene Analysis Bregman’90 • Listeners organize sound mixtures Darwin & Carlyon’95 common onset + continuity harmonicity freq / kHz into discrete perceived sources based on within-signal cues (audio + ...) 4 30 20 3 10 0 2 -10 -20 1 -30 0 -40 0 spatial, modulation, ... learned “schema” Model-based Separation - Dan Ellis 0.5 1 1.5 2 2.5 3 3.5 4 Reynolds-McAdams oboe time / sec reynolds-mcadams-dpwe.wav 2006-03-08 - 6 /19 level / dB Speech Mixtures: Spatial Separation • Task: Coordinate Response Measure Brungart et al.’02 “Ready Baron go to green eight now” 256 variants, 16 speakers correct = color and number for “Baron” crm-11737+16515.wav • Accuracy as a function of spatial separation: A, B same speaker Model-based Separation - Dan Ellis o Range effect 2006-03-08 - 7 /19 Separation by Vocal Differences Brungart et al.’01 varying the level and voice character • CRM !"#$%&$'()#*"&+$,-"#$'(!$./-#0 (same spatial1232'(4-**2&2#52. location) energetic vs. informational masking 1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"# 1-.,2#(*"&(,72(9%-2,2&(,$'/2&: Model-based Separation - Dan Ellis 2006-03-08 - 8 /19 Varying the Number of Voices Brungart et al.’01 • Two voices OK; More than two voices harder :/1$9($%";1./(0$#5/1$%":$)&51< =90>'("/."?$%&'() (same spatial origin) mix -'(./(0$12'"3(/4)"3($0$#52$%%6 of N voices tends to speech-shaped noise... 78'1"+(3 /(",#8 #$%&'("5)"$33'3"#/")#509%9) Model-based Separation - Dan Ellis 2006-03-08 - 9 /19 !"#$%&'()"**"+"#$%&'()"*","#$%&'() Outline 1. Audio Source Separation 2. Human Performance 3. Model-Based Separation Separation vs. Inference The Speech Fragment Decoder Model-based Separation - Dan Ellis 2006-03-08 - 10/19 Separation Approaches ICA •Multi-channel •Fixed filtering •Perfect separation – maybe! !ar$%! ' (n!%r*%r%nc% n - m(' 0r1c ', / CASA / Model-based •Single-channel •Time-varying filtering •Approximate separation !"#$#%& , n ,(-+.)* /)0! ,.2. ()*+ # !0,1 • Very different approaches... Model-based Separation - Dan Ellis 2006-03-08 - 11/19 ",.2. #' Separation vs. Inference Ellis’96 • Ideal separation is rarely possible i.e. no projection can completely remove overlaps • Overlaps Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) P(X |{Si}) P({Si}) combination physics source models • Better source models → better inference .. learn from examples? Model-based Separation - Dan Ellis 2006-03-08 - 12/19 • Central idea: Employ strong learned constraints time time to disambiguate possible sources Varga & Moore’90 Roweis’03... frequency {Si} = argmax{Si} P(X | {Si}) frequency speech-trained Vector-Quantizer • e.g. fitMAXVQ Results: Denoising totimemixed spectrum: time from Roweis’03 time frequency frequency frequency frequency frequency frequency Model-Based Separation time time time Model-based Separation - Dan Ellis frequency frequency Training: 300sec of isolated speech separate via T-F(TIMIT) mask to fit 512 codewords, and 100sec of solated noise (NOISEX) to fit 32 codewords; testing on new mpgr+noise+tfmask.wav signals at 0dB SNR mpgr+noise.wwav 2006-03-08 - 13/19 Separation or Description? • Are isolated waveforms required? clearly sufficient, but may not be necessary not part of perceptual source separation! • Integrate separation with application? e.g. speech recognition separation mix t-f masking + resynthesis identify target energy ASR words speech models mix vs. identify find best words model speech models source knowledge words output = abstract description of signal Model-based Separation - Dan Ellis 2006-03-08 - 14/19 words 1 M ! = argmax P " M X # = argmax P " X M M M +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(* #"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X The Speech Fragment Decoder 1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,) • P " MBarker # et al. ’05 ! M = argmax P " M X # = argmax P " X M # ! -------------%44*&,4%-,2*6>*P( X | Y,S )*;# P" X M M 1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=* Match ‘uncorrupt’ Observation %44*&,4%-,2*6>*P( X | Y,S )*; Y(f ) spectrum to ASR Observation models using Source Y(f ) X(f ) missing data Source Segregation S X(f ) freq model M and segregation S • Joint search for 1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/ Segregation S freq 1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(; to maximize: P" X Y $ S # P" X Y $ S # P " M $ S Y #P=" M P "$M # !"-----------------------! P#" S! -----------------------Y# -d S# %YP#" X=M P MP#"%XP# "-XdXM P " X # Isolated Source Model Segregation Model <$P(X)$#*$&*#521$=*#(0"#0$> <$P(X)$#*$&*#521$=*#(0"#0$> !"#$%&&'( Separation - Dan )*+#,-$.'/0+12($3$42"1#'#5 Model-based Ellis !"#$%&&'( 67789::976$9$::$;$68 2006-03-08 - 15/19 )*+#,-$.'/0+12($3$42"1#'#5 67789 Segregation S 1 freq Using CASA cues ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(; P" X Y $ S # !"#$%&'()(&*+,-./+" P " M $ S Y # = P " M # % P " X M # ! ------------------------- dX ! P " S Y # P" X # 0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$ • 9<$P(X)$#*$&*#521$=*#(0"#0$> '($0<'($(25125"0'*#$=*10<$>*#(',21'#5? CASA can help search 9 <*=$&'@2&A$'($'0? • E12F+2#>A$G"#,($G2&*#5$0*520<21 CASA can rate segregation 9 *#(20;>*#0'#+'0AD consider only segregations made from CASA !"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68 0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+" chunks 9 B21'*,'>'0A;<"1C*#'>'0AD construct0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2 P(S|Y) to reward CASA qualities: Frequency Proximity Model-based Separation - Dan Ellis !"#$%&&'( Common Onset )*+#,-$.'/0+12($3$42"1#'#5 Harmonicity 2006-03-08 - 16/19 67789::976$9$:8$;$68 Speech-Fragment Recognition • CASA-based fragments give extra gain over missing-data recognition from Barker et al. ’05 Model-based Separation - Dan Ellis 2006-03-08 - 17/19 Evaluating Separation • Real-world speech tasks crowded environments M. Cooke & T.-W. Lee “Speech Separation Challenge” 0 6 4 -50 2 -100 0 level / dB Pitch Track + Speaker Active Ground Truth 200 150 Lags human intelligibility? ‘diarization’ annotation (but not transcription) freq / kHz • Metric Personal Audio - Speech + Noise 8 100 ks-noisyspeech.wav 50 0 0 Model-based Separation - Dan Ellis 1 2 3 4 5 6 7 8 2006-03-08 - 18/19 9 10 time / sec Summary & Conclusions • Listeners do well separating speech using spatial location using source-property variations • Machines do less well difficult to apply enough constraints need to exploit signal detail • Models capture constraints learn from the real world adapt to sources • Inferring state (≈ recognition) is a promising approach to separation Model-based Separation - Dan Ellis 2006-03-08 - 19/19 Sources / See Also • NSF/AFOSR Montreal Workshops ’03, ’04 www.ebire.org/speechseparation/ labrosa.ee.columbia.edu/Montreal2004/ as well as the resulting book... • Hanse meeting: www.lifesci.sussex.ac.uk/home/Chris_Darwin/ Hanse/ • DeLiang Wang’s ICASSP’04 tutorial www.cse.ohio-state.edu/~dwang/presentation.html • Martin Cooke’s NIPS’02 tutorial www.dcs.shef.ac.uk/~martin/nips.ppt Model-based Separation - Dan Ellis 2006-03-08 - 20/19 References 1/2 [Barker et al. ’05] J. Barker, M. Cooke, D. Ellis, “Decoding speech in the presence of other sources,” Speech Comm. 45, 5-25, 2005. [Bell & Sejnowski ’95] A. Bell & T. Sejnowski, “An information maximization approach to blind separation and blind deconvolution,” Neural Computation, 7:1129-1159, 1995. [Blin et al.’04] A. Blin, S. Araki, S. Makino, “A sparseness mixing matrix estimation (SMME) solving the underdetermined BSS for convolutive mixtures,” ICASSP, IV-85-88, 2004. [Bregman ’90] A. Bregman, Auditory Scene Analysis, MIT Press, 1990. [Brungart ’01] D. Brungart, “Informational and energetic masking effects in the perception of two simultaneous talkers,” JASA 109(3), March 2001. [Brungart et al. ’01] D. Brungart, B. Simpson, M. Ericson, K. Scott, “Informational and energetic masking effects in the perception of multiple simultaneous talkers,” JASA 110(5), Nov. 2001. [Brungart et al. ’02] D. Brungart & B. Simpson, “The effects of spatial separation in distance on the informational and energetic masking of a nearby speech signal”, JASA 112(2), Aug. 2002. [Brown & Cooke ’94] G. Brown & M. Cooke, “Computational auditory scene analysis,” Comp. Speech & Lang. 8 (4), 297–336, 1994. [Cooke et al. ’01] M. Cooke, P. Green, L. Josifovski, A.Vizinho, “Robust automatic speech recognition with missing and uncertain acoustic data,” Speech Communication 34, 267-285, 2001. [Cooke’06] M. Cooke, “A glimpsing model of speech perception in noise,” submitted to JASA. [Darwin & Carlyon ’95] C. Darwin & R. Carlyon, “Auditory grouping” Handbk of Percep. & Cogn. 6: Hearing, 387– 424, Academic Press, 1995. [Ellis’96] D. Ellis, “Prediction-Driven Computational Auditory Scene Analysis,” Ph.D. thesis, MIT EECS, 1996. [Hu & Wang ’04] G. Hu and D.L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004. [Okuno et al. ’99] H. Okuno, T. Nakatani, T. Kawabata, “Listening to two simultaneous speeches,” Speech Communication 27, 299–310, 1999. Model-based Separation - Dan Ellis 2006-03-08 - 21/19 References 2/2 [Ozerov et al. ’05] A. Ozerov, P. Phillippe, R. Gribonval, F. Bimbot, “One microphone singing voice separation using source-adapted models,” Worksh. on Apps. of Sig. Proc. to Audio & Acous., 2005. [Pearlmutter & Zador ’04] B. Pearlmutter & A. Zador, “Monaural Source Separation using Spectral Cues,” Proc. ICA, 2005. [Parra & Spence ’00] L. Parra & C. Spence, “Convolutive blind source separation of non-stationary sources,” IEEE Tr. Speech & Audio, 320-327, 2000. [Reyes et al. ’03] M. Reyes-Gómez, B. Raj, D. Ellis, “Multi-channel source separation by beamforming trained with factorial HMMs,” Worksh. on Apps. of Sig. Proc. to Audio & Acous., 13–16, 2003. [Roman et al. ’02] N. Roman, D.-L. Wang, G. Brown, “Location-based sound segregation,” ICASSP, I-1013-1016, 2002. [Roweis ’03] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” EuroSpeech, 2003. [Schimmel & Atlas ’05] S. Schimmel & L. Atlas, “Coherent Envelope Detection for Modulation Filtering of Speech,” ICASSP, I-221-224, 2005. [Slaney & Lyon ’90] M. Slaney & R. Lyon, “A Perceptual Pitch Detector,” ICASSP, 357-360, 1990. [Smaragdis ’98] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Intl. Wkshp. on Indep. & Artif.l Neural Networks, Tenerife, Feb. 1998. [Seltzer et al. ’02] M. Seltzer, B. Raj, R. Stern, “Speech recognizer-based microphone array processing for robust hands-free speech recognition,” ICASSP, I–897–900, 2002. [Varga & Moore ’90] A.Varga & R. Moore, “Hidden Markov Model decomposition of speech and noise,” ICASSP, 845–848, 1990. [Vincent et al. ’06] E.Vincent, R. Gribonval, C. Févotte, “Performance measurement in Blind Audio Source Separation.” IEEE Trans. Speech & Audio, in press. [Yilmaz & Rickard ’04] O.Yilmaz & S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Tr. Sig. Proc. 52(7), 1830-1847, 2004. Model-based Separation - Dan Ellis 2006-03-08 - 22/19