Using Speech Models for Separation Dan Ellis Comprising the work of Michael Mandel and Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Eigenvoice Speaker Models 2. Spatial Parameter Models in Reverb 3. Combining Source + Spatial Speech Models for Separation - Dan Ellis 2010-04-20 - 1 /19 1. Speech Separation Using Models • Cooke & Lee’s Monaural Speech Separation Task pairs of short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> e.g. "bin white by R 8 again" task: report letter + number for “white” • Separation depends on source constraints the more the better - ASR model separation mix t-f masking + resynthesis identify target energy ASR speech models words mix vs. identify find best words model words speech models source knowledge Speech Models for Separation - Dan Ellis 2010-04-20 - 2 /19 Speech Mixture Recognition Kristjansson, Hershey et al. ’06 • Speech recognizers contain speech models ASR is just argmax P(W | X) • Recognize mixtures with Factorial HMM one model+state sequence for each voice exploit sequence constraints, speaker differences model 2 model 1 observations / time separation relies on detailed speaker model Speech Models for Separation - Dan Ellis 2010-04-20 - 3 /19 Eigenvoices • Idea: Find Kuhn et al. ’98, ’00 Weiss & Ellis ’07, ’08, ’09 speaker model parameter space generalize without losing detail? Speaker models Speaker subspace bases • Eigenvoice model: µ = µ̄ + U adapted model mean voice Speech Models for Separation - Dan Ellis eigenvoice bases w + B weights h channel channel bases weights 2010-04-20 - 4 /19 Eigenvoice Bases additional components for acoustic channel 10 20 6 30 4 40 2 b d g p t k jh ch s z f th v dh m n l 8 Frequency (kHz) • Eigencomponents shift formants/ coloration Mean Voice 50 r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 1 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l 8 Frequency (kHz) 280 states x 320 bins = 89,600 dimensions r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 2 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l 8 Frequency (kHz) • Mean model Frequency (kHz) 8 r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 3 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l Speech Models for Separation - Dan Ellis r w y iy ih eh ey ae aa aw ay ah ao owuw ax 2010-04-20 - 5 /19 Speaker-Adapted Separation Weiss & Ellis ’08 • Factorial HMM analysis with tuning of source model parameters = eigenvoice speaker adaptation Speech Models for Separation - Dan Ellis 2010-04-20 - 6 /19 Speaker-Adapted Separation Speech Models for Separation - Dan Ellis 2010-04-20 - 7 /19 Speaker-Adapted Separation • Eigenvoices for Speech Separation task speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) SI SA SD Speech Models for Separation - Dan Ellis 2010-04-20 - 8 /19 2. Spatial Models & Reverb Mandel & Ellis ’07 • 2 or 3 sources in reverberation assume just 2 ‘ears’ • Model interaural spectrum of each source as stationary level and time differences: Speech Models for Separation - Dan Ellis 2010-04-20 - 9 /19 IPD, ILD Distributions • Source at 75° in reverberation IPD IPD residual ILD IPD residual offsets phase by constant ωτ IPD can be fit by single Gaussian ILD needs frequency-dependence Speech Models for Separation - Dan Ellis 2010-04-20 - 10/19 Model-Based EM Source Separation and Localization (MESSL) Re-estimate source parameters Mandel & Ellis ’09 Assign spectrogram points to sources can model more sources than sensors flexible initialization Speech Models for Separation - Dan Ellis 2010-04-20 - 11/19 MESSL Results • Modeling uncertainty improves results tradeoff between constraints & noisiness 2.45 dB 0.22 dB 12.35 dB 8.77 dB 9.12 dB Speech Models for Separation - Dan Ellis -2.72 dB 2010-04-20 - 12/19 MESSL Results • Speech recognizer (Digits) Percent correct 100 Ground truth masks 100 80 80 60 60 40 0 −40 40 Human DP−Oracle Oracle OracleAllRev Mixes 20 −20 0 20 Target−to−masker ratio (dB) Speech Models for Separation - Dan Ellis Algorithmic masks 20 40 0 −40 Human Sawada Mouba MESSL−G MESSL−ΩΩ DUET Mixes −20 0 20 Target−to−masker ratio (dB) 40 2010-04-20 - 13/19 3. Combining Spatial + Speech Models Weiss, Mandel & Ellis ’08 • Interaural parameters give • • ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω)) Speech source model can give Pr(Si(t, ω) is speech signal) Can combine into one big EM framework... E-step M-step Speech Models for Separation - Dan Ellis u is: Pr(cell from source i) phoneme sequence Θ is: interaural params speaker params 2010-04-20 - 14/19 Parameter estimation and source separation MESSL-SP (Source Prior) Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models Speech Models for Separation - Dan Ellis May 4, 2009 27 / 34 2010-04-20 - 15/19 MESSL-SP Results • Source models function as priors • Interaural parameter spatial separation Frequency (kHz) Frequency (kHz) source model prior improves spatial estimate Ground truth (12.04 dB) 8 DUET (3.84 dB) 8 8 6 6 6 4 4 4 2 2 2 0 0 0.5 1 1.5 MESSL (5.66 dB) 8 0 0 0.5 1 1.5 MESSL SP (10.01 dB) 8 0 6 6 4 4 4 2 2 2 0 0.5 1 Time (sec) 1.5 Speech Models for Separation - Dan Ellis 0 0 0.5 1 Time (sec) 1.5 1 0.8 0 0 0.5 1 1.5 MESSL EV (10.37 dB) 8 6 0 2D FD BSS (5.41 dB) 0.6 0.4 0.2 0 0 0.5 1 Time (sec) 1.5 2010-04-20 - 16/19 Outline Introduction MESSL-SP Results Speaker subspace model • Monaural speech separation Binaural separation Conclusions Experiments Performance as distractor angle SNR–improvement vs.function source of angle separation SNR improvement (dB) 2 sources (anechoic) 2 sources (reverb) 15 15 10 10 5 5 0 0 0 20 40 60 80 20 SNR improvement (dB) 3 sources (anechoic) 15 40 60 80 3 sources (reverb) 15 Ground truth MESSL!EV MESSL!SP MESSL 2S!FD!BSS 10 10 5 5 0 0 0 20 40 60 80 Separation (degrees) DUET 20 40 60 80 Separation (degrees) SpeechRon Models - Dan Ellis Weiss for Separation Underdetermined Source Separation Using Speaker Subspace Models - 17/19 May 4, 2010-04-20 2009 29 / 34 Future Work • Better parametric speaker models limitations of eigenvoices varying style • Understanding reverb & ASR early echoes what spoils ASR? • Models of other sources eigeninstruments? Speech Models for Separation - Dan Ellis 2010-04-20 - 18/19 Summary & Conclusions • Source models provide the constraints to make scene analysis possible • Eigenvoices (model subspace) can be used to provide detailed models that generalize • Spatial parameters can identify more sources than models in reverb (MESSL) • Can combine source + spatial models Speech Models for Separation - Dan Ellis 2010-04-20 - 19/19