Using Source Models in Speech Separation Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Mixtures, Separation, and Models Monaural Speech Separation Binaural Speech Separation Conclusions Source Models in Separation - Dan Ellis 2007-11-09 - 1 /17 LabROSA Overview Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech Source Models in Separation - Dan Ellis 2007-11-09 - 2 /17 1. Mixtures, Separation, and Models • Sounds rarely occur in isolation .. so analyzing mixtures is a problem .. for humans and machines freq / kHz chan 3 freq / kHz chan 1 4 freq / kHz chan 0 mr-2000-11-02-14:57:43 4 4 2 20 0 -20 -40 -60 level / dB 2 2 chan 9 freq / kHz mr-20001102-1440-cE+1743.wav 4 2 0 0 1 2 3 4 5 Source Models in Separation - Dan Ellis 6 7 8 9 time / sec 2007-11-09 - 3 /17 Mixture Organization Scenarios • Interactive voice systems human-level understanding is expected • Speech prostheses crowds: #1 complaint of hearing aid users • Archive analysis freq / kHz identifying and isolating sound events dpwe-2004-09-10-13:15:40 8 20 6 0 4 -20 2 0 -40 0 1 2 3 4 5 6 7 8 9 level / dB time / sec • Unmixing/remixing/enhancement... Source Models in Separation - Dan Ellis 2007-11-09 - 4 /17 Separation vs. Inference • Ideal separation is rarely possible many situations where overlaps cannot be removed • Overlaps → Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) P(X |{Si}) P({Si}) combination physics source models search over {Si} ?? source models → better inference • Better .. learn from examples? Source Models in Separation - Dan Ellis 2007-11-09 - 5 /17 Approaches to Separation ICA CASA •Multi-channel •Single-channel •Fixed filtering •Time-var. filter •Perfect separation •Approximate – maybe! separation target x interference n - mix proc x^ + mix x+n ^ n s stft proc mask istft x^ Model-based •Any domain •Param. search •Synthetic output mix x+n params x^ param. fit synthesis source models or combinations ... Source Models in Separation - Dan Ellis 2007-11-09 - 6 /17 EM for Model-based Separation • Expectation-Maximization algorithm – for solving partially-unknown problems (only local optimality guaranteed) • EM for model-based separation E-step: find distribution of unknowns p(u) E-step given current model parameters Θ and observations x M-step M-step: optimize Θ to maximize fit to u is... GMM mixture assignment x given current p(u) ... T-F cell dominance ... current phone of voice i ... Source Models in Separation - Dan Ellis 2007-11-09 - 7 /17 What is a Source Model? • Source Model describes signal behavior encapsulates constraints on form of signal (any such constraint can be seen as a model...) • A model has parameters model + parameters → instance Excitation source g[n] • What is not a source model? Resonance filter H(ej!) n detail not provided in instance e.g. using phase from original mixture constraints on interaction between sources e.g. independence, clustering attributes Source Models in Separation - Dan Ellis 2007-11-09 - 8 /17 n ! 2. Monaural Speech Separation • Cooke & Lee’s Speech Separation Challenge short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> e.g. "bin white by R 8 again" task: report letter + number for “white” special session at Interspeech ’06 • Separation or Description? separation mix t-f masking + resynthesis identify target energy ASR words speech models mix vs. identify find best words model speech models source knowledge Source Models in Separation - Dan Ellis 2007-11-09 - 9 /17 words Codebook Models Roweis ’01, ’03 Kristjannson ’04, ’06 • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, Σ) model inference of {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source state can include sequential constraints... different domains for combining c and defining • E.g. stationary noise: In speech-shaped noise (mel magsnr = 2.41 dB) freq / mel bin Original speech 80 80 80 60 60 60 40 40 40 20 20 20 0 1 2 time / s 0 1 Source Models in Separation - Dan Ellis 2 VQ inferred states (mel magsnr = 3.6 dB) 0 1 2 2007-11-09 - 10/17 Speech Recognition Models Kristjansson, Hershey et al. ’06 • Decode with Factorial HMM i.e. two state sequences, one model for each voice exploit sequence constraints, speaker differences? model 2 • IBM “superhuman” Iroquois system model 1 observations / time fewer errors than people for same speaker, level exploit grammar constraints - higher-level dynamics wer / % Same Gender Speakers 100 50 0 6 dB SDL Recognizer 3 dB 0 dB No dynamics Source Models in Separation - Dan Ellis - 3 dB Acoustic dyn. - 6 dB - 9 dB Grammar dyn. Human 2007-11-09 - 11/17 Speaker-Adapted (SA) Models • Factorial HMM needs distinct speakers )*+,-.*/0,1 Weiss & Ellis ’07 (%% Mixture: t32_swil2a_m18_sbar9n 0 6 4 2 0 '% !20 !40 Adaptation iteration 1 % 0 !20 !40 0 6 4 2 0 (%% 23341*35 Adaptation iteration 3 !20 '% !40 Adaptation iteration 5 0 6 4 2 0 !20 !"# $"# %"# !$"# !!"# 89::-6,7",1 !40 !&"# SD - (%% SD model separation 0 6 4 2 0 % !20 acc % Frequency (kHz) 6 4 2 0 use “eigenvoice” speaker space iterate estimating voice!!"# & !&"# !"# $"# %"# !$"# )*+,-6,7",1 separating speech SI performs midway between speaker-independent (SI) and SA speaker-dependent (SD) '% !40 0 0.5 1 Time (sec) 1.5 %;1*3/, Source Models in Separation - Dan Ellis )8 )2 )< 2007-11-09 - 12/17 #*=,/97, 3. Binaural Speech Separation • 2 or 3 sources in reverberation assume just 2 ‘ears’ • Tasks: identify positions of sources (and number?) recover source signals Source Models in Separation - Dan Ellis 2007-11-09 - 13/17 Spatial Estimation in Reverb Mandel & Ellis ’07 • Model interaural spectrum of each source as stationary level and time differences: converge via EM to a(), τ for each source mask is Pr(X(t,ω) dominated by source i) Left Left Right Right ILD ILD IPD IPD Mask Mask 22 Mask Mask 11 8.77dB 8.77dB Source Models in Separation - Dan Ellis 2007-11-09 - 14/17 Spatial Estimation Results • Modeling uncertainty improves results tradeoff between constraints & noisiness 2.45 dB 0.22 dB 12.35 dB 8.77 dB 9.12 dB Source Models in Separation - Dan Ellis -2.72 dB 2007-11-09 - 15/17 Combining Spatial + Speech Model • Interaural parameters give • • ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω)) Speech source model can give Pr(Si(t, ω) is speech signal) Can combine into one big EM framework... E-step M-step Source Models in Separation - Dan Ellis u is: Pr(cell from source i) phoneme sequence Θ is: interaural params speaker params 2007-11-09 - 16/17 Summary & Conclusions • Inferring model parameters is very general .. and very difficult, in general • Speech models can separate single channels .. better match to individual → better results • Spatial cues can separate binaural signals .. but account for uncertainty from e.g. reverb • EM-type approach can integrate them both Source Models in Separation - Dan Ellis 2007-11-09 - 17/17