Using Source Models in Speech Separation 1. 2.

Using Source Models in Speech Separation Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Mixtures, Separation, and Models Monaural Speech Separation Binaural Speech Separation Conclusions Source Models in Separation - Dan Ellis 2007-11-09 - 1 /17 LabROSA Overview Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech Source Models in Separation - Dan Ellis 2007-11-09 - 2 /17 1. Mixtures, Separation, and Models • Sounds rarely occur in isolation .. so analyzing mixtures is a problem .. for humans and machines freq / kHz chan 3 freq / kHz chan 1 4 freq / kHz chan 0 mr-2000-11-02-14:57:43 4 4 2 20 0 -20 -40 -60 level / dB 2 2 chan 9 freq / kHz mr-20001102-1440-cE+1743.wav 4 2 0 0 1 2 3 4 5 Source Models in Separation - Dan Ellis 6 7 8 9 time / sec 2007-11-09 - 3 /17 Mixture Organization Scenarios • Interactive voice systems human-level understanding is expected • Speech prostheses crowds: #1 complaint of hearing aid users • Archive analysis freq / kHz identifying and isolating sound events dpwe-2004-09-10-13:15:40 8 20 6 0 4 -20 2 0 -40 0 1 2 3 4 5 6 7 8 9 level / dB time / sec • Unmixing/remixing/enhancement... Source Models in Separation - Dan Ellis 2007-11-09 - 4 /17 Separation vs. Inference • Ideal separation is rarely possible many situations where overlaps cannot be removed • Overlaps → Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) P(X |{Si}) P({Si}) combination physics source models search over {Si} ?? source models → better inference • Better .. learn from examples? Source Models in Separation - Dan Ellis 2007-11-09 - 5 /17 Approaches to Separation ICA CASA •Multi-channel •Single-channel •Fixed filtering •Time-var. filter •Perfect separation •Approximate – maybe! separation target x interference n - mix proc x^ + mix x+n ^ n s stft proc mask istft x^ Model-based •Any domain •Param. search •Synthetic output mix x+n params x^ param. fit synthesis source models or combinations ... Source Models in Separation - Dan Ellis 2007-11-09 - 6 /17 EM for Model-based Separation • Expectation-Maximization algorithm – for solving partially-unknown problems (only local optimality guaranteed) • EM for model-based separation E-step: find distribution of unknowns p(u) E-step given current model parameters Θ and observations x M-step M-step: optimize Θ to maximize fit to u is... GMM mixture assignment x given current p(u) ... T-F cell dominance ... current phone of voice i ... Source Models in Separation - Dan Ellis 2007-11-09 - 7 /17 What is a Source Model? • Source Model describes signal behavior encapsulates constraints on form of signal (any such constraint can be seen as a model...) • A model has parameters model + parameters → instance Excitation source g[n] • What is not a source model? Resonance filter H(ej!) n detail not provided in instance e.g. using phase from original mixture constraints on interaction between sources e.g. independence, clustering attributes Source Models in Separation - Dan Ellis 2007-11-09 - 8 /17 n ! 2. Monaural Speech Separation • Cooke & Lee’s Speech Separation Challenge short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> e.g. "bin white by R 8 again" task: report letter + number for “white” special session at Interspeech ’06 • Separation or Description? separation mix t-f masking + resynthesis identify target energy ASR words speech models mix vs. identify find best words model speech models source knowledge Source Models in Separation - Dan Ellis 2007-11-09 - 9 /17 words Codebook Models Roweis ’01, ’03 Kristjannson ’04, ’06 • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, Σ) model inference of {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source state can include sequential constraints... different domains for combining c and defining  • E.g. stationary noise: In speech-shaped noise (mel magsnr = 2.41 dB) freq / mel bin Original speech 80 80 80 60 60 60 40 40 40 20 20 20 0 1 2 time / s 0 1 Source Models in Separation - Dan Ellis 2 VQ inferred states (mel magsnr = 3.6 dB) 0 1 2 2007-11-09 - 10/17 Speech Recognition Models Kristjansson, Hershey et al. ’06 • Decode with Factorial HMM i.e. two state sequences, one model for each voice exploit sequence constraints, speaker differences? model 2 • IBM “superhuman” Iroquois system model 1 observations / time fewer errors than people for same speaker, level exploit grammar constraints - higher-level dynamics wer / % Same Gender Speakers 100 50 0 6 dB SDL Recognizer 3 dB 0 dB No dynamics Source Models in Separation - Dan Ellis - 3 dB Acoustic dyn. - 6 dB - 9 dB Grammar dyn. Human 2007-11-09 - 11/17 Speaker-Adapted (SA) Models • Factorial HMM needs distinct speakers )*+,-.*/0,1 Weiss & Ellis ’07 (%% Mixture: t32_swil2a_m18_sbar9n 0 6 4 2 0 '% !20 !40 Adaptation iteration 1 % 0 !20 !40 0 6 4 2 0 (%% 23341*35 Adaptation iteration 3 !20 '% !40 Adaptation iteration 5 0 6 4 2 0 !20 !"# $"# %"# !$"# !!"# 89::-6,7",1 !40 !&"# SD - (%% SD model separation 0 6 4 2 0 % !20 acc % Frequency (kHz) 6 4 2 0 use “eigenvoice” speaker space iterate estimating voice!!"# & !&"# !"# $"# %"# !$"# )*+,-6,7",1 separating speech SI performs midway between speaker-independent (SI) and SA speaker-dependent (SD) '% !40 0 0.5 1 Time (sec) 1.5 %;1*3/, Source Models in Separation - Dan Ellis )8 )2 )< 2007-11-09 - 12/17 #*=,/97, 3. Binaural Speech Separation • 2 or 3 sources in reverberation assume just 2 ‘ears’ • Tasks: identify positions of sources (and number?) recover source signals Source Models in Separation - Dan Ellis 2007-11-09 - 13/17 Spatial Estimation in Reverb Mandel & Ellis ’07 • Model interaural spectrum of each source as stationary level and time differences: converge via EM to a(), τ for each source mask is Pr(X(t,ω) dominated by source i) Left Left Right Right ILD ILD IPD IPD Mask Mask 22 Mask Mask 11 8.77dB 8.77dB Source Models in Separation - Dan Ellis 2007-11-09 - 14/17 Spatial Estimation Results • Modeling uncertainty improves results tradeoff between constraints & noisiness 2.45 dB 0.22 dB 12.35 dB 8.77 dB 9.12 dB Source Models in Separation - Dan Ellis -2.72 dB 2007-11-09 - 15/17 Combining Spatial + Speech Model • Interaural parameters give • • ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω)) Speech source model can give Pr(Si(t, ω) is speech signal) Can combine into one big EM framework... E-step M-step Source Models in Separation - Dan Ellis u is: Pr(cell from source i) phoneme sequence Θ is: interaural params speaker params 2007-11-09 - 16/17 Summary & Conclusions • Inferring model parameters is very general .. and very difficult, in general • Speech models can separate single channels .. better match to individual → better results • Spatial cues can separate binaural signals .. but account for uncertainty from e.g. reverb • EM-type approach can integrate them both Source Models in Separation - Dan Ellis 2007-11-09 - 17/17

Using Source Models in Speech Separation 1. 2.

Related documents

Products

Support

Using Source Models in Speech Separation 1. 2.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib