Using Speech Models for Separation Dan Ellis Comprising the work of Michael Mandel and Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Source Models and Scene Analysis Eigenvoice Speaker Models Spatial Parameter Models in Reverb Combining Source + Spatial Speech Models for Separation - Dan Ellis 2009-10-13 - 1 /28 LabROSA Overview Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech Speech Models for Separation - Dan Ellis 2009-10-13 - 2 /28 1. Source Models and Scene Analysis 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 -60 level / dB 0 2 4 6 8 10 12 time/s 02_m+s-15-evil-goodvoice-fade Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Sounds rarely occur in isolation .. so analyzing mixtures (“scenes”) is a problem .. for humans and machines Speech Models for Separation - Dan Ellis 2009-10-13 - 3 /28 Approaches to Separation ICA CASA •Multi-channel •Single-channel •Fixed filtering •Time-var. filter •Perfect separation •Approximate – maybe! separation target x interference n - mix proc x^ + n^ Speech Models for Separation - Dan Ellis mix x+n s stft proc mask istft x^ Model-based •Any domain •Param. search •Synthetic output? mix x+n params x^ param. fit synthesis source models 2009-10-13 - 4 /28 Separation vs. Inference • Ideal separation is rarely possible many situations where overlaps cannot be removed • Overlaps → Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) ∝ P(X |{Si}) ∏i P(Si|Mi) combination physics source models search over all source signal sets {Si} ?? • Better source models → better inference Speech Models for Separation - Dan Ellis 2009-10-13 - 5 /28 2. Speech Separation Using Models • Cooke & Lee’s Speech Separation Challenge pairs of short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> e.g. "bin white by R 8 again" task: report letter + number for “white” (special session at Interspeech ’06) • Separation or Description? separation mix t-f masking + resynthesis identify target energy ASR speech models words vs. mix identify find best words model words speech models source knowledge Speech Models for Separation - Dan Ellis 2009-10-13 - 6 /28 Codebook Models Roweis ’01, ’03 Kristjannson ’04, ’06 • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, ) model {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) inference of source state can include sequential constraints... • E.g. stationary noise: In speech-shaped noise (mel magsnr = 2.41 dB) freq / mel bin Original speech 80 80 80 60 60 60 40 40 40 20 20 20 0 1 2 time / s Speech Models for Separation - Dan Ellis 0 1 2 VQ inferred states (mel magsnr = 3.6 dB) 0 1 2 2009-10-13 - 7 /28 Speech Recognition Models Varga & Moore ’90 • Speech recognizers contain speech models ASR is just argmax P(W | X) • Recognize mixtures with Factorial HMM i.e. two state sequences, one model for each voice exploit sequence constraints, speaker differences model 2 model 1 observations / time Speech Models for Separation - Dan Ellis 2009-10-13 - 8 /28 Speech Factorial Separation ABSTRACT We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains in accuracy of automatic speech recognition in very noisy conditions. The method exploits the harmonic structure of speech by employing a high frequency resolution speech model in the log-spectrum domain and reconstructs the signal from the estimated posteriors of the clean signal and the phases from the original noisy signal. We achieve a gain in signal to noise ratio of 8.38 dB for enhancement of speech at 0 dB. We also present recognition results on the Aurora 2 data-set. At 0 dB SNR, we achieve a reduction of relative word error rate of 43.75% over the baseline, and 15.90% over the equivalent low-resolution algorithm. There are two observations that motivate the consideration of high frequency resolution modelling of speech for noise robust speech recognition and enhancement. First is the observation that most noise sources do not have harmonic structure similar to that of voiced speech. Hence, voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal space1 . Kristjansson, Hershey et al. ’06 • IBM’s 2006 Iroquois speech separation system Key features: • “Superhuman” performance 35 30 25 20 15 x estimate noisy vector clean vector A long standing goal in speech enhancement and robust 10 speech recognition has been to exploit the harmonic struc0 200 400 600 800 1000 Frequency [Hz] ture of speech to improve intelligibility and increase recognition accuracy. Fig. 1. The noisy input vector (dot-dash line), the correThe source-filter model of speech assumes that speech sponding clean vector (solid line) and the estimate of the is produced by an excitation source (the vocal cords) which clean speech (dotted line), with shaded area indicating the has strong regular harmonic structure during voiced phonemes. Same Gender Speakers uncertainty of the estimate (one standard deviation). Notice 100 The overall shape of the spectrum is then formed by a filthat the uncertainty on the estimate is considerably larger in ter (the vocal tract). In non-tonal languages the filter shape the valleys between the harmonic peaks. This reflects the alone determines which phone component of a word is prolower SNR in these regions. The vector shown is frame 100 50 duced (see Figure 2). The source on the other hand introfrom Figure 2 duces fine structure in the frequency spectrum that in many cases varies strongly among different utterances of the same A second observation is that in voiced speech, the signal 0 phone. power is -concentrated in areas near the harmonics of the 6 dBhas traditionally 3 dB 0 dBthe use -of 3 dB 9 dB This fact inspired smooth - 6 dB fundamental frequency, which show up as parallel ridges in representations of the speech spectrum, suchAcoustic as the MelSDL Recognizer No dynamics dyn. Grammar dyn. Human frequency cepstral coefficients, in an attempt to accurately 1 Even if the interfering signal is another speaker, the harmonic structure estimate the filter component of speech in a way that is inof the two signals may differ at different times, and the long term pitch variant to theEllis non-phonetic effects of the excitation[1]. contour of the speakers may be exploited to separate the two sources Separation - Dan 2009-10-13 - 9 [2]. /28 wer / % ... in some conditions Speech Models for Noisy vector, clean vector and estimate of clean vector 40 Energy [dB] detailed state combinations large speech recognizer exploits grammar constraints 34 per-speaker models 1. INTRODUCTION 45 Adapting Source Models • Power of model-based separation depends on detail of model • Speech separation relies on prior knowledge of every speaker? Diff Gender Not In Train Set 5 5 4 4 SNR (dB) SNR (dB) In Train 32 Set 3 2 1 0 Diff Gender 3 2 1 0 32 64 128 250 Num training speakers Oracle SD 32 64 128 250 Num training speakers SA SA (SD init) • Can this be practical? Speech Models for Separation - Dan Ellis 2009-10-13 - 10/28 Eigenvoices • Idea: Find Kuhn et al. ’98, ’00 Weiss & Ellis ’07, ’08, ’09 model parameter space generalize without losing detail? Speaker models Speaker subspace bases • Eigenvoice model: µ = µ̄ + U adapted model mean voice Speech Models for Separation - Dan Ellis eigenvoice bases w + B weights h channel channel bases weights 2009-10-13 - 11/28 Eigenvoice Bases additional components for channel Frequency (kHz) 10 20 6 30 4 40 2 b d g p t k jh ch s z f th v dh m n l 8 Frequency (kHz) • Eigencomponents shift formants/ coloration Mean Voice 50 r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 1 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l 8 Frequency (kHz) 280 states x 320 bins = 89,600 dimensions r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 2 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l 8 Frequency (kHz) • Mean model 8 r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 3 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l Speech Models for Separation - Dan Ellis r w y iy ih eh ey ae aa aw ay ah ao owuw ax 2009-10-13 - 12/28 Speaker-Adapted Separation Weiss & Ellis ’08 • Factorial HMM analysis with tuning of source model parameters = eigenvoice speaker adaptation Speech Models for Separation - Dan Ellis 2009-10-13 - 13/28 Speaker-Adapted Separation Speech Models for Separation - Dan Ellis 2009-10-13 - 14/28 Speaker-Adapted Separation • Eigenvoices for Speech Separation task speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) SI SA SD Speech Models for Separation - Dan Ellis 2009-10-13 - 15/28 3. Spatial Models & Reverb Mandel & Ellis ’07 • 2 or 3 sources in reverberation assume just 2 ‘ears’ • Model interaural spectrum of each source as stationary level and time differences: Speech Models for Separation - Dan Ellis 2009-10-13 - 16/28 ILD and IPD • Sources at 0° and 75° in reverb R i gh t IPD I LD Mixture Source 2 Source 1 L eft Speech Models for Separation - Dan Ellis 2009-10-13 - 17/28 IPD, ILD Distributions • Source at 75° in reverberation IPD IPD residual ILD IPD residual offsets phase by constant ωτ IPD can be fit by single Gaussian ILD needs frequency-dependence Speech Models for Separation - Dan Ellis 2009-10-13 - 18/28 Model-Based EM Source Separation and Localization (MESSL) Re-estimate source parameters Mandel & Ellis ’09 Assign spectrogram points to sources can model more sources than sensors flexible initialization Speech Models for Separation - Dan Ellis 2009-10-13 - 19/28 MESSL Results • Modeling uncertainty improves results tradeoff between constraints & noisiness 2.45 dB 0.22 dB 12.35 dB 8.77 dB 9.12 dB Speech Models for Separation - Dan Ellis -2.72 dB 2009-10-13 - 20/28 MESSL Results • Signal-to-Distortion Ratio (SDR) Anechoic 2 talkers Reverb 2 talkers 30 14 25 12 SDR (dB) 15 10 10 20 8 15 5 6 10 0 4 5 2 0 −5 Reverb 3 talkers −5 0 0 20 40 60 Angle (deg) 80 100 −2 0 Speech Models for Separation - Dan Ellis 20 40 60 Angle (deg) 80 100 −10 0 20 40 60 Angle (deg) 80 DP−Wiener Wiener MESSL−G 100 MESSL−ΩΩ TRINICON Mouba Sawada DUET Random 2009-10-13 - 21/28 MESSL Results • Speech recognizer (Digits) Percent correct 100 Ground truth masks 100 80 80 60 60 40 0 −40 40 Human DP−Oracle Oracle OracleAllRev Mixes 20 −20 0 20 Target−to−masker ratio (dB) Speech Models for Separation - Dan Ellis Algorithmic masks 20 40 0 −40 Human Sawada Mouba MESSL−G MESSL−ΩΩ DUET Mixes −20 0 20 Target−to−masker ratio (dB) 40 2009-10-13 - 22/28 4. Combining Spatial + Speech Models Weiss, Mandel & Ellis ’08 • Interaural parameters give • • ILDi(ω), ITDi, Pr(X(t, ω) = Si(t, ω)) Speech source model can give Pr(Si(t, ω) is speech signal) Can combine into one big EM framework... E-step M-step Speech Models for Separation - Dan Ellis u is: Pr(cell from source i) phoneme sequence Θ is: interaural params speaker params 2009-10-13 - 23/28 Parameter estimation and source separation MESSL-SP (Source Prior) Ron Weiss Underdetermined Source Separation Using Speaker Subspace Models Speech Models for Separation - Dan Ellis May 4, 2009 27 / 34 2009-10-13 - 24/28 MESSL-SP Results • Source models function as priors • Interaural parameter spatial separation Frequency (kHz) Frequency (kHz) source model prior improves spatial estimate Ground truth (12.04 dB) 8 DUET (3.84 dB) 8 8 6 6 6 4 4 4 2 2 2 0 0 0.5 1 1.5 MESSL (5.66 dB) 8 0 0 0.5 1 1.5 MESSL SP (10.01 dB) 8 0 6 6 4 4 4 2 2 2 0 0.5 1 Time (sec) 1.5 Speech Models for Separation - Dan Ellis 0 0 0.5 1 Time (sec) 1.5 1 0.8 0 0 0.5 1 1.5 MESSL EV (10.37 dB) 8 6 0 2D FD BSS (5.41 dB) 0.6 0.4 0.2 0 0 0.5 1 Time (sec) 1.5 2009-10-13 - 25/28 Outline Introduction MESSL-SP Results Speaker subspace model • Monaural speech separation Binaural separation Conclusions Experiments Performance as distractor angle SNR–improvement vs.function source of angle separation SNR improvement (dB) 2 sources (anechoic) 2 sources (reverb) 15 15 10 10 5 5 0 0 0 20 40 60 80 20 SNR improvement (dB) 3 sources (anechoic) 15 40 60 80 3 sources (reverb) 15 Ground truth MESSL!EV MESSL!SP MESSL 2S!FD!BSS 10 10 5 5 0 0 0 20 40 60 80 Separation (degrees) DUET 20 40 60 80 Separation (degrees) SpeechRon Models - Dan Ellis Weiss for Separation Underdetermined Source Separation Using Speaker Subspace Models - 26/28 May 4, 2009-10-13 2009 29 / 34 Future Work • Better parametric speaker models limitations of eigenvoices varying style • Understanding reverb & ASR early echoes what spoils ASR? • Models of other sources eigeninstruments? Speech Models for Separation - Dan Ellis 2009-10-13 - 27/28 Summary & Conclusions • Source models provide the constraints to make scene analysis possible • Eigenvoices (model subspace) can be used to provide detailed models that generalize • Spatial parameters can identify more sources than models in reverb (MESSL) • Can combine source + spatial models Speech Models for Separation - Dan Ellis 2009-10-13 - 28/28