Model-Based Scene Analysis Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Separation and Inference 2. Model-based separation 3. Speech Fragment Decoder Model-Based Scene Analysis - Dan Ellis 2005-06-30 - 1 /12 1. Separation and Inference • Full separation requires “separable dimension” e.g. spatial filtering but for single channel: overlap is inevitable • Signal knowledge provides extra constraints .. for inference of missing parts • Separation vs. recognition separation is sufficient .. but too hard recognition is easier .. but too coarse in-between: class plus parameters adequate for resynthesis? Model-Based Scene Analysis - Dan Ellis 2005-06-30 - 2 /12 Pattern Recognition Perspective • Inferring source signal set {si} from mixture signal x: arg max p( x|{ si} ) Σ p( si|Mi) { si } i p( x|{ si} ) gives physics of combination (sum) p( si|Mi) limits which si to consider • How to acquire/evaluate p( si|Mi) ? generalize observation of solo sources to search {si}? • How full joint space? clever pruning tricks Model-Based Scene Analysis - Dan Ellis 2005-06-30 - 3 /12 Factorial HMM - Toy Example • Two sources with same underlying model 01(&$2*).",'(&3#&,%&'4(#+'"5'1")6'("#$%&(7 %$#"&'$()*+$,!*-, %$#"&'$()*+$,!*-, !"#$%&'()*)&'+&*,( 2 1 0 / . !)*)& -.*/$*+ . / 0 1 2 3 4 5 !"#"$ . / 0 1 2 3 4 5 2 1 0 / . 2 .6 .2 /6 /2 "*+$ 06 !"#"$ !"#"$(#"("*+$(! 8,5&$$&-'()*)&( 5 4 3 2 1 0 / . 5 4 3 2 1 0 / . . / 0 1 2 3 4 5 !"#"$(#"("*+$(!!"!# 7-&'8$(9 7-&'8$(: 2 .6 .2 /6 /2 "*+$ 06 sequence constraints can disambiguate identical emissions Model-Based Scene Analysis - Dan Ellis 2005-06-30 - 4 /12 MAXVQ Results: Denoising Disambiguating with Knowledge (Roweis ’03) frequency frequency • Use strength of match to models as reasonableness measure for control • e.g. MAXVQ time time time time time Noise-corrupt speech Matching templates Training: 300sec of isolated speech (TIMIT) to fit 512 codewords, and 100sec of quency quency solated noise (NOISEX) fit 32 codewords; at 0dB SNR Model-Based to Scene Analysis - Dan Ellistesting on new signals 2005-06-30 - 5 /12 from Sam Roweis’s Montreal 2003 presentation frequency frequency frequency frequency learn dictionary of spectrogram slices find the ones that ‘fit’ - or max() of a combination.... MAXVQ Results: Denoising ...timethen filter out excess energy time frequency frequency time Full Mixture Inference ABSTRACT We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains in accuracy of automatic speech recognition in very noisy conditions. The method exploits the harmonic structure of speech by employing a high frequency resolution speech model in the log-spectrum domain and reconstructs the signal from the estimated posteriors of the clean signal and the phases from the original noisy signal. We achieve a gain in signal to noise ratio of 8.38 dB for enhancement of speech at 0 dB. We also present recognition results on the Aurora 2 data-set. At 0 dB SNR, we achieve a reduction of relative word error rate of 43.75% over the baseline, and 15.90% over the equivalent low-resolution algorithm. There are two observations that motivate the consideration of high frequency resolution modelling of speech for noise robust speech recognition and enhancement. First is the observation that most noise sources do not have harmonic structure similar to that of voiced speech. Hence, voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal space1 . (Kristjansson, Attias, Hershey’04) • Can model combination of magnitude spectra with stochastic model phase cancellation as noise... 45 • Precise inference of • by iterative linearization Works well (for small domains?) 40 35 Energy [dB] components Noisy vector, clean vector and estimate of clean vector 30 25 20 x estimate noisy vector clean vector 1. INTRODUCTION %&$'($)*+ A long standing goal in speech enhancement and robust speech recognition has been to exploit the harmonic structure of speech to improve intelligibility and increase recog0,,, nition accuracy. /,,, The source-filter model of speech assumes that speech is produced by an excitation source (the vocal cords) which .,,, has strong regular harmonic structure during voiced phonemes. The overall shape of the spectrum is then formed by a fil-,,, ter (the vocal tract). In non-tonal languages the filter shape alone,,determines - which.phone component / 0 of a word 1 is pro-2 !"#$ introduced (see Figure 2). The source on the other hand duces fine structure in the frequency spectrum that in many Model-Based Scene Analysis - Dan Ellis cases varies strongly among different utterances of the same phone. This fact has traditionally inspired the use of smooth 15 10 0 200 400 600 Frequency [Hz] 800 1000 Fig. 1. The noisy input vector (dot-dash line), the corresponding clean vector (solid line) and the estimate of the clean speech (dotted line), with shaded area indicating the uncertainty of the estimate (one standard deviation). Notice that the uncertainty on the estimate is considerably larger in the valleys between the harmonic peaks. This reflects the 3 4 5 -, lower SNR in these regions. The vector shown is frame 100 from Figure 2 2005-06-30 - 6 /12 A second observation is that in voiced speech, the signal power is concentrated in areas near the harmonics of the 2. Missing Data Recognition ! "#$$#%&!'()(!*+,-&%#)#-% • • ! . ! !! ! !! /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777 Speech models p(x|M) are multidimensional... 9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2& "#$$#%&!'()(!*+,-&%#)#-% means, variances for each frequency channel 9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•) . /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777 need values for all dimensions to get p(•) 9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2& . 86)9!,(%!+:(46()+!-:+5!(! But: can evaluate over a subset $6;$+)!-<!3#2+%$#-%$!x k . 86)9!,(%!+:(46()+!-:+5!(! x p(x ,x ) of dimensions p$6;$+)!-<!3#2+%$#-%$! ! x k m " Z= $xpk !xkx k# x yu m " dx u p ! x k m " = $ p ! x k# x u m " dx u p(x |M) = p(x , x |M)dx 9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•) u k k k u u =+%,+>! . =+%,+>! 2#$$#%&!3()(!5+,-&%#)#-%9 ?5+$+%)!3()(!2($@ p(xk|xu<y ) y xk p(xk ) p(xk|xu<y ) P(x | q) = 1| ?5+$+%)!3()(!2($@ · P(x2 | q) P(x dimension ! p(xk,xu) u data recognition: • Hence, missing 2#$$#%&!3()(!5+,-&%#)#-%9 . xu q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q) P(x | q) = dimension ! P(x1 | q) · P(x2 | q) · P(x3 | q) time ! · P(x4 | q) 9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G hard part is finding the mask (segregation) "#$!%&&'( )*+$,-!.'/0+12(!3!42#1$'$5 67789::976!9!:7!;!68 · P(x5 | q) · P(x6 | -q)7 /12 Model-Based Scene Analysis - Dan Ellis 2005-06-30 time ! xk p(xk ) 1 M ! = argmax P " M X # = argmax P " X M M M +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(* #"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X The Speech Fragment Decoder 1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,) P" M # ! M = argmax P " M X # = argmax P " X M #Barker, ! ------------Cooke, Ellis ’04 %44*&,4%-,2*6>*P( X | Y,S P " X)*;# M M • Match ‘uncorrupt’ 1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=* Observation %44*&,4%-,2*6>*P( X | Y,S )*; Y(f ) spectrum to ASR models using missing data • Observation Y(f ) Source X(f ) Source X(f ) Segregation S freq Joint search for model M and freq segregation S 1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/ to maximize: 1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(; P" X Y $ S # P" X Y $ S # P " M $ S Y #P=" M P "$MS# %YP#" X=M P # !"-----------------------! P#" S! -----------------------Y# -d MP#"%XP# "-XdX M P " X # Isolated Source Model Segregation Model Segregation S <$P(X)$#*$&*#521$=*#(0"#0$> <$P(X)$#*$&*#521$=*#(0"#0$> !"#$%&&'( Model-Based Scene Analysis)*+#,-$.'/0+12($3$42"1#'#5 - Dan Ellis !"#$%&&'( 2005-06-30 - 8 /12 67789::976$9$::$;$68 )*+#,-$.'/0+12($3$42"1#'#5 67789 Segregation S 1 freq Using CASA cues ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(; P" X Y $ S # !"#$%&'()(&*+,-./+" P " M $ S Y # = P " M # % P " X M # ! ------------------------- dX ! P " S Y # P" X # 0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$ • CASA helps search 9<$P(X)$#*$&*#521$=*#(0"#0$> '($0<'($(25125"0'*#$=*10<$>*#(',21'#5? 9 <*=$&'@2&A$'($'0? consider only segregations made from CASA !"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68 0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+" chunks 9 B21'*,'>'0A;<"1C*#'>'0AD • CASA rates segregation E12F+2#>A$G"#,($G2&*#5$0*520<21 9 *#(20;>*#0'#+'0AD construct0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2 P(S|Y) to reward CASA qualities: Frequency Proximity Common Onset Model-Based Scene Analysis - Dan Ellis !"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 Harmonicity 2005-06-30 - 9 /12 67789::976$9$:8$;$68 Learning for Separation • Control: learn what is “reasonable” • Input: discriminant features learned subspaces • Engine: clustering parameters • Output: restoration... Model-Based Scene Analysis - Dan Ellis 2005-06-30 - 10/12 Can Machine Learning Subsume CASA? • ASA grouping cues describe real sounds .. “anecdotally” • Machine Learning is another way to find regularities in large datasets can, e.g., Roweis templates subsume harmonicity, onset, etc.? ... and handle schema at the same time? “cut out the (grouping cue) middleman” • Trick is how to represent/generalize listeners can organize novel sounds Model-Based Scene Analysis - Dan Ellis 2005-06-30 - 11/12 Conclusions • Source separation needs constraints e.g. prior knowledge of signal form • Memorized signals (HMMs) can be powerful but can get very large • Speech recognition models can be co-opted e.g. to identify plausible subsets of regions Model-Based Scene Analysis - Dan Ellis 2005-06-30 - 12/12