Sound Organization By Source Models in Humans and Machines Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Mixtures and Models Human Sound Organization Machine Sound Organization Research Questions Sound Organization by Models - Dan Ellis 2006-12-09 - 1 /29 The Problem of Mixtures “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture 2 sensors, N sources - underconstrained • Undoing mixtures: hearing’s primary goal? .. by any means available Sound Organization by Models - Dan Ellis 2006-12-09 - 2 /29 Sound Organization Scenarios • Interactive voice systems human-level understanding is expected • Speech prostheses crowds: #1 complaint of hearing aid users • Archive analysis -.$/%&%012 identifying and isolating sound events dpwe-2004-09-10-13:15:40 ; 53 9 3 7 =53 5 3 =73 3 4 5 6 7 8 9 : ; )$*$)%&%+, < !"#$%&%'$( pa-2004-09-10-131540.wav • Unmixing/remixing/enhancement... Sound Organization by Models - Dan Ellis 2006-12-09 - 3 /29 How Can We Separate? • By between-sensor differences (spatial cues) ‘steer a null’ onto a compact interfering source the filtering/signal processing paradigm • By finding a ‘separable representation’ spectral? sources are broadband but sparse periodicity? maybe – for pitched sounds something more signal-specific... • By inference (based on knowledge/models) acoustic sources are redundant → use part to guess the remainder - limited possible solutions Sound Organization by Models - Dan Ellis 2006-12-09 - 4 /29 Separation vs. Inference • Ideal separation is rarely possible i.e. no projection can completely remove overlaps • Overlaps → Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) P(X |{Si}) P({Si}) combination physics source models • Better source models → better inference .. learn from examples? Sound Organization by Models - Dan Ellis 2006-12-09 - 5 /29 A Simple Example • Source models are codebooks from separate subspaces Sound Organization by Models - Dan Ellis 2006-12-09 - 6 /29 A Slightly Less Simple Example • Sources with Markov transitions Sound Organization by Models - Dan Ellis 2006-12-09 - 7 /29 What is a Source Model? • Source Model describes signal behavior encapsulates constraints on form of signal (any such constraint can be seen as a model...) • A model has parameters model + parameters → instance Excitation source g[n] Resonance filter H(ej!) n • What is not a source model? detail not provided in instance e.g. using phase from original mixture constraints on interaction between sources e.g. independence, clustering attributes Sound Organization by Models - Dan Ellis 2006-12-09 - 8 /29 n ! Outline 1. Mixtures and Models 2. Human Sound Organization Auditory Scene Analysis Using source characteristics Illusions 3. Machine Sound Organization 4. Research Questions Sound Organization by Models - Dan Ellis 2006-12-09 - 9 /29 Auditory Scene Analysis Bregman’90 Darwin & Carlyon’95 • How do people analyze sound mixtures? break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes • Grouping rules (Darwin, Carlyon, ...): cues: common onset/modulation, harmonicity, ... Onset map Sound Frequency analysis Elements Harmonicity map Sources Grouping mechanism Source properties Spatial map • Also learned “schema” (for speech etc.) Sound Organization by Models - Dan Ellis 2006-12-09 - 10/29 (after Darwin 1996) Perceiving Sources • Harmonics distinct in ear, but perceived as freq one source (“fused”): time depends on common onset depends on harmonics • Experimental techniques ask subjects “how many” match attributes e.g. pitch, vowel identity brain recordings (EEG “mismatch negativity”) Sound Organization by Models - Dan Ellis 2006-12-09 - 11/29 Auditory “Illusions” freq pulsation threshold 2,,, %&$'($)*+ • How do we explain illusions? 3,,, 1,,, 0,,, + ? , + ,-. / /-. !"#$ 0 0-. 0 !"#$ 0-. 4 / !"#$ /-. time .,,, %&$'($)*+ sinewave speech , + 1,,, 4,,, 0,,, /,,, , phonemic restoration ,-. / /-. 4-. 1,,, • Something is providing the %&$'($)*+ 4,,, 0,,, /,,, , , ,-. missing (illusory) pieces ... source models Sound Organization by Models - Dan Ellis 2006-12-09 - 12/29 0 Human Speech Separation • Task: Coordinate Response Measure Brungart et al.’02 “Ready Baron go to green eight now” 256 variants, 16 speakers correct = color and number for “Baron” crm-11737+16515.wav • Accuracy as a function of spatial separation: A, B same speaker Sound Organization by Models - Dan Ellis o Range effect 2006-12-09 - 13/29 Separation by Vocal Differences !"#$%&$'()#*"&+$,-"#$'(!$./-#0 Brungart et al.’01 • CRM varying the level and voice character 1232'(4-**2&2#52. (same spatial location) 1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"# energetic vs. informational masking more than1-.,2#(*"&(,72(9%-2,2&(,$'/2&: pitch .. source models Sound Organization by Models - Dan Ellis 2006-12-09 - 14/29 Outline 1. Mixtures and Models 2. Human Sound Organization 3. Machine Sound Organization Computational Auditory Scene Analysis Dictionary Source Models 4. Research Questions Sound Organization by Models - Dan Ellis 2006-12-09 - 15/29 Source Model Issues • Domain parsimonious expression of constraints nice combination physics • Tractability size of search space tricks to speed search/inference • Acquisition hand-designed vs. learned static vs. short-term • Factorization independent aspects hierarchy & specificity Sound Organization by Models - Dan Ellis 2006-12-09 - 16/29 Computational Auditory Scene Analysis Brown & Cooke’94 Okuno et al.’99 Hu & Wang’04 ... • Central idea: Segment time-frequency into sources based on perceptual grouping cues Segment input mixture Front end signal features (maps) Group Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod ... principal cue is harmonicity Sound Organization by Models - Dan Ellis 2006-12-09 - 17/29 CASA limitations • Limitations of T-F masking cannot undo overlaps – leaves gaps • Driven by local features limited model scope ➝ no inference or illusions • Does not learn from data Sound Organization by Models - Dan Ellis 2006-12-09 - 18/29 from Hu & Wang ’04 huwang-v3n7.wav Basic Dictionary Models Roweis ’01, ’03 Kristjannson ’04, ’06 • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, Σ) model inference of {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source state can include sequential constraints... different domains for combining c and defining • E.g. stationary noise: .%()*++,-/)-&*+0(%1#)+(( 23+'(3&$)%"(4(5678(09: freq / mel bin !"#$#%&'()*++,- 80 80 80 60 60 60 40 40 40 20 20 20 0 1 2 time / s 0 1 Sound Organization by Models - Dan Ellis 2 ;<(#%=+""+0()>&>+)(( 23+'(3&$)%"(4(?6@(09: 0 1 2 2006-12-09 - 19/29 Machine Learning and Applied Statistics Microsoft Research traustik@microsoft.com University of California, San Diego Machine Perception Lab jhershey@cogsci.ucsd.edu Deeper Models: Iriquois ABSTRACT There are two observations that motivate the consideration of high frequency resolution modelling of speech for noise robust speech recognition and enhancement. First is the observation that most noise sources do not have harmonic structure similar to that of voiced speech. Hence, voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal space1 . We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains in accuracy of automatic speech recognition in very noisy conditions. The method exploits the harmonic structure of speech by employing a high frequency resolution speech model in the log-spectrum domain and reconstructs the signal from the estimated posteriors of the clean signal and the phases from the original noisy signal. We achieve a gain in signal to noise ratio of 8.38 dB for enhancement of speech at 0 dB. We also present recognition results on the Aurora 2 data-set. At 0 dB SNR, we achieve a reduction of relative word error rate of 43.75% over the baseline, and 15.90% over the equivalent low-resolution algorithm. • Optimal inference on mixed 45 speaker-specific models (e.g. 512 mix GMM) Algonquin inference Kristjansson, Hershey spectra et al. ’06 Noisy vector, clean vector and estimate of clean vector 40 Energy [dB] 35 30 25 20 x estimate noisy vector clean vector 1. INTRODUCTION 15 A long standing goal in speech enhancement and robust 10 speech recognition has been to exploit the harmonic struc0 200 400 600 800 1000 Frequency [Hz] ture of speech to improve intelligibility and increase recognition accuracy. Fig. 1. The noisy input vector (dot-dash line), the correThe source-filter model of speech assumes that speech sponding clean vector (solid line) and the estimate of the is produced by an excitation source (the vocal cords) which clean speech (dotted line), with shaded area indicating the has strong regular harmonic structure during voiced phonemes. uncertainty of the estimate (one standard deviation). Notice The overall shape of the spectrum is then formed by a filthat the uncertainty on the estimate is considerably larger in ter (the vocal tract). In non-tonal languages the filter shape the valleys between the harmonic peaks. This reflects the alone determines which phone component of a word is proSame Gender Speakerslower SNR in these regions. The vector shown is frame 100 duced (see Figure 2). The source on the other hand introfrom Figure 2 duces fine structure in the frequency spectrum that in many cases varies strongly among different utterances of the same A second observation is that in voiced speech, the signal phone. power is concentrated in areas near the harmonics of the This fact has traditionally inspired the use of smooth fundamental frequency, which show up as parallel ridges in representations of the speech spectrum, such as the Melfrequency cepstral coefficients, in an attempt to accurately 1 Even if the interfering signal is another speaker, the harmonic structure estimate the filter component of speech in a way that is inof the two signals may differ at different times, and the long term pitch variant to the non-phonetic3 effects of the excitation[1]. exploited to separate the two sources [2]. 6 dB dB 0 dB - 3 dB contour-of6the dBspeakers may- 9bedB • .. for Speech Separation Challenge (Cooke/Lee’06) wer / % exploit grammar constraints - higher-level dynamics 100 50 0 SDL Recognizer No dynamics Sound Organization by Models - Dan Ellis 0-7803-7980-2/03/$17.00 © 2003 IEEE Acoustic dyn. 291 Grammar dyn. Human 2006-12-09 - 20/29ASRU 2003 #"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X !"#$%&'()*+,)&,)%-'"(*.%/0/ M ! = argmax P " M X # = argmax P " X M # ! M M +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(* Faster1 Search: Fragment Decoder #"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X 1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,) • et al. ’05 P " MBarker # ! ------------M = argmax P " M %44*&,4%-,2*6>*P( X # = argmax P " X MX# |!Y,S )*; Match ‘uncorrupt’ P" X # M M spectrum to ASR 1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*Observation models using Y(f ) %44*&,4%-,2*6>*P( X | Y,S )*; missing data Observation Source Y(f ) recognition X(f ) easy if you know the segregation mask S Source Segregation S X(f ) freq M and segregation S 1 model ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)& • Joint search for Segregation S freq 1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(; to maximize: P" X Y $ S # P " X Y $ S # - dX ! S# YP#" X=M P "-----------------------M # % P "-XdXM #" S! -----------------------P " M $ S Y #P=" M P "$M # ! ! P Y # % P" X # P" X # Isolated Source Model Segregation Model <$P(X)$#*$&*#521$=*#(0"#0$> <$P(X)$#*$&*#521$=*#(0"#0$> Sound Organization by Models - Dan Ellis !"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 !"#$%&&'( 2006-12-09 - 21/29 67789::976$9$::$;$68 )*+#,-$.'/0+12($3$42"1#'#5 67789::976$ Source X(f ) Segregation S freq CASA in the Fragment Decoder 1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(; P" X Y $ S # !"#$%&'()(&*+,-./+" P " M $ S Y # = P " M # % P " X M # ! ------------------------- dX ! P " S Y # P" X # 0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$ • 9<$P(X)$#*$&*#521$=*#(0"#0$> '($0<'($(25125"0'*#$=*10<$>*#(',21'#5? CASA can help search 9 <*=$&'@2&A$'($'0? • E12F+2#>A$G"#,($G2&*#5$0*520<21 CASA can rate segregation 9 *#(20;>*#0'#+'0AD consider only segregations made from CASA !"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68 0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+" chunks 9 B21'*,'>'0A;<"1C*#'>'0AD construct0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2 P(S|Y) to reward CASA qualities: Frequency Proximity Common Onset Sound Organization by Models - Dan Ellis !"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 Harmonicity 2006-12-09 - 22/29 67789::976$9$:8$;$68 (Pitch) Factored Dictionaries • Separate representations for Ghandi & Has-John. ’04 Radfar et al. ‘06 “source” (pitch) and “filter” NM codewords from N+M entries but: overgeneration... • Faster search direct extraction of pitches immediate separation of (most of) spectra Sound Organization by Models - Dan Ellis 2006-12-09 - 23/29 Discriminant Models for Music Poliner & Ellis ’06 • Transcribe piano recordings by classification train SVM detectors for every piano note 88 separate detectors, independent smoothing • Trained on player piano recordings freq / pitch Bach 847 Disklavier A6 20 A5 10 A4 0 A3 A2 -10 A1 -20 0 1 2 3 4 5 6 7 8 level / dB 9 time / sec • Can resynthesize from transcript... Sound Organization by Models - Dan Ellis 2006-12-09 - 24/29 Piano Transcription Results improvement from classifier: • Significant Table 1: Frame level transcription results. frame-level accuracy results: Algorithm SVM Klapuri&Ryynänen Marolt Errs 43.3% 66.6% 84.6% 120 False Pos 27.9% 28.1% 36.5% False Neg 15.4% 38.5% 48.1% d! 3.44 2.71 2.35 False Negatives False Positives Classification error % Breakdown • Overall Accuracy by frameAcc: Overall accuracy is a frame-level version of the metric proposed by Dixon in [Dixon, 2000] defined as: type: 100 80 60 40 N Acc = (F P 5 +6 F7 N8 + N ) 0 1 2 3 4 20 (3) # notes present where Sound N isOrganization the number of correctly transcribed frames, F P- 25 is/29 the number of by Models - Dan Ellis 2006-12-09 unvoiced frames U V transcribed as voiced V , and F N is the number of voiced frames transcribed as unvoiced. Outline 1. 2. 3. 4. Mixtures & Models Human Sound Organization Machine Sound Organization Research Questions Task and Evaluation Generic vs. Specific Sound Organization by Models - Dan Ellis 2006-12-09 - 26/29 Task & Evaluation • How to measure separation performance? depends what you are trying to do • SNR? energy (and distortions) are not created equal different nonlinear components [Vincent et al. ’06] rare for nonlinear processing to improve intelligibility listening tests expensive • ASR performance? Net effect Transmission errors • Human Intelligibility? Increased artefacts Reduced interference optimum Agressiveness of processing separate-then-recognize too simplistic; ASR needs to accommodate separation Sound Organization by Models - Dan Ellis 2006-12-09 - 27/29 /a/ /e/ How Many Models? ➝ better separation • More specific models /i/ need individual dictionaries for “everything”??/o/ Domain of normal speakers Model adaptation and hierarchy • /u/ extrapolation beyond normal VT length ratio speaker adapted models : base + parameters Smith, Patterson et al. ’05 mean generic-specific: pitched ➝ piano ➝ this piano pitch / Hz • Time scales of model acquisition innate/evolutionary (hair-cell tuning) developmental (mother tongue phones) dynamic - the “slung mugs” effect; Ozerov Sound Organization by Models - Dan Ellis 2006-12-09 - 28/29 Summary & Conclusions • Listeners do well separating sound mixtures using signal cues (location, periodicity) using source-property variations • Machines do less well difficult to apply enough constraints need to exploit signal detail • Models capture constraints learn from the real world adapt to sources • Separation feasible only sometimes describing source properties is easier Sound Organization by Models - Dan Ellis 2006-12-09 - 29/29