Auditory Scene Analysis in Humans and Machines Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. 5. http://labrosa.ee.columbia.edu/ The ASA Problem Human ASA Machine Source Separation Systems & Examples Concluding Remarks Auditory Scene Analysis - Dan Ellis 2006-05-20 - 1 /54 1. Auditory Scene Analysis • Sounds rarely occurs in isolation .. but recognizing sources in mixtures is a problem .. for humans and machines freq / kHz chan 9 freq / kHz chan 3 freq / kHz chan 1 4 4 freq / kHz chan 0 mr-2000-11-02-14:57:43 4 4 2 20 0 -20 -40 -60 level / dB 2 2 2 0 0 1 2 3 4 Auditory Scene Analysis - Dan Ellis 5 6 7 8 9 time / sec 2006-05-20 - 2 /54 mr-20001102-1440-cE+1743.wav Sound Mixture Organization • Goal: recover individual sources from scenes .. duplicating the perceptual effect 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Problems: competing sources, channel effects • Dimensionality loss need additional constraints Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 3 /21 The Problem of Mixtures “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture 2 sensors, N sources - underconstrained • Undoing mixtures: hearing’s primary goal? .. by any means available Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 4 /21 Source Separation Scenarios • Interactive voice systems human-level understanding is expected • Speech prostheses crowds: #1 complaint of hearing aid users • Archive analysis freq / kHz identifying and isolating sound events dpwe-2004-09-10-13:15:40 8 20 6 0 4 -20 2 0 -40 0 1 2 3 4 5 6 7 8 9 level / dB time / sec pa-2004-09-10-131540.wav • Unmixing/remixing/enhancement... Auditory Scene Analysis - Dan Ellis 2006-05-20 - 5 /54 How Can We Separate? • By between-sensor differences (spatial cues) ‘steer a null’ onto a compact interfering source • By finding a ‘separable representation’ spectral? sources are broadband but sparse periodicity? maybe – for pitched sounds something more signal-specific... • By inference (based on knowledge/models) acoustic sources are redundant → use part to guess the remainder Auditory Scene Analysis - Dan Ellis 2006-05-20 - 6 /54 Outline 1. The ASA Problem 2. Human ASA scene analysis separation by location separation by source characteristics 3. Machine Source Separation 4. Systems & Examples 5. Concluding Remarks Auditory Scene Analysis - Dan Ellis 2006-05-20 - 7 /54 Auditory Scene Analysis • Listeners organize sound mixtures common onset + continuity harmonicity freq / kHz into discrete perceived sources based on within-signal cues (audio + ...) 4 30 20 3 10 0 2 -10 -20 1 -30 0 -40 0 spatial, modulation, ... learned “schema” Auditory Scene Analysis - Dan Ellis 0.5 1 1.5 2 2.5 3 3.5 4 Reynolds-McAdams oboe time / sec reynolds-mcadams-dpwe.wav 2006-05-20 - 8 /54 level / dB Perceiving Sources • Harmonics distinct in ear, but perceived as %req one source (“fused”): !ime depends on common onset depends on harmonics • Experimental techniques ask subjects “how many” match attributes e.g. pitch, vowel identity brain recordings (EEG “mismatch negativity”) Auditory Scene Analysis - Dan Ellis 2006-05-20 - 9 /54 Auditory Scene Analysis Bregman’90 Darwin & Carlyon’95 • How do people analyze sound mixtures? break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes • Grouping rules (Darwin, Carlyon, ...): cues: common onset/offset/modulation, harmonicity, spatial location, ... Onset map Sound Frequency analysis Elements Harmonicity map Sources Grouping mechanism Source properties Spatial map (after Darwin 1996) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 10/21 Streaming • Sound event sequences are organized into streams i.e. distinct perceived sources difficult to make comparisons between streams • Two-tone streaming experiments: frequency TRT: 60-150 ms 1 kHz !f: asacd-36-streaming-50s –2 octaves time ecological relevance? Auditory Scene Analysis - Dan Ellis 2006-05-20 - 11/54 Illusions & Restoration freq • Illusion = hearing more than is “there” e.g. “pulsation threshold” example - tone is masked + ? + + time “old-plus-new” heuristic: existing sources continue frequency & /0z asacd-26-pulsation 2 1 + 0 0.0 0.4 0.8 1.2 time & s • Need to infer most likely real-world events observation equally good match to either case prior likelihood of continuity much higher Auditory Scene Analysis - Dan Ellis 2006-05-20 - 12/54 bregnoise Human Performance: Spatial Separation • Task: Coordinate Response Measure Brungart et al.’02 “Ready Baron go to green eight now” 256 variants, 16 speakers correct = color and number for “Baron” crm-11737+16515.wav • Accuracy as a function of spatial separation: A, B same speaker Auditory Scene Analysis - Dan Ellis o Range effect 2006-05-20 - 13/54 Separation by Vocal Differences Brungart et al.’01 varying the level and voice character • CRM !"#$%&$'()#*"&+$,-"#$'(!$./-#0 (same spatial1232'(4-**2&2#52. location) energetic vs. informational masking 1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"# 1-.,2#(*"&(,72(9%-2,2&(,$'/2&: Auditory Scene Analysis - Dan Ellis 2006-05-20 - 14/54 Varying the Number of Voices Brungart et al.’01 • Two voices OK; More than two voices harder :/1$9($%";1./(0$#5/1$%":$)&51< =90>'("/."?$%&'() (same spatial origin) mix -'(./(0$12'"3(/4)"3($0$#52$%%6 of N voices tends to speech-shaped noise... 78'1"+(3 /(",#8 #$%&'("5)"$33'3"#/")#509%9) Auditory Scene Analysis - Dan Ellis 2006-05-20 - 15/54 !"#$%&'()"**"+"#$%&'()"*","#$%&'() Outline 1. The ASA Problem 2. Human ASA 3. Machine Source Separation Independent Component Analysis Computational Auditory Scene Analysis Model-Based Separation 4. Systems & Examples 5. Concluding Remarks Auditory Scene Analysis - Dan Ellis 2006-05-20 - 16/54 Scene Analysis Systems • “Scene Analysis” not necessarily separation, recognition, ... scene = overlapping objects, ambiguity • General Framework: distinguish input and output representations distinguish engine (algorithm) and control (constraints, “computational model”) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 17/21 Human and Machine Scene Analysis • CASA (e.g. Brown’92): Input: Periodicity, continuity, onset “maps” Output: Waveform (or mask) Engine: Time-frequency masking Control: “Grouping cues” from input - or: spatial features (Roman, ...) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 18/21 Human and Machine Scene Analysis • CASA (e.g. Brown’92): • ICA (Bell & Sejnowski et seq.): Input: waveform (or STFT) Output: waveform (or STFT) Engine: cancellation Control: statistical independence of outputs - or energy minimization for beamforming Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 19/21 Human and Machine Scene Analysis • CASA (e.g. Brown’92): • ICA (Bell & Sejnowski et seq.): • Human Listeners: Input: excitation patterns ... Output: percepts ... Engine: ? Control: find a plausible explanation Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 20/21 Machine Separation • Problem: Features of combinations are not combinations of features voice is easy to characterize when in isolation redundancy needed for real-world communication (*,, -"%.e re.1234 !"l" $"%&e freq / 9Hz 5r%6%27l (r*-11-0.0/0.-n12se (r*-11-0.0/0. 4 / 2 1 0 0 ()* $"%&e (%+ freq / 9Hz -20 (r*-11-0.0/0.=16-0>010> 4 (r*-11-0.0/0.=16-0>010>-n12se -40 crm-11737.wav / crm-11737-noise.wav -60 level / dB 2 1 0 crm-11737+16515.wav crm-11737+16515-noise.wav 0 0?> 1 1?> 0 t2*e / se( Auditory Scene Analysis - Dan Ellis 0?> 1 1?> t2*e / se( 2006-05-20 - 21/54 Separation Approaches ICA •Multi-channel •Fixed filtering •Perfect separation – maybe! target x interference n - mix proc x^ + CASA / Model-based •Single-channel •Time-varying filtering •Approximate Separation mix x+n ^ n spectro gram stft proc x mask • Very different approaches! Auditory Scene Analysis - Dan Ellis 2006-05-20 - 22/54 istft x^ Independent Component Bell & Sejnowski’95 Analysis Smaragdis’98 • Central idea: ... Search unmixing space to maximize independence of outputs a11 a21 s1 a12 a22 × a11 a21 a12 s1 s2 w11 w12 × w21 w22 x1 x2 s^1 s^2 x1 x2 a22 s2 !" MutInfo "wij simple mixing → a good solution (usually) exists Auditory Scene Analysis - Dan Ellis 2006-05-20 - 23/54 Mixtures, Scatters, Kurtosis • Mixtures of sources become more Gaussian can measure e.g. via ‘kurtsosis’ (4th moment) s1 vs. s2 0.6 x1 vs. x2 0.6 0.4 0.4 0.2 0.2 0 0 -0.2 -0.4 -0.2 -0.6 -0.4 -0.4 -0.2 -0.2 0 0.2 0.4 0.6 -0.5 0 0.5 p(s1) kurtosis = 27.90 p(x1) kurtosis = 18.50 p(s2) kurtosis = 53.85 p(x2) kurtosis = 27.90 -0.1 0 0.1 Auditory Scene Analysis - Dan Ellis 0.2 -0.2 -0.1 0 0.1 0.2 2006-05-20 - 24/54 ICA Limitations • Cancellation is very finicky hard to get more than ~ 10 dB rejection s2 0.6 0.4 kurtosis mix 2 Mixture Scatter 0.8 s1 from Parra & Spence’00 Kurtosis vs. ! 12 10 8 0.2 6 0 4 -0.2 -0.4 2 -0.6 -0.3 0 lee_ss_in_1.wav lee_ss_para_1.wav s1 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0 0.2 0.4 mix 1 0.6 s2 0.8 !/" 1 • The world is not instantaneous, fixed, linear subband models for reverberation continuous adaptation • Needs spatially-compact interfering sources Auditory Scene Analysis - Dan Ellis 2006-05-20 - 25/54 Computational Auditory Scene Analysis Brown & Cooke’94 Okuno et al.’99 Hu & Wang’04 ... • Central idea: Segment time-frequency into sources based on perceptual grouping cues Segment input mixture Front end signal features (maps) Group Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod ... principal cue is harmonicity Auditory Scene Analysis - Dan Ellis 2006-05-20 - 26/54 CASA Preprocessing "#$#!%&'()*+(,!-&'.+//0(1 Slaney & Lyon ’90 a 3rd “periodicity” axis 2 "'&&+3'1&456 • Correlogram: 7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18 envelope of wideband channels follows pitch yl ine la de short-time autocorrelation sound Cochlea filterbank frequency channels "'&&+3'1&45 /30.+ envelope follower freq lag time linear filterbank cochlear & approximation c/w -Modulation Filtering [Schimmel Atlas ’05] static nonlinearity Auditory-Scene Analysis - Dan Ellis 2006-05-20 - 27/54 - zero-delay slice is like spectrogram “Weft” Periodic Elements Ellis ’96 • Represent harmonics without grouping? Correlogram Smooth Spectrum slices freq freq lag time Period Track commonperiod feature period time time hard to separate multiple pitch tracks Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 28/21 Time-Frequency (T-F) Masking Original freq / kHz • “Local Dominance” assumption Female Male 8 40 6 20 4 0 2 -20 0 level / dB Mix + Oracle Labels freq / kHz Oraclebased Resynth cooke-v3n7.wav 8 6 4 2 0 cooke-v3msk-ideal.wav 0 0.5 1 time / sec 0 0.5 1 time / sec oracle masks are remarkably effective! |mix – max(male, female)| < 3dB for ~80% of cells Auditory Scene Analysis - Dan Ellis 2006-05-20 - 29/54 cooke-n7msk-ideal.wav Combining Spatial + T-F Masking • T-F masks based on inter-channel properties [Roman et al. ’02], [Yilmaz & Rickard ’04] lee_ss_in_1.wav lee_ss_duet_1.wav multiple channels make CASA-like masks better masking after ICA • T-F [Blin et al. ’04] cancellation can remove energy within T-F cells Auditory Scene Analysis - Dan Ellis 2006-05-20 - 30/54 CASA limitations • Driven by local features problems with masking, aperiodic sources... • Limitations of T-F masking need to identify single-source regions cannot undo overlaps – leaves gaps Auditory Scene Analysis - Dan Ellis 2006-05-20 - 31/54 from Hu & Wang ’04 huwang-v3n7.wav Auditory “Illusions” How do we explain illusions? pulsation threshold 2,,, %&$'($)*+ • 3,,, 1,,, 0,,, , , ,-. / /-. !"#$ 0 0-. 0 !"#$ 0-. 4 sinewave speech %&$'($)*+ .,,, 1,,, 4,,, 0,,, /,,, , / /-. 4-. 1,,, 4,,, %&$'($)*+ phonemic restoration ,-. 0,,, /,,, , • Something is providing the missing (illusory) , ,-. / !"#$ pieces ... source models Auditory Scene Analysis - Dan Ellis 2006-05-20 - 32/54 /-. 0 Adding Top-Down Constraints Ellis ’96 • Bottom-up CASA: limited to what’s “there” input mixture Front end signal eatures Object formation discrete objects Grouping rules Source groups • Top-down predictions allow illusions hypotheses Hypothesis management input mixture Front end signal eatures prediction errors Compare & reconcile Noise components Periodic components Predict & combine predicted features match observations to a “world-model”... Auditory Scene Analysis - Dan Ellis 2006-05-20 - 33/54 Separation vs. Inference • Ideal separation is rarely possible i.e. no projection can completely remove overlaps • Overlaps Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) P(X |{Si}) P({Si}) combination physics source models • Better source models → better inference .. learn from examples? Auditory Scene Analysis - Dan Ellis 2006-05-20 - 34/54 Simple Source Separation Roweis ’01, ’03 Kristjannson ’04, ’06 • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, Σ) model inference of {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source state can include sequential constraints... different domains for combining c and defining • E.g. stationary noise: In speech-shaped noise (mel magsnr = 2.41 dB) freq / mel bin Original speech 80 80 80 60 60 60 40 40 40 20 20 20 0 1 2 time / s 0 Auditory Scene Analysis - Dan Ellis 1 2 VQ inferred states (mel magsnr = 3.6 dB) 0 1 2 2006-05-20 - 35/54 Can Models Do CASA? • Source models can learn harmonicity, onset freq / mel band ... to subsume rules/representations of CASA VQ800 Codebook - Linear distortion measure 80 -10 -20 -30 -40 -50 60 40 20 0 100 200 300 400 500 600 700 800 level / dB can capture spatial info too [Pearlmutter & Zador’04] • Can also capture sequential structure e.g. consonants follow vowels ... like people do? • But: need source-specific models ... for every possible source use model adaptation? [Ozerov et al. 2005] Auditory Scene Analysis - Dan Ellis 2006-05-20 - 36/54 Separation or Description? • Are isolated waveforms required? clearly sufficient, but may not be necessary not part of perceptual source separation! • Integrate separation with application? e.g. speech recognition separation mix t-f masking + resynthesis identify target energy ASR words speech models mix vs. identify find best words model speech models source knowledge words output = abstract description of signal Auditory Scene Analysis - Dan Ellis 2006-05-20 - 37/54 words !! !! ! !! "#$$#%&!'()(!*+,-&%#)#-% ! Missing Data Recognition . /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777 "#$$#%&!'()(!*+,-&%#)#-% Cooke et al. ’01 • need values for all dimensions to evaluate p(•) . 86)9!,(%!+:(46()+!-:+5!(! inferences given • But: can make $6;$+)!-<!3#2+%$#-%$!x 9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2& Speech models are multidimensional... . /0++,1!2-3+4$! p(p(x|M) x|m)!(5+!264)#3#2+%$#-%(4777 9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•) 9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2& 9 $22,!>#&+2(!@*1!#&&!,'=2$('*$(!0*!520!p(•) . 86)9!,(%!+:(46()+!-:+5!(! $6;$+)!-<!3#2+%$#-%$! xk Z kp(xk,xu) xu just pa! xsubset m " =of pdimensions ! x # x m " dxxk y $ k k u p ! x k m " = $ p ! x k# x u m " dx u p(x |M) = p(x , x |M)dx k k u y u . =+%,+>! =+%,+>! p(xk|xu<y ) 2#$$#%&!3()(!5+,-&%#)#-%9 2#$$#%&!3()(!5+,-&%#)#-%9 ?5+$+%)!3()(!2($@ p(xk,xu) u • Hence, missing data recognition: . xu P(x | q) = ?5+$+%)!3()(!2($@ P(x | q) · P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q) 1 xk ) k|xu<y ) p(xk p(x P(x | q) = dimension ! dimension ! P(x1 | q) · P(x2 | q) · P(x3 | q) · P(x4 | q) time ! · P(x5 | q) hard9 part is finding the mask (segregation) C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G · P(x6 | q) "#$!%&&'( )*+$,-!.'/0+12(!3!42#1$'$5 Auditory Scene Analysis - Dan Ellis 67789::976!9!:7!;!68 time ! 2006-05-20 - 38/54 9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G xk p(xk ) 1 M ! = argmax P " M X # = argmax P " X M M M +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(* #"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X The Speech Fragment Decoder 1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,) • P " MBarker # et al. ’05 ! M = argmax P " M X # = argmax P " X M # ! -------------%44*&,4%-,2*6>*P( X | Y,S )*;# P" X M M 1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=* Match ‘uncorrupt’ Observation %44*&,4%-,2*6>*P( X | Y,S )*; Y(f ) spectrum to ASR Observation models using Source Y(f ) X(f ) missing data Source Segregation S X(f ) freq model M and segregation S • Joint search for 1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/ Segregation S freq 1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(; to maximize: P" X Y $ S # P" X Y $ S # P " M $ S Y #P=" M P "$M # !"-----------------------! P#" S! -----------------------Y# -d S# %YP#" X=M P MP#"%XP# "-XdXM P " X # Isolated Source Model Segregation Model <$P(X)$#*$&*#521$=*#(0"#0$> <$P(X)$#*$&*#521$=*#(0"#0$> !"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 Auditory Scene Analysis - Dan Ellis !"#$%&&'( 67789::976$9$::$;$68 2006-05-20 - 39/54 )*+#,-$.'/0+12($3$42"1#'#5 67789 Segregation S 1 freq Using CASA cues ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(; P" X Y $ S # !"#$%&'()(&*+,-./+" P " M $ S Y # = P " M # % P " X M # ! ------------------------- dX ! P " S Y # P" X # 0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$ • 9<$P(X)$#*$&*#521$=*#(0"#0$> '($0<'($(25125"0'*#$=*10<$>*#(',21'#5? CASA can help search 9 <*=$&'@2&A$'($'0? • E12F+2#>A$G"#,($G2&*#5$0*520<21 CASA can rate segregation 9 *#(20;>*#0'#+'0AD consider only segregations made from CASA !"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68 0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+" chunks 9 B21'*,'>'0A;<"1C*#'>'0AD construct0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2 P(S|Y) to reward CASA qualities: Frequency Proximity Auditory Scene Analysis - Dan Ellis !"#$%&&'( Common Onset )*+#,-$.'/0+12($3$42"1#'#5 Harmonicity 2006-05-20 - 40/54 67789::976$9$:8$;$68 Outline 1. 2. 3. 4. The ASA Problem Human ASA Machine Source Separation Systems & Examples Periodicity-based Model-based Music signals 5. Concluding Remarks Auditory Scene Analysis - Dan Ellis 2006-05-20 - 41/54 Current CASA • State-of-the-art bottom-up separation Hu & Wang’03 1,,, 1,,, 5,,, 5,,, 0,,, 0,,, 4,,, 4,,, %&$'($)*+ %&$'($)*+ noise robust pitch track label T-F cells by pitch extensions to unvoiced transients by onset /,,, 3,,, /,,, 3,,, .,,, .,,, 2,,, 2,,, , , ,-. ,-/ ,-0 ,-1 !"#$ 2 2-. 2-/ Auditory Scene Analysis - Dan Ellis , , ,-. ,-/ ,-0 ,-1 !"#$ 2 2-. 2-/ 2006-05-20 - 42/54 Prediction-Driven CASA f/Hz City Ellis’96 4000 2000 1000 400 200 1000 400 200 100 50 0 1 2 3 f/Hz Wefts1!4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 • Identify objects 9 Wefts9!12 4000 2000 1000 in real-world scenes 400 200 1000 400 200 100 50 using “sound elements” Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 !40 400 200 !50 !60 Squeal (6/10) Truck (7/10) !70 0 1 2 3 4 5 6 7 Auditory Scene Analysis - Dan Ellis 8 9 time/s dB 2006-05-20 - 43/54 Singing Voice Separation freq / kHz • Pitch tracking + harmonic separation Avery Wang’94 8 6 freq / kHz 4 2 20 0 8 0 -20 6 4 level / dB 2 0 0 1 2 3 Auditory Scene Analysis - Dan Ellis 4 5 6 7 time / sec 8 2006-05-20 - 44/54 Periodic/Aperiodic SeparationVirtanen’03 • Harmonic structure + repetition of drums 4,,, %&$'($)*+ 2,,, 0,,, .,,, , , - . / 0 1 !"#$ 2 3 4 5 -, , - . / 0 1 !"#$ 2 3 4 5 -, , - . / 0 1 !"#$ 2 3 4 5 -, 4,,, %&$'($)*+ 2,,, 0,,, .,,, , 4,,, %&$'($)*+ 2,,, 0,,, .,,, , Auditory Scene Analysis - Dan Ellis 2006-05-20 - 45/54 “Speech Separation Challenge” • Mixed and Noisy Speech ASR task defined by Martin Cooke and Te-Won Lee short, grammatically-constrained utterances: <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> e.g. "bin white at M 5 soon" t5_bwam5s_m5_bbilzp_6p1.wav to be presented at Interspeech’06 • Results http://www.dcs.shef.ac.uk/~martin/SpeechSeparationChallenge.htm • See also “Statistical And Perceptual Audition” workshop http://www.sapa2006.org/ Auditory Scene Analysis - Dan Ellis 2006-05-20 - 46/54 ABSTRACT There are two observations that motivate the consideration of high frequency resolution modelling of speech for noise robust speech recognition and enhancement. First is the observation that most noise sources do not have harmonic structure similar to that of voiced speech. Hence, voiced speech sounds should be more easily distinguishable from environmental noise in a high dimensional signal space1 . IBM’s “Superhuman” Separation Kristjansson et a We present a framework for speech enhancement and robust speech recognition that exploits the harmonic structure of speech. We achieve substantial gains in signal to noise ratio (SNR) of enhanced speech as well as considerable gains in accuracy of automatic speech recognition in very noisy conditions. The method exploits the harmonic structure of speech by employing a high frequency resolution speech model in the log-spectrum domain and reconstructs the signal from the estimated posteriors of the clean signal and the phases from the original noisy signal. We achieve a gain in signal to noise ratio of 8.38 dB for enhancement of speech at 0 dB. We also present recognition results on the Aurora 2 data-set. At 0 dB SNR, we achieve a reduction of relative word error rate of 43.75% over the baseline, and 15.90% over the equivalent low-resolution algorithm. • Optimal inference on Mixed Spectra model each speaker (512 mix GMM) 40 35 Energy [dB] 45 Interspeech’06 Noisy vector, clean vector and estimate of clean vector 30 25 20 x estimate noisy vector clean vector 1. INTRODUCTION 15 • Applied to Speech Separation Challenge: 10 0 200 400 600 Frequency [Hz] 800 1000 Fig. 1. The noisy input vector (dot-dash line), the corresponding clean vector (solid line) and the estimate of the clean speech (dotted line), with shaded area indicating the uncertainty of the estimate (one standard deviation). Notice that the uncertainty on the estimate is considerably larger in the valleys between the harmonic peaks. This reflects the lower SNR in these regions. The vector shown is frame 100 from Figure 2 o Infer speakers and gain o Reconstruct speech A second observation that in voiced speech, the signal o Recognize asis normal... power is concentrated in areas near the harmonics of the wer / % 100 50 0 A long standing goal in speech enhancement and robust speech recognition has been to exploit the harmonic structure of speech to improve intelligibility and increase recognition accuracy. The source-filter model of speech assumes that speech is produced by an excitation source (the vocal cords) which Same Gender Speakers has strong regular harmonic structure during voiced phonemes. The overall shape of the spectrum is then formed by a filter (the vocal tract). In non-tonal languages the filter shape alone determines which phone component of a word is produced (see Figure 2). The source on the other hand introduces fine structure in the frequency spectrum that in many cases varies strongly among different utterances of the same phone. This fact has traditionally inspired the use of smooth representations of the spectrum, such as the Mel3 dB 0 dB - 3 dB - 6 dB - 9speech dB frequency cepstral coefficients, in an attempt to accurately estimate the filter component of speech in a way that is inNo dynamics Acoustic dyn. Grammar dyn. Human variant to the non-phonetic effects of the excitation[1]. 6 dB SDL Recognizer 0-7803-7980-2/03/$17.00 Auditory Scene Analysis - Dan Ellis © 2003 IEEE • Use grammar constraints fundamental frequency, which show up as parallel ridges in 1 Even if the interfering signal is another speaker, the harmonic structure of the two signals may differ at different times, and the long term pitch contour of the speakers may be exploited to separate the two sources [2]. 2006-05-20 - 47/54 291 ASRU 2003 Transcription as Separation • Transcribe piano recordings by classification train SVM detectors for every piano note 88 separate detectors, independent smoothing • Trained on player piano recordings freq / pitch Bach 847 Disklavier A6 20 A5 10 A4 0 A3 A2 -10 A1 -20 0 1 2 3 4 5 6 7 8 level / dB 9 time / sec • Sse transcription to resynthesize... Music Information Extraction - Ellis 2006-02-13 p. 48/32 Piano Transcription Results improvement from classifier: • Significant Table 1: Frame level transcription results. frame-level accuracy results: Algorithm SVM Klapuri&Ryynänen Marolt Errs 43.3% 66.6% 84.6% 120 False Pos 27.9% 28.1% 36.5% False Neg 15.4% 38.5% 48.1% d! 3.44 2.71 2.35 False Negatives False Positives Classification error % Breakdown • Overall Accuracy by frameAcc: Overall accuracy is a frame-level version of the metric proposed by Dixon in [Dixon, 2000] defined as: type: 100 80 60 40 N Acc =2 3 4 5 6 7 8 0 1 (F P present + FN + N) # notes 20 http://labrosa.ee.columbia.edu/projects/melody/ (3) where Music N isInformation the number of correctly transcribed2006-02-13 frames, F P is/32 the number of p. 49 Extraction - Ellis unvoiced frames U V transcribed as voiced V , and F N is the number of voiced Outline 1. 2. 3. 4. 5. The ASA Problem Human ASA Machine Source Separation Systems & Examples Concluding Remarks Evaluation Auditory Scene Analysis - Dan Ellis 2006-05-20 - 50/54 Evaluation • How to measure separation performance? depends what you are trying to do • SNR? energy (and distortions) are not created equal different nonlinear components [Vincent et al. ’06] rare for nonlinear processing to improve intelligibility listening tests expensive • ASR performance? Net effect Transmission errors • Intelligibility? Increased artefacts Reduced interference optimum Agressiveness of processing separate-then-recognize too simplistic; ASR needs to accommodate separation Auditory Scene Analysis - Dan Ellis 2006-05-20 - 51/54 Evaluating Scene Analysis • Need to establish ground truth subjective sources in real sound mixtures? Auditory Scene Analysis - Dan Ellis 2006-05-20 - 52/54 More Realistic Evaluation • Real-world speech tasks crowded environments applications: communication, command/control, transcription 0 6 4 -50 2 -100 0 level / dB Pitch Track + Speaker Active Ground Truth 200 150 Lags human intelligibility? ‘diarization’ annotation (not transcription) freq / kHz • Metric Personal Audio - Speech + Noise 8 100 50 0 ks-noisyspeech.wav 0 Auditory Scene Analysis - Dan Ellis 1 2 3 4 5 6 7 8 2006-05-20 - 53/54 9 10 time / sec Summary & Conclusions • Listeners do well separating sound mixtures using signal cues (location, periodicity) using source-property variations • Machines do less well difficult to apply enough constraints need to exploit signal detail • Models capture constraints learn from the real world adapt to sources • Separation feasible in certain domains describing source properties is easier Auditory Scene Analysis - Dan Ellis 2006-05-20 - 54/54 Sources / See Also • NSF/AFOSR Montreal Workshops ’03, ’04 www.ebire.org/speechseparation/ labrosa.ee.columbia.edu/Montreal2004/ as well as the resulting book... • Hanse meeting: www.lifesci.sussex.ac.uk/home/Chris_Darwin/ Hanse/ • DeLiang Wang’s ICASSP’04 tutorial www.cse.ohio-state.edu/~dwang/presentation.html • Martin Cooke’s NIPS’02 tutorial www.dcs.shef.ac.uk/~martin/nips.ppt Auditory Scene Analysis - Dan Ellis 2006-05-20 - 55/54 References 1/2 [Barker et al. ’05] J. Barker, M. Cooke, D. Ellis, “Decoding speech in the presence of other sources,” Speech Comm. 45, 5-25, 2005. [Bell & Sejnowski ’95] A. Bell & T. Sejnowski, “An information maximization approach to blind separation and blind deconvolution,” Neural Computation, 7:1129-1159, 1995. [Blin et al.’04] A. Blin, S. Araki, S. Makino, “A sparseness mixing matrix estimation (SMME) solving the underdetermined BSS for convolutive mixtures,” ICASSP, IV-85-88, 2004. [Bregman ’90] A. Bregman, Auditory Scene Analysis, MIT Press, 1990. [Brungart ’01] D. Brungart, “Informational and energetic masking effects in the perception of two simultaneous talkers,” JASA 109(3), March 2001. [Brungart et al. ’01] D. Brungart, B. Simpson, M. Ericson, K. Scott, “Informational and energetic masking effects in the perception of multiple simultaneous talkers,” JASA 110(5), Nov. 2001. [Brungart et al. ’02] D. Brungart & B. Simpson, “The effects of spatial separation in distance on the informational and energetic masking of a nearby speech signal”, JASA 112(2), Aug. 2002. [Brown & Cooke ’94] G. Brown & M. Cooke, “Computational auditory scene analysis,” Comp. Speech & Lang. 8 (4), 297–336, 1994. [Cooke et al. ’01] M. Cooke, P. Green, L. Josifovski, A.Vizinho, “Robust automatic speech recognition with missing and uncertain acoustic data,” Speech Communication 34, 267-285, 2001. [Cooke’06] M. Cooke, “A glimpsing model of speech perception in noise,” submitted to JASA. [Darwin & Carlyon ’95] C. Darwin & R. Carlyon, “Auditory grouping” Handbk of Percep. & Cogn. 6: Hearing, 387– 424, Academic Press, 1995. [Ellis’96] D. Ellis, “Prediction-Driven Computational Auditory Scene Analysis,” Ph.D. thesis, MIT EECS, 1996. [Hu & Wang ’04] G. Hu and D.L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004. [Okuno et al. ’99] H. Okuno, T. Nakatani, T. Kawabata, “Listening to two simultaneous speeches,” Speech Communication 27, 299–310, 1999. Auditory Scene Analysis - Dan Ellis 2006-05-20 - 56/54 References 2/2 [Ozerov et al. ’05] A. Ozerov, P. Phillippe, R. Gribonval, F. Bimbot, “One microphone singing voice separation using source-adapted models,” Worksh. on Apps. of Sig. Proc. to Audio & Acous., 2005. [Pearlmutter & Zador ’04] B. Pearlmutter & A. Zador, “Monaural Source Separation using Spectral Cues,” Proc. ICA, 2005. [Parra & Spence ’00] L. Parra & C. Spence, “Convolutive blind source separation of non-stationary sources,” IEEE Tr. Speech & Audio, 320-327, 2000. [Reyes et al. ’03] M. Reyes-Gómez, B. Raj, D. Ellis, “Multi-channel source separation by beamforming trained with factorial HMMs,” Worksh. on Apps. of Sig. Proc. to Audio & Acous., 13–16, 2003. [Roman et al. ’02] N. Roman, D.-L. Wang, G. Brown, “Location-based sound segregation,” ICASSP, I-1013-1016, 2002. [Roweis ’03] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” EuroSpeech, 2003. [Schimmel & Atlas ’05] S. Schimmel & L. Atlas, “Coherent Envelope Detection for Modulation Filtering of Speech,” ICASSP, I-221-224, 2005. [Slaney & Lyon ’90] M. Slaney & R. Lyon, “A Perceptual Pitch Detector,” ICASSP, 357-360, 1990. [Smaragdis ’98] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Intl. Wkshp. on Indep. & Artif.l Neural Networks, Tenerife, Feb. 1998. [Seltzer et al. ’02] M. Seltzer, B. Raj, R. Stern, “Speech recognizer-based microphone array processing for robust hands-free speech recognition,” ICASSP, I–897–900, 2002. [Varga & Moore ’90] A.Varga & R. Moore, “Hidden Markov Model decomposition of speech and noise,” ICASSP, 845–848, 1990. [Vincent et al. ’06] E.Vincent, R. Gribonval, C. Févotte, “Performance measurement in Blind Audio Source Separation.” IEEE Trans. Speech & Audio, in press. [Yilmaz & Rickard ’04] O.Yilmaz & S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Tr. Sig. Proc. 52(7), 1830-1847, 2004. Auditory Scene Analysis - Dan Ellis 2006-05-20 - 57/54