Sound, Mixtures, and Learning Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline 1 Human sound organization 2 Computational Auditory Scene Analysis 3 Speech models and knowledge 4 Sound mixture recognition 5 Learning opportunities Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 1 Lab ROSA Human sound organization 1 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Analyzing and describing complex sounds: - continuous sound mixture → distinct events • Hearing is ecologically grounded - reflects ‘natural scene’ properties - subjective not canonical (ambiguity) - mixture analysis as primary goal Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 2 Lab ROSA Sound mixtures • Sound ‘scene’ is almost always a mixture - always stuff going on - sound is ‘transparent’ - but big energy range freq / kHz Speech Noise Speech + Noise 4 3 + 2 = 1 0 0 0.5 1 time / s 0 0.5 1 time / s 0 0.5 1 time / s • Need information related to our ‘world model’ - i.e. separate objects - a wolf howling in a blizzard is the same as a wolf howling in a rainstorm - whole-signal statistics won’t do this • ‘Separateness’ is similar to independence - objects/sounds that change in isolation - but: depends on the situation e.g. passing car vs. mechanic’s diagnosis Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 3 Lab ROSA Auditory scene analysis (Bregman 1990) • How do people analyze sound mixtures? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes • Grouping ‘rules’ (Darwin, Carlyon, ...): - cues: common onset/offset/modulation, harmonicity, spatial location, ... Onset map Frequency analysis Harmonicity map Grouping mechanism Source properties Position map (after Darwin, 1996) Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 4 Lab ROSA Cues to simultaneous grouping freq / Hz • Elements + attributes 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s • Common onset - simultaneous energy has common source • Periodicity - energy in different bands with same cycle • Other cues - spatial (ITD/IID), familiarity, ... Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 5 Lab ROSA The effect of context Context can create an ‘expectation’: i.e. a bias towards a particular interpretation • e.g. Bregman’s “old-plus-new” principle: A change in a signal will be interpreted as an added source whenever possible frequency / kHz • 2 1 + 0 0.0 0.4 0.8 1.2 time / s - a different division of the same energy depending on what preceded it Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 6 Lab ROSA Outline 1 Human sound organization 2 Computational Auditory Scene Analysis - sound source separation - bottom-up models - top-down constraints 3 Speech models and knowledge 4 Sound mixture recognition 5 Learning opportunities Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 7 Lab ROSA 2 Computational Auditory Scene Analysis (CASA) CASA Object 1 description Object 2 description Object 3 description ... • Goal: Automatic sound organization ; Systems to ‘pick out’ sounds in a mixture - ... like people do • E.g. voice against a noisy background - to improve speech recognition • Approach: - psychoacoustics describes grouping ‘rules’ - ... just implement them? Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 8 Lab ROSA CASA front-end processing • Correlogram: Loosely based on known/possible physiology de lay l ine short-time autocorrelation sound Cochlea filterbank frequency channels Correlogram slice envelope follower freq lag time - linear filterbank cochlear approximation static nonlinearity zero-delay slice is like spectrogram periodicity from delay-and-multiply detectors Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 9 Lab ROSA The Representational Approach (Brown & Cooke 1993) • input mixture Implement psychoacoustic theory signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - ‘bottom-up’ processing - uses common onset & periodicity cues • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 0.6 0.8 1.0 time/s Sound, mixtures, learning @ Snowbird - Dan Ellis 0.2 0.4 0.6 0.8 2002-04-04 - 10 1.0 time/s Lab ROSA Problems with ‘bottom-up’ CASA freq time • Circumscribing time-frequency elements - need to have ‘regions’, but hard to find • Periodicity is the primary cue - how to handle aperiodic energy? • Resynthesis via masked filtering - cannot separate within a single t-f element • Bottom-up leaves no ambiguity or context - how to model illusions? Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 11 Lab ROSA Restoration in sound perception • Auditory ‘illusions’ = hearing what’s not there • The continuity illusion f/Hz ptshort 4000 2000 1000 0.0 • 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s SWS f/Bark 15 80 60 S1−env.pf:0 10 5 40 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 - duplex perception Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 12 Lab ROSA Adding top-down constraints Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Source groups Grouping rules - objects irresistibly appear vs. Prediction-driven (top-down) hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Predict & combine Periodic components predicted features - match observations with parameters of a world-model - need world-model constraints... Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 13 Lab ROSA Aside: Optimal techniques (ICA, ABF) (Bell & Sejnowski etc.) • General idea: Drive a parameterized separation algorithm to maximize independence of outputs m1 m2 x a11 a12 a21 a22 s1 s2 −δ MutInfo δa • Attractions: - mathematically rigorous, minimal assumptions • Problems: - limitations of separation algorithm (N x N) - essentially bottom-up Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 14 Lab ROSA Outline 1 Human sound organization 2 Computational Auditory Scene Analysis 3 Speech models and knowledge - automatic speech recognition - subword states - cepstral coefficients 4 Sound mixture recognition 5 Learning opportunities Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 15 Lab ROSA Speech models & knowledge 3 • Standard speech recognition sound D AT A Feature calculation feature vectors Acoustic model parameters Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") • Acoustic classifier phone probabilities HMM decoder phone / word sequence Understanding/ application... ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 16 Lab ROSA freq / Hz Speech units • Speech is highly variable - simple templates won’t do - spectral variation (voice quality) - time-warp problems • Match short segments (states), allow repeats - model with pseudo-stationary slices of ~ 10 ms sil g w eh n sil 4000 3000 2000 1000 0 0 0.05 • 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 time / s Speech models are distributions p ( X q ) Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 17 Lab ROSA Speech features: Cepstra • Idea: Decorrelate & summarize spectral slices: X m [ l ] = IDFT { log S [ mH , k ] } - easier to model: Cepstral coefficients Auditory spectrum Features Covariance matrix 25 20 20 16 15 12 10 8 5 4 Example joint distrib (10,15) -2 -3 -4 -5 20 18 16 14 12 10 8 6 4 2 16 12 8 4 50 100 150 frames 5 10 15 20 3 2 1 0 -1 -2 -3 -4 -5 0 5 - C0 ‘normalizes out’ average log energy • Decorrelated pdfs fit diagonal Gaussians - DCT is close to PCA for log spectra Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 18 Lab ROSA Acoustic model training • Goal: describe p ( X q ) with e.g. GMMs p(x|ω1) Labeled training examples {xn,ωxn} Sort according to class • Estimate conditional pdf for class ω1 Training data labels from: - manual phonetic annotation - ‘best path’ from earlier classifier (Viterbi) - EM: joint estimation of labels & pdfs Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 19 Lab ROSA HMM decoding • WE w iy Feature vectors cannot be reliably classified into phonemes DO HAVE d uwhh ae JUST A jh ah s FEW tcl v freq / kHz w iy dcl duw 4 3 2 1 0 0 hh aaw e 0.5 y d uw DAYS ey z jh ah stcl taxch jhy iy uwdcl d ey 1 LEFT THE l eh n z l eh 1.5 tcl h# f h# 2 dh 2.5 time / s • Use top-down constraints to get good results - allowable phonemes - dictionary of known words - grammar of possible sentences • Decoder searches all possible state sequences - at least notionally; pruning makes it possible Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 20 Lab ROSA Outline 1 Human sound organization 2 Computational Auditory Scene Analysis 3 Speech models and knowledge 4 Sound mixture recognition - feature invariance - mixtures including - general mixtures 5 Learning opportunities Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 21 Lab ROSA Sound mixture recognition 4 • Biggest problem in speech recognition is background noise interference • Feature invariance approach - use features that reflect only speech - e.g. normalization, mean subtraction - but: non-static noise? • Or: more complex models of the signal - HMM decomposition - missing-data recognition • Generalize to other, multiple sounds Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 22 Lab ROSA Feature normalization • Idea: feature variations, not absolute level • Hence: calculate average level & subtract it: X [ k ] = S [ k ] – mean { S [ k ] } • Factors out fixed channel frequency response: s[n] = h[n] * e[n] log S [ k ] = log H [ k ] + log E [ k ] • Normalize variance to handle added noise? freq / kHz clean/MGG95958 N2_SNR5/MGG95958 4 3 2 1 cepstra 0 8 6 4 2 0 0.5 1 1.5 2 2.5 time / s Sound, mixtures, learning @ Snowbird - Dan Ellis 0 0.5 1 1.5 2 2002-04-04 - 23 2.5 time / s Lab ROSA HMM decomposition (e.g. Varga & Moore 1991, Roweis 2000) • Total signal model has independent state sequences for 2+ component sources model 2 model 1 observations / time • New combined state space q' = {q1 q2} - new observation pdfs for each combination i i i p ( X q 1, q 2 ) Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 24 Lab ROSA Problems with HMM decomposition • O(qk)N is exponentially large... • Normalization no longer holds! - each source has a different gain → model at various SNRs? - models typically don’t use overall energy C0 - each source has a different channel H[k] • Modeling every possible sub-state combination is inefficient, inelegant and impractical Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 25 Lab ROSA Missing data recognition (Cooke, Green, Barker @ Sheffield) • Energy overlaps in time-freq. hide features - some observations are effectively missing • Use missing feature theory... - integrate over missing data xm under model M p( x M ) = • ∫ p( x p x m, M ) p ( x m M ) d x m Effective in speech recognition "1754" + noise 90 AURORA 2000 - Test Set A 80 Missing-data Recognition WER 70 MD Discrete SNR MD Soft SNR HTK clean training HTK multi-condition 60 50 40 "1754" Mask based on stationary noise estimate 30 20 10 0 -5 • 0 5 10 Problem: finding the missing data mask Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 26 15 Lab ROSA 20 Clean Maximum-likelihood data mask (Jon Barker @ Sheffield) • Search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder • "1754" Decoder searches over data mask K: p ( M, K x ) ∝ p ( x K , M ) p ( K M ) p ( M ) - how to estimate p(K) Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 27 Lab ROSA Multi-source decoding • Search for more than one source q2(t) O(t) K2(t) K1(t) q1(t) • Mutually-dependent data masks • Use CASA processing to propose masks - locally coherent regions - p(K|q) • Theoretical vs. practical limits Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 28 Lab ROSA General sound mixtures • Search for generative explanation: Generation model Channel parameters Source models Source signals M1 Y1 Received signals X1 C2 Model dependence Θ C1 Observations M2 Y2 X2 O Analysis structure Observations O Fragment formation Mask allocation p(X|Mi,Ki) Sound, mixtures, learning @ Snowbird - Dan Ellis {Ki} Model fitting {Mi} Likelihood evaluation 2002-04-04 - 29 Lab ROSA Outline 1 Human sound organization 2 Computational Auditory Scene Analysis 3 Speech models and knowledge 4 Sound mixture recognition 5 Opportunities for learning - learnable aspects of modeling - tractable decoding - some examples Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 30 Lab ROSA Opportunities for learning 5 • Per model feature distributions P ( Y M ) - e.g. analyzing isolated sound databases • Channel modifications P ( X Y ) - e.g. by comparing multi-mic recordings • Signal combinations P ( O { X i } ) - determined by acoustics • Patterns of model combinations P ( { M i } ) - loose dependence between sources • Search for most likely explanations • Short-term structure: repeating events P ( { M i } O ) ∝ P ( O { X i } )P ( { X i } { M i } )P ( { M i } ) Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 31 Lab ROSA Source models • The speech recognition lesson: Use the data as much as possible - what can we do with unlimited data feeds? • Data sources - clean data corpora - identify near-clean segments in real sound • Model types - templates - parametric/constraint models - HMMs Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 32 Lab ROSA What are the HMM states? • No sub-units defined for nonspeech sounds • Final states depend on EM initialization - labels - clusters - transition matrix • Have ideas of what we’d like to get - investigate features/initialization to get there dogBarks2 freq / kHz s7 s3 s2 s4 s2 s3 s7 s2 s3s5 s2 s3 s2 s4 s3 s4 s3 s7 s4 8 7 6 5 4 3 2 1 time 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2 Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 33 Lab ROSA Tractable decoding • Speech decoder notionally searches all states • Parametric models give infinite space - need closed-form partial explanations - examine residual, iterate, converge • Need general cues to get started - return to Auditory Scene Analysis: - onsets - harmonic patterns - then parametric fitting • Need multiple hypothesis search, pruning, efficiency tricks • Learning? Parameters for new source events - e.g. from artificial (hence labeled) mixtures Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 34 Lab ROSA freq / Hz Example: Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 freq / Hz - sinusoid peaks have invariant properties Speech + alarm 4000 2000 Pr(alarm) 0 1 0.5 0 • 1 2 3 4 5 6 7 8 9 time / sec Learn model parameters from examples Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 35 Lab ROSA Example: Music transcription (e.g. Masataka Goto) • High-quality training material: Synthesizer sample kits • Ground truth available: Musical scores • Find ML explanations for scores - guide by multiple pitch tracking (hyp. search) • Applications in similarity matching Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 36 Lab ROSA Summary • Sound contains lots of information ... but it’s always mixed up • Psychologists describe ASA ... but bottom-up computer models don’t work • Speech recognition works for isolated speech ... by exploiting top-down, context constraints • Speech in mixtures via multiple-source models ... practical combinatorics are the main problem • Generalize this idea for all sounds ... need models of ‘all sounds’ ... plus models of channel modification ... plus ways to propose segmentations ... plus missing-data recognition Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 37 Lab ROSA Further reading [BarkCE00] J. Barker, M.P. Cooke & D. Ellis (2000). “Decoding speech in the presence of other sound sources,” Proc. ICSLP-2000, Beijing. ftp://ftp.icsi.berkeley.edu/pub/speech/papers/icslp00-msd.pdf [Breg90] A.S. Bregman (1990). Auditory Scene Analysis: the perceptual organization of sound, MIT Press. [Chev00] A. de Cheveigné (2000). “The Auditory System as a Separation Machine,” Proc. Intl. Symposium on Hearing. http://www.ircam.fr/pcm/cheveign/sh/ps/ATReats98.pdf [CookE01] M. Cooke, D. Ellis (2001). “The auditory organization of speech and other sources in listeners and computational models,” Speech Communication (accepted for publication). http://www.ee.columbia.edu/~dpwe/pubs/tcfkas.pdf [Ellis99] D.P.W. Ellis (1999). “Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis...,” Speech Communications 27. http://www.ee.columbia.edu/~dpwe/pubs/spcomcasa98.pdf [Roweis00] S. Roweis (2000). “One microphone source separation.,” Proc. NIPS 2000. http://www.ee.columbia.edu/~dpwe/papers/roweis-nips2000.pdf Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 38 Lab ROSA