Scene Analysis for Speech and Audio Recognition 1 Sound, Mixtures & Learning 2 Computational Auditory Scene Analysis 3 Recognizing Speech in Noise 4 Using Models in Parallel 5 The Listening Machine Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Columbia University, New York http://labrosa.ee.columbia.edu/ Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 1 Sound, Mixtures & Learning 1 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB • Sound - carries useful information about the world - complements vision • Mixtures - .. are the rule, not the exception - medium is ‘transparent’ with many sources - must be handled! • Learning - the speech recognition lesson: let the data do the work - ... like listeners do Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 2 The problem with recognizing mixtures “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events - ... like listeners do • Hearing is ecologically grounded - reflects natural scene properties = constraints - subjective, not absolute Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 3 Auditory Scene Analysis (Bregman 1990) • How do people analyze sound mixtures? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes • Grouping ‘rules’ (Darwin, Carlyon, ...): - cues: common onset/offset/modulation, harmonicity, spatial location, ... Onset map Frequency analysis Harmonicity map Source properties Grouping mechanism Position map (after Darwin, 1996) Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 4 Cues to simultaneous grouping freq / Hz • Elements + attributes 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s • Common onset - simultaneous energy has common source • Periodicity - energy in different bands with same cycle • Other cues - spatial (ITD/IID), familiarity, ... Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 5 The effect of context Context can create an ‘expectation’: i.e. a bias towards a particular interpretation • Bregman’s old-plus-new principle: frequency / kHz • 2 1 + 0 0.0 0.4 0.8 1.2 time / s - a change is preferably interpreted as addition • E.g. the continuity illusion f/Hz ptshort 4000 2000 1000 0.0 Dan Ellis 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s Scene Analysis for Speech & Audio Recognition 2003-04-16 - 6 Approaches to sound mixture recognition • Separate signals, then recognize - e.g. CASA, ICA - nice, if you can do it • Recognize combined signal - ‘multicondition training’ - combinatorics.. • Recognize with parallel models - full joint-state space? - divide signal into fragments, then use missing-data recognition Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 7 Independent Component Analysis (ICA) (Bell & Sejnowski 1995 etc.) • Drive a parameterized separation algorithm to maximize independence of outputs m1 m2 x a11 a12 a21 a22 s1 s2 −δ MutInfo δa • Advantages: - mathematically rigorous, minimal assumptions - does not rely on prior information from models • Disadvantages: - may converge to local optima... - separation, not recognition - does not exploit prior information from models Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 8 Outline 1 Sound, Mixtures & Learning 2 Computational Auditory Scene Analysis - Data-driven - Top-down constraints 3 Recognizing Speech in Noise 4 Using Models in Parallel 5 The Listening Machine Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 9 Computational Auditory Scene Analysis: The Representational Approach (Cooke & Brown 1993) • input mixture Direct implementation of psych. theory signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - ‘bottom-up’ processing - uses common onset & periodicity cues • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 Dan Ellis 0.6 0.8 1.0 time/s 0.2 Scene Analysis for Speech & Audio Recognition 0.4 0.6 0.8 2003-04-16 - 10 1.0 time/s Adding top-down constraints Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups - objects irresistibly appear vs. Prediction-driven (top-down) hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Periodic components Predict & combine predicted features - match observations with parameters of a world-model - need world-model constraints... Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 11 Prediction-Driven CASA (Ellis 1996) • Explain a complex sound with basic elements f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 Dan Ellis 1 2 3 4 5 6 7 8 Scene Analysis for Speech & Audio Recognition 9 time/s dB 2003-04-16 - 12 Aside: Evaluation • Evaluation is a big problem for CASA - what is the goal, really? - what is a good test domain? - how do you measure performance? • SNR improvement - tricky to derive from before/after signals: correspondence problem - can do with fixed filtering mask; but rewards removing signal as well as noise • Speech Recognition (ASR) improvement - recognizers typically very sensitive to artefacts • ‘Real’ task? - mixture corpus with specific sound events... Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 13 Outline 1 Sound, Mixtures & Learning 2 Computational Auditory Scene Analysis 3 Recognizing Speech in Noise - Conventional ASR - Tandem modeling 4 Using Models in Parallel 5 The Listening Machine Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 14 Recognizing Speech in Noise 3 • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") • Dan Ellis Acoustic classifier phone probabilities HMM decoder phone / word sequence Understanding/ application... How to handle additive noise? - just train on noisy data: ‘multicondition training’ Scene Analysis for Speech & Audio Recognition 2003-04-16 - 15 Tandem speech recognition (with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound Speech features • Dan Ellis Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output ‘meaning’ Scene Analysis for Speech & Audio Recognition 2003-04-16 - 16 Words Tandem system results • It works very well (‘Aurora’ noisy digits): WER as a function of SNR for various Aurora99 systems 100 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% WER / % (log scale) 50 20 10 5 HTK GMM baseline Hybrid connectionist Tandem Tandem + PC 2 1 -5 0 5 System-features 10 15 SNR / dB (averaged over 4 noises) 20 clean Avg. WER 20-0 dB Baseline WER ratio HTK-mfcc 13.7% 100% Neural net-mfcc 9.3% 84.5% Tandem-mfcc 7.4% 64.5% Tandem-msg+plp 6.4% 47.2% Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 17 Inside Tandem systems: What’s going on? • Visualizations of the net outputs Clean 3 3 2 2 1 1 0 0 10 10 5 5 0 0 q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t freq / kHz 4 phone Spectrogram 5dB SNR to ‘Hall’ noise 4 phone “one eight three” (MFP_183A) 10 0 -10 -20 -30 -40 dB Cepstral-smoothed mel spectrum Phone posterior estimates Hidden layer linear outputs freq / mel chan 7 Dan Ellis 5 4 0 • 6 0.2 0.4 0.6 time / s 0.8 1 3 1 0.5 0 20 10 0 -10 0 0.2 0.4 0.6 time / s 0.8 1 -20 Neural net normalizes away noise? - ... just a successful way to build a classifier? Scene Analysis for Speech & Audio Recognition 2003-04-16 - 18 Tandem vs. other approaches Aurora 2 Eurospeech 2001 Evaluation Columbia Rel improvement % - Multicondition 60 Philips UPC Barcelona Bell Labs 50 IBM 40 Motorola 1 Motorola 2 30 Nijmegen ICSI/OGI/Qualcomm 20 ATR/Griffith AT&T Alcatel 10 Siemens UCLA Microsoft 0 Avg. rel. improvement -10 Slovenia Granada • 50% of word errors corrected over baseline • Beat a ‘bells and whistles’ system that used many large-vocabulary techniques Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 19 Outline 1 Sound, Mixtures & Learning 2 Computational Auditory Scene Analysis 3 Recognizing Speech in Noise 4 Using Models in Parallel - HMM decomposition/factoring - Speech fragment decoding 5 The Listening Machine Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 20 Using Models in Parallel: HMM decomposition 4 (e.g. Varga & Moore 1991, Gales & Young 1996) • Independent state sequences for 2+ component source models model 2 model 1 observations / time • New combined state space q' = {q1 q2} - need pdfs for each combination p ( X q 1, q 2 ) Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 21 “One microphone source separation” (Roweis 2000, Manuel Reyes) State sequences → t-f estimates → mask • Original voices freq / Hz Speaker 1 Speaker 2 4000 3000 2000 1000 0 Mixture 4000 3000 2000 1000 0 Resynthesis masks State means 4000 3000 2000 1000 0 4000 3000 2000 1000 0 0 1 2 3 4 5 6 time / sec 0 1 2 3 4 5 6 time / sec - 1000 states/model (→ 106 transition probs.) - simplify by modeling subbands (coupled HMM)? Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 22 Speech Fragment Recognition (Jon Barker & Martin Cooke, Sheffield) • Signal separation is too hard! Instead: - segregate features into partially-observed sources - then classify • Made possible by missing data recognition - integrate over uncertainty in observations for optimal posterior distribution • Goal: Relate clean speech models P(X|M) to speech-plus-noise mixture observations - .. and make it tractable Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 23 Comparing different segregations • Standard classification chooses between models M to match source features X P( M ) M ∗ = argmax P ( M X ) = argmax P ( X M ) ⋅ -------------P( X ) M M • Mixtures → observed features Y, segregation S, all related by P ( X Y , S ) Observation Y(f ) Source X(f ) Segregation S freq - spectral features allow clean relationship • Joint classification of model and segregation: P( X Y , S ) P ( M , S Y ) = P ( M ) ∫ P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y ) P( X ) - integral collapses in several cases... Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 24 Calculating fragment matches P( X Y , S ) P ( M , S Y ) = P ( M ) ∫ P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y ) P( X ) • P(X|M) - the clean-signal feature model • P(X|Y,S)/P(X) - is X ‘visible’ given segregation? • Integration collapses some bands... • P(S|Y) - segregation inferred from observation - just assume uniform, find S for most likely M - or: use extra information in Y to distinguish S’s e.g. harmonicity, onset grouping • Result: - probabilistically-correct relation between clean-source models P(X|M) and inferred, recognized source + segregation P(M,S|Y) Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 25 Speech fragment decoder results • Simple P(S|Y) model forces contiguous regions to stay together - big efficiency gain when searching S space "1754" + noise 90 AURORA 2000 - Test Set A 80 WER / % 70 SNR mask 60 HTK clean training MD Soft SNR HTK multicondition 50 40 30 20 Fragments 10 0 -5 Fragment Decoder • Dan Ellis 0 5 10 SNR / dB 15 "1754" Clean-models-based recognition rivals trained-in-noise recognition Scene Analysis for Speech & Audio Recognition 2003-04-16 - 26 20 clean Multi-source decoding • Search for more than one source q2(t) Y(t) S2(t) S1(t) q1(t) • Mutually-dependent data masks • Use e.g. CASA features to propose masks - locally coherent regions - more powerful than Roweis masks • Huge practical advantage over full search Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 27 Outline 1 Sound, Mixtures & Learning 2 Computational Auditory Scene Analysis 3 Recognizing Speech in Noise 4 Using Models in Parallel 5 The Listening Machine - Everyday sound - Alarms - Music Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 28 The Listening Machine 5 • Smart PDA records everything • Only useful if we have index, summaries - monitor for particular sounds - real-time description • Scenarios - personal listener → summary of your day - future prosthetic hearing device - autonomous robots • Dan Ellis Meeting data, ambulatory audio Scene Analysis for Speech & Audio Recognition 2003-04-16 - 29 Alarm sound detection (Ellis 2001) • freq / kHz s0n6a8+20 4 Alarm sounds have particular structure - people ‘know them when they hear them’ - clear even at low SNRs hrn01 bfr02 buz01 20 3 0 2 -20 1 0 0 5 10 15 20 25 time / s • Why investigate alarm sounds? - they’re supposed to be easy - potential applications... • Contrast two systems: - standard, global features, P(X|M) - sinusoidal model, fragments, P(M,S|Y) Dan Ellis Scene Analysis for Speech & Audio Recognition -40 level / dB 2003-04-16 - 30 freq / kHz Alarms: Results Restaurant+ alarms (snr 0 ns 6 al 8) 4 3 2 1 0 MLP classifier output freq / kHz 0 Sound object classifier output 4 6 9 8 7 3 2 1 0 20 • 25 35 40 45 time/sec 50 Both systems commit many insertions at 0dB SNR, but in different circumstances: Noise Neural net system Del Ins Tot Sinusoid model system Del Ins Tot 1 (amb) 7 / 25 2 36% 14 / 25 1 60% 2 (bab) 5 / 25 63 272% 15 / 25 2 68% 3 (spe) 2 / 25 68 280% 12 / 25 9 84% 4 (mus) 8 / 25 37 180% 9 / 25 135 576% 170 192% 50 / 100 147 197% Overall 22 / 100 Dan Ellis 30 Scene Analysis for Speech & Audio Recognition 2003-04-16 - 31 Music Applications • Music as a complex, information-rich sound • Applications of separation & recognition: - note/chord detection & classification freq / kHz DYWMB: Alignments to MIDI note 57 mapped to Orig Audio 2 1 0 23 24 31 32 40 41 48 49 2 1 0 72 73 80 81 85 86 88 89 - singing detection (→ genre identification ...) Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 0 50 Dan Ellis 100 150 Scene Analysis for Speech & Audio Recognition 200 2003-04-16 - 32 time / sec Summary • Sound - .. contains much, valuable information at many levels - intelligent systems need to use this information • Mixtures - .. are an unavoidable complication when using sound - looking in the right time-frequency place to find points of dominance • Learning - need to acquire constraints from the environment - recognition/classification as the real task Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 33 References A. Bregman. Auditory Scene Analysis, MIT Press, 1990. A. Bell and T. Sejnowski. “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, 7: 1129-1159, 1995. http://citeseer.nj.nec.com/bell95informationmaximization.html A. Berenzweig, D. Ellis, S. Lawrence (2002). “Using Voice Segments to Improve Artist Classification of Music “, Proc. AES-22 Intl. Conf. on Virt., Synth., and Ent. Audio. Espoo, Finland, June 2002. http://www.ee.columbia.edu/~dpwe/pubs/aes02-aclass.pdf A. Berenzweig, D. Ellis, S. Lawrence (2002). “Anchor Space for Classification and Similarity Measurement of Music“, Proc. ICME-03, Baltimore, July 2003. http://www.ee.columbia.edu/~dpwe/pubs/icme03-anchor.pdf M. Cooke and G. Brown. “Computational auditory scene analysis: Exploiting principles of perceived continuity”, Speech Communication 13, 391-399, 1993 D. Ellis. Prediction-driven computational auditory scene analysis, Ph.D. dissertation, MIT, 1996. http://www.ee.columbia.edu/~dpwe/pubs/pdcasa.pdf D. Ellis. “Detecting Alarm Sounds”, Proc. Workshop on Consistent & Reliable Acoustic Cues CRAC-01, Denmark, Sept. 2001. http://www.ee.columbia.edu/~dpwe/pubs/crac01-alarms.pdf M. Gales and S. Young. “Robust continuous speech recognition using parallel model combination”, IEEE Tr. Speech and Audio Proc., 4(5):352--359, Sept. 1996. http://citeseer.nj.nec.com/gales96robust.html H. Hermansky,D. Ellis and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” Proc. ICASSP, Istanbul, June 2000. http://citeseer.nj.nec.com/hermansky00tandem.html Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 34