Computational Auditory Scene Analysis Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. The Scene Analysis problem 2. ASA and CASA 3. Issues in CASA Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 1 /21 LabROSA Overview Information Extraction Music Environment Scene Analysis Signal Processing Machine Learning Speech Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 2 /21 1. Scene Analysis • Recover individual sources from scenes .. duplicate the perceptual effect 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Problems competing sources, channel effects • Dimensionality loss need additional constraints Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 3 /21 Scene Analysis Systems • “Scene Analysis” not necessarily separation, recognition, ... scene = overlapping objects, ambiguity • General Framework: distinguish input and output representations distinguish engine (algorithm) and control (constraints, “computational model”) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 4 /21 Human and Machine Scene Analysis • CASA (Brown’92 et seq.): Input: Periodicity, continuity, onset “maps” Output: Waveform (or mask) Engine: Time-frequency masking Control: “Grouping cues” from input - or: spatial features (Roman, ...) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 5 /21 Human and Machine Scene Analysis • CASA (e.g. Brown’92): • ICA (Bell & Sejnowski et seq.): Input: waveform (or STFT) Output: waveform (or STFT) Engine: cancellation Control: statistical independence of outputs - or energy minimization for beamforming Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 6 /21 Human and Machine Scene Analysis • CASA (e.g. Brown’92): • ICA (Bell & Sejnowski et seq.): • Human Listeners: Input: excitation patterns ... Output: percepts ... Engine: ? Control: find a plausible explanation Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 7 /21 2. Auditory Scene Analysis (Bregman 1990) • How do people analyze sound mixtures? break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes • Grouping rules (Darwin, Carlyon, ...): cues: common onset/offset/modulation, harmonicity, spatial location, ... Onset map Sound Frequency analysis Elements Harmonicity map Sources Grouping mechanism Source properties Spatial map (after Darwin 1996) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 8 /21 Grouping cues freq / Hz • Main cues: Harmonicity + Onset 8000 6000 4000 2000 0 0 1 2 3 4 5 6 not necessarily consistent! 7 8 9 time / s (from Pierce 1980) • Other cues: spatial information ‘schema’ – learned patterns • Cues ≈ constraints Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 9 /21 Bottom-up CASA (Brown’92, Hu & Wang’02) Segment Group input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod • Literal implementation of psychoacoustics segment time-frequency into elements group into sources • Output via time-frequency masking i.e. time-varying filter Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 10/21 Correlogram front-end • (Slaney’90 et seq.) "#$#!%&'()*+(,!-&'.+//0(1 Periodic modulation as 3rd separating axis 2 "'&&+3'1&456 envelope to handle unresolved harmonics 7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18 y ela lin e short-time autocorrelation d sound Cochlea filterbank frequency channels "'&&+3'1&45 /30.+ envelope follower freq lag time - linear filterbank cochlear approximation Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 11/21 - static nonlinearity “Weft” Periodic Elements (Ellis’96) • Represent harmonics without grouping? Correlogram Smooth Spectrum slices freq freq lag time Period Track commonperiod feature period time time hard to separate multiple pitch tracks Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 12/21 CASA Output • Time-Frequency masked reconstruction /012%B%CD%E%F514%8: )*$+%&%,-. )*$+%&%,-. /012%3*"4"156%#"7 = = 2 :8 < < >8 ? ? 8 ; ; B>8 0 0 B:8 : : B08 > > B;8 2 8 8 89: 89; 89< 89= > >9: >9; !"#$%&%'$( 8 8 89: 89; 89< 89= > >9: >9; !"#$%&%'$( 6$/$6%9&%@A works surprisingly well (for speech?) cannot undo overlapping energy (< 20%?) applicable to reverberation also? • Or: parametric resynthesis e.g. ‘wefts’, speech synthesizer Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 13/21 Challenges for CASA "#$%&'()!*+,-!.%$,,$(/012!3454 freq • Circumscribing time-frequency elements need to have ‘regions’, but hard to find • Periodicity is the primary cue how to handle aperiodic energy? • Bottom-up leaves no ambiguity or context how to model illusions/interpolations? • Need to group over longer timespans time 6 3+#70()7#+%+89!,+('/:#';0'87<!'&'('8,) - need to have ‘regions’, but hard to find 6 "'#+$=+7+,<!+)!,-'!1#+(>#<!70' - how to handle aperiodic energy? 6 ?')<8,-')+)!@+>!(>)A'=!!&,'#+89 - cannot separate within a single t-f element 6 B$,,$(/01!&'>@')!8$!>(%+90+,<!$#!7$8,'C, - how to model illusions? local properties not enough E6820 SAPR - Dan Ellis L11 - Signal Separation Comp. Aud. Scene Analysis - Dan Ellis 2005-04-13 - 15 2005-06-30 - 14/21 Model-based integration !"#$%&'()*&+,*-'$ high-level constraints? • How to represent How to integrate disparate fragments? • “Speech fragment decoder” (Barker et al. ’05) . /-%-(-'$)S)(,)01,2&)3"#$%&'(4) %#5&4)167,(1&4-4)4&#"+1)("#+(#82&9 Fragments Hypotheses w1 w5 w2 w6 w3 w7 w8 w4 time ;< ;= ;> ;? ;@ ;A model 9of=>*'=2$*?$?1"5@2#0($12!2=0($P(S|Y)$A$P(X|M) source (e.g. speech recognition HMM) 'B2B$C2(0$=*@C'#"0'*#$*?$(25125"0'*# to say which parts go together "#,$@"0=>$0*$(D22=>$@*,2&( Comp. Aud. Scene Analysis - Dan Ellis . 2005-06-30 - 15/21 :&"$-'$)167,(1&4&4)2-%-(4)47#+&)*&%#'*4 9 BB$C+0$21"(2($(D2='"=$>'(0*1E Disambiguation • Scene multiple possible explanations Analysis choose most reasonable one • Most reasonable means... consistent with grouping cues (CASA) independent sources (ICA) consistent with experience ... (human) max P({Si}| X) ∝ P(X |{Si}) P({Si}) combination physics source models • i.e. some kind of constraints to disambiguate Learning as the source of this disambiguation knowledge Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 16/21 Recognition for Separation • Speech recognizers embody knowledge trained on 100s of hours of speech useEstimation them as a ‘reasonableness’ measure Target Iteration & Filter Training 12 MIC1 ASR Alignment Target 1 MICM Filter 1 Speaker 1 Filter M Speaker 1 ! FE My Ms Error MIC1 MICM ! speech recognizer’s best-match provides optimization target MIC1 Comp. Aud. Scene Analysis Target 2 - Dan Ellis Factorial Processing MICM Filter 1 Speaker 2 Filter M ! My 2005-06-30 - 17/21Ms FE Error from Manuel Reyes’s WASPAA 2003 presentation • e.g. Seltzer, Raj, Reyes: Obliteration and Outputs • Perfect separation is rarely possible e.g. cancellation after psychoacoustic coding? strong interference will obliterate part of target • What should the output be? can fill-in missing-data holes using source models - ‘pretend’ we observed the full signal - but: hides observed/inferred distinction output internal model state instead? - e.g. ASR output - synthesize with “minimally informative noise” Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 18/21 Practical CASA? • When will CASA be useful? • Obstacles: Transmission errors no agreed way to measure progress! intelligibility is a novel idea Net effect graceful degradation - effect of distortions unpitched sounds optimum computation - look-ahead integration with multichannel techniques - sequential or all-at-once? Comp. Aud. Scene Analysis - Dan Ellis Increased artefacts Reduced interference Agressiveness of processing 2005-06-30 - 19/21 Current CASA work • Handling unvoiced events (OSU) • Partial recognition & grouping (Sheffield) • Model-based separation (Columbia) • Spatial cues? • Dereverberation? Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 20/21 Conclusions • Framework for scene analysis Input, Output, Engine, Control • Auditory Scene Analysis in humans and machines • Scene analysis as Disambiguation finding the additional constraints • Big problems still to overcome Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 21/21