Computational Auditory Scene Analysis 1. 2.

Computational Auditory Scene Analysis Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. The Scene Analysis problem 2. ASA and CASA 3. Issues in CASA Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 1 /21 LabROSA Overview Information Extraction Music Environment Scene Analysis Signal Processing Machine Learning Speech Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 2 /21 1. Scene Analysis • Recover individual sources from scenes .. duplicate the perceptual effect 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Problems competing sources, channel effects • Dimensionality loss need additional constraints Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 3 /21 Scene Analysis Systems • “Scene Analysis” not necessarily separation, recognition, ... scene = overlapping objects, ambiguity • General Framework: distinguish input and output representations distinguish engine (algorithm) and control (constraints, “computational model”) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 4 /21 Human and Machine Scene Analysis • CASA (Brown’92 et seq.): Input: Periodicity, continuity, onset “maps” Output: Waveform (or mask) Engine: Time-frequency masking Control: “Grouping cues” from input - or: spatial features (Roman, ...) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 5 /21 Human and Machine Scene Analysis • CASA (e.g. Brown’92): • ICA (Bell & Sejnowski et seq.): Input: waveform (or STFT) Output: waveform (or STFT) Engine: cancellation Control: statistical independence of outputs - or energy minimization for beamforming Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 6 /21 Human and Machine Scene Analysis • CASA (e.g. Brown’92): • ICA (Bell & Sejnowski et seq.): • Human Listeners: Input: excitation patterns ... Output: percepts ... Engine: ? Control: find a plausible explanation Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 7 /21 2. Auditory Scene Analysis (Bregman 1990) • How do people analyze sound mixtures? break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes • Grouping rules (Darwin, Carlyon, ...): cues: common onset/offset/modulation, harmonicity, spatial location, ... Onset map Sound Frequency analysis Elements Harmonicity map Sources Grouping mechanism Source properties Spatial map (after Darwin 1996) Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 8 /21 Grouping cues freq / Hz • Main cues: Harmonicity + Onset 8000 6000 4000 2000 0 0 1 2 3 4 5 6 not necessarily consistent! 7 8 9 time / s (from Pierce 1980) • Other cues: spatial information ‘schema’ – learned patterns • Cues ≈ constraints Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 9 /21 Bottom-up CASA (Brown’92, Hu & Wang’02) Segment Group input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod • Literal implementation of psychoacoustics segment time-frequency into elements group into sources • Output via time-frequency masking i.e. time-varying filter Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 10/21 Correlogram front-end • (Slaney’90 et seq.) "#$#!%&'()*+(,!-&'.+//0(1 Periodic modulation as 3rd separating axis 2 "'&&+3'1&456 envelope to handle unresolved harmonics 7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18 y ela lin e short-time autocorrelation d sound Cochlea filterbank frequency channels "'&&+3'1&45 /30.+ envelope follower freq lag time - linear filterbank cochlear approximation Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 11/21 - static nonlinearity “Weft” Periodic Elements (Ellis’96) • Represent harmonics without grouping? Correlogram Smooth Spectrum slices freq freq lag time Period Track commonperiod feature period time time hard to separate multiple pitch tracks Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 12/21 CASA Output • Time-Frequency masked reconstruction /012%B%CD%E%F514%8: )*$+%&%,-. )*$+%&%,-. /012%3*"4"156%#"7 = = 2 :8 < < >8 ? ? 8 ; ; B>8 0 0 B:8 : : B08 > > B;8 2 8 8 89: 89; 89< 89= > >9: >9; !"#$%&%'$( 8 8 89: 89; 89< 89= > >9: >9; !"#$%&%'$( 6$/$6%9&%@A works surprisingly well (for speech?) cannot undo overlapping energy (< 20%?) applicable to reverberation also? • Or: parametric resynthesis e.g. ‘wefts’, speech synthesizer Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 13/21 Challenges for CASA "#$%&'()!*+,-!.%$,,$(/012!3454 freq • Circumscribing time-frequency elements need to have ‘regions’, but hard to find • Periodicity is the primary cue how to handle aperiodic energy? • Bottom-up leaves no ambiguity or context how to model illusions/interpolations? • Need to group over longer timespans time 6 3+#70()7#+%+89!,+('/:#';0'87<!'&'('8,) - need to have ‘regions’, but hard to find 6 "'#+$=+7+,<!+)!,-'!1#+(>#<!70' - how to handle aperiodic energy? 6 ?')<8,-')+)!@+>!(>)A'=!!&,'#+89 - cannot separate within a single t-f element 6 B$,,$(/01!&'>@')!8$!>(%+90+,<!$#!7$8,'C, - how to model illusions? local properties not enough E6820 SAPR - Dan Ellis L11 - Signal Separation Comp. Aud. Scene Analysis - Dan Ellis 2005-04-13 - 15 2005-06-30 - 14/21 Model-based integration !"#$%&'()*&+,*-'$ high-level constraints? • How to represent How to integrate disparate fragments? • “Speech fragment decoder” (Barker et al. ’05) . /-%-(-'$)S)(,)01,2&)3"#$%&'(4) %#5&4)167,(1&4-4)4&#"+1)("#+(#82&9 Fragments Hypotheses w1 w5 w2 w6 w3 w7 w8 w4 time ;< ;= ;> ;? ;@ ;A model 9of=>*'=2$*?$?1"5@2#0($12!2=0($P(S|Y)$A$P(X|M) source (e.g. speech recognition HMM) 'B2B$C2(0$=*@C'#"0'*#$*?$(25125"0'*# to say which parts go together "#,$@"0=>$0*$(D22=>$@*,2&( Comp. Aud. Scene Analysis - Dan Ellis . 2005-06-30 - 15/21 :&"$-'$)167,(1&4&4)2-%-(4)47#+&)*&%#'*4 9 BB$C+0$21"(2($(D2='"=$>'(0*1E Disambiguation • Scene multiple possible explanations Analysis choose most reasonable one • Most reasonable means... consistent with grouping cues (CASA) independent sources (ICA) consistent with experience ... (human) max P({Si}| X) ∝ P(X |{Si}) P({Si}) combination physics source models • i.e. some kind of constraints to disambiguate Learning as the source of this disambiguation knowledge Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 16/21 Recognition for Separation • Speech recognizers embody knowledge trained on 100s of hours of speech useEstimation them as a ‘reasonableness’ measure Target Iteration & Filter Training 12 MIC1 ASR Alignment Target 1 MICM Filter 1 Speaker 1 Filter M Speaker 1 ! FE My Ms Error MIC1 MICM ! speech recognizer’s best-match provides optimization target MIC1 Comp. Aud. Scene Analysis Target 2 - Dan Ellis Factorial Processing MICM Filter 1 Speaker 2 Filter M ! My 2005-06-30 - 17/21Ms FE Error from Manuel Reyes’s WASPAA 2003 presentation • e.g. Seltzer, Raj, Reyes: Obliteration and Outputs • Perfect separation is rarely possible e.g. cancellation after psychoacoustic coding? strong interference will obliterate part of target • What should the output be? can fill-in missing-data holes using source models - ‘pretend’ we observed the full signal - but: hides observed/inferred distinction output internal model state instead? - e.g. ASR output - synthesize with “minimally informative noise” Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 18/21 Practical CASA? • When will CASA be useful? • Obstacles: Transmission errors no agreed way to measure progress! intelligibility is a novel idea Net effect graceful degradation - effect of distortions unpitched sounds optimum computation - look-ahead integration with multichannel techniques - sequential or all-at-once? Comp. Aud. Scene Analysis - Dan Ellis Increased artefacts Reduced interference Agressiveness of processing 2005-06-30 - 19/21 Current CASA work • Handling unvoiced events (OSU) • Partial recognition & grouping (Sheffield) • Model-based separation (Columbia) • Spatial cues? • Dereverberation? Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 20/21 Conclusions • Framework for scene analysis Input, Output, Engine, Control • Auditory Scene Analysis in humans and machines • Scene analysis as Disambiguation finding the additional constraints • Big problems still to overcome Comp. Aud. Scene Analysis - Dan Ellis 2005-06-30 - 21/21

Computational Auditory Scene Analysis 1. 2.

Related documents

Products

Support

Computational Auditory Scene Analysis 1. 2.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib