Computational Auditory Scene Analysis: Principles, Practice and Applications Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 Auditory Scene Analysis (ASA) 2 Computational ASA (CASA) 3 Context, expectation & predictions 4 Applications: speech recognition, indexing 5 Conclusions and open issues CASA for TICSP - Dan Ellis 1999sep01 - 1 Auditory Scene Analysis 1 “The organization of sound scenes according to their inferred sources” f/Hz • Sounds rarely occur in isolation - need to ‘separate’ for useful information • Human audition is very effective - it’s a shock that modeling it is so hard city22 4000 −40 2000 −50 1000 400 −60 200 −70 0 1 2 CASA for TICSP - Dan Ellis 3 4 5 6 7 8 9 dB time/s 1999sep01 - 2 How can we separate sources? • In general, we can’t - mathematically, too few degrees of freedom - sensor noise limits separation • ‘Correct’ analysis is subjectively defined - sources exhibit independence, continuity, ... →ecological constraints enable organization • Goal not complete separation (reconstruction) but organization of available information into useful structures - telling us useful things about the outside world • ‘The auditory system as a separation machine’ (Alain de Cheveigné) CASA for TICSP - Dan Ellis 1999sep01 - 3 Psychology of ASA • Extensive experimental research - perception of simplified stimuli (sinusoids, noise) • “Auditory Scene Analysis” [Bregman 1990] - first: break mixture into small elements - elements are grouped in to sources using cues - sources have aggregate attributes • Grouping ‘rules’ (Darwin, Carlyon, ...): - common onset/offset/modulation,harmonicity, spatial location, ... (from Darwin 1996) CASA for TICSP - Dan Ellis 1999sep01 - 4 Thinking about information processing: Marr’s levels-of-explanation • Three distinct aspects to info. processing Computational Theory ‘what’ and ‘why’; the overall goal Sound source organization Algorithm ‘how’; an approach to meeting the goal Auditory grouping Implementation practical realization of the process. Feature calculation & binding Why bother? CASA for TICSP - Dan Ellis - helps organize interpretation - it’s OK to consider levels separately, one at a time 1999sep01 - 5 Cues to grouping • Common onset/offset/modulation (“fate”) • Common periodicity (“pitch”) Common onset Periodicity Computational theory Acoustic consequences tend to be synchronized (Nonlinear) cyclic processes are common Algorithm Group elements that start in a time range ? Place patterns ? Autocorrelation Implementation Onset detector cells Synchronized osc’s? ? Delay-and-mult ? Modulation spect • Spatial location (ITD, ILD, spectral cues) • Sequential cues... • Source-specific cues... CASA for TICSP - Dan Ellis 1999sep01 - 6 Simple grouping • E.g. isolated tones freq time Computational theory Algorithm Implementation CASA for TICSP - Dan Ellis • common onset • common period (harmonicity) • locate elements (tracks) • group by shared features ? exhaustive search • evolution in time 1999sep01 - 7 Complications for grouping: 1: Cues in conflict • Mistuned harmonic (Moore, Darwin..): freq time - harmonic usually groups by onset & periodicity - can alter frequency and/or onset time - ‘degree of grouping’ from overall pitch match • Gradual, various results: pitch shift 3% mistuning - heard as separate tone, still affects pitch CASA for TICSP - Dan Ellis 1999sep01 - 8 Complications for grouping: 2: The effect of time • Added harmonics: freq time - onset cue initially segregates; periodicity eventually fuses • The effect of time - some cues take time to become apparent - onset cue becomes increasingly distant... • What is the impetus for fission? - e.g. double vowels - depends on what you expect .. ? CASA for TICSP - Dan Ellis 1999sep01 - 9 Outline 1 Auditory Scene Analysis (ASA) 2 Computational ASA (CASA) - A simple model of grouping - Other systems 3 Context, expectation & predictions 4 Applications: speech recognition, indexing 5 Conclusions and open issues CASA for TICSP - Dan Ellis 1999sep01 - 10 2 Computational Auditory Scene Analysis (CASA) • Automatic sound organization? - convert an undifferentiated signal into a description in terms of different sources • Translate psych. rules into programs? - representations to reveal common onset, harmonicity ... • Motivations & Applications - it’s a puzzle: new processing principles? - real-world interactive systems (speech, robots) - hearing prostheses (enhancement, description) - advanced processing (remixing) - multimedia indexing CASA for TICSP - Dan Ellis 1999sep01 - 11 A simple model of grouping • input mixture “Bregman at face value” (e.g. Brown 1992): Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod - feature maps periodicity cue common-onset boost resynthesis CASA for TICSP - Dan Ellis 1999sep01 - 12 Grouping model results • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 0.6 0.8 1.0 time/s 0.2 0.4 • Periodicity is the primary cue - how to handle aperiodic energy? • Limitations - resynthesis via filter-mask - only periodic targets - robustness of discrete objects CASA for TICSP - Dan Ellis 0.6 0.8 1999sep01 - 13 1.0 time/s Other CASA systems (1) • Weintraub 1985 - separate male & female voices - find periodicities in each frequency channel by auto-coincidence - number of voices is ‘hidden state’ • Cooke 1991 - ‘Synchrony strands’ auditory model - Fusing resolved harmonics and AM formants - led to [Brown 1992] CASA for TICSP - Dan Ellis 1999sep01 - 14 Other CASA systems (2) • Okuno, Nakatani &c (1994-) - ‘tracers’ follow each harmonic + noise ‘agent’ - residue-driven: account for whole signal • Klassner 1996 - search for a combination of templates - high-level hypotheses permit front-end tuning 3760 Hz Buzzer-Alarm 2540 Hz 2230 Hz 2350 Hz Glass-Clink 1675 Hz 1475 Hz 950 Hz 500 Hz 460 Hz 420 Hz Phone-Ring 1.0 • Siren-Chirp 2.0 3.0 4.0 sec 1.0 2.0 TIME TIME (a) (b) 3.0 4.0 sec Ellis 1996 - model for events perceived in dense scenes - prediction-driven: observations - hypotheses CASA for TICSP - Dan Ellis 1999sep01 - 15 Other signal-separation approaches • HMM decomposition (RK Moore ’86) - recover combined source states directly • Blind source separation (Bell & Sejnowski ’94) - find exact separation parameters by maximizing statistic e.g. signal independence • Neural models (Malsburg, Wang & Brown) - avoid implausible AI methods (search, lists) - oscillators substitute for iteration? CASA for TICSP - Dan Ellis 1999sep01 - 16 Outline 1 Auditory Scene Analysis (ASA) 2 Computational ASA (CASA) 3 Context, expectation & predictions - the effect of context - streaming, illusions and restoration - prediction-driven (PD) CASA 4 Applications: speech recognition, indexing 5 Conclusions and open issues CASA for TICSP - Dan Ellis 1999sep01 - 17 3 Context, expectations & predictions Perception is not direct but a search for plausible hypotheses • Data-driven... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven hypotheses Noise components Hypothesis management prediction errors input mixture • Front end signal features Compare & reconcile Periodic components Predict & combine predicted features Motivations - detect non-tonal events (noise & clicks) - support ‘restoration illusions’... → hooks for high-level knowledge + ‘complete explanation’, multiple hypotheses, resynthesis CASA for TICSP - Dan Ellis 1999sep01 - 18 The effect of context • Context can create an ‘expectation’: i.e. a bias towards a particular interpretation • e.g. Bregman’s “old-plus-new” principle: A change in a signal will be interpreted as an added source whenever possible freq/kHz 2 1 + 0 0.0 0.4 0.8 1.2 time/s - a different division of the same energy depending on what preceded it CASA for TICSP - Dan Ellis 1999sep01 - 19 Streaming • Successive tone events form separate streams freq. TRT: 60-150 ms 1 kHz ∆f: ±2 octaves time • Order, rhythm &c within, not between, streams Computational theory Algorithm Implementation CASA for TICSP - Dan Ellis Consistency of properties for successive source events • ‘expectation window’ for known streams (widens with time) • competing time-frequency affinity weights... 1999sep01 - 20 Restoration & illusions • Direct evidence may be masked or distorted →make best guess using available information • E.g. the ‘continuity illusion’: f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s - tones alternates with noise bursts - noise is strong enough to mask tone ... so listener discriminate presence - continuous tone distinctly perceived for gaps ~100s of ms → Inference acts at low, preconscious level CASA for TICSP - Dan Ellis 1999sep01 - 21 Speech restoration • Speech provides very strong bases for inference (coarticulation, grammar, semantics): frq/Hz nsoffee.aif 3500 3000 2500 • 2000 Phonemic restoration 1500 1000 500 0 1.2 1.3 1.4 1.5 1.6 1.7 time/s Temporal compound (1998jul10) 20 • Temporal compounds 40 60 80 100 120 • Sinewave speech (duplex?) 50 f/Bark 15 80 60 100 150 200 250 300 time / ms 350 400 450 500 550 S1−env.pf:0 10 5 40 0.0 CASA for TICSP - Dan Ellis 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1999sep01 - 22 1.8 Modeling top-down processing: ‘Prediction-driven’ CASA (PDCASA): hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Periodic components Predict & combine predicted features • An approach as well as an implementation... • Key features: - ‘complete explanation’ of all scene energy - vocabulary of periodic/noise/transient elements - multiple hypotheses - explanation hierarchy CASA for TICSP - Dan Ellis 1999sep01 - 23 PDCASA for old-plus-new • Incremental analysis t1 t2 t3 Input signal Time t1: initial element created Time t2: Additional element required Time t3: Second element finished CASA for TICSP - Dan Ellis 1999sep01 - 24 PDCASA for the continuity illusion • Subjects hear the tone as continuous ... if the noise is a plausible masker f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 i 1.4 / • Data-driven analysis gives just visible portions: • Prediction-driven can infer masking: CASA for TICSP - Dan Ellis 1999sep01 - 25 PDCASA analysis of a complex scene f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 CASA for TICSP - Dan Ellis 1 2 3 4 5 6 7 8 9 time/s dB 1999sep01 - 26 Problems in PDCASA • Subjective ground-truth in mixtures? - listening tests collect ‘perceived events’: • Other problems - error allocation - source hierarchy CASA for TICSP - Dan Ellis - rating hypotheses - resynthesis 1999sep01 - 27 Marrian analysis of PDCASA • Marr invoked to separate high-level function from low-level details Computational theory Algorithm Implementation • Objects persist predictably • Observations interact irreversibly • Build hypotheses from generic elements • Update by prediction-reconciliation ??? “It is not enough to be able to describe the response of single cells, nor predict the results of psychophysical experiments. Nor is it enough even to write computer programs that perform approximately in the desired way: One has to do all these things at once, and also be very aware of the computational theory...” CASA for TICSP - Dan Ellis 1999sep01 - 28 Outline 1 Auditory Scene Analysis (ASA) 2 Computational ASA (CASA) 3 Context, expectation & predictions 4 Applications - CASA for speech recognition - CASA for audio indexing 5 Conclusions and open issues CASA for TICSP - Dan Ellis 1999sep01 - 29 4 .1 Applications: Speech recognition • Conventional speech recognition: signal Feature extraction low-dim. features Phone classifier phone probabilities HMM decoder words - signal assumed entirely speech - find valid segmentation using discrete labels - class models from training data • Some problems: - need to ignore lexically-irrelevant variation (microphone, voice pitch etc.) - compact feature space → everything speech-like • Very fragile to nonspeech, background - scene-analysis methods very attractive... CASA for TICSP - Dan Ellis 1999sep01 - 30 CASA for speech recognition • Data-driven: CASA as preprocessor - problems with ‘holes’ (but: Okuno) - doesn’t exploit knowledge of speech structure • Missing data (Cooke &c, de Cheveigné) - CASA cues distinguish present/absent - use ‘aware’ classifier • Prediction-driven: speech as component - same ‘reconciliation’ of speech hypotheses - need to express ‘predictions’ in signal domain Speech components Hypothesis management input mixture CASA for TICSP - Dan Ellis Noise components Predict & combine Periodic components Front end Compare & reconcile 1999sep01 - 31 Example of speech & nonspeech f/Bark 15 (a) Clap 10 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 (b) Speech plus clap dB 60 40 20 (c) Recognizer output h# w n h# ay n n ay tcl t uw n t f uw ay ah f s ay ay ow h# v s eh v ow h# five oh <SIL> s v ah eh v n ax h# n tcl <SIL> nine two seven (d) Reconstruction from labels alone (e) Slowly−varying portion of original (f) Predicted speech element ( = (d)+(e) ) (g) Click5 from nonspeech analysis (h) Spurious elements from nonspeech analysis 0.0 • 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Problems: - undoing classification & normalization - finding a starting hypothesis - granularity of integration CASA for TICSP - Dan Ellis 1999sep01 - 32 CASA & Missing Data • CASA indicates source energy regions - e.g. the time-frequency mask of [Brown 1992] • ‘Missing data’ theory permits inference: - skip dimensions of an uncorrelated Gaussian - perform full integral over unknown range - ‘data imputation’ e.g. for deltas, cepstra • ... or just weighting of information streams - 4 band recognizer [Berthommier et al. 1998] • RESPITE project CASA Signal quality labels SNR signal Feature stream division per-stream quality labels Confidence estimation per-stream confidences Missing-data classifier multiple information streams CASA for TICSP - Dan Ellis Missing data/ multistream recombination per-stream class probabilites 1999sep01 - 33 words The RESPITE CASA Toolkit (Barker, Ellis & Cooke 1999) CASA toolkit block diagram dpwe@icsi.berkeley.edu 1999mar19 Interaural analysis Modulation analysis • simple cycle • blackboard • ASR-style decoding Frequency analysis CxFxP Periodicity tracking Object periodicity hypotheses • summary & tracking • channel voting • modulation filtering • stabilized image • cancellation • autocorrelation (FFT/delay&mult) CxF Spectral: • Gammatone • FFT • general FIR/IIR Envelope: • IHC model • HWR/FWR + LPF • analytic envelope Spectral modeling • t-f samples • LPC Crossspectral analysis • local maxima • synchrony Sinusoid tracking Multidimensional continuous signal Semi-discrete object hypotheses CASA for TICSP - Dan Ellis Object location hypotheses • cross-correlation(FFT/sampled) • interaural level difference Execution/ hypothesis control Sound C CxFxA Spatial tracking Spectral envelope hypotheses Onset hypotheses Integration & object formation • evidence weighting • ‘re-entry’ Spectral partition hypotheses Sinusoid hyps Sinusoid grouping Sinusoid group hypotheses C = number of input channels (monaural/binaural) F = number of frequency channels (2-1024) P = number of periodicity bins (25-500) A = number of spatial (azimuth) bins (16-256) 1999sep01 - 34 4 .2 f/Hz Applications: audio indexing city22 4000 horn −40 2000 yell −60 400 200 car noise −70 0 1 2 3 4 5 6 7 8 9 horn door crash −50 1000 dB time/s • Current approaches - speech recognition (Informedia etc.) - whole-sample statistics (Muscle Fish) • What are the ‘objects’ in a soundtrack? - i.e. the analog of words in text IR - subjective definition → need auditory model • Problems - parts vs. wholes - general vs. specific - how to be ‘data-driven’ CASA for TICSP - Dan Ellis 1999sep01 - 35 Element-based audio indexing • Segment-level features Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example • Using ‘generic sound elements’ Segment feature analysis Continuous audio archive Element representations Seach/ comparison Results Segment feature analysis Query example - search for subset (but: masked features?) - how to generalize? - how to use segment-style features? CASA for TICSP - Dan Ellis 1999sep01 - 36 Object-based audio indexing • Organizing elements into objects reveals higher-order properties Segment feature analysis Continuous audio archive Object formation Seach/ comparison Element representations Segment feature analysis Query example Object formation (?) Objects + properties • How to form objects? - heuristics (onset, harmonicity, continuity) - machine learning: associative recall, clustering, ‘data mining’ • Which higher-order properties? - current wisdom (brightness, roughness...) - psychoacoustics - (semi) data-driven hierarchies CASA for TICSP - Dan Ellis 1999sep01 - 37 Results Open issues in automatic indexing • How to do CASA for element descriptions? - PDCASA: ‘generic’ primitives + constraining hierarchy - (semi?) automatic learning of object structure • Classification - connecting subjective & objective properties → finding subjective invariants, prominence - representation of sound-object ‘classes’ - matching incompletely-described objects • Queries - .. by example (which part?) - .. by symbolic descriptions of classes? • Related applications - ‘structured audio encoder’ - semantic hearing aid / robot listener CASA for TICSP - Dan Ellis 1999sep01 - 38 Outline 1 Auditory Scene Analysis (ASA) 2 Computational ASA (CASA) 3 Context, expectation & predictions 4 Applications: speech recognition, indexing 5 Conclusions and open issues CASA for TICSP - Dan Ellis 1999sep01 - 39 CASA: Open issues 5 • We’re still looking for the right perspective - bottom up vs. top down - physiology, psychology, levels of description • What is the goal? - simulating listeners on contrived tasks? - solving practical engineering problems? - laying the conceptual groundwork • How to evaluate CASA work? - evaluation is critical for a healthy field - .. but people have to agree on a task - subjectively defined → listening tests • Looming on the horizon... - learning - attention CASA for TICSP - Dan Ellis 1999sep01 - 40 Conclusions • Auditory organization is required in real environments • We don’t know how listeners do it! - plenty of modeling interest • Prediction-reconciliation can account for ‘illusions’ - use ‘knowledge’ when signal is inadequate - important in a wider range of circumstances? • Speech & speech recognizers - urgent application for CASA - good source of signal knowledge? • Automatic indexing implies ‘synthetic listener’ - need to solve a lot of modeling issues - the next big thing? CASA for TICSP - Dan Ellis 1999sep01 - 41