Auditory Scene Analysis: phenomena, theories and computational models July 1998 Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 The computational theory of ASA 2 Cues & grouping 3 Expectations & inference 4 Big issues ASA - Dan Ellis 1998jul11 - 1 Auditory Scene Analysis What does our sense of hearing do? - recover useful information ... about objects of interest ... in a wide range of circumstances Measuring objects in an auditory scene: ASA - Dan Ellis 1998jul11 - 2 Subjective analysis of auditory scenes f/Hz City 4000 2000 1000 400 200 0 1 2 3 4 5 6 7 8 9 Horn1 (10/10) S9−horn 2 S10−car horn S4−horn1 S6−double horn S2−first double horn S7−horn S7−horn2 S3−1st horn S5−Honk S8−car horns S1−honk, honk Crash (10/10) S7−gunshot S8−large object crash S6−slam S9−door Slam? S2−crash S4−crash S10−door slamming S5−Trash can S3−crash (not car) S1−slam Horn2 (5/10) S9−horn 5 S8−car horns S2−horn during crash S6−doppler horn S7−horn3 Truck (7/10) S8−truck engine S2−truck accelerating S5−Acceleration S1−rev up/passing S6−acceleration S3−closeup car S10−wheels on road Horn3 (5/10) S7−horn4 S9−horn 3 S8−car horns S3−2nd horn S10−car horn • ASA - Dan Ellis Subjects identify structures in dense scenes with high agreement 1998jul11 - 3 Outline 1 The computational theory of ASA - ASA and CASA - The grouping paradigm - Marr’s three levels of explanation 2 Cues & grouping 3 Expectations & inference 4 Big issues ASA - Dan Ellis 1998jul11 - 4 Auditory Scene Analysis (ASA) “The organization of sound scenes according to their inferred sources” ASA - Dan Ellis • Real-world sounds rarely occur in isolation →a useful sense of hearing must be able to segregate mixtures - people (and ...) do this very well; unexpectedly difficult to model - depends on: subjective definition of relevant sources regularity/constraints of real-world sounds • Studied via experimental psychology - characterize ‘rules’ for organizing simple pieces (tones, noise bursts, clicks) i.e. ‘reductive’ approach 1998jul11 - 5 Computational Auditory Scene Analysis (CASA) • Psychological ‘rules’ suggest computer implementation - .. but many practical problems arise! • Motivations: Practical applications - real-world interactive systems - indexing of media databases - hearing prostheses Crossover opportunities - unknown signal/information processing principles? Benefits for theory - implementations are very revealing ASA - Dan Ellis 1998jul11 - 6 The grouping paradigm • Standard theory of ASA (Bregman, Darwin &c): - sound mixture is broken up into small elements e.g. time-frequency ‘cells’ - each element has a number of feature dimensions (amplitude, ITD, period) - elements are grouped together according to their features to form larger structures - resulting groups have overall attributes (pitch, location) (from Darwin 1996) ASA - Dan Ellis 1998jul11 - 7 Marr’s levels-of-explanation of information processing • Three distinct aspects to info. processing Computational Theory ‘what’ and ‘why’; the overall goal Sound source organization Algorithm ‘how’; an approach to meeting the goal Auditory grouping Implementation practical realization of the process. Feature calculation & binding Why bother? - to help organize understanding - avoid confusion/wasted effort →use as an analysis tool... ASA - Dan Ellis 1998jul11 - 8 Level 1: Computational theory • The underlying regularities that make the problem possible - i.e. the ‘ecological’ facts • Implicit definition of “what is a source?”: Independence of attributes between sources Continuity of attributes for each source + other source-specific constraints ASA - Dan Ellis 1998jul11 - 9 Level 2: Algorithm ASA - Dan Ellis • A particular approach to exploiting the constraints of the computational theory - both process & representation • Audition: the “elements-then-grouping” approach - could have been otherwise e.g. templates • Often the focus of analysis - but: debate is muddled without a clear computational theory 1998jul11 - 10 Level 3: Implementation ASA - Dan Ellis • A specific realization of the algorithm - computer programs - neurons - ... • Can be analyzed separately? - provided epiphenomena are correctly assigned • Needs context of algorithm, computational theory “You cannot understand stereopsis simply by thinking about neurons” 1998jul11 - 11 The advantage of the appropriate level ASA - Dan Ellis • Computational theory - determines the purpose of the process; provides focus necessary for analysis e.g. biosonar: benefit of hyperresolution • Algorithm - abstraction that is still specific, transferable e.g. autocorrelation for pitch • Implementation - explain ‘epiphenomena’ e.g. ‘subjective octave’ from refractory period 1998jul11 - 12 An example: Neural inhibition Computational theory Frequencydomain processing X(f) f Algorithm Discrete-time filtering (subtraction) Implementation Neurons with GABAergic inhibitions ASA - Dan Ellis 1998jul11 - 13 Summary 1 ASA - Dan Ellis • Acoustic scenes are very complex • .. but the auditory system extracts useful information • Grouping is the main focus of Auditory Scene Analysis • .. but it fits into a larger Marrian framework 1998jul11 - 14 Outline 1 The computational theory of ASA 2 Cues & grouping - Cue analysis - Simple scenes - Models - Complications: interaction, ambiguity, time 3 Expectations & inference 4 Big issues ASA - Dan Ellis 1998jul11 - 15 Cues to grouping • Common onset/offset/modulation (“fate”) • Common periodicity (“pitch”) Common onset Periodicity Computational theory Acoustic consequences tend to be synchronized (Nonlinear) cyclic processes are common Algorithm Group elements that start in a time range ? Place patterns ? Autocorrelation Implementation Onset detector cells Synchronized osc’s? ? Delay-and-mult ? Modulation spect ASA - Dan Ellis • Spatial location (ITD, ILD, spectral cues) • Sequential cues... • Source-specific cues... 1998jul11 - 16 Simple grouping • E.g. isolated tones freq time Computational theory Algorithm Implementation ASA - Dan Ellis • common onset • common period (harmonicity) • locate elements (tracks) • group by shared features ? exhaustive search • evolution in time 1998jul11 - 17 Computer models of grouping • input mixture “Bregman at face value” (e.g. Brown 1992): Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod - ASA - Dan Ellis feature maps periodicity cue common-onset boost resynthesis 1998jul11 - 18 Grouping model results • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 ASA - Dan Ellis 0.4 0.6 0.8 1.0 time/s 0.2 0.4 • Periodicity is the primary cue - how to handle aperiodic energy? • Limitations - resynthesis via filter-mask - only periodic targets - robustness of discrete objects 0.6 0.8 1998jul11 - 19 1.0 time/s Complications for grouping: 1: Cues in conflict • Mistuned harmonic (Moore, Darwin..): freq time - harmonic usually groups by onset & periodicity - can alter frequency and/or onset time - ‘degree of grouping’ from overall pitch match • Gradual, various results: pitch shift 3% mistuning - heard as separate tone, still affects pitch ASA - Dan Ellis 1998jul11 - 20 Complications for grouping: 2: The effect of time • Added harmonics: freq time - onset cue initially segregates; periodicity eventually fuses ASA - Dan Ellis • The effect of time - some cues take time to become apparent - onset cue becomes increasingly distant... • What is the impetus for fission? - e.g. double vowels - depends on what you expect .. ? 1998jul11 - 21 Summary 2 ASA - Dan Ellis • Known grouping cues make sense • Simple examples are straightforward • Models can be implemented directly • .. but problematic situations abound 1998jul11 - 22 Outline 1 The computational theory of ASA 2 Cues & grouping 3 Expectations & inference - “Old-plus-new” - Streaming - Restoration & illusions - Top-down models 4 Big issues ASA - Dan Ellis 1998jul11 - 23 The effect of context • Context can create an ‘expectation’: i.e. a bias towards a particular interpretation • e.g. Bregman’s “old-plus-new” principle: A change in a signal will be interpreted as an added source whenever possible freq/kHz 2 1 + 0 0.0 0.4 0.8 1.2 time/s - a different division of the same energy depending on what preceded it ASA - Dan Ellis 1998jul11 - 24 Streaming • Successive tone events form separate streams freq. TRT: 60-150 ms 1 kHz ∆f: ±2 octaves time • Order, rhythm &c within, not between, streams Computational theory Algorithm Implementation ASA - Dan Ellis Consistency of properties for successive source events • ‘expectation window’ for known streams (widens with time) • competing time-frequency affinity weights... 1998jul11 - 25 Restoration & illusions • Direct evidence may be masked or distorted →make best guess using available information • E.g. the ‘continuity illusion’: f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s - tones alternates with noise bursts - noise is strong enough to mask tone ... so listener discriminate presence - continuous tone distinctly perceived for gaps ~100s of ms → Inference acts at low, preconscious level ASA - Dan Ellis 1998jul11 - 26 Speech restoration • Speech provides very strong bases for inference (coarticulation, grammar, semantics): frq/Hz nsoffee.aif 3500 3000 2500 • 2000 Phonemic restoration 1500 1000 500 0 1.2 1.3 1.4 1.5 1.6 1.7 time/s Temporal compound (1998jul10) 20 • Temporal compounds 40 60 80 100 120 • Sinewave speech (duplex?) 50 f/Bark 15 80 60 100 150 200 250 300 time / ms 350 400 450 500 550 S1−env.pf:0 10 5 40 0.0 ASA - Dan Ellis 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1998jul11 - 27 1.8 Models of top-down processing Perception as a search for plausible explanations • ‘Prediction-driven’ CASA (PDCASA): hypotheses Noise components Hypothesis management prediction errors input mixture ASA - Dan Ellis Front end signal features Compare & reconcile Periodic components Predict & combine predicted features • An approach as well as an implementation... • Key features: - ‘complete explanation’ of all scene energy - vocabulary of periodic/noise/transient elements - multiple hypotheses - explanation hierarchy 1998jul11 - 28 PDCASA for old-plus-new • t1 Incremental analysis t2 t3 Input signal Time t1: initial element created Time t2: Additional element required Time t3: Second element finished ASA - Dan Ellis 1998jul11 - 29 PDCASA for the continuity illusion • Subjects hear the tone as continuous ... if the noise is a plausible masker f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 i ASA - Dan Ellis 1.4 / • Data-driven analysis gives just visible portions: • Prediction-driven can infer masking: 1998jul11 - 30 PDCASA analysis of a complex scene f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 1 2 3 Wefts1−4 f/Hz 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 ASA - Dan Ellis 1 2 3 4 5 6 7 8 9 time/s dB 1998jul11 - 31 Marrian analysis of PDCASA • Marr invoked to separate high-level function from low-level details Computational theory Algorithm Implementation • Objects persist predictably • Observations interact irreversibly • Build hypotheses from generic elements • Update by prediction-reconciliation ??? “It is not enough to be able to describe the response of single cells, nor predict the results of psychophysical experiments. Nor is it enough even to write computer programs that perform approximately in the desired way: One has to do all these things at once, and also be very aware of the computational theory...” ASA - Dan Ellis 1998jul11 - 32 Summary 3 ASA - Dan Ellis • Perceptual processing is highly context-dependent • Auditory system will use prior knowledge to fill-in gaps (subconsciously) • Prediction-reconciliation models can encompass this behavior 1998jul11 - 33 Outline 1 The computational theory of ASA 2 Cues & grouping 3 Expectations & inference 4 Big issues - the state of ASA and CASA - outstanding issues - discussion points ASA - Dan Ellis 1998jul11 - 34 The current state of ASA and CASA ASA - Dan Ellis • ASA - detailed descriptions of “in vitro” tests - some quite subtle effects explained (DV beats) but: how to extend to complex scenarios? • CASA - numerous models, some convergence (mainly periodicity-based) - best results sound impressive (least plausible systems!) - applications in speech recognition? but: domains limited, poor robustness 1998jul11 - 35 Big issues in CASA: ASA - Dan Ellis • Plausibility - correct level for human correspondence? - which phenomena are important to match? - how to implement symbolic-style processing? • Top-down vs. bottom-up - different approaches to ambiguity, latency - how far down for top-down? - how far ‘up’ for high level? - choice between extraction & inference? • Integrating multiple cues (e.g. binaural) • Other debates: - what is the real goal? - resynthesis - evaluation 1998jul11 - 36 Big issues in ASA & CASA: ASA - Dan Ellis • Knowledge: how to acquire, represent & store ... - short-term: context - long-term: memories - abstract: classes, generalities • Attention: - what does it mean in these models? - limitation or important principle? 1998jul11 - 37 Conclusions ASA - Dan Ellis • Real-world sounds are complex; scene-analysis is required • We know certain cues & some rules, but real situations raise contradictions • Current models handle ‘obvious’ cases; robustness & generality are hard • Many issues remain 1998jul11 - 38 Discussion points ASA - Dan Ellis • Are Marr’s levels important? Useful? Can you study levels in isolation? • What do restoration phenomena imply about internal representations? • Do we have an adequate account of an ASA algorithm? e.g. where do hypotheses come from? • How important/challenging are phenomena like duplex perception, sinewave speech etc.? 1998jul11 - 39