Computational Models of Auditory Organization Dan Ellis Electrical Engineering, Columbia University http://www.ee.columbia.edu/~dpwe/ Outline 1 Sound organization 2 Human Auditory Scene Analysis (ASA) 3 Computational ASA (CASA) 4 CASA issues & applications 5 Summary CASA @ ICTP - Dan Ellis 2001-08-09 - 1 Lab ROSA Outline 1 Sound organization - the information in sound - Marr’s levels of explanation 2 Human Auditory Scene Analysis (ASA) 3 Computational ASA (CASA) 4 CASA issues & applications 5 Summary CASA @ ICTP - Dan Ellis 2001-08-09 - 2 Lab ROSA Sound organization 1 frq/Hz 4000 2000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Central operation: - continuous sound mixture → distinct objects & events • Perceptual impression is very strong - but hard to ‘see’ in signal CASA @ ICTP - Dan Ellis 2001-08-09 - 3 Lab ROSA Bregman’s lake “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture - two sensors, N signals ... • Disentangling mixtures as primary goal - perfect solution is not possible - need knowledge-based constraints CASA @ ICTP - Dan Ellis 2001-08-09 - 4 Lab ROSA The information in sound freq / Hz Steps 1 Steps 2 4000 3000 2000 1000 0 0 1 2 3 4 0 1 2 3 4 time / s • A sense of hearing is evolutionarily useful - gives organisms ‘relevant’ information • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) - special-purpose processing reflects ‘natural scene’ properties - subjective not canonical (ambiguity) CASA @ ICTP - Dan Ellis 2001-08-09 - 5 Lab ROSA Sound mixtures • Speech 4000 Frequency Sound ‘scene’ is almost always a mixture - always stuff going on - sound is ‘transparent’ Noise Speech+Noise 3000 = + 2000 1000 0 0 0.5 1 Time 0 0.5 1 Time 0 0.5 1 Time • Need information related to our ‘world model’ - i.e. separate objects - a wolf howling in a blizzard is the same as a wolf howling in a rainstorm - whole-signal statistics won’t do this • ‘Separateness’ is similar to independence - objects/sounds that change in isolation - but: depends on the situation e.g. passing car vs. mechanic’s diagnosis CASA @ ICTP - Dan Ellis 2001-08-09 - 6 Lab ROSA Vision and hearing • Hearing and seeing are complementary - hearing is omindirectional - hearing works in the dark • Reveal different things about the world - vision is good for examining static situations - physical motion almost always makes sound CASA @ ICTP - Dan Ellis 2001-08-09 - 7 Lab ROSA Thinking about information processing: * Marr’s levels-of-explanation • Three distinct aspects to info. processing Computational Theory ‘what’ and ‘why’; the overall goal Sound source organization Algorithm ‘how’; an approach to meeting the goal Auditory grouping Implementation practical realization of the process. Feature calculation & binding Why bother? CASA @ ICTP - Dan Ellis - helps organize interpretation - it’s OK to consider levels separately, one at a time 2001-08-09 - 8 Lab ROSA An example: Neural inhibition Computational theory Frequencydomain processing * X(f) f Algorithm Discrete-time filtering (subtraction) Implementation Neurons with GABAergic inhibitions CASA @ ICTP - Dan Ellis 2001-08-09 - 9 Lab ROSA Outline 1 Sound organization 2 Human Auditory Scene Analysis (ASA) - experimental psychoacoustics - grouping and cues - perception of complex scenes 3 Computational ASA (CASA) 4 CASA issues & applications 5 Summary CASA @ ICTP - Dan Ellis 2001-08-09 - 10 Lab ROSA Human Sound Organization 2 • “Auditory Scene Analysis” [Bregman 1990] - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes • Grouping ‘rules’ (Darwin, Carlyon, ...): - cues: common onset/offset/modulation, harmonicity, spatial location, ... (from Darwin 1996) CASA @ ICTP - Dan Ellis 2001-08-09 - 11 Lab ROSA Cues to simultaneous grouping • Common attributes and ‘fate’ freq time Common onset Periodicity Computational theory Acoustic consequences tend to be synchronized (Nonlinear) cyclic processes are common Algorithm Group elements that start in a time range Place patterns? Autocorrelation? Implementation Onset detector cells Synchronized osc’s? Delay-and-mult? Modulation spect? • + Spatial location (ITD, ILD, spectral)... CASA @ ICTP - Dan Ellis 2001-08-09 - 12 Lab ROSA Complications for grouping: 1: Cues in conflict • Mistuned harmonic (Moore, Darwin..): freq ∆f ∆t time - determine how ∆t and ∆f affect • segregation of harmonic • pitch of complex • Gradual, various results: pitch shift 3% mistuning http://www.dcs.shef.ac.uk/~martin/MAD/docs/mad.htm CASA @ ICTP - Dan Ellis 2001-08-09 - 13 Lab ROSA Complications for grouping: 2: The effect of time • * Added harmonics: freq time - onset cue initially segregates; periodicity eventually fuses • The effect of time - some cues take time to become apparent - onset cue becomes increasingly distant... • What is the impetus for fission? - e.g. double vowels - depends on what you expect .. ? CASA @ ICTP - Dan Ellis 2001-08-09 - 14 Lab ROSA Sequential grouping: Streaming • Successive tone events form separate streams freq. TRT: 60-150 ms 1 kHz ∆f: ±2 octaves time • Order, rhythm &c within, not between, streams Computational theory Algorithm Implementation CASA @ ICTP - Dan Ellis Consistency of properties for successive source events • ‘expectation window’ for known streams (widens with time) • competing time-frequency affinity weights... 2001-08-09 - 15 Lab ROSA The effect of context Context can create an ‘expectation’: i.e. a bias towards a particular interpretation • e.g. Bregman’s “old-plus-new” principle: A change in a signal will be interpreted as an added source whenever possible frequency / kHz • 2 1 + 0 0.0 0.4 0.8 1.2 time / s - a different division of the same energy depending on what preceded it CASA @ ICTP - Dan Ellis 2001-08-09 - 16 Lab ROSA Restoration & illusions • ‘Illusions’ illuminate algorithm - what model would ‘misbehave’ this way? • E.g. the ‘continuity illusion’: - making ‘best guess’ for masked information f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s - tones alternates with noise bursts - noise is strong enough to mask tone - continuous tone distinctly perceived for gaps ~100s of ms → Inference acts at low, preconscious level CASA @ ICTP - Dan Ellis 2001-08-09 - 17 Lab ROSA Speech restoration • Speech provides a very strong basis for inference (coarticulation, grammar, semantics): frq/Hz nsoffee.aif 3500 3000 • Phonemic restoration 2500 2000 1500 1000 500 0 1.2 • Sinewave speech (duplex?) CASA @ ICTP - Dan Ellis f/Bark 15 80 60 1.3 1.4 1.5 1.6 1.7 time/s S1−env.pf:0 10 5 40 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 2001-08-09 - 18 1.6 1.8 Lab ROSA Ground truth in complex scenes • * What do people hear in sound mixtures? - do interpretations match? → Listening tests to collect ‘perceived events’: CASA @ ICTP - Dan Ellis 2001-08-09 - 19 Lab ROSA Ground-truth results f/Hz * City 4000 2000 1000 400 200 0 1 2 3 4 5 6 7 8 9 Horn1 (10/10) S9−horn 2 S10−car horn S4−horn1 S6−double horn S2−first double horn S7−horn S7−horn2 S3−1st horn S5−Honk S8−car horns S1−honk, honk Crash (10/10) S7−gunshot S8−large object crash S6−slam S9−door Slam? S2−crash S4−crash S10−door slamming S5−Trash can S3−crash (not car) S1−slam Horn2 (5/10) S9−horn 5 S8−car horns S2−horn during crash S6−doppler horn S7−horn3 Truck (7/10) S8−truck engine S2−truck accelerating S5−Acceleration S1−rev up/passing S6−acceleration S3−closeup car S10−wheels on road Horn3 (5/10) S7−horn4 S9−horn 3 S8−car horns S3−2nd horn S10−car horn Squeal (6/10) S6−airbrake S4−squeal S2−squeek S3−squeal S8−break Squeaks S1−squeal Horn4 (8/10) S7−horn5 S9−horn 4 S2−horn during Squeek S8−car horns S3−2nd horn S10−car horn S5−Honk 2 S6−horn3 H 5 (10/10) CASA @ ICTP - Dan Ellis 2001-08-09 - 20 Lab ROSA Outline 1 Sound organization 2 Human Auditory Scene Analysis (ASA) 3 Computational ASA (CASA) - bottom-up models - top-down predictions - other approaches 4 CASA issues & applications 5 Summary CASA @ ICTP - Dan Ellis 2001-08-09 - 21 Lab ROSA Computational ASA 3 CASA Object 1 description Object 2 description Object 3 description ... • Goal: Automatic sound organization ; Systems to ‘pick out’ sounds in a mixture - ... like people do • E.g. voice against a noisy background - to improve speech recognition • Approach: - psychoacoustics describes grouping ‘rules’ - ... just implement them? CASA @ ICTP - Dan Ellis 2001-08-09 - 22 Lab ROSA CASA front-end processing • Correlogram: Loosely based on known/possible physiology de lay l ine short-time autocorrelation sound Cochlea filterbank frequency channels Correlogram slice envelope follower freq lag time - linear filterbank cochlear approximation static nonlinearity zero-delay slice is like spectrogram periodicity from delay-and-multiply detectors CASA @ ICTP - Dan Ellis 2001-08-09 - 23 Lab ROSA The Representational Approach (Brown & Cooke 1993) • input mixture Implement psychoacoustic theory signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - ‘bottom-up’ processing - uses common onset & periodicity cues • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 0.6 0.8 CASA @ ICTP - Dan Ellis 1.0 time/s 0.2 0.4 0.6 0.8 2001-08-09 - 24 1.0 time/s Lab ROSA Problems with ‘bottom-up’ CASA freq time • Circumscribing time-frequency elements - need to have ‘regions’, but hard to find • Periodicity is the primary cue - how to handle aperiodic energy? • Resynthesis via masked filtering - cannot separate within a single t-f element • Bottom-up leaves no ambiguity or context - how to model illusions? CASA @ ICTP - Dan Ellis 2001-08-09 - 25 Lab ROSA Adding top-down cues Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven (top-down) (PDCASA) hypotheses Noise components Hypothesis management prediction errors input mixture • Front end signal features Compare & reconcile Periodic components Predict & combine predicted features Motivations - detect non-tonal events (noise & click elements) - support ‘restoration illusions’... → hooks for high-level knowledge + ‘complete explanation’, multiple hypotheses, ... CASA @ ICTP - Dan Ellis 2001-08-09 - 26 Lab ROSA Generic sound elements for PDCASA • f/Hz Goal is a representational space that - covers real-world perceptual sounds - minimal parameterization (sparseness) - separate attributes in separate parameters NoiseObject profile magProfile H[k] 4000 ClickObject1 decayTimes XT [n,k] T[k] f/Hz 2000 4000 2000 1000 1000 400 400 200 H[k] 0 0.00 magContour A[n] 0.005 0.01 2.2 2.4 time / s 0.0 0.1 time / s 0.000 0.1 p[n] 200 100 XC[n,k] = H[k]·exp((n - n0) / T[k]) 0.0 H W[n,k] 200 f/Hz 400 XN[n,k] 0.010 WeftObject1 0.2 time/s XN[n,k] = H[k]·A[n] 50 f/Hz 400 Correlogram analysis Periodogram 200 100 50 0.0 0.2 0.4 0.6 0.8 time/s XW[n,k] = HW[n,k]·P[n,k] • Object hierarchies built on top... CASA @ ICTP - Dan Ellis 2001-08-09 - 27 Lab ROSA PDCASA for old-plus-new • t1 Incremental analysis t2 t3 Input signal Time t1: initial element created Time t2: Additional element required Time t3: Second element finished CASA @ ICTP - Dan Ellis 2001-08-09 - 28 Lab ROSA PDCASA for the continuity illusion • * Subjects hear the tone as continuous ... if the noise is a plausible masker f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 i 1.4 / • Data-driven analysis gives just visible portions: • Prediction-driven can infer masking: CASA @ ICTP - Dan Ellis 2001-08-09 - 29 Lab ROSA PDCASA and complex scenes f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 CASA @ ICTP - Dan Ellis 1 2 3 4 5 6 7 8 9 time/s 2001-08-09 - 30 dB Lab ROSA Marrian analysis of PDCASA • * Marr invoked to separate high-level function from low-level details Computational theory Algorithm Implementation • Objects persist predictably • Observations interact irreversibly • Build hypotheses from generic elements • Update by prediction-reconciliation ??? “It is not enough to be able to describe the response of single cells, nor predict the results of psychophysical experiments. Nor is it enough even to write computer programs that perform approximately in the desired way: One has to do all these things at once, and also be very aware of the computational theory...” CASA @ ICTP - Dan Ellis 2001-08-09 - 31 Lab ROSA Other approaches: ICA * (Bell & Sejnowski etc.) • General idea: Drive a parameterized separation algorithm to maximize indepdendence of outputs m1 m2 x a11 a12 a21 a22 s1 s2 −δ MutInfo δa • Attractions: - mathematically rigorous, minimal assumptions • Problems: - limitations of separation algorithm (N x N) - essentially bottom-up CASA @ ICTP - Dan Ellis 2001-08-09 - 32 Lab ROSA Other approaches: Neural Oscillators * (Malsburg, Wang & Brown) • Locally-excited, globally-inhibited networks form separate phases of synchrony • Advantages: - avoid implausible AI methods (search, lists) - oscillators substitute for iteration • Only concerns the implementation level? CASA @ ICTP - Dan Ellis 2001-08-09 - 33 Lab ROSA Outline 1 Sound organization 2 Human Auditory Scene Analysis (ASA) 3 Computational ASA (CASA) 4 CASA issues & applications - learning - missing data - audio information retrieval - the machine listener 5 Summary CASA @ ICTP - Dan Ellis 2001-08-09 - 34 Lab ROSA Learning & acquisition • The speech recognition lesson: How to exploit large databases? • ‘Maximum likelihood’ sound organization (e.g. Roweis) * - learn model→sound distributions P ( X M ) by analyzing isolated sound databases - combine models with physics: P ( X { M i } ) - learn patterns of model combinations P ( { M i } ) - search for most likely combinations of models to explain observed sound mixtures max P ( { M i } X ) = P ( X { M i } ) ⋅ P ( { M i } ) • Short-term learning - hearing a particular source can alter short-term interpretations of mixtures CASA @ ICTP - Dan Ellis 2001-08-09 - 35 Lab ROSA Missing data recognition * (Cooke, Green, Barker... @ Sheffield) • Energy overlaps in time-freq. hide features - some observations are effectively missing • Use missing feature theory... - integrate over missing data dimensions xm p( x q) = • ∫ p ( xg x m, q ) p ( x m q ) d x m Effective in speech recognition - trick is finding good/bad data mask "1754" + noise Missing-data Recognition "1754" Mask based on stationary noise estimate CASA @ ICTP - Dan Ellis 2001-08-09 - 36 Lab ROSA Multi-source decoding * (Jon Barker @ Sheffield) • Search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder "1754" • CASA for masks/fragments - larger fragments → quicker search • Use with nonspeech models? CASA @ ICTP - Dan Ellis 2001-08-09 - 37 Lab ROSA Audio Information Retrieval * • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in weighted feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures CASA @ ICTP - Dan Ellis 2001-08-09 - 38 Lab ROSA CASA for audio retrieval * • When audio material contains mixtures, global features are insufficient • Retrieval based on element/object analysis: Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis Seach/ comparison Results (Object formation) Query example rushing water Symbolic query Word-to-class mapping Properties alone - features are calculated over grouped subsets CASA @ ICTP - Dan Ellis 2001-08-09 - 39 Lab ROSA freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 • 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 representation of energy in time-frequency formation of atomic elements grouping by common properties (onset &c.) classify by attributes... Key: recognize despite background CASA @ ICTP - Dan Ellis 2001-08-09 - 40 Lab ROSA Future prosthetic listening devices • CASA to replace lost hearing ability - sound mixtures are difficult for hearing impaired • Signal enhancement - resynthesize a single source without background - (need very good resynthesis) • Signal understanding - monitor for particular sounds (doorbell, knocks) & translate into alternative mode (vibro alarm) - real-time textual descriptions i.e. “automatic subtitles for real life” [thunder] S: I THINK THE WEATHER'S CHANGING CASA @ ICTP - Dan Ellis 2001-08-09 - 41 Lab ROSA The ‘Machine listener’ • Goal: An auditory system for machines - use same environmental information as people • Aspects: - recognize spoken commands (but not others) - track ‘acoustic channel’ quality (for responses) - categorize environment (conversation, crowd...) • Scenarios - personal listener → summary of your day - autonomous robots: need awareness CASA @ ICTP - Dan Ellis 2001-08-09 - 42 Lab ROSA Outline 1 Information from sound 2 Human Auditory Scene Analysis (ASA) 3 Computational ASA (CASA) 4 CASA issues & applications 5 Summary CASA @ ICTP - Dan Ellis 2001-08-09 - 43 Lab ROSA Summary 5 • Sound contains lots of information ... but it’s not easy to extract • We know a little about how humans hear ... at least for simplified sounds • We have some ways to copy it ... which we hope to improve • CASA would have many useful applications ... machines to listen and remember for us CASA @ ICTP - Dan Ellis 2001-08-09 - 44 Lab ROSA Further reading [BarkCE00] J. Barker, M.P. Cooke & D. Ellis (2000). “Decoding speech in the presence of other sound sources,” Proc. ICSLP-2000, Beijing. ftp://ftp.icsi.berkeley.edu/pub/speech/papers/icslp00-msd.pdf [Breg90] A.S. Bregman (1990). Auditory Scene Analysis: the perceptual organization of sound, MIT Press. [BrowC94] G.J. Brown & M.P. Cooke (1994). “Computational auditory scene analysis,” Computer Speech and Language 8, 297-336. [Chev00] A. de Cheveigné (2000). “The Auditory System as a Separation Machine,” Proc. Intl. Symposium on Hearing. http://www.ircam.fr/pcm/cheveign/sh/ps/ATReats98.pdf [CookE01] M. Cooke, D. Ellis (2001). “The auditory organization of speech and other sources in listeners and computational models,” Speech Communication (accepted for publication). http://www.ee.columbia.edu/~dpwe/pubs/tcfkas.pdf [DarC95] C.J. Darwin, R.P. Carlyon (1995). “Auditory Grouping,” in The Handbook of Perception and Cognition, Vol 6, Hearing (ed: B.C.J. Moore), Academic Press, 387-424. [Ellis99] D.P.W. Ellis (1999). “Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis, and its application to speech/nonspeech mixtures,” Speech Communications 27. http://www.icsi.berkeley.edu/~dpwe/research/spcomcasa98/ spcomcasa98.pdf CASA @ ICTP - Dan Ellis 2001-08-09 - 45 Lab ROSA