Automatic audio analysis for content description & indexing Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 Auditory Scene Analysis (ASA) 2 Computational ASA (CASA) 3 Prediction-driven CASA 4 Speech recognition & sound mixtures 5 Implications for content analysis Audio Indexing - Dan Ellis 1998feb04 - 1 Auditory Scene Analysis 1 “The organization of complex sound scenes according to their inferred sources” f/Hz • Sounds rarely occur in isolation - organization required for useful information • Human audition is very effective - unexpectedly difficult to model • ‘Correct’ analysis defined by goal - source shows independence, continuity →ecological constraints enable organization city22 4000 −40 2000 −50 1000 400 −60 200 −70 0 1 Audio Indexing - Dan Ellis 2 3 4 5 6 7 8 9 dB time/s 1998feb04 - 2 Psychology of ASA • Extensive experimental research - organization of ‘simple pieces’ (sinusoids & white noise) - streaming, pitch perception, ‘double vowels’ • “Auditory Scene Analysis” [Bregman 1990] → grouping ‘rules’ - common onset/offset/modulation, harmonicity, spatial location • Debated... (Darwin, Carlyon, Moore, Remez) (from Darwin 1996) Audio Indexing - Dan Ellis 1998feb04 - 3 2 Computational Auditory Scene Analysis (CASA) • f/Hz Automatic sound organization? - convert an undifferentiated signal into a description in terms of different sources city22 4000 horn −40 2000 horn door crash −50 1000 yell −60 400 200 car noise −70 0 1 2 3 4 5 6 7 8 9 dB time/s • Translate psych. rules into programs? - representations to reveal common onset, harmonicity ... • Motivations & Applications - it’s a puzzle: new processing principles? - real-world interactive systems (speech, robots) - hearing prostheses (enhancement, description) - advanced processing (remixing) - multimedia indexing... Audio Indexing - Dan Ellis 1998feb04 - 4 CASA survey • Early work on co-channel speech - listeners benefit from pitch difference - algorithms for separating periodicities • Utterance-sized signals need more - cannot predict number of signals (0, 1, 2 ...) - birth/death processes • Ultimately, more constraints needed - nonperiodic signals - masked cues - ambiguous signals Audio Indexing - Dan Ellis 1998feb04 - 5 CASA1: Periodic pieces • Weintraub 1985 - separate male & female voices - find periodicities in each frequency channel by auto-coincidence - number of voices is ‘hidden state’ • Cooke & Brown (1991-3) - divide time-frequency plane into elements - apply grouping rules to form sources - pull single periodic target out of noise frq/Hz brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 Audio Indexing - Dan Ellis 0.4 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1.0 1998feb04 - 6 time/s CASA2: Hypothesis systems • Okuno et al. (1994-) - ‘tracers’ follow each harmonic + noise ‘agent’ - residue-driven: account for whole signal • Klassner 1996 - search for a combination of templates - high-level hypotheses permit front-end tuning 3760 Hz Buzzer-Alarm 2540 Hz 2230 Hz 2350 Hz Glass-Clink 1675 Hz 1475 Hz 950 Hz 500 Hz 460 Hz 420 Hz Phone-Ring 1.0 • Siren-Chirp 2.0 3.0 4.0 sec 1.0 2.0 TIME TIME (a) (b) 3.0 4.0 sec Ellis 1996 - model for events perceived in dense scenes - prediction-driven: observations - hypotheses Audio Indexing - Dan Ellis 1998feb04 - 7 CASA3: Other approaches • Blind source separation (Bell & Sejnowski) - find exact separation parameters by maximizing statistic e.g. signal independence • HMM decomposition (RK Moore) - recover combined source states directly • Neural models (Malsburg, Wang & Brown) - avoid implausible AI methods (search, lists) - oscillators substitute for iteration? Audio Indexing - Dan Ellis 1998feb04 - 8 Prediction-driven CASA 3 Perception is not direct but a search for plausible hypotheses • Data-driven... input mixture Front end signal features Object formation discrete objects vs. Prediction-driven Grouping rules Source groups hypotheses Noise components Hypothesis management prediction errors input mixture • Front end signal features Compare & reconcile Periodic components Predict & combine predicted features Motivations - detect non-tonal events (noise & clicks) - support ‘restoration illusions’... → hooks for high-level knowledge + ‘complete explanation’, multiple hypotheses, resynthesis Audio Indexing - Dan Ellis 1998feb04 - 9 Analyzing the continuity illusion • Interrupted tone heard as continuous - .. if the interruption could be a masker f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s • Data-driven just sees gaps • Prediction-driven can accommodate - special case or general principle? Audio Indexing - Dan Ellis 1998feb04 - 10 Phonemic Restoration (Warren 1970) • Another ‘illusion’ instance • Inference relies on high-level semantics frq/Hz nsoffee.aif 3500 3000 2500 2000 1500 1000 500 0 1.2 • 1.3 1.4 1.5 1.6 1.7 time/s Incorporating knowledge into models? Audio Indexing - Dan Ellis 1998feb04 - 11 Subjective ground-truth in mixtures? • Listening tests collect ‘perceived events’: • Consistent answers: f/Hz City 4000 2000 1000 400 200 0 1 2 3 4 5 6 7 8 9 Horn1 (10/10) S9−horn 2 S10−car horn S4−horn1 S6−double horn S2−first double horn S7−horn S7−horn2 S3−1st horn S5−Honk S8−car horns S1−honk, honk Crash (10/10) S7−gunshot S8−large object crash S6−slam S9−door Slam? S2−crash S4−crash S10−door slamming S5−Trash can S3−crash (not car) S1−slam Horn2 (5/10) S9−horn 5 S8−car horns S2−horn during crash S6−doppler horn S7−horn3 Truck (7/10) S8−truck engine S2−truck accelerating S5−Acceleration S1−rev up/passing S6−acceleration S3−closeup car S10−wheels on road Audio Indexing - Dan Ellis 1998feb04 - 12 PDCASA example: City-street ambience f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 1 2 3 Wefts1−4 f/Hz 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 • 1 2 3 Problems - error allocation - source hierarchy Audio Indexing - Dan Ellis 4 5 6 7 8 9 time/s dB - rating hypotheses - resynthesis 1998feb04 - 13 Speech recognition & sound mixtures 4 • Conventional speech recognition: signal Feature extraction low-dim. features Phoneme classifier phoneme probabilities HMM decoder words - signal assumed entirely speech - find valid labelling by discrete labels - class models from training data • Some problems: - need to ignore lexically-irrelevant variation (microphone, voice pitch etc.) - compact feature space → everything speech-like • Very fragile to nonspeech, background - scene-analysis methods very attractive... Audio Indexing - Dan Ellis 1998feb04 - 14 CASA for speech recognition • Data-driven: CASA as preprocessor - problems with ‘holes’ (but: Cooke, Okuno) - doesn’t exploit knowledge of speech structure • Prediction-driven: speech as component - same ‘reconciliation’ of speech hypotheses - need to express ‘predictions’ in signal domain Speech components Hypothesis management input mixture Audio Indexing - Dan Ellis Noise components Predict & combine Periodic components Front end Compare & reconcile 1998feb04 - 15 Example of speech & nonspeech f/Bark 15 (a) Clap (clap8k−env.pf) 10 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 (b) Speech plus clap (223cl−env.pf) dB 60 40 20 (c) Recognizer output h# w n h# ay n n ay tcl t n uw t f uw ay ah f s ay ay ow h# v s eh v ow h# five oh <SIL> s v ah eh v n ax h# n tcl <SIL> nine two seven (d) Reconstruction from labels alone (223cl−renvG.pf) (e) Slowly−varying portion of original (223cl−envg.pf) (f) Predicted speech element ( = (d)+(e) ) (223cl−renv.pf) (g) Click5 from nonspeech analysis (justclick.pf) (h) Spurious elements from nonspeech analysis (nonclicks.pf) 0.0 • 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Problems: - undoing classification & normalization - finding a starting hypothesis - granularity of integration Audio Indexing - Dan Ellis 1998feb04 - 16 Implications for content analysis: Using CASA to index soundtracks 5 f/Hz city22 4000 horn −40 2000 horn door crash −50 1000 yell −60 400 200 car noise −70 0 1 2 3 4 5 6 7 8 9 dB time/s • What are the ‘objects’ in a soundtrack? - subjective definition → need auditory model • Segmentation vs. classification - low-level cues → locate events - higher-level ‘learned’ knowledge to give semantic label (footstep, crash) ... AI complete? • But: hard to separate - illusion phenomena suggest auditory organization depends on interpretation Audio Indexing - Dan Ellis 1998feb04 - 17 Using speech recognition for indexing • Active research area: Access to news broadcast databases - e.g. Informedia (CMU), ThisL (BBC+...) - use LV-CSR to transcribe, then text-retrieval to find - 30-40% word error rate, still works OK • Several systems at NIST TREC workshop • Tricks to ‘ignore’ nonspeech/poor speech Audio Indexing - Dan Ellis 1998feb04 - 18 Open issues in automatic indexing • How to do ASA? • Explanation/description hierarchy - PDCASA: ‘generic’ primitives + constraining hierarchy - subjective & task-dependent • Classification - connecting subjective & objective properties → finding subjective invariants, prominence - representation of sound-object ‘classes’ • Resynthesis? - a ‘good’ description should be adequate - provided in PDCASA, but low quality - requires good knowledge-based constraints Audio Indexing - Dan Ellis 1998feb04 - 19 Conclusions 6 • Auditory organization is required in real environments • We don’t know how listeners do it! - plenty of modeling interest • Prediction-reconciliation can account for ‘illusions’ - use ‘knowledge’ when signal is inadequate - important in a wider range of circumstances? • Speech recognizers are a good source of knowledge • Automatic indexing implies ‘synthetic listener’ - need to solve a lot of modeling issues Audio Indexing - Dan Ellis 1998feb04 - 20