Sound, Mixtures, and Learning: LabROSA overview Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio Electrical Engineering Dept., Columbia University, New York http://labrosa.ee.columbia.edu/ Outline 1 Auditory Scene Analysis 2 Speech Recognition & Mixtures 3 Music Analysis & Similarity 4 General Sound Organization 5 Future Work Sound, mixtures, learning - Dan Ellis 2003-03-20 - 1/33 Auditory Scene Analysis 1 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events - ... like listeners do • Hearing is ecologically grounded - reflects ‘natural scene’ properties - subjective, not absolute Sound, mixtures, learning - Dan Ellis 2003-03-20 - 2/33 Sound, mixtures, and learning freq / kHz Speech Noise Speech + Noise 4 3 + 2 = 1 0 0 0.5 1 time / s 0 0.5 1 time / s 0 0.5 1 • Sound - carries useful information about the world - complements vision • Mixtures - .. are the rule, not the exception - medium is ‘transparent’, sources are many - must be handled! • Learning - the ‘speech recognition’ lesson: let the data do the work - like listeners Sound, mixtures, learning - Dan Ellis 2003-03-20 - 3/33 time / s The problem with recognizing mixtures “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture - two sensors, N signals ... underconstrained • Disentangling mixtures as the primary goal? - perfect solution is not possible - need experience-based constraints Sound, mixtures, learning - Dan Ellis 2003-03-20 - 4/33 Human Auditory Scene Analysis (Bregman 1990) • How do people analyze sound mixtures? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes • Grouping ‘rules’ (Darwin, Carlyon, ...): - cues: common onset/offset/modulation, harmonicity, spatial location, ... Onset map Frequency analysis Harmonicity map Grouping mechanism Position map (after Darwin, 1996) Sound, mixtures, learning - Dan Ellis 2003-03-20 - 5/33 Source properties Cues to simultaneous grouping freq / Hz • Elements + attributes 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s • Common onset - simultaneous energy has common source • Periodicity - energy in different bands with same cycle • Other cues - spatial (ITD/IID), familiarity, ... Sound, mixtures, learning - Dan Ellis 2003-03-20 - 6/33 The effect of context Context can create an ‘expectation’: i.e. a bias towards a particular interpretation • e.g. Bregman’s “old-plus-new” principle: A change in a signal will be interpreted as an added source whenever possible frequency / kHz • 2 1 + 0 0.0 0.4 0.8 1.2 time / s - a different division of the same energy depending on what preceded it Sound, mixtures, learning - Dan Ellis 2003-03-20 - 7/33 Computational Auditory Scene Analysis (CASA) CASA Object 1 description Object 2 description Object 3 description ... • Goal: Automatic sound organization ; Systems to ‘pick out’ sounds in a mixture - ... like people do • E.g. voice against a noisy background - to improve speech recognition • Approach: - psychoacoustics describes grouping ‘rules’ - ... just implement them? Sound, mixtures, learning - Dan Ellis 2003-03-20 - 8/33 The Representational Approach (Brown & Cooke 1993) • input mixture Implement psychoacoustic theory signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - ‘bottom-up’ processing - uses common onset & periodicity cues • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 0.6 0.8 1.0 time/s Sound, mixtures, learning - Dan Ellis 0.2 0.4 0.6 0.8 2003-03-20 - 9/33 1.0 time/s Restoration in sound perception • Auditory ‘illusions’ = hearing what’s not there • The continuity illusion f/Hz ptshort 4000 2000 1000 0.0 • 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s SWS f/Bark 15 80 60 S1−env.pf:0 10 5 40 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 - duplex perception • How to model in CASA? Sound, mixtures, learning - Dan Ellis 2003-03-20 - 10/33 Adding top-down constraints Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups - objects irresistibly appear vs. Prediction-driven (top-down) hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Predict & combine Periodic components predicted features - match observations with parameters of a world-model - need world-model constraints... Sound, mixtures, learning - Dan Ellis 2003-03-20 - 11/33 Prediction-Driven CASA (Ellis 1996) • Explain a complex sound with basic elements f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 1 2 Sound, mixtures, learning - Dan Ellis 3 4 5 6 7 8 9 time/s dB 2003-03-20 - 12/33 Approaches to sound mixture recognition • Recognize combined signal - ‘multicondition training’ - combinatorics.. • Separate signals - e.g. CASA, ICA - nice, if you can do it • Segregate features into fragments - then missing-data recognition Sound, mixtures, learning - Dan Ellis 2003-03-20 - 13/33 Aside: Evaluation • Evaluation is a big problem for CASA - what is the goal, really? - what is a good test domain? - how do you measure performance? • SNR improvement - not easy given only before-after signals: correspondence problem - can do with fixed filtering mask; rewards removing signal as well as noise • ASR improvement - recognizers typically very sensitive to artefacts • ‘Real’ task? - mixture corpus with specific sound events... Sound, mixtures, learning - Dan Ellis 2003-03-20 - 14/33 Outline 1 Auditory Scene Analysis 2 Speech Recognition & Mixtures - the information in speech - Meeting Recorder project - speech fragment decoding 3 Music Analysis & Similarity 4 General Sound Organization 5 Future Work Sound, mixtures, learning - Dan Ellis 2003-03-20 - 15/33 The information in speech 2 (Patricia Scanlon) • - Mutual Information identifies where the information is in time/ frequency: - little temporal structure averaged over all sounds Phoneme info Better with just vowels: Phoneme info Sound, mixtures, learning - Dan Ellis Speaker info 2003-03-20 - 16/33 The best subword units? (Eric Fosler) • Speech recognizers typically use phonemes - inherited from linguistics • Alternative approach is ‘articulatory features’ - orthogonal attributes defining subwords • Can we infer a feature set from the data - using e.g. Independent Component Analysis Principal Components Spectrogram test/dr1/faks0/sa2 15 10 5 8 6 0 4 2 0 Independent Components Basis vectors 1 0 1 2 3 4 -1 8 2 6 4 1 2 0 0 d ow n ae s m iytix k eh r iyixn oy l iy r ae g l ay kdh ae tcl -1 time / labels Sound, mixtures, learning - Dan Ellis 0 5 10 15 20 frequency / Bark 2003-03-20 - 17/33 The Meeting Recorder Project (CompSci, ICSI, UW, IDIAP, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, IDIAP): - 100 hours collected, ongoing transcription • NSF ‘Mapping Meetings’ project - also interest from NIST, DARPA, EU Sound, mixtures, learning - Dan Ellis 2003-03-20 - 18/33 Speaker Turn detection (Huan Wei Hee, Jerry Liu) • Acoustic: Triangulate tabletop mic timing differences - use normalized peak value for confidence mr-2000-11-02-1440: PZM xcorr lags Example cross coupling response, chan3 to chan0 1 4 250 lag 3-4 / ms 0 200 150 100 0 -1 3 -2 50 2 1 0 50 100 • 150 time / s 200 250 300 -3 -3 -2 -1 0 1 lag 1-2 / ms Behavioral: Look for patterns of speaker turns mr04: Hand-marked speaker turns vs. time + auto/manual boundaries Participant 100xR skew/samps 300 10: 9: 8: 7: 5: 3: 2: 1: 0 5 10 15 20 Sound, mixtures, learning - Dan Ellis 25 30 35 40 45 50 55 60 time/min 2003-03-20 - 19/33 2 3 Speech Fragment recognition (Barker & Cooke/Sheffield) • Standard classification chooses between models M to match source features X P( M ) M ∗ = argmax P ( M X ) = argmax P ( X M ) ⋅ -------------P( X ) M M • Mixtures → observed features Y, segregation S, all related by P ( X Y , S ) Observation Y(f ) Source X(f ) Segregation S freq - spectral features allow clean relationship • Joint classification of model and segregation: P( X Y , S ) P ( M , S Y ) = P ( M ) P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y ) P( X ) ∫ Sound, mixtures, learning - Dan Ellis 2003-03-20 - 20/33 Multi-source decoding • Search for more than one source q2(t) S2(t) Y(t) S1(t) q1(t) • Mutually-dependent data masks • Use e.g. CASA features to propose masks - locally coherent regions • Theoretical vs. practical limits Sound, mixtures, learning - Dan Ellis 2003-03-20 - 21/33 Outline 1 Auditory Scene Analysis 2 Speech Recognition & Mixtures 3 Music Analysis & Similarity - musical structure analysis - similarity browsing 4 General Sound Organization 5 Future Work Sound, mixtures, learning - Dan Ellis 2003-03-20 - 22/33 Music Structure Analysis 3 (Alex Sheh) • Fine-level information from music - for searching - for modeling/statistics • e.g. Chord sequences via PCPs : Sound, mixtures, learning - Dan Ellis 2003-03-20 - 23/33 Ground truth for Music Recordings (Rob Turetsky) • Machine Learning algorithms need labels - but real recordings don’t have labels • MIDI ‘replicas’ exist • Alignment locates MIDI notes in real sound: freq / kHz "Don't you want me" (Human League), verse1 4 2 0 freq / kHz Original 4 17 18 19 20 21 22 23 24 25 26 MIDI Replica 2 MIDI # 0 80 MIDI notes 60 40 19 20 21 22 Sound, mixtures, learning - Dan Ellis 23 24 25 26 27 28 2003-03-20 - 24/33 time / sec Music Similarity Browsing (Adam Berenzweig) • ‘Anchor models’ : music on subjective axes Sound, mixtures, learning - Dan Ellis 2003-03-20 - 25/33 Outline 1 Auditory Scene Analysis 2 Speech Recognition & Mixtures 3 Music Analysis & Similarity 4 General Sound Organization - alarm detection - sound texture modeling - recognition of multiple sources 5 Future Work Sound, mixtures, learning - Dan Ellis 2003-03-20 - 26/33 Alarm sound detection freq / Hz 4 • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 2 0 2.5 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 freq / Hz - sinusoid peaks have invariant properties Speech + alarm 4000 2000 Pr(alarm) 0 1 0.5 0 1 2 3 4 5 6 7 8 9 time / sec - cepstral coefficients are easy to model Sound, mixtures, learning - Dan Ellis 2003-03-20 - 27/33 Sound Texture Modeling (Marios Athineos) • Best sound models are based on sinusoids - noise residual modeled quite simply • Noise ‘textures’ have extra temporal structure - need a more detailed model • Linear prediction of spectrum defines a parametric temporal envelope: mpgr1-sx419: TDLPC env (60 poles / 300 ms) 0.1 0.05 0 -0.05 0.65 0.7 • 0.75 0.8 0.85 0.9 High-quality noise-excited resynthesis: - original - resynth - x2 TSM - c/w PVOC Sound, mixtures, learning - Dan Ellis 2003-03-20 - 28/33 Sound mixture decomposition (Manuel Reyes) • Full or approximate Bayesian inference to model multiple, independent sound sources: Generation model Channel parameters Source models Source signals M1 Y1 Received signals X1 C2 Model dependence Θ C1 Observations M2 Y2 X2 O Analysis structure Observations O Fragment formation Mask allocation p(X|Mi,Ki) Sound, mixtures, learning - Dan Ellis {Ki} Model fitting {Mi} Likelihood evaluation 2003-03-20 - 29/33 Outline 1 Auditory Scene Analysis 2 Speech Recognition & Mixtures 3 Music Analysis & Similarity 4 General Sound Organization 5 Future Work - audio-visual information - real-world sound indexing Sound, mixtures, learning - Dan Ellis 2003-03-20 - 30/33 5 Future work: Automatic audio-video analysis (Shih-Fu Chang, Kathy McKeown) • Documentary archive management - huge ratio of raw-to-finished material - costly manual logging • Problem: term ↔ signal mapping - training corpus of past annotations - interactive semi-automatic learning MM Concept Network annotations A/V data text processing A/V segmentation and feature extraction terms A/V features concept discovery A/V feature unsupervised clustering generic detectors Sound, mixtures, learning - Dan Ellis Multimedia Fusion: Mining Concepts & Relationships Complex SpatioTemporal Classification Question Answering 2003-03-20 - 31/33 The ‘Listening Machine’ • Smart PDA records everything • Only useful if we have index, summaries - monitor for particular sounds - real-time description • Scenarios - personal listener → summary of your day - future prosthetic hearing device - autonomous robots • Meeting data, ambulatory audio Sound, mixtures, learning - Dan Ellis 2003-03-20 - 32/33 DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • Sound, mixtures, learning - Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding 2003-03-20 - 33/33