Sound, Mixtures, and Learning: LabROSA overview 1 Sound Content Analysis 2 Recognizing sounds 3 Organizing mixtures 4 Accessing large datasets 5 Music Information Retrieval Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Columbia University, New York http://labrosa.ee.columbia.edu/ Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 1 Sound Content Analysis 1 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Sound understanding: the key challenge - what listeners do - understanding = abstraction • Applications - indexing/retrieval - robots - prostheses Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 2 The problem with recognizing mixtures “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events - ... like listeners do • Hearing is ecologically grounded - reflects natural scene properties = constraints - subjective, not absolute Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 3 Auditory Scene Analysis (Bregman 1990) • How do people analyze sound mixtures? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes • Grouping ‘rules’ (Darwin, Carlyon, ...): - cues: common onset/offset/modulation, harmonicity, spatial location, ... Onset map Frequency analysis Harmonicity map Source properties Grouping mechanism Position map (after Darwin, 1996) Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 4 Cues to simultaneous grouping freq / Hz • Elements + attributes 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s • Common onset - simultaneous energy has common source • Periodicity - energy in different bands with same cycle • Other cues - spatial (ITD/IID), familiarity, ... • But: Context ... Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 5 Outline 1 Sound Content Analysis 2 Recognizing sounds - Clean speech - Speech-in-noise - Nonspeech 3 Organizing mixtures 4 Accessing large datasets 5 Music Information Retrieval Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 6 Recognizing Sounds: Speech 2 • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") • Dan Ellis Acoustic classifier phone probabilities HMM decoder phone / word sequence Understanding/ application... How to handle additive noise? - just train on noisy data: ‘multicondition training’ Sound, Mxtures & Learning 2003-07-21 - 7 How ASR Represents Speech • Markov model structure: states + transitions S A 0.1 0.02 T 0.9 4 Model M'2 0.8 0.9 K freq / kHz 0.8 0.18 0.2 3 0.8 0.05 2 A 0.15 20 T 0.2 30 E 40 O 0 • 10 0.8 0.1 1 O 0.1 0.9 K S E 0.1 freq / kHz M'1 State Transition Probabilities State models (means) 0.9 50 10 A generative model - but not a good speech generator! 4 3 2 1 0 0 1 2 3 4 5 time / sec - only meant for inference of p(X|M) Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 8 20 30 40 50 General Audio Recognition (with Manuel Reyes) • Searching audio databases - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are global for each soundfile, no attempt to separate mixtures Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 9 Audio Recognition: Results • Musclefish corpus - most commonly reported set • Features - MFCC, brightness, bandwidth, pitch ... - no temporal structure • Results: - 208 examples, 16 classes Global features: 41% corr Mu Musical Sp 59/ 46 11/ 6 Speech HMM models: 81% corr. Env An Mec Mu 24 2 19 136/ 6 4 5 1 Eviron. 7/ 2 Animals 2 1/ 2 4 1 8/ 4 Mechan 1 Dan Ellis 1 Sp 14/ 2 Env An Mec 2 1 5 5 3 1 7/ 1 4/ 3 Sound, Mxtures & Learning 3 2003-07-21 - 10 1 12/ What are the HMM states? • No sub-units defined for nonspeech sounds • Final states depend structure, initialization - number of states - initial clusters / labels / transition matrix - EM update objective • Have ideas of what we’d like to get - investigate features/initialization to get there dogBarks2 freq / kHz s7 s3 s2 s4 s2 s3 s7 s2 s3s5 s2 s3 s2 s4 s3 s4 s3 s7 s4 8 7 6 5 4 3 2 1 time 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2 Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 11 Alarm sound detection (Ellis 2001) • freq / kHz s0n6a8+20 4 Alarm sounds have particular structure - people ‘know them when they hear them’ - clear even at low SNRs hrn01 bfr02 buz01 20 3 0 2 -20 1 0 0 5 10 15 20 25 time / s • Why investigate alarm sounds? - they’re supposed to be easy - potential applications... • Contrast two systems: - standard, global features, P(X|M) - sinusoidal model, fragments, P(M,S|Y) Dan Ellis Sound, Mxtures & Learning -40 level / dB 2003-07-21 - 12 freq / kHz Alarms: Results Restaurant+ alarms (snr 0 ns 6 al 8) 4 3 2 1 0 MLP classifier output freq / kHz 0 Sound object classifier output 4 6 9 8 7 3 2 1 0 20 • 25 35 40 45 time/sec 50 Both systems commit many insertions at 0dB SNR, but in different circumstances: Noise Neural net system Del Ins Tot Sinusoid model system Del Ins Tot 1 (amb) 7 / 25 2 36% 14 / 25 1 60% 2 (bab) 5 / 25 63 272% 15 / 25 2 68% 3 (spe) 2 / 25 68 280% 12 / 25 9 84% 4 (mus) 8 / 25 37 180% 9 / 25 135 576% 170 192% 50 / 100 147 197% Overall 22 / 100 Dan Ellis 30 Sound, Mxtures & Learning 2003-07-21 - 13 Outline 1 Sound Content Analysis 2 Recognizing sounds 3 Organizing mixtures - Auditory Scene Analysis - Parallel model inference 4 Accessing large datasets 5 Music Information Retrieval Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 14 Organizing mixtures: 3 Approaches to handling overlapped sound • Separate signals, then recognize - e.g. CASA, ICA - nice, if you can do it • Recognize combined signal - ‘multicondition training’ - combinatorics.. • Recognize with parallel models - full joint-state space? - or: divide signal into fragments, then use missing-data recognition Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 15 Computational Auditory Scene Analysis: The Representational Approach (Cooke & Brown 1993) • input mixture Direct implementation of psych. theory signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - ‘bottom-up’ processing - uses common onset & periodicity cues • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 Dan Ellis 0.6 0.8 1.0 time/s Sound, Mxtures & Learning 0.2 0.4 0.6 0.8 2003-07-21 - 16 1.0 time/s Adding top-down constraints Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups - objects irresistibly appear vs. Prediction-driven (top-down) hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Periodic components Predict & combine predicted features - match observations with parameters of a world-model - need world-model constraints... Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 17 Prediction-Driven CASA f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 Dan Ellis 1 2 3 4 5 Sound, Mxtures & Learning 6 7 8 9 time/s 2003-07-21 - 18 dB Segregation vs. Inference • Source separation requires attribute separation - sources are characterized by attributes (pitch, loudness, timbre + finer details) - need to identify & gather different attributes for different sources ... • Need representation that segregates attributes - spectral decomposition - periodicity decomposition • Sometimes values can’t be separated - e.g. unvoiced speech - maybe infer factors from probabilistic model? p ( O, x , y ) → p ( x , y O ) - or: just skip those values, infer from higher-level context - do both: missing-data recognition Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 19 Missing Data Recognition • Speech models p(x|m) are multidimensional... - i.e. means, variances for every freq. channel - need values for all dimensions to get p(•) • But: can evaluate over a subset of dimensions xk p ( xk m ) = • ∫ p ( x k, x u m ) dx u Hence, missing data recognition: Present data mask xu p(xk,xu) y p(xk|xu<y ) P(x | q) = dimension → P(x1 | q) · P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q) time → - hard part is finding the mask (segregation) Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 20 xk p(xk ) Comparing different segregations • Standard classification chooses between models M to match source features X P( M ) M ∗ = argmax P ( M X ) = argmax P ( X M ) ⋅ -------------P( X ) M M • Mixtures → observed features Y, segregation S, all related by P ( X Y , S ) Observation Y(f ) Source X(f ) Segregation S freq - spectral features allow clean relationship • Joint classification of model and segregation: P( X Y , S ) P ( M , S Y ) = P ( M ) ∫ P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y ) P( X ) - probabilistic relation of models & segregation Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 21 Multi-source decoding • Search for more than one source q2(t) S2(t) Y(t) S1(t) q1(t) • Mutually-dependent data masks • Use e.g. CASA features to propose masks - locally coherent regions • Lots of issues in models, representations, matching, inference... Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 22 Outline 1 Sound Content Analysis 2 Recognizing sounds 3 Organizing mixtures 4 Accessing large datasets - Spoken documents - The Listening Machine - Music preference modeling 5 Music Information Retrieval Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 23 Accessing large datasets: The Meeting Recorder Project 4 (with ICSI, UW, IDIAP, SRI, Sheffield) • Microphones in conventional meetings - for summarization / retrieval / behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, IDIAP, NIST): - ~100 hours collected & transcribed • Dan Ellis NSF ‘Mapping Meetings’ project Sound, Mxtures & Learning 2003-07-21 - 24 Meeting IR tool • IR on (ASR) transcripts from meetings - ASR errors have limited impact on retrieval Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 25 Speaker Turn detection (Huan Wei Hee, Jerry Liu) • Acoustic: Triangulate tabletop mic timing differences - use normalized peak value for confidence mr-2000-11-02-1440: PZM xcorr lags Example cross coupling response, chan3 to chan0 1 4 250 lag 3-4 / ms 0 200 150 100 0 -1 3 -2 50 2 1 0 50 100 • 150 time / s 200 250 300 -3 -3 -2 -1 0 1 lag 1-2 / ms Behavioral: Look for patterns of speaker turns mr04: Hand-marked speaker turns vs. time + auto/manual boundaries Participant 100xR skew/samps 300 10: 9: 8: 7: 5: 3: 2: 1: 0 5 Dan Ellis 10 15 20 25 30 35 Sound, Mxtures & Learning 40 45 50 55 60 time/min 2003-07-21 - 26 2 3 Speech/nonspeech detection (Williams & Ellis 1999) • ASR run over entire soundtracks? - for nonspeech, result is nonsense • Watch behavior of speech acoustic model: - average per-frame entropy - ‘dynamism’ - mean-squared 1st-order difference Dynamism vs. Entropy for 2.5 second segments of speecn/music 3.5 Spectrogram frq/Hz Speech Music Speech+Music 3 4000 2.5 2000 2 speech music Entropy 0 speech+music 1.5 40 1 20 0 0 2 4 6 8 10 12 time/s 0.5 Posteriors 0 0 • Dan Ellis 0.05 0.1 0.15 0.2 Dynamism 0.25 0.3 1.3% error on 2.5 second speech-music testset Sound, Mxtures & Learning 2003-07-21 - 27 The Listening Machine • Smart PDA records everything • Only useful if we have index, summaries - monitor for particular sounds - real-time description • Scenarios - personal listener → summary of your day - future prosthetic hearing device - autonomous robots • Dan Ellis Meeting data, ambulatory audio Sound, Mxtures & Learning 2003-07-21 - 28 Personal Audio • LifeLog / MyLifeBits / Remembrance Agent: Easy to record everything you hear • Then what? - prohibitively time consuming to search - but .. applications if access easier • Automatic content analysis / indexing... Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 29 Outline 1 Sound Content Analysis 2 Recognizing sounds 3 Organizing mixtures 4 Accessing large datasets 5 Music Information Retrieval - Anchor space - Playola browser Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 30 Music Information Retrieval 5 • Transfer search concepts to music? - “musical Google” - finding something specific / vague / browsing - is anything more useful than human annotation? • Most interesting area: finding new music - is there anything on mp3.com that I would like? - audio is only information source for new bands • Basic idea: Project music into a space where neighbors are “similar” • Also need models of personal preference - where in the space is the stuff I like - relative sensitivity to different dimensions • Evaluation problems - requires large, shareable music corpus! Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 31 Artist Classification (Berenzweig et al. 2001) • Artists’ oeuvres as similarity-sets • Train MLP to classify frames among 21 artists • Using (detected) voice segments: Song-level accuracy improves 56.7% → 64.9% Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 0 50 100 150 200 time / sec Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 0 10 Dan Ellis 20 30 40 50 Sound, Mxtures & Learning 60 70 80 2003-07-21 - 32 time / sec Artist Similarity • Recognizing work from each artist is all very well... • But: what is similarity between artists? - pattern recognition systems give a number... Dan Ellis roxette toni_braxton e ron_carter erasure lara_fabian jessica_simpson mariah_carey new_ janet_jackson a whitney_ eiffel_65 celine_dionpet_shop_boys christina_aguilera aqua lauryn_hill rs sade sof all_saints backstreet_boys madonna pr spice_girlsbelinda_carlisle wain nelly_furtado miroquai annie_lennox • Need subjective ground truth: Collected via web site www.musicseer.com • Results: - 1,000 users, 22,300 judgments collected over 6 months Sound, Mxtures & Learning 2003-07-21 - 33 Music similarity from Anchor space • A classifier trained for one artist (or genre) will respond partially to a similar artist • Each artist evokes a particular pattern of responses over a set of classifiers • We can treat these classifier outputs as a new feature space in which to estimate similarity n-dimensional vector in "Anchor Space" Anchor Anchor p(a1|x) Audio Input (Class i) AnchorAnchor Anchor Audio Input (Class j) |x) p(a2n-dimensional vector in "Anchor Space" GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace • Dan Ellis “Anchor space” reflects subjective qualities? Sound, Mxtures & Learning 2003-07-21 - 34 Anchor space visualization • Comparing 2D projections of per-frame feature points in cepstral and anchor spaces: Anchor Space Features Cepstral Features 0 0.6 0.2 Electronica fifth cepstral coef 0.4 0 0.2 0.4 0.6 madonna bowie 0.8 1 0.5 0 third cepstral coef 5 10 15 0.5 madonna bowie 15 10 Country 5 - each artist represented by 5GMM - greater separation under MFCCs! - but: relevant information? Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 35 Playola interface ( www.playola.org ) • Browser finds closest matches to single tracks or entire artists in anchor space • Direct manipulation of anchor space axes Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 36 Evaluation • Are recommendations good or bad? • Subjective evaluation is the ground truth - .. but subjects aren’t familiar with the bands being recommended - can take a long time to decide if a recommendation is good • Measure match to other similarity judgments - e.g. musicseer data: Top rank agreement 80 70 60 SrvKnw 4789x3.58 % 50 SrvAll 6178x8.93 40 GamKnw 7410x3.96 GamAll 7421x8.92 30 20 10 0 cei Dan Ellis cmb erd e3d opn kn2 rnd Sound, Mxtures & Learning ANK 2003-07-21 - 37 Summary • Sound - .. contains much, valuable information at many levels - intelligent systems need to use this information • Mixtures - .. are an unavoidable complication when using sound - looking in the right time-frequency place to find points of dominance • Learning - need to acquire constraints from the environment - recognition/classification as the real task Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 38 DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding Sound, Mxtures & Learning 2003-07-21 - 39 Extra Slides Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 40 Independent Component Analysis (ICA) (Bell & Sejnowski 1995 et seq.) • Drive a parameterized separation algorithm to maximize independence of outputs m1 m2 x a11 a12 a21 a22 s1 s2 −δ MutInfo δa • Advantages: - mathematically rigorous, minimal assumptions - does not rely on prior information from models • Disadvantages: - may converge to local optima... - separation, not recognition - does not exploit prior information from models Dan Ellis Sound, Mxtures & Learning 2003-07-21 - 41