Extracting Information from Sound Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Machine Listening Global Classification Foreground & Transients Outstanding Issues Information from Sound - Dan Ellis 2011-03-01 1 /19 1. Machine Listening • Extracting useful information from sound Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription “Sound Intelligence” VAD Speech/Music Environmental Sound Speech Dectect Task ... like animals do Information from Sound - Dan Ellis Music Domain 2011-03-01 2 /19 Listening to Mixtures • The world is cluttered & sound is transparent mixtures are inevitable • Useful information is structured by ‘sources’ specific definition of a ‘source’: intentional independence Information from Sound - Dan Ellis 2011-03-01 3 /19 • Applications Audio Lifelog Diarization 09:00 2004-09-13 09:30 10:00 10:30 11:00 11:30 preschool cafe Ron lecture 12:30 office outdoor group L2 cafe office outdoor lecture outdoor DSP03 compmtg meeting2 13:30 lab 14:00 cafe meeting2 Manuel outdoor office cafe office Mike Arroyo? outdoor Sambarta? 15:00 15:30 16:00 office office office postlec office Lesser 16:30 Consumer Video Classification Information from Sound - Dan Ellis cafe office 14:30 • preschool 12:00 13:00 2004-09-14 17:00 17:30 18:00 outdoor lab cafe 2011-03-01 4 /19 Consumer Video Dataset • 25 “concepts” from 1G+KL2 (10/15) Kodak user study boat, crowd, cheer, dance, ... from YouTube search then filter for quality, unedited = 1873 videos manually relabel with concepts • Concept overlap: Information from Sound - Dan Ellis museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+ Labeled Concepts • Grab top 200 videos 1 8GMM+Bha (9/15) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 3mc c s o n g s d b p b p p b s g s s b a w pm 0 Overlapped Concepts pLSA500+lognorm (12/15) 2011-03-01 5 /19 2. Global Classification • Baseline for soundtrack classification 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz divide sound into short frames (e.g. 30 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = “bag of features” 14 12 0 10 8 6 4 2 5 10 15 MFCC dimension • Classify by e.g. Mahalanobis distance + SVM Information from Sound - Dan Ellis 20 2011-03-01 -50 6 /19 Codebook Histograms • Convert nonplanar distributions to multinomial 8 150 6 4 7 2 MFCC features Per-Category Mixture Component Histogram count MFCC(1) Global Gaussian Mixture Model 0 2 5 -2 6 1 10 14 8 100 15 9 12 13 3 4 -4 50 11 -6 -8 -10 -20 -10 0 10 20 0 1 2 3 • Classify by distance on histograms MFCC(0) 4 5 6 7 8 9 10 11 12 13 14 15 GMM mixture KL, Chi-squared + SVM Information from Sound - Dan Ellis 2011-03-01 7 /19 Latent Semantic Analysis (LSA) • Probabilistic LSA (pLSA) models each histogram as a mixture of several ‘topics’ .. each clip may have several things going on • Topic sets optimized through EM p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip) = GMM histogram ftrs * “Topic” p(topic | clip) p(ftr | clip) “Topic” AV Clip AV Clip GMM histogram ftrs p(ftr | topic) use (normalized?) p(topic | clip) as per-clip features Information from Sound - Dan Ellis 2011-03-01 8 /19 Global Classification Results Average Precision Lee & Ellis ’10 1 Guessing MFCC + GMM Classifier Single−Gaussian + KL2 8−GMM + Bha 1024−GMM Histogram + 500−log(P(z|c)/Pz)) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 • Wide range in performance sk su i ns e bi rth t da an y im w al ed di ng pi cn ic m us eu m M EA N a sp t o gr ad rts ua tio n bo de ra nd pa pl ay gr ou by rk ba pa d ch ow cr us m e si er n on gin g e pe rs on gr n ou igh t p of tw o sh o da w nc in g be ac h gr ou p of 3+ 0 ic 0.1 Concept class audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes Information from Sound - Dan Ellis 2011-03-01 9 /19 3. Foreground & Transients • Global vs. local class models tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs: YT baby 002: voice baby laugh 4 New Way: Limited temporal extents freq / kHz freq / kHz Old Way: All frames contribute 3 4 2 1 1 5 10 voice 15 0 time / s baby 3 2 0 voice bg 5 voice baby 10 laugh 15 bg time / s laugh baby laugh “background” model shared by all clips Information from Sound - Dan Ellis 2011-03-01 10/19 Landmark-based Fingerprints Shazam ’03 robust to channel, background freq / Hz • Sound characterized by time-frequency peaks Query audio 4000 3500 3000 2500 2000 1500 1000 500 0 relies on precise timing Match: 05−Full Circle at 0.032 sec 4000 3500 3000 2500 2000 1500 1000 500 0 Information from Sound - Dan Ellis 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time / sec 2011-03-01 11 /19 Soundtrack Fingerprint Matching • Landmark pairs are a noise-robust fingerprint VIdeo IMpLQaiHWbE at 195s 4 3 2 1 Use to match distinct videos with same sound ambience 0 freq / kHz • freq / kHz Cotton & Ellis ’10 195.5 196 196.5 197 197.5 198 198.5 199 time / sec VIdeo Yi1hkNkqHBc at 218 s 4 3 2 1 0 Information from Sound - Dan Ellis 218.5 219 219.5 220 220.5 221 221.5 222 time / sec 2011-03-01 12/19 Event Landmark Signatures • Build index of Gabor neighbor pairs Cotton & Ellis ’09 recognize repeated events with similar pairs Information from Sound - Dan Ellis 2011-03-01 13/19 Transient Features Cotton, Ellis, Loui ’11 • Onset detector finds energy bursts best SNR • PCA basis to represent each 300 ms x auditory frq • “bag of transients” Information from Sound - Dan Ellis 2011-03-01 14/19 4. Outstanding Issues • How to define “transients”? • How to separate foreground & background? • How to exploit prior knowledge of sounds? • How to make classification discriminative? • Large-scale soundtrack classification Information from Sound - Dan Ellis 2011-03-01 15/19 Nonnegative Matrix Factorization templates + activation freq / Hz • Decompose spectrograms into 3474 2203 883 442 Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) 0 Information from Sound - Dan Ellis Original mixture 1398 X=W·H fast & forgiving gradient descent algorithm 2D patches sparsity control... 5478 Smaragdis & Brown ’03 Abdallah & Plumbley ’04 Virtanen ’07 1 2 3 4 5 6 7 8 9 10 time / s 2011-03-01 16/19 Sound Textures McDermott & Simoncelli ’09 Ellis, Zhang, McDermott ’11 • Characterize sounds by perceptually-sufficient statistics \x\ Sound \x\ Automatic gain control mel filterbank (18 chans) Octave bins 0.5,1,2,4,8,16 Hz FFT \x\ \x\ Histogram \x\ \x\ Mahalanobis distance ... Information from Sound - Dan Ellis mean, var, skew, kurt (18 x 4) Cross-band correlations (318 samples) Envelope correlation 1159_10 urban cheer clap 1062_60 quiet dubbed speech music 2404 1273 1 617 0 2 4 6 8 10 0 2 4 6 8 10 time / s Texture features mel band • Subband distributions & env x-corrs freq / Hz .. verified by matched resynthesis Modulation energy (18 x 6) 0 level 15 10 5 M V S K 0.5 2 8 32 moments mod frq / Hz 5 10 15 mel band M V S K 0.5 2 8 32 moments mod frq / Hz 5 10 15 mel band 2011-03-01 17/19 Real-World Dictionary • BBC Sound Effects as reference library The Czech Republic - Slovakia Hungary Rural South America Urban South America Footsteps 2 Footsteps 1 India Pakistan Nepal-Countrysid India & Nepal-City Life Exterior Atmospheres-Rural Background England France Suburbia Crowds Birds Istanbul Emergency Cats Construction Age Of Steam Trains Spain Schools & Crowds Horses & Dogs Horses Farm Machinery Livestock 2 Livestock 1 Adventure Sports Greece Equestrian Events Africa The Natural World Africa The Human World Hospitals Babies China Aircraft America Ships And Boats 2 Ships And Boats 1 Weather 1 Electronically Generated Sounds Explosions Guns Alarms Sport Leisure Auto Rural Soundscapes Big Ben Taxi Bus Atmospheres Industry Birds Rivers Streams Water Computers Printers Phones European Soundscapes Misc Audiences Children Crowds Foots Animals and Birds Transportation Interior Atmosphere Household Exterior Atmosphere BBC Sound Effects BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT 1000+ examples ... comprehensive? similarity via normalized textures (over 10s chunks) Information from Sound - Dan Ellis 2011-03-01 18/19 Summary • Machine Listening: Getting useful information from sound • Environmental sound classification ... from whole-clip statistics? • Transients & energy peaks ... separate foreground & background • Useful classification of unconstrained audio ... to combine with video analysis Information from Sound - Dan Ellis 2011-03-01 19/19