Environmental Sound Recognition and Classification Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. 5. http://labrosa.ee.columbia.edu/ Machine Listening Background Classification Foreground Event Recognition Speech Separation Open Issues Environmental Sound Recognition - Dan Ellis 2011-06-01 1 /36 1. Machine Listening • Extracting useful information from sound Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription “Sound Intelligence” VAD Speech/Music Environmental Sound Speech Dectect Task ... like animals do Environmental Sound Recognition - Dan Ellis Music Domain 2011-06-01 2 /36 Environmental Sound Applications • Audio Lifelog Diarization 09:00 09:30 10:00 10:30 11:00 11:30 2004-09-14 preschool cafe L2 cafe office preschool cafe Ron lecture 12:00 office 12:30 office outdoor group 13:00 outdoor lecture outdoor DSP03 compmtg meeting2 13:30 lab 14:00 cafe meeting2 Manuel outdoor office cafe office Mike Arroyo? outdoor Sambarta? 14:30 15:00 15:30 • 2004-09-13 16:00 office office office postlec office Lesser 16:30 Consumer Video Classification & Search Environmental Sound Recognition - Dan Ellis 17:00 17:30 18:00 outdoor lab cafe 2011-06-01 3 /36 • Environment Prior Work Classification speech/music/ silent/machine Zhang & Kuo ’01 • Nonspeech Sound Recognition Meeting room Audio Event Classification sports events - cheers, bat/ball sounds, ... Temko & Nadeu ’06 Environmental Sound Recognition - Dan Ellis 2011-06-01 4 /36 Consumer Video Dataset • 25 “concepts” from 1G+KL2 (10/15) Kodak user study boat, crowd, cheer, dance, ... 1 Labeled Concepts museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+ 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 3mc c s o n g s d b p b p p b s g s s b a w pm 0 Overlapped Concepts Environmental Sound Recognition - Dan Ellis • Grab top 200 videos 8GMM+Bha (9/15) from YouTube search then filter for quality, unedited = 1873 videos manually relabel with concepts pLSA500+lognorm (12/15) 2011-06-01 5 /36 Obtaining Labeled Data • Amazon Mechanical Turk Y-G Jiang et al. 2011 10s clips 9,641 videos in 4 weeks Environmental Sound Recognition - Dan Ellis 2011-06-01 6 /36 2. Background Classification • Baseline for soundtrack classification 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value 14 12 0 10 • Classify by e.g. KL distance + SVM Environmental Sound Recognition - Dan Ellis 50 20 MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz divide sound into short frames (e.g. 30 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = “bag of features” 8 6 4 2 5 10 15 MFCC dimension 20 2011-06-01 -50 7 /36 Codebook Histograms • Convert high-dim. distributions to multinomial 8 150 6 4 7 2 MFCC features Per-Category Mixture Component Histogram count MFCC(1) Global Gaussian Mixture Model 0 2 5 -2 6 1 10 14 8 100 15 9 12 13 3 4 -4 50 11 -6 -8 -10 -20 -10 0 10 20 0 1 2 3 MFCC(0) 4 5 6 7 8 9 10 11 12 13 14 15 GMM mixture • Classify by distance on histograms KL, Chi-squared + SVM Environmental Sound Recognition - Dan Ellis 2011-06-01 8 /36 Latent Semantic Analysis (LSA) • Probabilistic LSA (pLSA) models each histogram as a mixture of several ‘topics’ .. each clip may have several things going on • Topic sets optimized through EM p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip) = GMM histogram ftrs * “Topic” p(topic | clip) p(ftr | clip) “Topic” AV Clip AV Clip GMM histogram ftrs p(ftr | topic) use (normalized?) p(topic | clip) as per-clip features Environmental Sound Recognition - Dan Ellis 2011-06-01 9 /36 Background Classification Results Average Precision K Lee & Ellis ’10 1 Guessing MFCC + GMM Classifier Single−Gaussian + KL2 8−GMM + Bha 1024−GMM Histogram + 500−log(P(z|c)/Pz)) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 pl gr • Wide range in performance sk su i ns e bi rth t da an y im w al ed di ng pi cn ic m us eu m M EA N a sp t o gr ad rts ua tio n bo de ra nd pa ou ay gr ba by rk h pa ac g be in da nc ow o sh tw t p of ni gh on ou e pe rs g in r si ng ee d ch ow us m cr on gr ou p of 3+ 0 ic 0.1 Concept class audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes Environmental Sound Recognition - Dan Ellis 2011-06-01 10/36 Sound Texture Features McDermott Simoncelli ’09 Ellis, Zheng, McDermott ’11 • Characterize sounds by perceptually-sufficient statistics \x\ Sound \x\ Automatic gain control mel filterbank (18 chans) Octave bins 0.5,1,2,4,8,16 Hz FFT \x\ \x\ Histogram \x\ \x\ Mahalanobis distance ... mean, var, skew, kurt (18 x 4) Cross-band correlations (318 samples) Envelope correlation 1159_10 urban cheer clap 1062_60 quiet dubbed speech music 2404 1273 1 617 0 2 4 6 8 10 0 2 4 6 8 10 time / s Texture features mel band • Subband distributions & env x-corrs freq / Hz .. verified by matched resynthesis Modulation energy (18 x 6) 0 level 15 10 5 M V S K 0.5 2 8 32 moments Environmental Sound Recognition - Dan Ellis mod frq / Hz 5 10 15 mel band M V S K 0.5 2 8 32 moments mod frq / Hz 5 10 15 mel band 2011-06-01 11/36 Sound Texture Features • Test on MED 2010 development data 10 specially-collected manual labels cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M 0.9 0.8 0.7 0.6 0.5 MED 2010 Txtr Feature (normd) means 0.5 0 −0.5 V S K ms cbc MED 2010 Txtr Feature (normd) stddevs 1 0.5 0 V S K ms cbc MED 2010 Txtr Feature (normd) means/stddevs 0.5 0 −0.5 V S K ms M MVSK ms cbc MVSK+ms MVSK+cbc MVSK+ms+cbc 1 cbc Environmental Sound Recognition - Dan Ellis Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg • Contrasts in feature sets correlation of labels... • Perform ~ same as MFCCs combine well 2011-06-01 12/36 3. Foreground Event Recognition • Global vs. local class models K Lee, Ellis, Loui ’10 tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs: YT baby 002: voice baby laugh 4 New Way: Limited temporal extents freq / kHz freq / kHz Old Way: All frames contribute 3 4 2 1 1 5 10 voice 15 0 time / s baby 3 2 0 voice bg 5 voice baby 10 laugh 15 bg time / s laugh baby laugh “background” model shared by all clips Environmental Sound Recognition - Dan Ellis 2011-06-01 13/36 Foreground Event HMMs freq / Hz Spectrogram : YT_beach_001.mpeg, Manual Clip-level Labels : “beach”, “group of 3+”, “cheer” • Training labels only • Refine models by EM realignment • Use for classifying entire video... or seeking to relevant part Environmental Sound Recognition - Dan Ellis etc speaker4 speaker3 speaker2 unseen1 Background 25 concepts + 2 global BGs at clip-level 4k 3k 2k 1k 0 global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground picnic park one person night museum group of 2 group of 3+ graduation crowd boat birthday beach baby animal 0 0 −50 drum Manual Frame-level Annotation wind cheer voice voice voice water wave voice voice voice voice cheer Viterbi Path by 27−state Markov Model 1 2 3 4 5 6 7 8 9 10 time/sec Lee & Ellis ’10 2011-06-01 14/36 Transient Features Cotton, Ellis, Loui ’11 • Transients = • foreground events? Onset detector finds energy bursts best SNR • PCA basis to represent each 300 ms x auditory freq • “bag of transients” Environmental Sound Recognition - Dan Ellis 2011-06-01 15/36 Nonnegative Matrix Factorization templates + activation freq / Hz • Decompose spectrograms into 5478 Original mixture 3474 2203 1398 883 X=W·H fast forgiving gradient descent algorithm 2D patches sparsity control... Smaragdis Brown ’03 Abdallah Plumbley ’04 Virtanen ’07 442 Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) 0 Environmental Sound Recognition - Dan Ellis 1 2 3 4 5 6 7 8 9 10 time / s 2011-06-01 16/36 NMF Transient Features • Learn 20 patches from Acoustic Event Error Rate (AEER%) • Cotton, Ellis ’11 freq (Hz) Meeting Room Acoustic Event data 300 Patch 3 Patch 4 Patch 5 Patch 6 Patch 7 Patch 8 Patch 9 Patch 10 Patch 11 Patch 12 Patch 13 Patch 14 Patch 15 Patch 16 Patch 17 Patch 18 Patch 19 Patch 20 1398 442 3474 1398 442 1398 442 0.0 200 100 0.5 0.0 0.5 0.0 0.5 0.0 0.5 time (s) noise-robust MFCC NMF combined 15 0.5 0.0 • NMF more 150 0 10 Patch 2 442 3474 250 50 Patch 1 1398 3474 Compare to MFCC-HMM detector Error Rate in Noise 3474 20 SNR (dB) 25 30 Environmental Sound Recognition - Dan Ellis combines well ... 2011-06-01 17/36 4. Speech Separation • Speech recognition is finding best-fit parameters - argmax P(W | X) • Recognize mixtures with Factorial HMM model + state sequence for each voice/source exploit sequence constraints, speaker differences model 2 model 1 observations / time separation relies on detailed speaker model Environmental Sound Recognition - Dan Ellis 2011-06-01 18/36 • Idea: Find Eigenvoices Kuhn et al. ’98, ’00 Weiss & Ellis ’07, ’08, ’09 speaker model parameter space generalize without losing detail? • Eigenvoice model: µ = µ̄ + U adapted model mean voice eigenvoice bases Environmental Sound Recognition - Dan Ellis Speaker models Speaker subspace bases w + B weights h channel channel bases weights 2011-06-01 19/36 Eigenvoice Bases 280 states x 320 bins = 89,600 dimensions Frequency (kHz) • Mean model 8 30 4 40 2 b d g p t k jh ch s z f th v dh m n l Frequency (kHz) r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 1 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l 8 Frequency (kHz) additional components for acoustic channel 20 50 r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 2 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l 8 Frequency (kHz) • 10 6 8 Eigencomponents shift formants/ coloration Mean Voice r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 3 8 6 6 4 4 2 2 0 b d g p t k jh ch s z f th v dh m n l Environmental Sound Recognition - Dan Ellis r w y iy ih eh ey ae aa aw ay ah ao owuw ax 2011-06-01 20/36 Eigenvoice Speech Separation • Factorial HMM analysis Weiss & Ellis ’10 with tuning of source model parameters = eigenvoice speaker adaptation Environmental Sound Recognition - Dan Ellis 2011-06-01 21/36 Eigenvoice Speech Separation Environmental Sound Recognition - Dan Ellis 2011-06-01 22/36 Eigenvoice Speech Separation • Eigenvoices for Speech Separation task speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) Mix SI SA SD Environmental Sound Recognition - Dan Ellis 2011-06-01 23/36 Binaural Cues • Model interaural spectrum of each source as stationary level and time differences: • e.g. at 75°, in reverb: IPD IPD residual Environmental Sound Recognition - Dan Ellis ILD 2011-06-01 24/36 Model-Based EM Source Separation and Localization (MESSL) Mandel & Ellis ’09 Re-estimate source parameters Assign spectrogram points to sources can model more sources than sensors flexible initialization Environmental Sound Recognition - Dan Ellis 2011-06-01 25/36 MESSL Results • Modeling uncertainty improves results tradeoff between constraints & noisiness 2.45 dB 0.22 dB 12.35 dB 8.77 dB 9.12 dB Environmental Sound Recognition - Dan Ellis -2.72 dB 2011-06-01 26/36 MESSL with Source Priors • Fixed or adaptive speech models Environmental Sound Recognition - Dan Ellis Weiss, Mandel & Ellis ’11 2011-06-01 27/36 MESSL-EigenVoice Results • Source models function as priors • Interaural parameter spatial separation Frequency (kHz) Frequency (kHz) source model prior improves spatial estimate Ground truth (12.04 dB) 8 DUET (3.84 dB) 8 6 6 6 4 4 4 2 2 2 0 0 0.5 1 1.5 MESSL (5.66 dB) 8 0 0 0.5 1 1.5 MESSL SP (10.01 dB) 8 0 6 6 4 4 4 2 2 2 0 0.5 1 Time (sec) 1.5 0 0 0.5 1 Time (sec) Environmental Sound Recognition - Dan Ellis 1.5 1 0.8 0 0 0.5 1 1.5 MESSL EV (10.37 dB) 8 6 0 2D FD BSS (5.41 dB) 8 0.6 0.4 0.2 0 0 0.5 1 Time (sec) 1.5 2011-06-01 28/36 Noise-Robust Pitch Tracking • Important for voice detection & separation • Based on channel selection Wu & Wang (2003) BS Lee & Ellis ’11 pitch from summary autocorrelation over “good” bands trained classifier decides which channels to include Mean BER of pitch tracking algorithms Mean BER / % 100 • Improves over simple Wu criterion 10 especially for mid SNR 1 Wu CS-GTPTHvarAdjC5 -10 -5 0 5 10 SNR / dB 15 20 Environmental Sound Recognition - Dan Ellis 25 2011-06-01 29/36 Freq / kHz 4 Channels 40 08_rbf_pinknoise5dB 20 0 −20 −40 2 0 • Trained selection 20 25 includes more off-harmonic channels 12.5 period / ms Wu Noise-Robust Pitch Tracking 0 25 12.5 Channels 40 20 25 12.5 period / ms CS-GT 0 0 25 12.5 0 0 1 2 3 Environmental Sound Recognition - Dan Ellis 4 time / sec 2011-06-01 30/36 5. Outstanding Issues • Better object/event separation parametric models spatial information? computational auditory scene analysis... Frequency Proximity • Large-scale analysis • Integration with video Environmental Sound Recognition - Dan Ellis Common Onset Harmonicity Barker et al. ’05 2011-06-01 31/36 Audio-Visual Atoms Jiang et al. ’09 • Object-related features from both audio (transients) & video (patches) time visual atom ... visual atom ... audio atom frequency frequency audio spectrum time Environmental Sound Recognition - Dan Ellis audio atom Joint Audio-Visual Atom discovery ... visual atom ... audio atom time 2011-06-01 32/36 Audio-Visual Atoms • Multi-instance learning of A-V co-occurrences M Visual Atoms visual atom … visual atom time … … time codebook learning Joint A-V Codebook … audio atom audio atom M×N A-V Atoms … N audio atoms Environmental Sound Recognition - Dan Ellis 2011-06-01 33/36 Audio-Visual Atoms black suit + romantic music marching people + parade sound Wedding sand + beach sounds Parade Beach Audio only Visual only Visual + Audio Environmental Sound Recognition - Dan Ellis 2011-06-01 34/36 Summary • Machine Listening: Getting useful information from sound • Background sound classification ... from whole-clip statistics? • Foreground event recognition ... by focusing on peak energy patches • Speech content is very important ... separate with pitch, models, ... Environmental Sound Recognition - Dan Ellis 2011-06-01 35/36 • • • • • • • • • • • • References Jon Barker, Martin Cooke, & Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech Communication 45(1): 5-25, 2005. Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE ICASSP, Prague, May 2011. Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event Classification,” submitted to IEEE WASPAA, 2011. Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,” IEEE ICASSP, Prague, May 2011. Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, “Short-Term Audio-Visual Atoms for Generic Video Concept Classification,” ACM MultiMedia, 5-14, Beijing, Oct 2009. Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010. Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in environmental sounds using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010. Byung-Suk Lee & Dan Ellis, “Noise-robust pitch tracking by trained channel selection,” submitted to IEEE WASPAA, 2011. Andriy Temko & Climent Nadeu, “Classification of acoustic events using SVM-based clustering schemes,” Pattern Recognition 39(4): 682-694, 2006 Ron Weiss & Dan Ellis, “Speech separation using speaker-adapted Eigenvoice speech models,” Computer Speech & Lang. 24(1): 16-29, 2010. Ron Weiss, Michael Mandel, & Dan Ellis, “Combining localization cues and source model constraints for binaural source separation,” Speech Communication 53(5): 606-621, May 2011. Tong Zhang & C.-C. Jay Kuo, “Audio content analysis for on-line audiovisual data segmentation,” IEEE TSAP 9(4): 441-457, May 2001 Environmental Sound Recognition - Dan Ellis 2011-06-01 36/36