Using the Soundtrack to Classify Videos Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Machine Listening Global Classification Foreground Classification Outstanding Issues Soundtrack Classification - Dan Ellis 2011-05-04 1 /16 1. Machine Listening Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription “Sound Intelligence” VAD Speech/Music Environmental Sound Speech Dectect Task • Extracting useful information from sound Soundtrack Classification - Dan Ellis Music Domain 2011-05-04 2 /16 The Information in Audio • Environmental recordings contain info on: location – type (restaurant, street, ...) and specifics activity – talking, walking, typing, ... people – generic (2 males), specific (Chuck & John) spoken content (sometimes) • but not: what people and things “looked like” day/night ... ... except when correlated with audible features Soundtrack Classification - Dan Ellis 2011-05-04 3 /16 • Applications Audio Lifelog Diarization 09:00 09:30 10:00 10:30 11:00 11:30 2004-09-14 preschool cafe preschool cafe Ron lecture 12:00 office 12:30 office outdoor group 13:00 L2 cafe office outdoor lecture outdoor DSP03 compmtg meeting2 13:30 lab 14:00 cafe meeting2 Manuel outdoor office cafe office Mike Arroyo? outdoor Sambarta? 14:30 15:00 15:30 • 2004-09-13 16:00 office office office postlec office Lesser 16:30 Consumer Video Classification Soundtrack Classification - Dan Ellis 17:00 17:30 18:00 outdoor lab cafe 2011-05-04 4 /16 Consumer Video Dataset • 25 “concepts” from 1G+KL2 (10/15) Kodak user study boat, crowd, cheer, dance, ... from YouTube search filter for raw, unedited = 1873 videos manually relabel with concepts • Concept overlap: Soundtrack Classification - Dan Ellis museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+ Labeled Concepts • Grab top 200 videos 1 8GMM+Bha (9/15) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 3mc c s o n g s d b p b p p b s g s s b a w pm 0 Overlapped Concepts pLSA500+lognorm (12/15) 2011-05-04 5 /16 2. Global Classification • Baseline for soundtrack classification 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz divide sound into short frames (e.g. 30 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = “bag of features” 14 12 0 10 8 6 4 2 5 10 15 MFCC dimension • Classify by e.g. Mahalanobis distance + SVM Soundtrack Classification - Dan Ellis 20 2011-05-04 -50 6 /16 Codebook Histograms • Convert nonplanar distributions to multinomial 8 150 6 4 7 2 MFCC features Per-Category Mixture Component Histogram count MFCC(1) Global Gaussian Mixture Model 0 2 5 -2 6 1 10 14 8 100 15 9 12 13 3 4 -4 50 11 -6 -8 -10 -20 -10 0 10 20 0 1 2 3 • Classify by distance on histograms MFCC(0) 4 5 6 7 8 9 10 11 12 13 14 15 GMM mixture KL, Chi-squared + SVM Soundtrack Classification - Dan Ellis 2011-05-04 7 /16 Global Classification Results Average Precision Lee & Ellis ’10 1 Guessing MFCC + GMM Classifier Single−Gaussian + KL2 8−GMM + Bha 1024−GMM Histogram + 500−log(P(z|c)/Pz)) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 pl gr • Wide range in performance sk su i ns e bi rth t da an y im w al ed di ng pi cn ic m us eu m M EA N a sp t o gr ad rts ua tio n bo de ra nd pa ou ay gr ba by rk h pa ac g be in da nc ow o sh tw t p of ni gh on ou e pe rs g in r si ng ee d ch ow us m cr on gr ou p of 3+ 0 ic 0.1 Concept class audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes Soundtrack Classification - Dan Ellis 2011-05-04 8 /16 Combining with Video Average Precision • Video classification by SIFT codebooks 0.9 0.8 Chang et al. ’07 Random Baseline Video Only Audio Only A+V Fusion 0.7 0.6 0.5 0.4 0.3 0.2 0.1 AP M on e p gr er s ou o p n of 3 m + us c gr h ic ou ee p r of 2 sp or pa t r cr k ow d ni g an ht im si al ng in sh g ow s da sk nc i in g bo at ba b be y w ac ed h d m ing us eu pa m pl ra ay d gr e ou su nd ns et pi c bi nic rth da y 0 • Audio adds most for dancing, baby, museum ... Soundtrack Classification - Dan Ellis 2011-05-04 9 /16 3. Foreground Classification • Global vs. local class models Lee, Ellis & Loui ’10 tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs: YT baby 002: voice baby laugh 4 New Way: Limited temporal extents freq / kHz freq / kHz Old Way: All frames contribute 3 4 2 1 1 5 10 voice 15 0 time / s baby 3 2 0 voice bg 5 voice baby 10 laugh 15 bg time / s laugh baby laugh “background” model shared by all clips Soundtrack Classification - Dan Ellis 2011-05-04 10/16 Transient Features Cotton, Ellis, Loui ’11 • Onset detector finds energy bursts best SNR • PCA basis to represent each 300 ms x auditory frq • “bag of transients” • Object-related... Soundtrack Classification - Dan Ellis 2011-05-04 11/16 Nonnegative Matrix Factorization templates + activation freq / Hz • Decompose spectrograms into 3474 2203 883 442 Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) 0 Soundtrack Classification - Dan Ellis Original mixture 1398 X=W·H fast forgiving gradient descent algorithm 2D patches sparsity control... 5478 Smaragdis Brown ’03 Abdallah Plumbley ’04 Virtanen ’07 1 2 3 4 5 6 7 8 9 10 time / s 2011-05-04 12/16 4. Outstanding Issues • How to separate foreground & background? • How to exploit prior knowledge of sounds? • How to make classification credible? Soundtrack Classification - Dan Ellis 2011-05-04 13/16 Classifier Insight • How can we understand classification results? .. to inspire confidence & enrich results • Temporal breakdown freq / Hz Spectrogram : YT_beach_001.mpeg, Manual Clip-level Labels : “beach”, “group of 3+”, “cheer” 4k 3k 2k 1k 0 etc speaker4 speaker3 speaker2 unseen1 Background 25 concepts + 2 global BGs global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground picnic Soundtrack park one person night museum 0 −50 drum Manual Frame-level Annotation wind cheer voice • Cluster examples voice voice voice water wave voice voice voice cheer Viterbi Path by 27−state Markov Model Classification - Dan Ellis cluster 123 (baby) 2011-05-04 14/16 Summary • Machine Listening: Getting useful information from sound • Environmental sound classification ... from whole-clip statistics? • Transients energy peaks ... separate foreground background • Classifier insight ... for confidence & analysis Soundtrack Classification - Dan Ellis 2011-05-04 15/16 References • [Chang et al. 2007] S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A.Yanagawa, A. Loui, & J. Luo, “Large-scale multimodal semantic concept detection for consumer video,” Proc. ACM MIR, 255-264, 2007. • [Lee & Ellis 2010] K. Lee & D. Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE Tr. Audio, Speech, Lang. Process. 18 (6): 1406-1416, Aug. 2010. • [Lee, Ellis & Loui 2010] K. Lee, D. Ellis, & A. Loui, “Detecting Local Semantic Concepts in Environmental Sounds using Markov Model based Clustering,” Proc. ICASSP, 2278-2281, Dallas, 2010. • [Cotton, Ellis & Loui 2011] C. Cotton, D. Ellis, & A. Loui, “Soundtrack classification by transient events,” Proc. ICASSP, to appear, Prague, 2011. • [Ellis, Zhang & McDermott 2011] D. Ellis, X. Zheng, & J. McDermott, “Classifying soundtracks with audio texture features,” Proc. ICASSP, to appear, Prague, 2011. Soundtrack Classification - Dan Ellis 2011-05-04 16/16