Subband Autocorrelation Features for Video Soundtrack Classification Courtenay Cotton & Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {cvcotton,dpwe} 1. Soundtrack Classification 2. Auditory Model Features 3. Results & Future Work Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 1 /17 1. Soundtrack Classification • Goal: Describe soundtracks with a vocabulary of user-relevant acoustic events/sources Spectrogram : YT_beach_001.mpeg, freq / Hz 4k 3k sp4 sp3 sp2 sp1 Bgd 2k 1k 0 0 0 1 2 3 4 5 6 7 8 cheer voice cheer voice voice voice water wave cheer voice 1 2 9 10 3 4 5 6 7 8 9 10 time/sec time/sec • Challenges: Defining acoustic event vocabulary Overlapping sounds Ground-truth training data Classifier accuracy Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 2 /17 Environmental Sound Applications • Audio Lifelog Diarization 09:00 09:30 10:00 10:30 11:00 11:30 2004-09-14 preschool cafe preschool cafe Ron lecture 12:00 office 12:30 office outdoor group 13:00 L2 cafe office outdoor lecture outdoor DSP03 compmtg meeting2 13:30 lab 14:00 cafe meeting2 Manuel outdoor office cafe office Mike Arroyo? outdoor Sambarta? 14:30 15:00 15:30 16:00 • Consumer Video 2004-09-13 office office office postlec office Lesser 16:30 17:00 17:30 18:00 outdoor lab cafe Classification & Search • Live hearing prosthesis app • Robot environment sensitivity Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 3 /17 Consumer Video Dataset Y-G. Jiang et al. 2011 • Columbia Consumer Video (CCV) set 9,317 videos / 210 hours 20 concepts based on consumer user study Labeled via Amazon Mechanical Turk Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 4 /17 Soundtrack Classification K. Lee & Ellis 2010 • Baseline soundtrack classification system: 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz divide sound into short frames (e.g. 30 ms) calculate MFCC features for each frame describe clip by statistics of frames (mean, covariance) = “bag of features” 14 12 0 10 8 6 4 2 5 10 15 MFCC dimension • Classify by e.g. Mahalanobis distance + SVM Subband Autocorrelation Features - Cotton & Ellis 20 -50 2013-05-29 5 /17 Codebook Histograms • Instead of Gaussian model, convert high-dim. distributions to multinomial Vector Quantization (VQ) 8 150 6 4 7 2 MFCC features Per-Category Mixture Component Histogram count MFCC(1) Global Gaussian Mixture Model 0 2 5 -2 6 1 10 14 8 100 15 9 12 13 3 4 -4 50 11 -6 -8 -10 -20 -10 0 10 20 0 1 2 3 MFCC(0) • Classify by distance on histograms 4 5 6 7 8 9 10 11 12 13 14 15 GMM mixture KL, Chi-squared + SVM Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 6 /17 broad spectral shape but strip away fine detail transients, pitch, texture Original 4 3 2 1 0 freq / kHz • MFCCs model freq / kHz Baseline Limitations MFCC resynthesis 4 3 2 1 0 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15 time / s • BoF loses temporal structure... • Channel & Noise Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 7 /17 2. Auditory Model Features Lyon et al. 2010 • Lyon’s Stabilized Auditory Image (SAI) model fine structure stabilized by ‘strobing’ on transients 1 Audio in 3 Cochlea simulation Strobe detection Temporal integration Auditory image Multiscale segmentation Sparse coding Sparse code Aggregate features 2 4 ... 0 0 0 1 0 0 0 0 0 0 ... ... 0 0 0 1 0 0 0 0 0 0 ... ... 0 0 0 0 1 0 0 0 0 0 ... Document feature vector ... 0 0 0 1 0 0 0 0 0 0 ... Σ ... 0 0 0 0 0 0 0 1 0 0 ... ... 0 0 0 2 1 0 0 1 0 0 ... Figure 1: The systems for generating It Subband Autocorrelation Features - Cotton & sparse Ellis codes from audio using an auditory frontend.2013-05-29 8 /17 Subband Autocorrelation (SBPCA) B-S Lee & Ellis 2012 • Simplified version of Lyon model 10x faster (RTx5 → RT/2) • Captures fine time structure in multiple bands .. the information that is lost in MFCC features de lay short-time autocorrelation line frequency channels Sound Cochlea filterbank Subband VQ Subband VQ Histogram Feature Vector Subband VQ Subband VQ freq lag Correlogram slice time Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 9 /17 Filterbank lay lin short-time autocorrelation e Sound Cochlea filterbank Subband VQ Subband VQ Histogram Feature Vector Subband VQ frequency channels • Simple four-pole, two-zero bandpass filters • Constant-Q, log-spacing de Subband VQ freq lag Correlogram slice time very rough approximation to cochlea Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 10 /17 Subband Autocorrelation lin short-time autocorrelation e Subband VQ Sound Cochlea filterbank Subband VQ Histogram Subband VQ Feature Vector fine time structure lay frequency channels • Autocorrelation stabilizes de Subband VQ freq lag Correlogram slice time 25 ms window, lags up to 25 ms calculated every 10 ms normalized to max (zero lag) Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 11 /17 PCA Summarization lin short-time autocorrelation e Subband VQ Sound Cochlea filterbank full image = 24 bands x 200 lags but intrinsic information is much lower Subband VQ Subband VQ Histogram Feature Vector bandpass signals is constrained lay frequency channels • Autocorrelation of de Subband VQ freq lag Correlogram slice time • Reduce dimensionality with Principal Component Analysis (PCA) calculated on individual bands to retain separability per-subband PCA bases have little dependence on training material 10 dimensions adequate Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 12 /17 Subband VQ • 4 separate bands to lin short-time autocorrelation e Subband VQ Sound Cochlea filterbank Subband VQ Subband VQ Histogram Feature Vector segments by VQ histogram lay frequency channels • Summarize features across de Subband VQ freq lag Correlogram slice time provide overlap resistance non-overlapping groups of 6 subbands x 10 PCA coefs • Each band quantized into 1000 codewords; whole soundtrack → codeword histogram • 4000 dimensions, sparse SVM classifier with Chi2 kernel Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 13 /17 Auditory Model Feature Results • SAI and SBPCA close to MFCC baseline MAP All 3 MFCC + SBPCA MFCC + SAI SAI + SBPCA MFCC SAI SBPCA Playground Beach Parade NonMusicPerf MusicPerfmnce • Fusing MFCC and SBPCA improves mAP by 15% rel mAP: 0.35 → 0.40 WeddingDance WeddingCerem WeddingRecep Birthday Graduation Bird Dog Cat Biking • Calculation time MFCC: 6 hours SAI: 1087 hours SBPCA: 110 hours Subband Autocorrelation Features - Cotton & Ellis Swimming Skiing IceSkating Soccer Baseball Basketball 0 0.2 0.4 0.6 Avg. Precision (AP) 0.8 1 2013-05-29 14 /17 Retrieval Examples • Browsing results are sometimes surprising! file:///u/drspeech/data/aladdin/code/genVidClassif/html/mfcc230/Cat-max.html Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 15 /17 3. Open Issues • Foreground & Events transients at a coarser scale • Better object/event separation parametric models spatial information? computational auditory scene analysis... • Better Annotation Frequency Proximity Common Onset Harmonicity Barker et al. ’05 granularity in time & concept Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 16 /17 Summary • Soundtrack classification Acoustic properties of different events • Standard model MFCC + bag-of-features + classifier • Auditory model features to recapture the fine time structure combination improves retrieval code available: Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 17 /17 References • • • • • Jon Barker, Martin Cooke, & Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech Communication 45(1): 5-25, 2005. Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A.C. Loui, “Consumer video understanding: A benchmark database and an evaluation of human and machine per- formance,” in Proc. ACM International Conference on Multimedia Retrieval (ICMR), Apr. 2011, p. 29. Byung-Suk Lee & Dan Ellis, “Noise robust pitch tracking by subband autocorrelation classification,” in Proc. INTERSPEECH-12, Sept. 2012. Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010. R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, and G. Chechik, “Sound retrieval and ranking using sparse auditory representations,” Neural Computation, vol. 22, no. 9, pp. 2390–2416, Sept. 2010. Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 18 /17 Acknowledgment Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20070. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. Subband Autocorrelation Features - Cotton & Ellis 2013-05-29 19 /17