Analysis of Everyday Sounds Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. 5. Personal and Consumer Audio Segmenting & Clustering Special-Purpose Detectors Generic Concept Detectors Challenges & Future Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 1 /35 LabROSA Overview Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 2 /35 1. Personal Audio Archives • Easy to record everything you hear <2GB / week @ 64 kbps to find anything • Hard how to scan? how to visualize? how to index? • Need automatic analysis • Need minimal impact Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 3 /35 Personal Audio Applications • Automatic appointment-book history fills in when & where of movements • “Life statistics” how long did I spend in meetings this week? most frequent conversations favorite phrases? • Retrieving details what exactly did I promise? privacy issues... • Nostalgia • ... or what? Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 4 /35 Consumer Video • Short video clips as the evolution of snapshots 10-60 sec, one location, no editing browsing? • More information for indexing... video + audio foreground + background Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 5 /35 Information in Audio • Environmental recordings contain info on: location – type (restaurant, street, ...) and specific activity – talking, walking, typing people – generic (2 males), specific (Chuck & John) spoken content ... maybe • but not: what people and things “looked like” day/night ... ... except when correlated with audible features Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 6 /35 A Brief History of Audio Processing • Environmental sound classification draws on earlier sound classification work as well as source separation... Speech Recognition Source Separation One channel Speaker ID Multi-channel GMM-HMMs Model-based Music Audio Genre & Artist ID Cue-based Sountrack & Environmental Recognition Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 7 /35 2. Segmentation & Clustering • Top-level structure for long recordings: Where are the major boundaries? e.g. for diary application support for manual browsing of fundamental time-frame • Length 60s rather than 10ms? background more important than foreground average out uncharacteristic transients features • Perceptually-motivated .. so results have perceptual relevance broad spectrum + some detail Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 8 /35 MFCC Features • Need “timbral” features: log-domain auditory-like frequency warping Mel-Frequency Cepstral Coeffs (MFCCs) discrete cosine transform = orthogonalization Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 9 /35 Long-Duration Features Ellis & Lee ’04 Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 0.6 5 20 freq / bark freq / bark 20 0.5 bits 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min • Capture both average and variation • Capture a little more detail in subbands... Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 10/35 bits Spectral Entropy NF • Auditory spectrum: A[n, j] = w X[n, k] ‘peakiness’ of each band: • Spectral entropy ≈w X[n, ! " w X[n, k] k] jk k=0 NF H[n, j] = − jk A[n, j] energy / dB k=0 jk · log A[n, j] FFT spectral magnitude 0 -20 Auditory Spectrum -40 -60 rel. entropy / bits 0 1000 2000 3000 4000 5000 6000 7000 8000 0.5 0 per-band Spectral Entropies -0.5 -1 30 340 750 1130 1630 2280 3220 3780 4470 5280 6250 7380 freq / Hz Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 11 /35 BIC Segmentation Chen & Gopalakrishnan ’98 • BIC (Bayesian Info. Crit.) compares models: log L(X1 ;M1 )L(X2 ;M2 ) L(X;M0 ) ≷ λ 2 log(N )∆#(M ) AvgLogAudSpec 2004-09-10-1023_AvgLEnergy 20 15 10 5 BIC score boundary passes BIC 0 no boundary with shorter context -100 -200 13:30 14:00 14:30 L(X1;M1) last segmentation point L(X;M0) Analysis of Everyday Sounds - Ellis & Lee 15:00 15:30 16:00 time / hr L(X2;M2) candidate boundary current context limit 2007-07-24 p. 12/35 BIC Segmentation Results 62 hr hand-marked dataset • Evaluate: 8 days, 139 segments, 16 categories measure Correct Accept % @ False Accept = 2%: Feature Correct Accept 80.8% 0.8 μH 81.1% 0.7 σH/μH 81.6% μdB + σH/μH 84.0% μdB + σH/μH + μH 83.6% 0.4 mfcc 73.6% 0.3 o Sensitivity μdB 0.6 0.5 0.2 0 Analysis of Everyday Sounds - Ellis & Lee µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.005 0.01 0.015 0.02 0.025 1 - Specificity 0.03 2007-07-24 p. 13 /35 0.035 0.04 Segment Clustering • Daily activity has lots of repetition: Automatically cluster similar segments ‘affinity’ of segments as KL2 distances 4*5)#1-% 1))%'23 -"#"0-) ,"#,)# ()!%*#)/ ,'(('"#. ,#)"()!%*#)+ !"#$%"&' ;01) ,0:('23 4%#))% #)4%"*#"2% + 768 (',#"#9 7 !"15*4 !15 Analysis of Everyday Sounds - Ellis & Lee (', #4% 4%# 666 2007-07-24 p. 14 /35 Spectral Clustering Ng, Jordan, Weiss ’01 • Eigenanalysis of affinity matrix: A = U•S•V SVD components: uk•skk•vk' Affinity Matrix 900 800 800 600 700 400 600 200 k=1 k=2 k=3 k=4 500 400 800 300 600 200 400 100 200 200 400 600 800 200 400 600 800 200 400 600 800 eigenvectors vk give cluster memberships • Number of clusters? Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 15 /35 Clustering Results • Clustering of automatic segments gives ‘anonymous classes’ BIC criterion to choose number of clusters make best correspondence to 16 GT clusters • Frame-level scoring gives ~70% correct errors when same ‘place’ has multiple ambiences Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 16 /35 Browsing Interface / Diary interface • Browsing links to other information (diary, email, photos) synchronize with note taking? (Stifelman & Arons) audio thumbnails • Release Tools + “how to” for capture !"#!! '!!(D!%D&$ '!!(D!%D&( '!!(D!%D&) '!!(D!%D&* '!!(D!%D&+ !"#$! !%#!! !%#$! &!#!! &!#$! &&#!! ,-./01223 045. ,-./01223 045. >2= 3.067-. <..68=: 045. 25580. <..68=:' ,2/63.0 EFG!( C' 045. 25580. 2769223.067-. EFG!$ &&#$! &'#!! 25580. &'#$! 25580. 276922:-27, <..68=:' 27692227692225580. 276922276922<..68=: 25580. ,2/63.0 <..68=: &$#$! 34; &(#!! 045. <..68=:' ?4=7.3 27692225580. 045. 25580. ?8H. C' 045. 25580. <..68=: 3.067-.' @--2A2B 276922- F4<;4-64B 25580. &(#$! &+#$! 25580. &"#$! &%#!! &%#$! /.<8=4- 25580. :, 25580. ,2/63.0 25580. :-27, C.//.- H.4=/7; 25580. <..68=:' :-27, 25580. 045. 27692234; 25580. Analysis of Everyday Sounds - Ellis & Lee &"#!! :-27, 25580. :-414< &*#$! &+#!! 045. :-27, 276922- &)#!! &*#!! -9: 02<,<6: 276922- &$#!! &)#$! 045. :-27, 3.067-. 045. 045. 276922- ,-./01223 =4614= 045. 045. 2007-07-24 p. 17 /35 045. 3. Special-Purpose Detectors: Speech • Speech emerges as most interesting content • Just identifying speech would be useful goal is speaker identification / labeling • Lots of background noise conventional Voice Activity Detection inadequate • Insight: Listeners detect pitch track (melody) look for voice-like periodicity in noise coffeeshop excerpt Frequency 4000 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 Time Analysis of Everyday Sounds - Ellis & Lee 3 3.5 4 4.5 2007-07-24 p. 18/35 Voice Periodicity Enhancement • Noise-robust subband autocorrelation Lee & Ellis ’06 • Subtract local average suppresses steady background e.g. machine noise 15 min test set; 88% acc (no suppression: 79%) also for enhancing speech by harmonic filtering Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 19/35 Detecting Repeating Events Ogle & Ellis ’07 • Recurring sound events can be informative indicate similar circumstance... but: define “event” – sound organization define “recurring event” – how similar? .. and how to find them – tractable? • Idea: Use hashing (fingerprints) index points to other occurrences of each hash; intersection of hashes points to match - much quicker search use a fingerprint insensitive to background? Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 20/35 Shazam Fingerprints A. Wang ’06 • Prominent spectral onsets are landmarks; Use relations {f1, f2, t} as hashes Phone ring - Shazam fingerprint 4000 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 intrinsically robust to background noise Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 21/35 3 .(<,#'8#*'%(/#,1,(&*" D'2#&4,#-2'/%3&6'(#.%/6'#.(/#&,5,-4'(,#26(<*7#&4,#2,-,.&,/#*,.234#:'2?*#8.625=#:,55"##D6<%2,#EF#*4':*#&4,#'%&-%&#'8#. ,-,.&,/#*,.234#'(#.#*326-&,/#GH#)6(%&,#2,3'2/6(<"##>46*#6(35%/,/#2,-,.&*#'8#&:'#*'(<*7#&42,,#&,5,-4'(,#26(<*7#&:'#,5,3&2'(63 ;,55*7#.(/#&:'#,.34#'8#<.2.<,#/''2#.(/#3.2#/''2#('6*,*"##$#1.26,&=#'8#;.3?<2'%(/#('6*,*#:,2,#.5*'#-2,*,(&" Exhaustive Search for Repeats • More selective hashes → few hits required to confirm match (faster; better precision) but less robust to backgound (reduce recall) • Works well when exact structure repeats !"#$%&'()S#+,-,.&,/#*'%(/#,1,(&*#8'%(/#/%26(<#*,.234"##C'&,#&4,#K*,58L).&34#/6.<'(.5"##+,-,.&*#'8#,1,(&*#.2,#8'%(/#'88T/6.<'(.5#.* 5.;,5,/" recorded music, electronic alerts no good for “organic” sounds e.g. garage door $*#,I-,3&,/#&4,#-2'/%3&6'(#.%/6'#.(/#&,5,-4'(,#26(<*#:,2,#,.*65=#8'%(/#:465,#&4,#*6(<5,#&'(,#;,55#.(/#('6*,#*'%(/*#:,2, )6**,/"##>46*#*3.&&,2-5'&#6*#32,.&,/#;=#-5'&&6(<#&4,#&6),#'8#&4,#J%,2=#*'%(/#,1,(&#.<.6(*&#&4,#&6),#'8#.(=#2,-,.&#'33%22,(3,* '8#&4.&#,1,(&"##>4%*#&4,#*&2.6<4&#/6.<'(.5#56(,#2,-2,*,(&*#&4,#K*,58L#).&34#&4.&#'33%2*#:4,(#&4,#J%,2=#*'%(/#,1,(&#-.**,* &*,58#6(#&4,#865,"##>4,#'88#/6.<'(.5#-'6(&*#.2,#&4,#2,-,.&,/#,1,(&*"##D'2#,I.)-5,7#&4,#-'6(&*#.&#MHFF7ENHFO#.(/#MHFF7#EPFFO ,-2,*,(&#&:'#2,-,.&#'33%22,(3,*#'8#&4,#&,5,-4'(,#26(<*#.&#HFF#*,3'(/*"##>4,*,#*.),#2,-,.&#,1,(&*#.2,#*4':(#.&#MHFF7HFFO7 ENHF7HFFO7#.(/#MEPFF7HFFO"##>46*#*=)),&2=#3.(#;,#*,,(#8'2#.55#&4,#).&34,*#.(/#6*#&4,#2,*%5&#'8#&4,#5.&,2#'33%22,(3,*#'8#. '%(/#,1,(&#86(/6(<#&4,#,.256,2#'33%22,(3,"##Q(#-2.3&63,#.#8'2:.2/#*,.234#:'%5/#'(5=#;,#(,,/,/#.(/#4.58#'8#&46*#-5'&#:'%5/ 4':#.55#2,5,1.(&##6(8'2).&6'("##>4,#/%-563.&6'(#6(#&46*#6)-5,),(&.&6'(#:.*#/'(,#&'#6(32,.*,#&4,#(%);,2#'8#2,-,.&,/ 5,),(&*#&'#;,&&,2#,1.5%.&,#&4,#*=*&,)"##>4,#*,.234#&6),#8'2#&4,#GH#)6(%&,#.%/6'#865,#%*,/#&'#<,(,2.&,#&46*#-5'&#:.*#RP Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 22/35 Music Detector • Two characteristic features for music Lee & Ellis ’08 strong, sustained periodicity (notes) clear, rhythmic repetition (beat) at least one should be present! Pitch-range subband autocorrelation Local stability measure Rhythm-range envelope autocorrelation Perceptual rhythm model Fused music classifier music audio • Noise-robust pitch detector looks for high-order autocorrelation • Beat tracker .. from Music IR work Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 23/35 4. Generic Concept Detectors Chang, Ellis et al. ’07 • Consumer Video application: How to assist browsing? system automatically tags recordings tags chosen by usefulness, feasibility • Initial set of 25 tags defined: “animal”, “baby”, “cheer”, “dancing” ... human annotation of 1300+ videos evaluate by average precision • Multimodal detection separate audio + visual low-level detectors (then fused...) Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 24 /35 MFCC Covariance Representation • Each clip/segment → fixed-size statistics similar to speaker ID and music genre classification • Full Covariance matrix of MFCCs 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz maps the kinds of spectral shapes present 14 12 0 10 8 6 4 2 • Clip-to-clip distances for SVM classifier 5 10 15 MFCC dimension 20 by KL or 2nd Gaussian model Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 25/35 -50 GMM Histogram Representation • Want a more ‘discrete’ description .. to accommodate nonuniformity in MFCC space .. to enable other kinds of models... • Divide up feature space with a single Gaussian Mixture Model .. then represent each clip by the components used 8 150 6 4 7 2 MFCC features Per-Category Mixture Component Histogram count MFCC(1) Global Gaussian Mixture Model 0 2 5 -2 6 1 10 14 8 100 15 9 12 13 3 4 -4 50 11 -6 -8 -10 -20 -10 0 10 20 MFCC(0) Analysis of Everyday Sounds - Ellis & Lee 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GMM mixture 2007-07-24 p. 26/35 Latent Semantic Analysis (LSA) • Probabilistic LSA (pLSA) models each Hofmann ’99 histogram as a mixture of several ‘topics’ .. each clip may have several things going on • Topic sets optimized through EM p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip) = GMM histogram ftrs * “Topic” p(topic | clip) p(ftr | clip) “Topic” AV Clip AV Clip GMM histogram ftrs p(ftr | topic) use p(topic | clip) as per-clip features Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 27/35 AP Audio-Only Results • Wide range of results: 0.8 1G + KL 1G + Mah GMM Hist. + pLSA Guess 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Da Ski nc in g Bo at Ba by Be a W ch ed di M ng us eu m Pa Pl ra ay de gr ou nd Su ns et Pi cn Bi ic rth da y Gr On e pe r ou son p of 3 + M us ic Ch e Gr ou er p of 2 Sp or ts Pa r Cr k ow d Ni gh An t im Si al ng in g Sh ow 0 Concepts audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 28/35 How does it ‘feel’? • Browser impressions: How wrong is wrong? Top 8 hits for “Baby” Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 29/35 Confusion analysis • Where are the errors coming from? (b) Confusion Matrix of Classified Labels True Labels (a) Overlapped Manual Labels 1 Birthday Picnic Sunset Playground Parade Museum Wedding Beach Baby Boat Dancing Ski Show Singing Animal Night Crowd Park Sports Group of 2 Cheer Music Group of 3 + One person 0.7 0.9 0.8 0.6 0.7 0.5 0.6 0.4 0.5 0.3 0.4 0.3 0.2 0.2 0.1 0.1 1 3 M C 2 S P C N A S S S D B B BWM P P S P B Analysis of Everyday Sounds - Ellis & Lee 0 1 3 M C 2 S P C N A S S S D B B BWM P P S P B 2007-07-24 p. 30/35 0 Fused Results - AV Joint Boosting • Audio helps in many classes 0.9 Random Baseline Video Only Audio Only A+V Fusion 0.8 0.7 0.5 0.4 0.3 0.2 0.1 Analysis of Everyday Sounds - Ellis & Lee MAP birthday picnic sunset playground parade museum wedding beach baby boat dancing ski shows singing animal night crowd park sport group of 2 music cheer group of 3+ 0 one person AP 0.6 2007-07-24 p. 31/35 5. Future: Temporal Focus • Global vs. local class models tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs: YT baby 002: voice baby laugh 4 New Way: Limited temporal extents freq / kHz freq / kHz Old Way: All frames contribute 3 4 2 1 1 5 10 voice 15 0 time / s baby 3 2 0 voice bg 5 voice baby 10 laugh 15 bg time / s baby laugh “background” (bg) model shared by all clips Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 32 /35 laugh Handling Sound Mixtures • MFCCs of mixtures ≠ mix of MFCCs recognition despite widely varying background? factorial models / Nonnegative Matrix Factorization sinusoidal / landmark techniques MFCC Noise resynth Solo Voice freq / kHz Original crm-11-070307-noise crm-11-070307 4 3 2 1 0 0 M+F Voice Mix freq / kHz -20 crm-11-070307+16-050105 4 crm-11-070307+16-050105-noise -40 crm-11737.wav 3 crm-11737-noise.wav -60 level / dB 2 1 0 crm-11737+16515.wav crm-11737+16515-noise.wav 0 0.5 1 1.5 0 time / sec 0.5 Analysis of Everyday Sounds - Ellis & Lee 1 1.5 time / sec 2007-07-24 p. 33/35 Larger Datasets • Many detectors are visibly data-limited getting data is ~ hard labeling data is expensive • Bootstrap from YouTube etc. lots of web video is edited/dubbed... - need a “consumer video” detector? • Preliminary YouTube results disappointing downloaded data needed extensive clean-up models did not match Kodak data • (Freely available data!) Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 34/35 Conclusions • Environmental sound contains information .. that’s why we hear! .. computers can hear it too • Personal audio can be segmented, clustered find specific sounds to help navigation/retrieval • Consumer video can be ‘tagged’ .. even in unpromising cases audio is complementary to video • Interesting directions for better models Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 35 /35 References • D. Ellis and K.S. Lee, “Minimal-Impact Audio-Based Personal Archives,” First ACM workshop on Continuous Archiving and Recording of Personal Experiences CARPE-04, New York, Oct 2004, pp. 39-47. • S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” In Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998. • A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in NIPS. MIT Press, Cambridge MA, 2001. • K. Lee and D. Ellis, “Voice Activity Detection in Personal Audio Recordings Using Autocorrelogram Compensation,” Proc. Interspeech ICSLP-06, pp. 1970-1973, Pittsburgh, Oct 2006. • J. Ogle and D. Ellis, “Fingerprinting to Identify Repeated Sound Events in Long-Duration Personal Audio Recordings,” Proc. ICASSP-07, Hawai'i, pp.I-233-236. • A. Wang, “The Shazam music recognition service,” Communications of the ACM, 49(8):44–48, Aug 2006. • K. Lee and D. Ellis, “Detecting Music in Ambient Audio by Long-Window Autocorrelation,” Proc. ICASSP-08, pp. 9-12, Las Vegas, April 2008. •S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. Loui, J. Luo (2007), Large-scale multimodal semantic concept detection for consumer video, Multimedia Information Retrieval workshop, ACM Multimedia Augsburg, Germany, Sep 2007, pp. 255-264. •T. Hofmann. Probabilistic latent semantic indexing. In Proc. SIGIR, 1999 Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 36/35