Recognizing and Classifying Environmental Sounds Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Machine Listening Background Classification Foreground Event Recognition Open Issues Recognizing Environmental Sounds - Dan Ellis 2012-10-24 1 /35 1. Machine Listening • Extracting useful information from sound Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription “Sound Intelligence” VAD Speech/Music Environmental Sound Speech Dectect Task ... like animals do Recognizing Environmental Sounds - Dan Ellis Music Domain 2012-10-24 2 /35 Environmental Sound Recognition • Goal: Describe soundtracks with a vocabulary of user-relevant acoustic events/sources Spectrogram : YT_beach_001.mpeg, freq / Hz 4k 3k sp4 sp3 sp2 sp1 Bgd 2k 1k 0 0 0 1 2 3 4 5 6 7 8 cheer voice cheer voice voice voice water wave cheer voice 1 2 9 10 3 4 5 6 7 8 9 10 time/sec time/sec • Challenges: Defining acoustic event vocabulary Overlapping sounds Ground-truth training data Classifier accuracy Recognizing Environmental Sounds - Dan Ellis 2012-10-24 3 /35 Environmental Sound Applications • Audio Lifelog Diarization 09:00 09:30 10:00 10:30 11:00 11:30 2004-09-14 preschool cafe preschool cafe Ron lecture 12:00 office 12:30 office outdoor group 13:00 L2 cafe office outdoor lecture outdoor DSP03 compmtg meeting2 13:30 lab 14:00 cafe meeting2 Manuel outdoor office cafe office Mike Arroyo? outdoor Sambarta? 14:30 15:00 15:30 16:00 • Consumer Video 2004-09-13 office office office postlec office Lesser 16:30 17:00 17:30 18:00 outdoor lab cafe Classification & Search • Live hearing prosthesis app • Robot environment sensitivity Recognizing Environmental Sounds - Dan Ellis 2012-10-24 4 /35 Prior Work • Environment Classification speech/music/ silent/machine Zhang & Kuo ’01 • Nonspeech Sound Recognition Temko & Nadeu ’06 Recognizing Environmental Sounds - Dan Ellis Meeting room Audio Event Classification sports events - cheers, bat/ball sounds, ... 2012-10-24 5 /35 Consumer Video Dataset • 25 “concepts” from 1G+KL2 (10/15) Kodak user study boat, crowd, cheer, dance, ... 1 Labeled Concepts museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+ 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 3mc c s o n g s d b p b p p b s g s s b a w pm 0 Overlapped Concepts • Grab top 200 videos 8GMM+Bha (9/15) from YouTube search then filter for quality, unedited = 1873 videos manually relabel with concepts pLSA500+lognorm (12/15) Recognizing Environmental Sounds - Dan Ellis 2012-10-24 6 /35 Obtaining Labeled Data • Amazon Mechanical Turk Y-G Jiang et al. 2011 10s clips 9,641 videos in 4 weeks Columbia Consumer Video (CCV) set Recognizing Environmental Sounds - Dan Ellis 2012-10-24 7 /35 2. Background Classification • Baseline for soundtrack classification 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz divide sound into short frames (e.g. 30 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = “bag of features” 14 12 0 10 8 6 4 2 5 10 15 MFCC dimension 20 -50 • Classify by e.g. Mahalanobis distance + SVM Recognizing Environmental Sounds - Dan Ellis 2012-10-24 8 /35 Codebook Histograms • Convert high-dim. distributions to multinomial 8 150 6 4 7 2 MFCC features Per-Category Mixture Component Histogram count MFCC(1) Global Gaussian Mixture Model 0 2 5 -2 6 1 10 14 8 100 15 9 12 13 3 4 -4 50 11 -6 -8 -10 -20 -10 0 10 20 0 1 2 3 MFCC(0) 4 5 6 7 8 9 10 11 12 13 14 15 GMM mixture • Classify by distance on histograms KL, Chi-squared + SVM Recognizing Environmental Sounds - Dan Ellis 2012-10-24 9 /35 Latent Semantic Analysis (LSA) • Probabilistic LSA (pLSA) models each histogram as a mixture of several ‘topics’ .. each clip may have several things going on • Topic sets optimized through EM p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip) = GMM histogram ftrs * “Topic” p(topic | clip) p(ftr | clip) “Topic” AV Clip AV Clip GMM histogram ftrs p(ftr | topic) use (normalized?) p(topic | clip) as per-clip features Recognizing Environmental Sounds - Dan Ellis 2012-10-24 10 /35 Background Classification Results Average Precision K Lee & Ellis ’10 1 Guessing MFCC + GMM Classifier Single−Gaussian + KL2 8−GMM + Bha 1024−GMM Histogram + 500−log(P(z|c)/Pz)) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 • Wide range in performance sk su i ns e bi rth t da an y im w al ed di ng pi cn ic m us eu m M EA N a sp t o gr ad rts ua tio n bo de ra nd pa ou pl ay gr ba by rk h pa ac g be in da nc ow o sh tw t ou p of ni gh on gr e pe rs g in r si ng ee d ch ow us m cr on gr ou p of 3+ 0 ic 0.1 Concept class audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes Recognizing Environmental Sounds - Dan Ellis 2012-10-24 11 /35 Sound Texture Features • Characterize sounds by McDermott Simoncelli ’09 Ellis, Zheng, McDermott ’11 perceptually-sufficient statistics \x\ Sound \x\ Automatic gain control mel filterbank (18 chans) Octave bins 0.5,1,2,4,8,16 Hz FFT \x\ \x\ Histogram \x\ \x\ .. verified by matched resynthesis Mahalanobis distance ... Cross-band correlations (318 samples) Envelope correlation 1062_60 quiet dubbed speech music 2404 1273 1 617 0 2 4 6 8 10 0 2 4 6 8 10 time / s Texture features mel band distributions & env x-corrs mean, var, skew, kurt (18 x 4) 1159_10 urban cheer clap freq / Hz • Subband Modulation energy (18 x 6) 0 level 15 10 5 M V S K 0.5 2 8 32 moments Recognizing Environmental Sounds - Dan Ellis mod frq / Hz 5 10 15 mel band M V S K 0.5 2 8 32 moments mod frq / Hz 5 10 15 mel band 2012-10-24 12 /35 Sound Texture Features • Test on MED 2010 development data 10 specially-collected manual labels 0.9 0.8 0.7 0.6 0.5 cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M MED 2010 Txtr Feature (normd) means 0.5 0 −0.5 V S K ms cbc MED 2010 Txtr Feature (normd) stddevs 1 0.5 0 V S K ms cbc MED 2010 Txtr Feature (normd) means/stddevs 0.5 0 −0.5 V S K ms M MVSK ms cbc MVSK+ms MVSK+cbc MVSK+ms+cbc 1 Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg • Contrasts in feature sets correlation of labels... • Perform ~ same as MFCCs combine well cbc Recognizing Environmental Sounds - Dan Ellis 2012-10-24 13 /35 Auditory Model Features • Based on Lyon/Patterson Auditory Image Model Simplified version is 10x faster (RTx5 → RT/2) • Captures fine time structure in multiple bands .. The information that is lost in MFCC features de lay short-time autocorrelation e lin frequency channels Sound Cochlea filterbank Subband VQ Subband VQ Histogram Feature Vector Subband VQ Subband VQ freq lag Correlogram slice time Recognizing Environmental Sounds - Dan Ellis 2012-10-24 14 /35 Auditory Model Feature Results • Results vary with class, but fusion helps 15% 0.8 MFCC 0.7 SBPCA Fusion 0.6 0.5 0.4 0.3 0.2 0.1 P mA ac h ay gr ou nd Be Pl Bi rd ra du at io n Bi rth da W y ed dR ec W ed ep dC er W em ed dD an ce M us No Perf nM us Pe rf Pa ra de G g Do Ca t Ba sk et ba ll Ba se ba ll So cc er Ice Sk at ing Sk iin g Sw im mi ng Bi ki n g 0 Results for CCV (9317 videos) Recognizing Environmental Sounds - Dan Ellis 2012-10-24 15 /35 Real-World Dictionary • BBC Sound Effects as reference library The Czech Republic - Slovakia Hungary Rural South America Urban South America Footsteps 2 Footsteps 1 India Pakistan Nepal-Countrysid India & Nepal-City Life Exterior Atmospheres-Rural Background England France Suburbia Crowds Birds Istanbul Emergency Cats Construction Age Of Steam Trains Spain Schools & Crowds Horses & Dogs Horses Farm Machinery Livestock 2 Livestock 1 Adventure Sports Greece Equestrian Events Africa The Natural World Africa The Human World Hospitals Babies China Aircraft America Ships And Boats 2 Ships And Boats 1 Weather 1 Electronically Generated Sounds Explosions Guns Alarms Sport Leisure Auto Rural Soundscapes Big Ben Taxi Bus Atmospheres Industry Birds Rivers Streams Water Computers Printers Phones European Soundscapes Misc 1000+ examples ... comprehensive? similarity via normalized textures (over 10s chunks) Audiences Children Crowds Foots Animals and Birds Transportation Interior Atmosphere Household Exterior Atmosphere BBC Sound Effects BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT Recognizing Environmental Sounds - Dan Ellis 2012-10-24 16 /35 BBC Audio Semantic Classes • Use BBC Sound Effects Library 2238 sound files Short keyword descriptions SFX001–04–01 SFX001–05–01 SFX001–06–01 SFX001–07–01 SFX001–08–01 SFX001–09–01 • Train classifiers for 100 most common keywords Select 45 best-performing • Added as “semantic units” .. Useful for description? Recognizing Environmental Sounds - Dan Ellis Wood Fire Inside Stove 5:07 ! City Skyline City Skyline 9:46 ! High Street With Traffic, Footsteps! Car Wash Automatic, Wash Phase Inside R Motor Cycle Yamaha Rd 350: Motor Cycle Motor Cycle Yamaha Rd 350, Rider Runs U! 290 267 240 197 193 ... walking stairs stone siren pavement country ascending warfare wood footsteps running futuristic woman ambience women men boat horses litre emergency descending babies wooden animals backgrounds footsteps on animals ambience interior ! ! 59 men! 59 general! 55 switch! 53 starts! 53 crowds! ...! BBC2238 top 25 mutual AP 1 0.8 0.6 0.4 0.2 2012-10-24 17 /35 0.5 0.4 0.3 Recognizing Environmental Sounds - Dan Ellis P 0.6 mA g_ E0 a_ 02 bo -F ar ee d_ di tri ng ck _a E0 E0 n _a 03 05 ni m -L -W E0 a al nd or 04 ki n ing -W g_ _a ed on _f di _a ish ng _w _c er oo em dw on or y ki n E0 g _ E0 06 pr 07 oj -B ec -C ir t t hd ha ay ng _p i ng E0 ar _a 08 ty _ v -F eh E0 l a icl 09 sh e_ _m -G tir ob et e tin _g g_ at he a_ E0 rin ve 10 g h icl -G e ro _u om ns tu E0 ing ck 11 _a -M n_ ak an i ng im al _a _s an dw ic h E0 12 -P E0 ar 14 ad E0 -R E e 01 ep 15 3ai -W P r ing ar or ko kin _a ur g_ n_ ap on _a pl ia _s nc ew e i ng _p ro je ct mp tin tte E0 01 -A Audio Semantic Results • Semantic-level feature fusion helps ~15% relative mfccs-230 sema100mfcc sbpcas-4000 sema100sbpca sema100sum 0.2 0.1 0 Results on TREC MED 2011 DEVT1 (6314 videos) 2012-10-24 18 /35 Audio Semantic Results • Browsing results is surprisingly good (sometimes) Recognizing Environmental Sounds - Dan Ellis 2012-10-24 19 /35 3. Foreground: Transient Features Cotton, Ellis, Loui ’11 • Transients = • foreground events? Onset detector finds energy bursts best SNR • PCA basis to represent each 300 ms x auditory freq • “bag of transients” Recognizing Environmental Sounds - Dan Ellis 2012-10-24 20 /35 Nonnegative Matrix Factorization • Decompose spectrograms into freq / Hz templates + activation 5478 Smaragdis Brown ’03 Abdallah Plumbley ’04 Virtanen ’07 Original mixture 3474 2203 1398 883 X=W·H 442 Basis 1 (L2) fast & forgiving gradient descent algorithm 2D patches sparsity control computation time... Basis 2 (L1) Basis 3 (L1) Recognizing Environmental Sounds - Dan Ellis 0 1 2 3 4 5 6 7 8 9 10 time / s 2012-10-24 21 /35 NMF Transient Features • Learn 20 patches from freq (Hz) Meeting Room Acoustic Event data Cotton, Ellis ’11 • Compare to MFCC-HMM detector 3474 Patch 1 Patch 2 Patch 3 Patch 4 Patch 5 Patch 6 Patch 7 Patch 8 Patch 9 Patch 10 Patch 11 Patch 12 Patch 13 Patch 14 Patch 15 Patch 16 Patch 17 Patch 18 Patch 19 Patch 20 1398 442 3474 1398 442 3474 Acoustic Event Error Rate (AEER%) 1398 442 Error Rate in Noise 300 3474 1398 250 442 0.0 200 100 0 10 0.5 0.0 0.5 0.0 0.5 time (s) noise-robust MFCC NMF combined 15 0.5 0.0 • NMF more 150 50 0.5 0.0 20 SNR (dB) 25 30 Recognizing Environmental Sounds - Dan Ellis combines well ... 2012-10-24 22 /35 Foreground Event Localization • Global vs. local class models K Lee, Ellis, Loui ’10 tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs: YT baby 002: voice baby laugh 4 New Way: Limited temporal extents freq / kHz freq / kHz Old Way: All frames contribute 3 4 2 1 1 5 10 voice 15 0 time / s baby 3 2 0 voice bg 5 voice baby 10 laugh 15 bg time / s laugh baby laugh “background” model shared by all clips Recognizing Environmental Sounds - Dan Ellis 2012-10-24 23 /35 Foreground Event HMMs freq / Hz Spectrogram : YT_beach_001.mpeg, Manual Clip-level Labels : “beach”, “group of 3+”, “cheer” • Training labels only • Refine models by EM realignment • Use for classifying entire video... or seeking to relevant part Recognizing Environmental Sounds - Dan Ellis etc speaker4 speaker3 speaker2 unseen1 Background 25 concepts + 2 global BGs at clip-level 4k 3k 2k 1k 0 global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground picnic park one person night museum group of 2 group of 3+ graduation crowd boat birthday beach baby animal 0 0 −50 drum Manual Frame-level Annotation wind cheer voice voice voice water wave voice voice voice voice cheer Viterbi Path by 27−state Markov Model 1 2 3 4 5 6 7 8 Lee & Ellis ’10 9 10 time/sec 2012-10-24 24 /35 Refining Ground Truth • Audio Ground Truth at coarse time resolution better-focused labels give better classifiers? but little information in very short time frames Relabel based on most likely segments in clip Retrain classifier Recognizing Environmental Sounds - Dan Ellis Video 0EwtL2yr0y8 03FF1Sz53YU Tags Parade - Time Labels Parade Features SVM Train SVM Classify Iteration 1 Initial labels apply to whole clip Iteration 0 • Train classifiers on shorter (2 sec) segments? Posteriors Time Labels Features Parade P SVM Train SVM Classify 2012-10-24 25 /35 Refining Ground Truth • Refining labels is “Multiple Instance Learning” “Positive” clips have at least one +ve frame “Negative” clips are all –ve • Refine based on previous classifier’s scores Text Threshold from CDFs of +ve and –ve frames Recognizing Environmental Sounds - Dan Ellis 2012-10-24 26 /35 Refining Ground Truth • Score by aggregating frame labels to whole clip • Systems improve for 3-5 epochs, ~10% total 0.7 mfcc-230-2s-E1 0.6 mfcc-230-2s-E2 mfcc-230-2s-E3 0.5 0.4 0.3 0.2 0.1 Recognizing Environmental Sounds - Dan Ellis mA P Bi ra rd du at io n Bi rth d W ed ay dR ec W ed ep dC W ere m ed dD an ce M u No sPer nM f us Pe rf Pa ra de Be a Pl ay ch gr ou nd G Ca t Do g Ba sk et ba Ba l l se ba ll So c Ice cer Sk at i ng Sk ii Sw ng im mi ng Bi kin g 0 2012-10-24 27 /35 4. Open Issues • Better object/event separation parametric models spatial information? computational auditory scene analysis... • • Integration with video Large-scale analysis Recognizing Environmental Sounds - Dan Ellis Frequency Proximity Common Onset Harmonicity Barker et al. ’05 2012-10-24 28 /35 Audio Annotation Audio*Annota9on* • CoLordina7ng$programLwide$effort$to$share$labels$ Co-ordinating program-wide effort labels - Defining*label*set* - Crea9ng*labeled*data* Defining label set - CMU*43*label*set*as*basis* Creating labeled data CMU 43 label set as basis Efficiency$of$labeling$is$key$ • 2012D10D22* - AudioDonly*vs.*audio+video?* - CoarseDlevel*human*labels** *+*automa9c*refinement?* Efficiency of labeling is key Audio-only vs. audio+video? Coarse-level human labels + automatic refinement? Recognizing Environmental Sounds - Dan Ellis to share animal singing clatter anim_bird music_sing rustle anim_cat music scratch anim_ghoat knock hammer anim_horse thud washboard human_noise clap applause laugh click whistle scream bang squeak child beep tone mumble engine_quiet sirene speech engine_light water speech_ne power_tool micro_blow radio engine_heavy wind white_noise cheer other_creak crowd Burger & Metze, CMU, 2012 (c)*IBM*Corpora9on*and* Columbia*University* 2012-10-24 29 /35 Audio-Visual Atoms Jiang et al. ’09 • Object-related features from both audio (transients) & video (patches) time visual atom ... visual atom ... audio atom frequency frequency audio spectrum time Recognizing Environmental Sounds - Dan Ellis audio atom Joint Audio-Visual Atom discovery ... visual atom ... audio atom time 2012-10-24 30 /35 Audio-Visual Atoms • Multi-instance learning of A-V co-occurrences M Visual Atoms visual atom … visual atom time … … time codebook learning Joint A-V Codebook … audio atom audio atom M×N A-V Atoms … N audio atoms Recognizing Environmental Sounds - Dan Ellis 2012-10-24 31 /35 Audio-Visual Atoms black suit + romantic music marching people + parade sound Wedding sand + beach sounds Parade Beach Audio only Visual only Visual + Audio Recognizing Environmental Sounds - Dan Ellis 2012-10-24 32 /35 Summary • Machine Listening: Getting useful information from sound • Background sound classification ... from whole-clip statistics? • Foreground event recognition ... by focusing on peak energy patches • Speech content is very important ... separate with pitch, models, ... Recognizing Environmental Sounds - Dan Ellis 2012-10-24 33 /35 References • • • • • • • • • • • • Jon Barker, Martin Cooke, & Dan Ellis, “Decoding Speech in the Presence of Other Sources,” Speech Communication 45(1): 5-25, 2005. Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE ICASSP, Prague, May 2011. Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event Classification,” submitted to IEEE WASPAA, 2011. Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,” IEEE ICASSP, Prague, May 2011. Wei Jiang, Courtenay Cotton, Shih-Fu Chang, Dan Ellis, & Alex Loui, “Short-Term Audio-Visual Atoms for Generic Video Concept Classification,” ACM MultiMedia, 5-14, Beijing, Oct 2009. Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE Tr. Audio, Speech and Lang. Proc. 18(6): 1406-1416, Aug. 2010. Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in environmental sounds using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010. Byung-Suk Lee & Dan Ellis, “Noise-robust pitch tracking by trained channel selection,” submitted to IEEE WASPAA, 2011. Andriy Temko & Climent Nadeu, “Classification of acoustic events using SVM-based clustering schemes,” Pattern Recognition 39(4): 682-694, 2006 Ron Weiss & Dan Ellis, “Speech separation using speaker-adapted Eigenvoice speech models,” Computer Speech & Lang. 24(1): 16-29, 2010. Ron Weiss, Michael Mandel, & Dan Ellis, “Combining localization cues and source model constraints for binaural source separation,” Speech Communication 53(5): 606-621, May 2011. Tong Zhang & C.-C. Jay Kuo, “Audio content analysis for on-line audiovisual data segmentation,” IEEE TSAP 9(4): 441-457, May 2001 Recognizing Environmental Sounds - Dan Ellis 2012-10-24 34 /35 Acknowledgment Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20070. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. Recognizing Environmental Sounds - Dan Ellis 2012-10-24 35 /35