Recognizing and Classifying Environmental Sounds Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. 5. http://labrosa.ee.columbia.edu/ Environmental Sound Recognition Foreground Events Background Retrieval Labels & Annotation Future Directions Environmental Sounds - Dan Ellis 2013-06-01 1 /34 1. What is hearing for? http://news.bbc.co.uk/2/hi/science/nature/7130484.stm • Hearing = getting information from sound predators/prey communication • Environmental sound recognition is fundamental Environmental Sounds - Dan Ellis 2013-06-01 2 /34 Environmental Sound Perception Ellis 1996 f/Hz City 4000 2000 • 1000 What do people hear? 400 200 0 1 2 3 4 5 6 7 8 9 Horn1 (10/10) S9 horn 2 S10 car horn S4 horn1 S6 double horn S2 first double horn S7 horn S7 horn2 S3 1st horn S5 Honk S8 car horns S1 honk, honk Crash (10/10) sources ambience S7 gunshot S8 large object crash S6 slam S9 door Slam? S2 crash S4 crash S10 door slamming S5 Trash can S3 crash (not car) S1 slam Horn2 (5/10) S9 horn 5 S8 car horns S2 horn during crash S6 doppler horn S7 horn3 Truck (7/10) • Mixtures are the rule S8 truck engine S2 truck accelerating S5 Acceleration S1 rev up/passing S6 acceleration S3 closeup car S10 wheels on road Horn3 (5/10) S7 horn4 S9 horn 3 S8 car horns S3 2nd horn S10 car horn Squeal (6/10) S6 airbrake S4 squeal S2 squeek S3 squeal S8 break Squeaks S1 squeal Horn4 (8/10) S7 horn5 S9 horn 4 S2 horn during Squeek S8 car horns S3 2nd horn S10 car horn S5 Honk 2 S6 horn3 Environmental Sounds - Dan Ellis Horn5 (10/10) 2013-06-01 3 /34 Sound Scene Evaluations • Evaluations are good for research help researchers, help funders • A decade of evaluations: NIST MtgRm Meeting rooms Acoustic events ADC Music transcription MIREX Speech separation SSC/PASCAL Source separation Segmentation Video (soundtrack) classification D-CASE CHIL/CLEAR CHiME SASSEC SiSEC { TRECVID 2003 Albayzin TRECVID MED VideoCLEF / MediaEval 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Metrics: SNR, Frame Acc, Event Error Rate, mAP Environmental Sounds - Dan Ellis 2013-06-01 4 /34 DCASE (2012-2013) Giannoulis et al. 2013 • 2012 IEEE AASP Challenge: freq / kHz freq / kHz freq / kHz freq / kHz freq / kHz Systems submitted Mar 2013 bus01 Results at WASPAA, Oct 2013 4 2 tasks... 02 office01 4 • Task 1: Scene classification 2 0 park01 10 classes x2 10 examples x 30 s - 4 street, supermarket, restaurant, office, park ... 0 restaurant01 evaluate on 100 examples 2 (~1 hour total) 0 tube01 4 by classification accuracy 2 4 0 Environmental Sounds - Dan Ellis 5 10 15 20 25 freq / kHz freq / kHz freq / kHz freq / kHz freq / kHz “Detection and Classification of Acoustic Scenes and Events” busystreet01 4 2 0 openairmarket01 4 2 0 quietstreet01 4 2 0 supermarket01 4 2 0 tubestation01 4 2 0 5 10 15 20 25 time / s 2013-06-01 5 /34 DCASE Event Detection • Task 2: Event detection (“office live”) 16 events x 20 training examples (~ 20 min total) knock, laugh, drawer, keys, phone ... evaluate on ~15 min (?) metrics: frame-level AEER & event-level precision-recall Environmental Sounds - Dan Ellis 2013-06-01 6 /34 TRECVID MED (2010-2013) • “Multimedia Event Detection” Over et al. 2011 e.g. MED2011: 15 events x 200 example videos (~60s) - Making a sandwich, Birthday party, Parade, Flash mob evaluate by mean Average Precision over 10k-100k videos (200-2,000 hours) audio and video ... - participants have annotated ~1000 videos (> 10 h) E009 Getting a Vehicle Unstuck Environmental Sounds - Dan Ellis 2013-06-01 7 /34 Consumer Video Dataset • Columbia Consumer Video (CCV) Y-G. Jiang et al. 2011 9,317 videos / 210 hours 20 concepts based on consumer user study Labeled via Amazon Mechanical Turk Environmental Sounds - Dan Ellis 2013-06-01 8 /34 Environmental Sound Motivations 09:00 09:30 • Audio Lifelog Diarization 10:00 10:30 11:00 11:30 cafe L2 cafe office preschool cafe Ron lecture office 12:30 office outdoor group 13:00 2004-09-14 preschool 12:00 outdoor lecture outdoor DSP03 compmtg meeting2 13:30 lab 14:00 cafe meeting2 Manuel outdoor office cafe office Mike Arroyo? outdoor Sambarta? 14:30 15:00 15:30 16:00 • 2004-09-13 office office office postlec office Lesser 16:30 Consumer Video Classification & Search 17:00 17:30 18:00 outdoor lab cafe • Real-time hearing prosthesis app • Robot environment sensitivity • Understanding hearing Environmental Sounds - Dan Ellis 2013-06-01 9 /34 2. Foreground Event Recognition • “Events” are what we hear / notice • ASR approach? Reyes-Gomes & Ellis 2003 dogBarks2 freq / kHz s7 s3 s2 s4 s2 s3 s7 s2 s3s5 s2 s3 s2 s4 s3 s4 s3 s7 s4 8 7 6 5 4 3 2 1 time 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2 1.15 events = words? what are subwords? need labeled data but ... mature tools are great Environmental Sounds - Dan Ellis 2013-06-01 10 /34 Transient Features Cotton, Ellis, Loui ’11 • Transients = • foreground events? Onset detector finds energy bursts best SNR • PCA basis to represent each 300 ms x auditory freq • “bag of transients” Environmental Sounds - Dan Ellis 2013-06-01 11 /34 Transient Features • Results show a small benefit • Examine clusters looking for semantic consistency... link cluster to label class similar to MFCC baseline? group of 3+ music crowd cheer singing 1 person night group of 2 show dancing beach park baby playground parade boat sports graduation birthday ski sunset animal wedding picnic museum MAP 0 Environmental Sounds - Dan Ellis fusion event ftrs MFCC ftrs guess 0.2 0.4 0.6 average precision 0.8 1 2013-06-01 12 /34 NMF Transient Features • Decompose spectrograms into freq / Hz templates + activation 3474 2203 883 442 Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) computation time... Environmental Sounds - Dan Ellis Original mixture 1398 X=W·H well-behaved gradient descent algorithm 2D patches sparsity control 5478 Smaragdis & Brown ’03 Abdallah & Plumbley ’04 Virtanen ’07 0 1 2 3 4 5 6 7 8 9 10 time / s 2013-06-01 13 /34 NMF Transient Features Cotton & Ellis ’11 CLEAR Meeting Room events freq (Hz) • Learn 20 patches from • Compare to MFCC-HMM Acoustic Event Error Rate (AEER%) detector Patch 3 Patch 4 Patch 5 Patch 6 Patch 7 Patch 8 Patch 9 Patch 10 Patch 11 Patch 12 Patch 13 Patch 14 Patch 15 Patch 16 Patch 17 Patch 18 Patch 19 Patch 20 442 3474 1398 442 442 3474 1398 442 0.0 200 150 MFCC NMF combined 15 0.5 0.0 0.5 0.0 0.5 0.0 0.5 0.0 0.5 time (s) • NMF more 100 0 10 Patch 2 1398 250 50 Patch 1 1398 3474 Error Rate in Noise 300 3474 noise-robust 20 SNR (dB) Environmental Sounds - Dan Ellis 25 30 combines well ... 2013-06-01 14 /34 Why Are Events Hard? • Events are short target sounds may occupy only a few % of time • Events are varied what is the vocabulary? what are the prototypes? source & channel variability • Critical information is in fine-time structure onset transient etc. poor match to classic frame-spectral-envelope features Environmental Sounds - Dan Ellis 2013-06-01 15 /34 3. Background Retrieval • Baseline for soundtrack classification K. Lee & Ellis 2010 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz divide sound into short frames (e.g. 30 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = “bag of features” 14 12 0 10 8 6 4 2 5 10 15 MFCC dimension 20 -50 • Classify by e.g. Mahalanobis distance + SVM Environmental Sounds - Dan Ellis 2013-06-01 16 /34 Retrieval Evaluation CCV Precision−Recall (mfcc+sbpca) • Rank large test set 1 MusicPerformance − 0.69 Birthday − 0.49 Graduation − 0.46 IceSkating − 0.42 Swimming − 0.37 Dog − 0.27 WeddingReception − 0.18 0.9 by match to category 0.8 0.7 • Precision-Recall Precision 0.6 0.5 0.4 CCV Test − mfc230 − AP (mean=0.345) RAND 0.9 Beach Parade 0.8 NonMusicPerf MusicPerf WedCerem Classifiers WedRecep Birthday Graduation 1 0.2 Playground WedDance 0.3 • 0.7 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.6 0.5 Bird Dog 0.4 Cat Biking 0.3 Swimming Skiing 0.2 IceSkating Soccer • mean Average Precision 0.1 Baseball Basketball Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr Wc Wd Mp Np Pa Be Pl RN Labels Environmental Sounds - Dan Ellis 0 2013-06-01 17 /34 Retrieval Examples • High precision for top hits (in-domain) Environmental Sounds - Dan Ellis 2013-06-01 18 /34 Sound Texture Features McDermott et al. ’09 Ellis, Zheng, McDermott ’11 • Characterize sounds by perceptually-sufficient statistics \x\ Sound \x\ Automatic gain control mel filterbank (18 chans) Octave bins 0.5,1,2,4,8,16 Hz FFT \x\ \x\ Histogram \x\ \x\ .. verified by matched resynthesis Mahalanobis distance ... 2404 1273 1 617 0 2 4 6 8 10 0 2 4 6 8 10 time / s Texture features 0 level 15 10 5 M V S K 0.5 2 8 32 moments Environmental Sounds - Dan Ellis 1062_60 quiet dubbed speech music mel band distributions & env x-corrs mean, var, skew, kurt (18 x 4) Cross-band correlations (318 samples) Envelope correlation 1159_10 urban cheer clap freq / Hz • Subband Modulation energy (18 x 6) mod frq / Hz 5 10 15 mel band M V S K 0.5 2 8 32 moments mod frq / Hz 5 10 15 mel band 2013-06-01 19 /34 Sound Texture Features development data 10 audio-oriented manual labels cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M • cLap Cheer Music Speech Dubbed other noisy quiet urban rural M AP • Test on MED 2010 M MVSK ms cbc MVSK+ms MVSK+cbc MVSK+ms+cbc 1 0.9 0.8 0.7 0.6 0.5 MED 2010 Txtr Feature (normd) means Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg 0.5 • Per-class stats 0 −0.5 V S K ms relate dimensions to classes? cbc MED 2010 Txtr Feature (normd) stddevs 1 0.5 0 V S K ms 1 cbc MED 2010 Txtr Feature (normd) means/stddevs Perform ~ same as MFCCs covariance ~ texture? V S K ms cbc Environmental Sounds - Dan Ellis mfcc txtr mfcc+txtr 0.9 0.8 0.5 0 0.7 0.6 −0.5 0.5 Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg 2013-06-01 20 /34 Auditory Model Features • Subband Autocorrelation PCA (SBPCA) Lyon et al. 2010 Cotton & Ellis 2013 Simplified version of Lyon et al. system 10x faster (RTx5 → RT/2) • Captures fine time structure in multiple bands .. missing in MFCC features la de yl short-time autocorrelation ine frequency channels Sound Cochlea filterbank Subband VQ Subband VQ Histogram Feature Vector Subband VQ Subband VQ freq lag Correlogram slice time Environmental Sounds - Dan Ellis 2013-06-01 21 /34 Subband Autocorrelation lin short-time autocorrelation e Subband VQ Sound Cochlea filterbank Subband VQ Histogram Subband VQ Feature Vector fine time structure lay frequency channels • Autocorrelation stabilizes de Subband VQ freq lag Correlogram slice time 25 ms window, lags up to 25 ms calculated every 10 ms normalized to max (zero lag) Environmental Sounds - Dan Ellis 2013-06-01 22 /34 Auditory Model Feature Results • SAI and SBPCA close to MFCC baseline MAP All 3 MFCC + SBPCA MFCC + SAI SAI + SBPCA MFCC SAI SBPCA Playground Beach Parade NonMusicPerf MusicPerfmnce • Fusing MFCC and SBPCA improves mAP by 15% rel mAP: 0.35 → 0.40 WeddingDance WeddingCerem WeddingRecep Birthday Graduation Bird Dog Cat • Calculation time MFCC: 6 hours SAI: 1087 hours SBPCA: 110 hours Biking Swimming Skiing IceSkating Soccer Baseball Basketball 0 Environmental Sounds - Dan Ellis 0.2 0.4 0.6 Avg. Precision (AP) 0.8 1 2013-06-01 23 /34 What is Being Recognized? • Soundtracks represented by global features freq / kHz MFCC covariance, codebook histograms What are the critical parts of the sound? k5tiYqzvuoc Dog 4 3 2 1 0 Playground Beach Parade NonMusicPerformance a MusicPerformance u Classifiers WeddingDance WeddingCeremony e WeddingReception e Birthday Graduation Bird Dog Cat Biking Swimming Skiing IceSkating Soccer Baseball Basketball 0 Environmental Sounds - Dan Ellis 10 20 30 40 50 60 time / sec 70 2013-06-01 24 /34 Semantic Audio Features • Train classifiers on related labeled data Sound MFCCs Segment Statistics 55 Pre-Trained SVMs Soundtrack Semantic Features Event Classifiers defines a new “semantic” feature space 0.6 0.5 • mfccs-230 sema100mfcc sbpcas-4000 0.4 Use for target classifier 0.3 sema100sbpca sema100sum 0.2 0.1 0 or combo Environmental Sounds - Dan Ellis 2013-06-01 25 /34 4. Labels & Annotation Burger et al. ’12 • “Semantic Features” are a promising approach but we need good coverage... how to learn more categories? • Annotation is expensive fine time annotation > 10x real-time a few hours are available • What to label? generic vs. task-specific Environmental Sounds - Dan Ellis animal singing clatter anim_bird music_sing rustle anim_cat music scratch anim_ghoat knock hammer anim_horse thud washboard human_noise clap applause laugh click whistle scream bang squeak child beep tone mumble engine_quiet sirene speech engine_light water speech_ne power_tool micro_blow radio engine_heavy wind white_noise cheer other_creak crowd 2013-06-01 26 /34 BBC Audio Semantic Classes • BBC Sound Effects Library 2238 tracks (60 h) short descriptions SFX001–04–01 SFX001–05–01 SFX001–06–01 SFX001–07–01 SFX001–08–01 SFX001–09–01 290 267 240 197 193 ... • Use top 45 keywords Classifiers BBC2238 traindev1 - Average Precision (mean=0.678) RAND voices crowds household children street traffic electric crowd train urban transport boat car cars speech aircraft horse electronic rural birds animal rhythms animals siren vocals cat steam sports babies ambience futuristic construction emergency horses men wood country footsteps wooden warfare running women walking pavement stairs Wood Fire Inside Stove 5:07 ! City Skyline City Skyline 9:46 ! High Street With Traffic, Footsteps! Car Wash Automatic, Wash Phase Inside R Motor Cycle Yamaha Rd 350: Motor Cycle Motor Cycle Yamaha Rd 350, Rider Runs U! footsteps on animals ambience interior 59 men! 59 general! 55 switch! 53 starts! 53 crowds! ...! 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 • Added as “semantic units” some redundancy visible in mutual APs 0.2 0.1 st pawaworuwawo fo cowomehoemco fuambasp st ca vo si an rh an bi ru el ho ai sp ca cabo tr ur tr cr el tr st chho cr voRN Labels Environmental Sounds - Dan Ellis 0 2013-06-01 27 /34 BBC Audio Semantic Classes • Limited semantic correspondence Frequency SFX051−05 (ambience crowds) Ambience, London Covent Garden Exterior, Crowds In The Piazza On A Busy Saturday Afternoon, Occasional Busker S Music In The Distance 4000 3000 2000 1000 voices crowds household children street traffic electric crowd train urban transport boat car cars speech aircraft horse electronic rural birds animal rhythms animals siren vocals cat steam sports babies ambience futuristic construction emergency horses men wood country footsteps wooden warfare running women walking pavement stairs 0 0 50 Environmental Sounds - Dan Ellis 100 150 200 250 time / s 2013-06-01 28 /34 Label Temporal Refinement K Lee, Ellis, Loui ’10 • Audio Ground Truth at coarse time resolution better-focused labels give better classifiers? but little information in very short time frames Relabel based on most likely segments in clip Retrain classifier Environmental Sounds - Dan Ellis Video 0EwtL2yr0y8 03FF1Sz53YU Tags Parade - Time Labels Parade Features SVM Train SVM Classify Iteration 1 Initial labels apply to whole clip Iteration 0 • Train classifiers on shorter (2 sec) segments? Posteriors Time Labels Features Parade P SVM Train SVM Classify 2013-06-01 29 /34 Label Temporal Refinement • Refining labels is “Multiple Instance Learning” “Positive” clips have at least one +ve frame “Negative” clips are all –ve • Refine based on previous classifier’s scores threshold from CDFs of +ve and –ve frames mAP improves ~10% after a few iterations Environmental Sounds - Dan Ellis 2013-06-01 30 /34 5. Future: Tasks & Metrics • Environmental sound recognition: What is it good for? media content description (“Recounting”) environmental awareness • What are the right ways to evaluate? task-specific metrics: AEER, F-measure downstream tasks: WER, mAP real applications: archive search, aware devices Environmental Sounds - Dan Ellis 2013-06-01 31 /34 Labels & Annotations • Training data: quality vs. quantity quality costs: - DCASE ~ 0.3 h - TRECVID MED ~ 10 h quantity always wins • Opportunistic labeling Susanne Burger CMU e.g. Sound Effects library, subtitles ... need refinement strategies • Existing annotations indicate interest Environmental Sounds - Dan Ellis 2013-06-01 32 /34 Source Separation • Separated sources makes event detection easy “separate then recognize” paradigm separation mix t-f masking + resynthesis identify target energy ASR speech models words mix identify vs. find best words model words speech models source knowledge integrated solution more powerful... • Environmental Source Separation is ill-defined relevant “sources” are listener-defined environment description addresses this Environment recognition for source separation Environmental Sounds - Dan Ellis 2013-06-01 33 /34 Summary • (Machine) Listening: Getting useful information from sound • Foreground event recognition ... by focusing on peak energy patches • Background sound retrieval ... from long-time statistics • Data, Labels, and Task ... what are the sources of interest? Environmental Sounds - Dan Ellis 2013-06-01 34 /34 References 1/2 • • • • • • • • • Samer Abdallah, Mark Plumbley, “Polyphonic music transcription by non-negative sparse coding of power spectra,” ISMIR, 2004, p. 10-14. Susanne Burger, Qin Jin, Peter F. Schulam, Florian Metze, “Noisemes: Manual annotation of environmental noise in audio streams,” Technical Report LTI-12-017, CMU, 2012. Courtenay Cotton, Dan Ellis, & Alex Loui, “Soundtrack classification by transient events,” IEEE ICASSP, Prague, May 2011. Courtenay Cotton & Dan Ellis, “Spectral vs. Spectro-Temporal Features for Acoustic Event Classification,” IEEE WASPAA, 2011, 69-72. Courtenay Cotton and Dan Ellis, “Subband Autocorrelation Features for Video Soundtrack Classification,” IEEE ICASSP, Vancouver, May 2013, 8663-8666. Dan Ellis, “Prediction-driven computational auditory scene analysis.” Ph.D. thesis, MIT Dept of EECS, 1996. Dan Ellis, Xiaohong Zheng, Josh McDermott, “Classifying soundtracks with audio texture features,” IEEE ICASSP, Prague, May 2011. Dimitrios Giannoulis, Emmanouil Benetos, Dan Stowell, Mathias Rossignol, Mathieu Lagrange and Mark Plumbley, “Detection and Classification of Acoustic Scenes and Events,” Technical Report EECSRR-13-01, QMUL School of EE and CS, 2013. Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Dan Ellis, Alex Loui, “Consumer video understanding: A benchmark database and an evaluation of human and machine performance,” ACM ICMR, Apr. 2011, p. 29. Environmental Sounds - Dan Ellis 2013-06-01 35 /34 References 2/2 • • • • • • • • Keansub Lee, Dan Ellis, Alex Loui, “Detecting local semantic concepts in environmental sounds using Markov model based clustering,” IEEE ICASSP, 2278-2281, Dallas, Apr 2010. Keansub Lee & Dan Ellis, “Audio-Based Semantic Concept Classification for Consumer Video,” IEEE TASLP. 18(6): 1406-1416, Aug. 2010. R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, and G. Chechik, “Sound retrieval and ranking using sparse auditory representations,” Neural Computation, vol. 22, no. 9, pp. 2390–2416, Sept. 2010. Josh McDermott, Andrew Oxenham, Eero Simoncelli, “Sound texture synthesis via filter statistics,” IEEE WASPAA, 2009, 297-300. Paul Over, George Awad, Jon Fiscus, Brian Antonishek, Martial Michel, Alan Smeaton, Wessel Kraaij, Wessel, Georges Quénot, “TRECVID 2010 - An overview of the goals, tasks, data, evaluation mechanisms, and metrics,” Technical Report, NIST, 2011. Manuel Reyes-Gomez & Dan Ellis, “Selection, Parameter Estimation, and Discriminative Training of Hidden Markov Models for General Audio Modeling,” IEEE ICME, Baltimore, 2003, I-73-76. Paris Smaragdis & Judith Brown, “Non-negative matrix factorization for polyphonic music transcription,” IEEE WASPAA, 2003, p. 177-180. Tuomas Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE TASLP 15(3): 1066-1074, 2007. Environmental Sounds - Dan Ellis 2013-06-01 36 /34 Acknowledgment Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20070. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government. Environmental Sounds - Dan Ellis 2013-06-01 37 /34