Mining Audio Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. Mining Audio - Dan Ellis http://labrosa.ee.columbia.edu/ Real-World Sound Music Audio Environmental Audio Outstanding Issues 2012-09-14 1 /41 LabROSA Overview • Getting information from sound Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech Mining Audio - Dan Ellis 2012-09-14 2 /41 1. Real-World Sound 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 -60 level / dB 0 2 4 6 8 10 12 time/s 02_m+s-15-evil-goodvoice-fade Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Sounds rarely occur in isolation .. so analyzing mixtures (“scenes”) is a problem .. for humans and machines Mining Audio - Dan Ellis 2012-09-14 3 /41 Auditory Scene Analysis “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture 2 sensors, N sources - underconstrained • Use prior knowledge (models) to constrain Mining Audio - Dan Ellis 2012-09-14 4 /41 Machine Listening • Extracting useful information from sound Describe Automatic Narration Emotion Music Recommendation Classify Environment Awareness ASR Music Transcription “Sound Intelligence” VAD Speech/Music Environmental Sound Speech Dectect Task ... like animals do Mining Audio - Dan Ellis Music Domain 2012-09-14 5 /41 2. Mining Music Audio Query track One Million Songs 4000 Frequency 3000 2000 1000 0 0 1 2 3 4 5 Time 6 7 8 9 10 “Similar” tracks 4000 Frequency 3000 2000 4000 1000 4000 0 3000 1 2 3 4 5 Time 6 7 8 9 10 Frequency 0 Frequency 3000 2000 2000 1000 1000 0 0 Mining Audio - Dan Ellis 0 1 2 3 4 5 Time 6 7 8 9 0 1 2 3 4 5 Time 6 7 8 9 10 10 2012-09-14 6 /41 Music Audio Representations • We need a description of the audio that contains “suitable” detail • Speech Recognition uses Mel-Frequency Cepstral Coefficients (MFCCs) Let It Be (LIB-1) - log-freq specgram freq / Hz 5915 1396 329 Coefficient MFCCs 12 10 8 6 4 2 Noise excited MFCC resynthesis (LIB-2) freq / Hz 5915 1396 329 0 Mining Audio - Dan Ellis 5 10 15 20 25 time / sec 2012-09-14 7 /41 Warren et al. 2003 Chroma Features • We’d like to preserve the notes at least within one octave Let It Be - log-freq specgram (LIB-1) freq / Hz 6000 1400 chroma bin 300 Chroma features B A G E D C Shepard tone resynthesis of chroma (LIB-3) freq / Hz 6000 1400 300 MFCC-filtered shepard tones (LIB-4) freq / Hz 6000 1400 300 0 Mining Audio - Dan Ellis 5 10 15 20 25 time / sec 2012-09-14 8 /41 Beat-Synchronous Chroma • Perform Beat Tracking .. and record one chroma vector per beat compact representation of harmonies Let It Be - log-freq specgram (LIB-1) freq / Hz 6000 1400 300 chroma bin Onset envelope + beat times Beat-synchronous chroma B A G E D C Beat-synchronous chroma + Shepard resynthesis (LIB-6) freq / Hz 6000 1400 300 0 Mining Audio - Dan Ellis 5 10 15 20 25 time / sec 2012-09-14 9 /41 Million Song Dataset (MSD) Thierry Bertin-Mahieux • Commercial-scale dataset available to MIR researchers 1M pop songs 250 GB of features (6 years of listening) • Many facets: Features, Lyrics, Tags, Covers, Listeners ... http://labrosa.ee.columbia.edu/millionsong Mining Audio - Dan Ellis 2012-09-14 10/41 MSD Audio Features • Use Echo Nest “Analyze” features segment audio into variable-length “events” supports a crude resynthesis: freq / Hz 2416 Origina 761 240 B A G EN Featur E D C freq / Hz represent by 12 chroma + 12 “timbre” TRKUYPW128F92E1FC0 - Tori Amos - Smells Like Teen Spirit 2416 Resyn 761 240 0 Mining Audio - Dan Ellis 2 4 6 8 10 12 time / sec 14 2012-09-14 11/41 MSD Metadata EN Metadata Last.fm Tags artist: release: title: id: key: mode: loudness: tempo: time_signature: duration: sample_rate: audio_md5: 7digitalid: familiarity: year: 100.0 – cover 57.0 – covers 43.0 – female vocalists 42.0 – piano 34.0 – alternative 14.0 – singer-songwriter 11.0 – acoustic 8.0 – tori amos 7.0 – beautiful 6.0 – rock 6.0 – pop 6.0 – Nirvana 6.0 – female vocalist 6.0 – 90s 5.0 – out of genre covers 'Tori Amos' 'LIVE AT MONTREUX' 'Smells Like Teen Spirit' 'TRKUYPW128F92E1FC0' 5 0 -16.6780 87.2330 4 216.4502 22050 '8' 5764727 0.8500 1992 SHS Covers %5489,4468, Smells TRTUOVJ128E078EE10 TRFZJOZ128F4263BE3 TRJHCKN12903CDD274 TRELTOJ128F42748B7 TRJKBXL128F92F994D TRIHLAW128F429BBF8 TRKUYPW128F92E1FC0 Mining Audio - Dan Ellis 5.0 4.0 4.0 4.0 4.0 3.0 3.0 3.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 – – – – – – – – – – – – – – – cover songs soft rock nirvana cover Mellow alternative rock chick rock Ballad Awesome Covers melancholic k00l ch1x indie female vocalistist female cover song american MxM Lyric Bag-of-Words Like Teen Spirit Nirvana Weird Al Yankovic Pleasure Beach The Flying Pickets Rhythms Del Mundo feat. Shanade The Bad Plus Tori Amos 12 11 10 9 7 6 6 6 hello i a and it are we now 6 6 6 4 4 4 3 3 here us entertain the feel yeah to my 3 3 3 3 3 3 3 3 is with oh out an light less danger 2012-09-14 12/41 Finding Similar Items Avery Wang ’03 • If it’s really the same, we can use freq / Hz fingerprinting (e.g. Shazam): Query audio 4000 3500 3000 2500 find local “landmarks” quantize by pairs inverted index 2000 1500 1000 500 0 Match: 05−Full Circle at 0.032 sec 4000 3500 3000 2500 2000 1500 1000 500 0 Mining Audio - Dan Ellis 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time / sec 2012-09-14 13/41 Dynamic Programming Alignment Serra, Gomez et al. ’08 • We can match the chroma representations by DP alignment robust to timing changes quite efficient best performer in MIREX Cover Song evaluations Mining Audio - Dan Ellis 2012-09-14 14/41 Large-Scale Cover Matching Bertin-Mahieux & Ellis ’11 • How can we find covers in 1M songs? @ 1 sec / comparison, one search = 11.5 CPU-days full N2 mining = 16,000 CPU-years • Hashing? landmarks from chroma patches (like Shazam) represent item by distribution of contour fragments one stage in a filtering hierarchy... Mining Audio - Dan Ellis 2012-09-14 15/41 Euclidean Cover Matching Ellis & Poliner ’08 Bertin-Mahieux & Ellis ’12 • Reduce cover matching to nearest-neighbor search in fixed-size beat-chroma patches • Right “distance”? 10 8 principal components learned weightings 6 4 2 120 Alignment/ segmentation? music segmentation multiple probes framing-insensitive representation Mining Audio - Dan Ellis 130 140 150 160 170 Between the Bars − Glenn Phillips − pwr .25 − −17 beats − trsp 2 12 10 8 6 4 2 120 chroma bin • Between the Bars − Elliot Smith − pwr .25 12 130 140 150 160 170 160 170 pointwise product (sum(12x300) = 19.34) 12 10 8 6 4 2 120 130 140 150 time / beats 2012-09-14 16/41 Pattern Mining • Cluster beat-synchronous chroma patches #32273 - 13 instances chroma bins G F D C A G F D C A G F D C A G F D C A Mining Audio - Dan Ellis 5 a20-top10x5-cp4-4p0 #51917 - 13 instances #65512 - 10 instances #9667 - 9 instances #10929 - 7 instances #61202 - 6 instances #55881 - 5 instances #68445 - 5 instances 10 15 20 5 10 15 time / beats 2012-09-14 17/41 Other Work: MSD Challenge with Brian McFee • Using “Taste Profile” data User-Song-playcount b80344d063b5ccb3212f76538f3d9e43d87dca9e b80344d063b5ccb3212f76538f3d9e43d87dca9e b80344d063b5ccb3212f76538f3d9e43d87dca9e b80344d063b5ccb3212f76538f3d9e43d87dca9e b80344d063b5ccb3212f76538f3d9e43d87dca9e b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 SOAPDEY12A81C210A9 SOBBMDR12A8C13253B SOCNMUH12A6D4F6E6D SODACBL12A8C13C273 SODDNQT12A6D4F5F7E 1 1 2 1 1 5 • Challenge: Rank 1M songs to complete half a playlist using audio, metadata, lyrics, whatever • Kaggle.com based contest self-service evaluation, leader board • Results at MIREX, AdMIRe Mining Audio - Dan Ellis 2012-09-14 18/41 Music: Summary • Finding Musical Similarity at Large Scale • Beat Chroma Representation • Million Song Dataset • MSD Challenge Mining Audio - Dan Ellis 2012-09-14 19/41 3. Environmental Audio • Audio Lifelog Diarization 09:00 09:30 10:00 10:30 11:00 11:30 2004-09-14 preschool cafe preschool cafe Ron lecture 12:00 office 12:30 office outdoor group 13:00 L2 cafe office outdoor lecture outdoor DSP03 compmtg meeting2 13:30 lab 14:00 cafe meeting2 Manuel outdoor office cafe office Mike Arroyo? outdoor Sambarta? 14:30 15:00 15:30 • 2004-09-13 16:00 office office office postlec office Lesser 16:30 Consumer Video Classification & Search Mining Audio - Dan Ellis 17:00 17:30 18:00 outdoor lab cafe 2012-09-14 20/41 Prior Work • Environment Classification speech/music/ silent/machine Zhang & Kuo ’01 • Nonspeech Sound Recognition Temko & Nadeu ’06 Mining Audio - Dan Ellis Meeting room Audio Event Classification sports events - cheers, bat/ball sounds, ... 2012-09-14 21/41 Consumer Video Dataset • 25 “concepts” from 1G+KL2 (10/15) Kodak user study boat, crowd, cheer, dance, ... 1 Labeled Concepts museum picnic wedding animal birthday sunset ski graduation sports boat parade playground baby park beach dancing show group of two night one person singing cheer crowd music group of 3+ 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 3mc c s o n g s d b p b p p b s g s s b a w pm 0 Overlapped Concepts • Grab top 200 videos 8GMM+Bha (9/15) from YouTube search then filter for quality, unedited = 1873 videos manually relabel with concepts pLSA500+lognorm (12/15) Mining Audio - Dan Ellis 2012-09-14 22/41 Obtaining Labeled Data • Amazon Mechanical Turk Y-G Jiang et al. 2011 10s clips 9,641 videos in 4 weeks Mining Audio - Dan Ellis 2012-09-14 23/41 2. Background Classification • Baseline for soundtrack classification 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value 14 12 0 10 8 6 4 2 • Classify by e.g. KL distance + SVM Mining Audio - Dan Ellis 50 20 MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz divide sound into short frames (e.g. 30 ms) calculate features (e.g. MFCC) for each frame describe clip by statistics of frames (mean, covariance) = “bag of features” 5 10 15 MFCC dimension 20 -50 2012-09-14 24/41 Codebook Histograms • Convert high-dim. distributions to multinomial 8 150 6 4 7 2 MFCC features Per-Category Mixture Component Histogram count MFCC(1) Global Gaussian Mixture Model 0 2 5 -2 6 1 10 14 8 100 15 9 12 13 3 4 -4 50 11 -6 -8 -10 -20 -10 0 10 20 0 1 2 3 MFCC(0) 4 5 6 7 8 9 10 11 12 13 14 15 GMM mixture • Classify by distance on histograms KL, Chi-squared + SVM Mining Audio - Dan Ellis 2012-09-14 25/41 Latent Semantic Analysis (LSA) • Probabilistic LSA (pLSA) models each histogram as a mixture of several ‘topics’ .. each clip may have several things going on • Topic sets optimized through EM p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip) = GMM histogram ftrs * “Topic” p(topic | clip) p(ftr | clip) “Topic” AV Clip AV Clip GMM histogram ftrs p(ftr | topic) use (normalized?) p(topic | clip) as per-clip features Mining Audio - Dan Ellis 2012-09-14 26/41 Background Classification Results Average Precision K Lee & Ellis ’10 1 Guessing MFCC + GMM Classifier Single−Gaussian + KL2 8−GMM + Bha 1024−GMM Histogram + 500−log(P(z|c)/Pz)) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 pl gr • Wide range in performance sk su i ns e bi rth t da an y im w al ed di ng pi cn ic m us eu m M EA N a sp t o gr ad rts ua tio n bo de ra nd pa ou ay gr ba by rk h pa ac g be in da nc ow o sh tw t p of ni gh on ou e pe rs g in r si ng ee d ch cr us m ow on gr ou p of 3+ 0 ic 0.1 Concept class audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes Mining Audio - Dan Ellis 2012-09-14 27/41 Sound Texture Features McDermott Simoncelli ’09 Ellis, Zheng, McDermott ’11 • Characterize sounds by perceptually-sufficient statistics \x\ Sound \x\ Automatic gain control mel filterbank (18 chans) Octave bins 0.5,1,2,4,8,16 Hz FFT \x\ \x\ Histogram \x\ \x\ 1159_10 urban cheer clap distributions & env x-corrs Mahalanobis distance ... 2404 1273 1 617 0 2 4 6 8 10 0 2 4 6 8 10 time / s Texture features 0 level 15 10 5 M V S K 0.5 2 8 32 moments Mining Audio - Dan Ellis 1062_60 quiet dubbed speech music mel band freq / Hz • Subband mean, var, skew, kurt (18 x 4) Cross-band correlations (318 samples) Envelope correlation .. verified by matched resynthesis Modulation energy (18 x 6) mod frq / Hz 5 10 15 mel band M V S K 0.5 2 8 32 moments mod frq / Hz 5 10 15 mel band 2012-09-14 28/41 Sound Texture Features • Test on MED 2010 development data cLap Cheer Music Speech Dubbed other noisy quiet urban rural M cLap Cheer Music Speech Dubbed other noisy quiet urban rural M MED 2010 Txtr Feature (normd) means 0.5 0 −0.5 V S K ms cbc MED 2010 Txtr Feature (normd) stddevs 1 0.5 0 V S K ms cbc MED 2010 Txtr Feature (normd) means/stddevs 0.5 0 −0.5 V S K ms Mining Audio - Dan Ellis cbc 0.9 0.8 10 specially-collected manual labels cLap Cheer Music Speech Dubbed other noisy quiet urban rural M M MVSK ms cbc MVSK+ms MVSK+cbc MVSK+ms+cbc 1 0.7 0.6 0.5 Rural Urban Quiet Noisy Dubbed Speech Music Cheer Clap Avg • Contrasts in feature sets correlation of labels... • Perform ~ same as MFCCs combine well 2012-09-14 29/41 Real-World Dictionary • BBC Sound Effects as reference library The Czech Republic - Slovakia Hungary Rural South America Urban South America Footsteps 2 Footsteps 1 India Pakistan Nepal-Countrysid India & Nepal-City Life Exterior Atmospheres-Rural Background England France Suburbia Crowds Birds Istanbul Emergency Cats Construction Age Of Steam Trains Spain Schools & Crowds Horses & Dogs Horses Farm Machinery Livestock 2 Livestock 1 Adventure Sports Greece Equestrian Events Africa The Natural World Africa The Human World Hospitals Babies China Aircraft America Ships And Boats 2 Ships And Boats 1 Weather 1 Electronically Generated Sounds Explosions Guns Alarms Sport Leisure Auto Rural Soundscapes Big Ben Taxi Bus Atmospheres Industry Birds Rivers Streams Water Computers Printers Phones European Soundscapes Misc 1000+ examples ... comprehensive? similarity via normalized textures (over 10s chunks) Mining Audio - Dan Ellis Audiences Children Crowds Foots Animals and Birds Transportation Interior Atmosphere Household Exterior Atmosphere BBC Sound Effects BEH I T A A MEC RB I BRAS EE W S S A A C B H AAEGA L L F HHSST A C CE I B CSF EE I I F FURHT 2012-09-14 30/41 3. Foreground Event Recognition • Global vs. local class models K Lee, Ellis, Loui ’10 tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs: YT baby 002: voice baby laugh 4 New Way: Limited temporal extents freq / kHz freq / kHz Old Way: All frames contribute 3 4 2 1 1 5 10 voice 15 0 time / s baby 3 2 0 voice bg 5 voice baby 10 laugh 15 bg time / s laugh baby laugh “background” model shared by all clips Mining Audio - Dan Ellis 2012-09-14 31/41 Foreground Event HMMs freq / Hz Spectrogram : YT_beach_001.mpeg, Manual Clip-level Labels : “beach”, “group of 3+”, “cheer” • Training labels only • Refine models by EM realignment • Use for classifying entire video... or seeking to relevant part Mining Audio - Dan Ellis etc speaker4 speaker3 speaker2 unseen1 Background 25 concepts + 2 global BGs at clip-level 4k 3k 2k 1k 0 global BG2 global BG1 music cheer ski singing parade dancing wedding sunset sports show playground picnic park one person night museum group of 2 group of 3+ graduation crowd boat birthday beach baby animal 0 0 −50 drum Manual Frame-level Annotation wind cheer voice voice voice water wave voice voice voice voice cheer Viterbi Path by 27−state Markov Model 1 2 3 4 5 6 7 8 Lee & Ellis ’10 9 10 time/sec 2012-09-14 32/41 Transient Features Cotton, Ellis, Loui ’11 • Transients = • foreground events? Onset detector finds energy bursts best SNR • PCA basis to represent each 300 ms x auditory freq • “bag of transients” Mining Audio - Dan Ellis 2012-09-14 33/41 Nonnegative Matrix Factorization templates + activation freq / Hz • Decompose spectrograms into 3474 2203 883 442 Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) 0 Mining Audio - Dan Ellis Original mixture 1398 X=W·H fast forgiving gradient descent algorithm 2D patches sparsity control... 5478 Smaragdis Brown ’03 Abdallah Plumbley ’04 Virtanen ’07 1 2 3 4 5 6 7 8 9 10 time / s 2012-09-14 34/41 NMF Transient Features • Learn 20 patches from Acoustic Event Error Rate (AEER%) • Compare to MFCC-HMM detector 3474 Error Rate in Noise Patch 1 Patch 2 Patch 3 Patch 4 Patch 5 Patch 6 Patch 7 Patch 8 Patch 9 Patch 10 Patch 11 Patch 12 Patch 13 Patch 14 Patch 15 Patch 16 Patch 17 Patch 18 Patch 19 Patch 20 1398 442 3474 1398 442 3474 1398 442 3474 1398 442 0.0 250 0.5 0.0 0.5 0.0 • NMF more 200 150 MFCC NMF combined 15 Mining Audio - Dan Ellis 0.5 0.0 0.5 0.0 0.5 time (s) noise-robust 100 0 10 freq (Hz) Meeting Room Acoustic Event data 300 50 Cotton, Ellis ’11 combines well ... 20 SNR (dB) 25 30 2012-09-14 35/41 Matching Videos via Fingerprints are a noise-robust fingerprint freq / kHz • Landmark pairs Cotton & Ellis ’10 VIdeo IMpLQaiHWbE at 195s 4 3 2 • Use to match 0 freq / kHz distinct videos with same sound ambience 1 195.5 196 196.5 197 197.5 198 198.5 199 time / sec VIdeo Yi1hkNkqHBc at 218 s 4 3 2 1 0 Mining Audio - Dan Ellis 218.5 219 219.5 220 220.5 221 221.5 222 time / sec 2012-09-14 36/41 4. Outstanding Issues • Better object/event separation parametric models spatial information? computational auditory scene analysis... Frequency Proximity • Large-scale analysis • Integration with video Mining Audio - Dan Ellis Common Onset Harmonicity Barker et al. ’05 2012-09-14 37/41 Audio-Visual Atoms Jiang et al. ’09 • Object-related features from both audio (transients) & video (patches) time visual atom ... visual atom ... time Mining Audio - Dan Ellis audio atom frequency frequency audio spectrum audio atom Joint Audio-Visual Atom discovery ... visual atom ... audio atom time 2012-09-14 38/41 Audio-Visual Atoms • Multi-instance learning of A-V co-occurrences M Visual Atoms visual atom visual atom time … … … time codebook learning Joint A-V Codebook … audio atom audio atom M×N A-V Atoms … N audio atoms Mining Audio - Dan Ellis 2012-09-14 39/41 Audio-Visual Atoms black suit + romantic music marching people + parade sound Wedding sand + beach sounds Parade Beach Audio only Visual only Visual + Audio Mining Audio - Dan Ellis 2012-09-14 40/41 Summary • Machine Listening: Getting useful information from sound • Background sound classification ... from whole-clip statistics? • Foreground event recognition ... by focusing on peak energy patches • Speech content is very important ... separate with pitch, models, ... Mining Audio - Dan Ellis 2012-09-14 41/41