LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Real-World Sound Speech Separation Environmental Audio Classification Music Audio Analysis LabROSA Overview - Dan Ellis 2011-09-09 1 /17 LabROSA Overview • Getting information from sound Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech LabROSA Overview - Dan Ellis 2011-09-09 2 /17 1. Real-World Sound 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 -60 level / dB 0 2 4 6 8 10 12 time/s 02_m+s-15-evil-goodvoice-fade Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Sounds rarely occur in isolation .. so analyzing mixtures (“scenes”) is a problem .. for humans and machines LabROSA Overview - Dan Ellis 2011-09-09 3 /17 Auditory Scene Analysis “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregmanʼ90) • Received waveform is a mixture 2 sensors, N sources - underconstrained • Use prior knowledge (models) to constrain LabROSA Overview - Dan Ellis 2011-09-09 4 /17 2. Speech Separation Roweis ’01, ’03 Kristjannson ’04, ’06 • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, Σ) model {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) inference of source state can include sequential constraints... • E.g. stationary noise: In speech-shaped noise (mel magsnr = 2.41 dB) freq / mel bin Original speech 80 80 80 60 60 60 40 40 40 20 20 20 0 1 LabROSA Overview - Dan Ellis 2 time / s 0 1 2 VQ inferred states (mel magsnr = 3.6 dB) 0 1 2 2011-09-09 5 /17 Eigenvoices Weiss & Ellis ’09, ’10 • Idea: Find speaker model parameter space generalize without losing detail? Speaker models Speaker subspace bases Mean Voice Frequency (kHz) 8 • Eigenvoice model: 20 6 30 4 40 2 50 b d g p t k jh ch s z f th v dh m n l 280 states x 320 bins = 89,600 dimensions 10-30 dimensions Frequency (kHz) 8 8 6 6 4 4 2 2 0 mean voice eigenvoice bases w + B weights Frequency (kHz) adapted model 8 6 h r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 2 8 6 4 channel4 channel bases 2 weights 2 0 b d g p t k jh ch s z f th v dh m n l 8 ) LabROSA Overview - Dan Ellis r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 1 b d g p t k jh ch s z f th v dh m n l µ = µ̄ + U 10 r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 3 2011-09-09 6 /17 8 Speaker-Adapted Separation LabROSA Overview - Dan Ellis 2011-09-09 7 /17 Speaker-Adapted Separation • Eigenvoices for Speech Separation task speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) Mix SA LabROSA Overview - Dan Ellis 2011-09-09 8 /17 3. Soundtrack Classification • Short video clips as the evolution of snapshots 10-100 sec, one location, no editing browsing? • Need information for indexing... video + audio foreground + background LabROSA Overview - Dan Ellis 2011-09-09 9 /17 MFCC Covariance Representation • Each clip/segment → fixed-size statistics similar to speaker ID and music genre classification • Full Covariance matrix of MFCCs 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz maps the kinds of spectral shapes present 14 12 0 10 8 6 4 2 5 10 15 MFCC dimension • Clip-to-clip distances for SVM classifier 20 -50 by KL or 2nd Gaussian model LabROSA Overview - Dan Ellis 2011-09-09 10/17 Classification Results Chang, Ellis et al. ’07 Lee & Ellis ’10 some concepts are more audio-related Mutual Information Proportion I(classifier; label) MIP = H(label) LabROSA Overview - Dan Ellis 1 Classifiers vs. all labels CCV: Average Precision (mean=0.300) RAND Playground Beach Parade NonMusicPerf MusicPerf WedDance WedCerem WedRecep Birthday Graduation Bird Dog Cat Biking Swimming Skiing IceSkating Soccer Baseball Basketball RAND Playground Beach Parade NonMusicPerf MusicPerf WedDance WedCerem WedRecep Birthday Graduation Bird Dog Cat Biking Swimming Skiing IceSkating Soccer Baseball Basketball 0.5 0 Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr WcWd Mp Np Pa Be Pl RN AvPrec Mutual Info Prop (mean=0.175) 0.25 Classifiers • All classifiers 0.2 0.15 0.1 0.05 Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr WcWd Mp Np Pa Be Pl RN MIProp Labels 2011-09-09 11/17 Matching Videos via Fingerprints are a noise-robust fingerprint freq / kHz • Landmark pairs Cotton & Ellis ’10 VIdeo IMpLQaiHWbE at 195s 4 3 2 • Use to match 0 freq / kHz distinct videos with same sound ambience 1 195.5 196 196.5 197 197.5 198 198.5 199 time / sec VIdeo Yi1hkNkqHBc at 218 s 4 3 2 1 0 LabROSA Overview - Dan Ellis 218.5 219 219.5 220 220.5 221 221.5 222 time / sec 2011-09-09 12/17 4. Music Audio Analysis Signal freq / kHz Let it Be (final verse) 4 20 0 2 -20 0 • ... at all levels from notes to genres 162 Melody C5 C4 C3 C2 Piano C5 C4 C3 C2 164 166 168 170 172 level / dB 174 time / s Onsets & Beats G Per-frame chroma E D C 1 0.75 0.5 0.25 0 intensity A Per-beat normalized chroma G E D C A 390 LabROSA Overview - Dan Ellis 395 400 405 410 415 time / beats 2011-09-09 13/17 Polyphonic Transcription • Apply the Eigenvoice idea to music eigeninstruments? LabROSA Overview - Dan Ellis Grindlay & Ellis ’09 • Subspace NMF 2011-09-09 14/17 Melodic-Harmonic Mining Bertin-Mahieux et al. ’10, ’11 • Million Song Dataset as Echo Nest Analyze • Frequent clusters of 12 x 8 binarized eventchroma Original Beat tracking Music audio Chroma features Key normalization Locality Sensitive Hash Table Landmark identification #1 (3491) #2 (2775)) #3 (2255) #4 (1241)) #5 (1224)) #6 (1218)) #7 (1092)) #8 (1084)) #9 (1080)) #10 (1035)) # (1021) #11 # (1005)) #12 #13 (974) #14 (942)) #15 (936)) #16 (924)) #17 (920)) #18 (913)) #19 (901)) #20 (897) #21 (887) #22 (882)) #23 (881) #24 (881)) #25 (879)) #26 (875)) #27 (875)) #28 (874)) #29 (868)) #30 (844) #31 (839) #32 (839)) #33 (794) #34 (786)) #35 (785)) #36 (747)) #37 (731)) #38 (714)) #39 (706)) #40 (698) #41 (682) #42 (678)) #43 (675) #44 (657)) #45 (656)) #46 (651)) #47 (647)) #48 (638)) #49 (610)) #50 (593) #51 (592) #52 (591)) #53 (589) #56 (550)) #57 (549)) #58 (534)) #59 (534)) #60 (531) LabROSA Overview - Dan Ellis Reconstruction #54 (572)) #55 (571)) 2011-09-09 15/17 Results - Beatles • Over 86 Beatles tracks • All beat offsets = 41,705 patches LSH takes 300 sec - approx NlogN in patches? • High-pass • Song filter remove hits in same track LabROSA Overview - Dan Ellis 05-Here There And Everywhere 12.1-20.5s 10 10 8 8 chroma bin 12 6 6 4 4 2 2 09-Martha My Dear 90.9-98.6s 12-Piggies 22.0-29.6s 12 12 10 10 8 8 chroma bin to avoid sustained notes chroma bin along time chroma bin 02-I Should Have Known Better 92.4-97.7s 12 6 6 4 4 2 2 5 10 beat 15 20 5 10 beat 15 20 2011-09-09 16/17 Summary • LabROSA : getting information from sound • Speech monaural separation using eigenvoices binaural + reverb using MESSL • Environmental classification of consumer video landmark-based events and matching • Music transcription of notes, chords, ... large corpus mining • http://labrosa.ee.columbia.edu/ LabROSA Overview - Dan Ellis 2011-09-09 17/17