LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Real-World Sound Speech Separation Environmental Audio Classification Music Audio Analysis LabROSA Overview - Dan Ellis 2010-09-10 1 /19 LabROSA Overview Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech LabROSA Overview - Dan Ellis 2010-09-10 2 /19 1. Real-World Sound 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 -60 level / dB 0 2 4 6 8 10 12 time/s 02_m+s-15-evil-goodvoice-fade Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Sounds rarely occur in isolation .. so analyzing mixtures (“scenes”) is a problem .. for humans and machines LabROSA Overview - Dan Ellis 2010-09-10 3 /19 Auditory Scene Analysis “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregmanʼ90) • Received waveform is a mixture 2 sensors, N sources - underconstrained • Use prior knowledge (models) to constrain LabROSA Overview - Dan Ellis 2010-09-10 4 /19 2. Speech Separation Roweis ’01, ’03 Kristjannson ’04, ’06 • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, Σ) model {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) inference of source state can include sequential constraints... • E.g. stationary noise: In speech-shaped noise (mel magsnr = 2.41 dB) freq / mel bin Original speech 80 80 80 60 60 60 40 40 40 20 20 20 0 1 LabROSA Overview - Dan Ellis 2 time / s 0 1 2 VQ inferred states (mel magsnr = 3.6 dB) 0 1 2 2010-09-10 5 /19 Eigenvoices Weiss & Ellis ’09, ’10 • Idea: Find speaker model parameter space generalize without losing detail? Speaker models Speaker subspace bases Mean Voice Frequency (kHz) 8 • Eigenvoice model: 20 6 30 4 40 2 50 b d g p t k jh ch s z f th v dh m n l 280 states x 320 bins = 89,600 dimensions 10-30 dimensions Frequency (kHz) 8 8 6 6 4 4 2 2 0 mean voice eigenvoice bases w + B weights Frequency (kHz) adapted model 8 6 h r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 2 8 6 4 channel4 channel bases 2 weights 2 0 b d g p t k jh ch s z f th v dh m n l 8 ) LabROSA Overview - Dan Ellis r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 1 b d g p t k jh ch s z f th v dh m n l µ = µ̄ + U 10 r w y iy ih eh ey ae aa aw ay ah ao owuw ax Eigenvoice dimension 3 2010-09-10 6 /19 8 Speaker-Adapted Separation LabROSA Overview - Dan Ellis 2010-09-10 7 /19 Speaker-Adapted Separation • Eigenvoices for Speech Separation task speaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI) SI SA LabROSA Overview - Dan Ellis 2010-09-10 8 /19 3. Soundtrack Classification • Short video clips as the evolution of snapshots 10-100 sec, one location, no editing browsing? • Need information for indexing... video + audio foreground + background LabROSA Overview - Dan Ellis 2010-09-10 9 /19 MFCC Covariance Representation • Each clip/segment → fixed-size statistics similar to speaker ID and music genre classification • Full Covariance matrix of MFCCs 8 7 6 5 4 3 2 1 0 VTS_04_0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5 6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz maps the kinds of spectral shapes present 14 12 0 10 8 6 4 2 5 10 15 MFCC dimension • Clip-to-clip distances for SVM classifier 20 -50 by KL or 2nd Gaussian model LabROSA Overview - Dan Ellis 2010-09-10 10/19 AP Classification Results • Wide range: 0.8 Chang, Ellis et al. ’07 Lee & Ellis ’10 1G + KL 1G + Mah GMM Hist. + pLSA Guess 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Da Ski nc in g Bo at Ba by Be a W ch ed di M ng us eu m Pa Pl ra ay de gr ou nd Su ns et Pi cn Bi ic rth da y Gr On e pe r ou son p of 3 + M us ic Ch e Gr ou er p of 2 Sp or ts Pa r Cr k ow d Ni gh An t im Si al ng in g Sh ow 0 Concepts audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes LabROSA Overview - Dan Ellis 2010-09-10 11/19 Matching Videos via Fingerprints are a noise-robust fingerprint freq / kHz • Landmark pairs Cotton & Ellis ’10 VIdeo IMpLQaiHWbE at 195s 4 3 2 • Use to match 0 freq / kHz distinct videos with same sound ambience 1 195.5 196 196.5 197 197.5 198 198.5 199 time / sec VIdeo Yi1hkNkqHBc at 218 s 4 3 2 1 0 LabROSA Overview - Dan Ellis 218.5 219 219.5 220 220.5 221 221.5 222 time / sec 2010-09-10 12/19 4. Music Audio Analysis Signal freq / kHz Let it Be (final verse) 4 20 0 2 -20 0 • ... at all levels from notes to genres 162 Melody C5 C4 C3 C2 Piano C5 C4 C3 C2 164 166 168 170 172 level / dB 174 time / s Onsets & Beats G Per-frame chroma E D C 1 0.75 0.5 0.25 0 intensity A Per-beat normalized chroma G E D C A 390 LabROSA Overview - Dan Ellis 395 400 405 410 415 time / beats 2010-09-10 13/19 High-Resolution Transcription Smit & Ellis ’09 • Extract notes from audio time alignment & approximate match • Find most likely parameters fundamental, amplitudes of all harmonics Frequency (Hz) • Use musical score Spectrogram of triplum track: YIN (red) 700 30 500 20 400 300 10 5 6 7 8 9 10 dB Spectrogram of combined trackes: YIN (red), Prob (blue) 700 40 600 30 500 20 400 300 200 LabROSA Overview - Dan Ellis 40 600 200 Frequency (Hz) despite overlapping voices with precise tuning, timing 10 5 6 7 8 Time (s) 9 10 dB 2010-09-10 14/19 Polyphonic Transcription • Apply the Eigenvoice idea to music eigeninstruments? LabROSA Overview - Dan Ellis Grindlay & Ellis ’09 • Subspace NMF 2010-09-10 15/19 Chord Recognition Weller, Ellis, Jebara ’09 • StructSVM vs. HMM: better results LabROSA Overview - Dan Ellis 2010-09-10 16/19 Melodic-Harmonic Mining • 100,000 tracks from morecowbell.dj as Echo Nest Analyze • Frequent clusters of 12 x 8 binarized eventchroma Bertin-Mahieux et al. ’10 Beat tracking Music audio Chroma features Key normalization Locality Sensitive Hash Table Landmark identification #1 (3491) #2 (2775)) #3 (2255) #4 (1241)) #5 (1224)) #6 (1218)) #7 (1092)) #8 (1084)) #9 (1080)) #10 (1035)) # (1021) #11 # (1005)) #12 #13 (974) #14 (942)) #15 (936)) #16 (924)) #17 (920)) #18 (913)) #19 (901)) #20 (897) #21 (887) #22 (882)) #23 (881) #24 (881)) #25 (879)) #26 (875)) #27 (875)) #28 (874)) #29 (868)) #30 (844) #31 (839) #32 (839)) #33 (794) #34 (786)) #35 (785)) #36 (747)) #37 (731)) #38 (714)) #39 (706)) #40 (698) #41 (682) #42 (678)) #43 (675) #44 (657)) #45 (656)) #46 (651)) #47 (647)) #48 (638)) #49 (610)) #50 (593) #51 (592) #52 (591)) #53 (589) #54 (572)) #55 (571)) #56 (550)) #57 (549)) #58 (534)) #59 (534)) #60 (531) #61 (528) #62 (525)) #63 (522) #64 (514)) #65 (510)) #66 (507)) #67 (500)) #68 (497)) #69 (486)) #70 (479) #71 (476) #72 (468)) #73 (468) #74 (466)) #75 (463)) #76 (454)) #77 (453)) #78 (448)) #79 (441)) #80 (440) LabROSA Overview - Dan Ellis 2010-09-10 17/19 Results - Beatles • Over 86 Beatles tracks • All beat offsets = 41,705 patches LSH takes 300 sec - approx NlogN in patches? • High-pass • Song filter remove hits in same track LabROSA Overview - Dan Ellis 05-Here There And Everywhere 12.1-20.5s 10 10 8 8 chroma bin 12 6 6 4 4 2 2 09-Martha My Dear 90.9-98.6s 12-Piggies 22.0-29.6s 12 12 10 10 8 8 chroma bin to avoid sustained notes chroma bin along time chroma bin 02-I Should Have Known Better 92.4-97.7s 12 6 6 4 4 2 2 5 10 beat 15 20 5 10 beat 15 20 2010-09-10 18/19 Summary • LabROSA : getting information from sound • Speech monaural separation using eigenvoices binaural + reverb using MESSL • Environmental classification of consumer video landmark-based events and matching • Music transcription of notes, chords, ... large corpus mining LabROSA Overview - Dan Ellis 2010-09-10 19/19