Audio & Music Research at LabROSA Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Eigenrhythms: Representing drum tracks Frequency-Domain Linear Prediction Segmenting meeting turns Analyzing ‘personal audio’ recordings Audio/Music @ LabROSA - Dan Ellis 2004-08-24 LabROSA Projects Overview Information Extraction Music Eigenrhythms Machine Meeting Learning turns Environment Personal audio FDLP Signal Processing Speech Audio/Music @ LabROSA - Dan Ellis 2004-08-24 1. Eigenrhythms: Drum Pattern Space with John Arroyo • Pop songs built on repeating “drum loop” bass drum, snare, hi-hat small variations on a few basic patterns • Eigen-analysis (PCA) to capture variations? by analyzing lots of (MIDI) data • Applications music categorization “beat box” synthesis Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Aligning the Data • Need to align patterns prior to PCA... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against ‘mean’ template Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Eigenrhythms 20+ Eigenvectors for good coverage • Need of 100 training patterns (1200 dims) • Top patterns: Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Eigenrhythms for Classification All tracks projected onto 1st two eigenrhythms 6 in • Clusters Eigenspace: ho:inside hh:gThang rb:honey 4 rc:whteroom pp:dlla l hh:rufryder bl:hideaway rb:heyLover Eigenrhythm 2 2 0 rc:californ nw:evcount s nw:psboysi n ho:pvandyk di:danqueen pp:distance di:boot yrc:zztop nw:dontyou rb:mgirlsat hh:1mChance rb:downlow nw:pur e di:satnight nw:amadeus pu:blitzkr gpu:bSedated rc:jump di:funkytwn hh:nEpisode co:alabama hh:bigpimpn pu:rubysoho rc:money hh:stan hh:jackson bl:crosfire rc:tuesdays bl:thrill pp:lkvirgin pp:fly hh:slmshady pu:beatbrat rc:hardday rc:blackdog nw:deserve pu:waitinRm pp:lvprayer co:SArose hh:superst rdi:lafreak pp:mjBeatit di:dontstop co:walkline nw:bmonday nw:whipi trb:chgWorld pp:loveshck rc:rolstone di:carwash bl:meanwoma nw:dbdance bl:blues2gm co:aftermid co:walkmi d ho:modjo pu:happyguy pu:bombshel co:goodlook bl:onebeer hh:bigPoppa bl:dimples co:byYrMan bl:chicken co:texas rc:layl a co:tennesse rb:volove di:boogient ho:bemylove pu:aWal k ho:dpworld rb:lsaround -2 di:discoinf di:boogiewl pp:bholly ho:onemore bl:boomboom -4 co:ringfire rb:bismine nw:banvenus ho:badtouch -6 -6 -4 -2 0 Eigenrhythm 1 pp:onemore pu:anarchy pp:downundr 2 4 • Genre classification? (10 way) nearest neighbor in 4D eigenspace: 21% correct Audio/Music @ LabROSA - Dan Ellis 2004-08-24 6 Eigenrhythm BeatBox • Resynthesize rhythms from eigen-space Audio/Music @ LabROSA - Dan Ellis 2004-08-24 2. Frequency-Domain Lin. Pred. Linear Linear Prediction Prediction with Marios Athineos domain (Time-domain) Linear Prediction ••• Time Time domain – The well-known spectral estimator spectralestimator estimator –the Thewell-known well-known spectral TDLP TDLP a y[n ! i] + e[n] y[n] " y[n] == i =1.. "p aii y[n ! i] + e[n] i =1.. p Apply to adomain ‘frequency domain’ signal ••• Frequency Frequency domain estimates temporal envelope ––dual: Frequency is time and vice Frequency is time and vice versa versa DCT DCT FDLP bFDLP Y[k ! i] + E[k] Y[k] " Y[k] == i =1.. "pbiiY[k ! i] + E[k] i =1.. p AthineosAudio/Music & Ellis - Music processing with FDLP @ LabROSA - Dan Athineos & Ellis - Music processing with FDLP Ellis 2004-05-25 2004-08-24 2004-05-25 4/16 4/16 Aside:DCT Spectrogram of the DCT spectrogram • DCT gives a pure-real signal: • Looks like a mirror image over time = freq axis Can we treat it like a waveform? Audio/Music @ LabROSA - Dan Ellis 2004-08-24 FDLP and TDLP Duality !,-. ),-. )*+# Audio/Music @ LabROSA - Dan Ellis !"#$%#&'( 2004-08-24 Subband FDLP • Time-frequency Temporal envelopes without slicing 25 ms windows Auditory STFT (10-25ms + Bark bin) TDLP (per time frame) Subband FDLP (per frequency subband) Audio/Music @ LabROSA - Dan Ellis Athineos & Ellis - Music processing with FDLP 2004-08-24 2004-05-25 12/16 Cascade FDLPTime-Frequency Applications LP • • Time-scale Analysis modification •• Temporal equalization Modulation-domain • Filtering in frequency “temporal equalization” Residual DCT in freq. Synthesis OLA & iDCT 1 sec up to whole sample Overlap • Perceptual audio features... (temporal equalization) Athineos & Ellis - Music processingby with FDLP FDLP = Filtering inverse Audio/Music @ LabROSA - Dan Ellis Athineos & Ellis - Music processing with FDLP Flat Temporal Envelopes 2004-05-25 2004-08-24 2004-05-25 13/16 8/16 PLP-squared Marios Athineos Hynek Hermansky • FDLP fits temporal envelope with LP Perceptual Linear Prediction (PLP) smooths across frequency can we do both... iteratively? • Speech features without ST windows Bark band 15 10 5 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 t / sec Audio/Music @ LabROSA - Dan Ellis 2004-08-24 3. Meeting Turns with Jerry Liu and ICSI • Multi-mic recordings for speaker turns every voice reaches every mic... (?) ... but with differing coupling filters (delays, gains) • Find turns with minimal assumptions e.g. ad-hoc sensor setups (multiple PDAs) differences to remove effect of source signal - no spectral models, < 1xRT Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Between-channel cues: Timing (ITD) & Level Speaker activity Speaker ground-truth xocrr peak lags (5pt med filt) skew/samp 50 Timing diffs (ITD) (2 mic pairs, 250ms win) 0 -50 125 130 135 140 145 150 norm xcorr pk val 155 160 165 170 175 1 Peak correlation coefficient r 0.5 0 125 130 135 140 145 150 per-chan E 155 160 165 170 175 -40 Per-channel energy dB -50 -60 -70 -80 125 130 135 140 145 150 chan E diffs 155 160 165 170 175 10 Between-channel energy differences dB 5 0 -5 -10 125 130 135 140 145 150 time/s 155 160 Audio/Music @ LabROSA - Dan Ellis 165 170 175 2004-08-24 Pre-whitening for ITD by 12-pole LPC models (32 ms • Inverse-filter windows) to remove local resonances • Filter out noise < 500 Hz, > 6 kHz • Then cross-correlate... lag / samps 100 Short-time xcorr: raw signals 100 50 50 0 0 -50 -50 spkr ID -100 1220 1225 1230 1235 Speaker ground truth 1240 -100 1220 6 6 4 4 2 2 1220 1225 1230 time / sec 1235 1240 Audio/Music @ LabROSA - Dan Ellis Short-time xcorr: whitened+filtered signals 1220 1225 1230 1235 Speaker ground truth 1240 1225 1240 1230 time / sec 1235 2004-08-24 Choosing “Good” Frames coef. r • Correlation ~ channel similarity: !n mi[n] · m j [n + !] ri j [!] = ! ! m2i ! m2j • Select frames with r in top 50% in both pairs • ITD - high-correlation points (435/1201) 0 Skew34 / samples Skew34 / samples ITD - all points -50 -100 0 -50 -100 • • Cleaner basis for models -150 -100 -50 Skew12 / samples 0 Audio/Music @ LabROSA - Dan Ellis about 35% of points -150 -100 -50 Skew12 / samples 0 2004-08-24 Spectral clustering of “affinity matrix” A • Eigenvectors to pick out similar points: Affinity matrix A first 12 eigenvectors (normalized) 2 400 0 350 300 -2 250 -4 200 -6 150 -8 100 -10 50 100 200 300 point index 400 -12 0 100 • • Ad-hoc mapping to clusters 200 300 point index 400 amn = exp{−"x[m] − x[n]"2/2!2} Number of clusters K from eigenvalues ≈ points Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Speaker Models & Classification • Actual clusters depend on ! and K heuristic Gaussians to each cluster, • Fit assign that class to all frames within radius or: consider dimensions independently, choose best ICSI0: good points All pts: nearest class All pts: closest dimension 0 0 0 -20 -20 -20 -40 -40 -40 -60 -60 -60 -80 -80 -80 -100 -100 -50 0 -100 -100 Audio/Music @ LabROSA - Dan Ellis -50 0 -100 -100 -50 0 2004-08-24 Performance Analysis • Compare reference & system activity maps: system misses quiet speakers 2,3,4 (deletions) system splits speaker 6 (deletions+insertions) many short gaps (deletions) • ~52% avg. error on NIST 2004 dev set speaker-characteristic-based systems ~25% Audio/Music @ LabROSA - Dan Ellis 2004-08-24 4. Segmenting Personal Audio • Easy to record everything you hear ~100GB / year @ 64 kbps • Very hard to find anything with Kean sub Lee how to scan? how to visualize? how to index? • Starting point: Collect data ~ 60 hours (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 16 classes (8 have multiple instances) Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Features for Long Recordings • Feature frames = 1 min (not 25 ms!) • Characterize variation within each frame... Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 5 • 0.6 0.5 bits 20 freq / bark freq / bark 20 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min and structure within coarse auditory bands Audio/Music @ LabROSA - Dan Ellis 2004-08-24 bits BIC Segmentation • Untrained segmentation technique statistical test indicates good change points: log L(X1 ;M1 )L(X2 ;M2 ) L(X;M0 ) ≷ λ 2 log(N )∆#(M ) • Evaluate: 60hr hand-marked boundaries different features & combinations Correct Accept % @ False Accept = 2%: 80.8% 81.1% 81.6% 84.0% 83.6% 0.8 0.7 Sensitivity µdB µH σH/µH µdB + σH/µH µdB + σH/µH + µH 0.6 0.5 0.3 0.2 0 Audio/Music @ LabROSA - Dan Ellis µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.4 0.005 0.01 0.015 0.02 0.025 1 - Specificity 0.03 0.035 2004-08-24 0.04 Segment clustering activity has lots of repetition: • Daily Automatically cluster similar segments 1 supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus 0.5 cmp lib rst str ... 0 • Spectral clustering achieves ~70% correct 16-way ground truth labels KL distance, smoothed covariance estimates Audio/Music @ LabROSA - Dan Ellis 2004-08-24 Future Work • Visualization / browsing / diary inference link to other information sources • • Privacy protection speaker/speech “search and destroy” Audio/Music @ LabROSA - Dan Ellis 2004-08-24 LabROSA Summary • LabROSA signal processing + machine learning + information extraction • Applications Eigenrhythms: drum pattern models FDLP temporal envelopes Meeting recordings Personal audio analysis • Also... music similarity, signal separation, ... Audio/Music @ LabROSA - Dan Ellis 2004-08-24