Multimedia Applications of Audio Recognition Dan Ellis LabROSA Columbia EE dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Finding speaker turns in meetings 2. Segmenting ‘personal audio’ recordings 3. Eigenrhythms: representing drum tracks MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 1. Finding Speaker Turns in Meetings mic recordings carry information • Multiple on speaker turns every voice reaches every mic... (?) ... but with differing coupling filters (delays, gains) • Find turns with minimal assumptions e.g. ad-hoc sensor setups (multiple PDAs) differences to remove effect of source signal - no spectral models, < 1xRT MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Between-Channel Cues: Timing (ITD) & Level Speaker activity Speaker ground-truth xocrr peak lags (5pt med filt) skew/samp 50 Timing diffs (ITD) (2 mic pairs, 250ms win) 0 -50 125 130 135 140 145 150 norm xcorr pk val 155 160 165 170 175 1 Peak correlation coefficient r 0.5 0 125 130 135 140 145 150 per-chan E 155 160 165 170 175 -40 Per-channel energy dB -50 -60 -70 -80 125 130 135 140 145 150 chan E diffs 155 160 165 170 175 10 Between-channel energy differences dB 5 0 -5 -10 125 130 135 140 145 150 time/s 155 MM workshop ’04 - Dan Ellis Columbia University 160 165 170 175 2004-06-18 Choosing “Good” Frames coef. r • Correlation ~ channel similarity: !n mi[n] · m j [n + !] ri j [!] = ! ! m2i ! m2j • Select frames with r in top 50% in both pairs • ITD - high-correlation points (435/1201) 0 Skew34 / samples Skew34 / samples ITD - all points -50 -100 0 -50 -100 • • Cleaner basis for models -150 -100 -50 Skew12 / samples 0 MM workshop ’04 - Dan Ellis Columbia University about 35% of points -150 -100 -50 Skew12 / samples 0 2004-06-18 Spectral Clustering of “affinity matrix” A • Eigenvectors to pick out similar points: Affinity matrix A first 12 eigenvectors (normalized) 2 400 0 350 300 -2 250 -4 200 -6 150 -8 100 -10 50 100 200 300 point index 400 -12 0 100 • • Ad-hoc mapping to clusters 200 300 point index 400 amn = exp{−"x[m] − x[n]"2/2!2} Number of clusters K from eigenvalues ≈ points MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Speaker Models & Classification • Actual clusters depend on ! and K heuristic Gaussians to each cluster, • Fit assign that class to all frames within radius or: consider dimension independently, choose best ICSI0: good points All pts: nearest class All pts: closest dimension 0 0 0 -20 -20 -20 -40 -40 -40 -60 -60 -60 -80 -80 -80 -100 -100 -50 0 -100 -100 MM workshop ’04 - Dan Ellis Columbia University -50 0 -100 -100 -50 0 2004-06-18 Performance Analysis • Compare reference & system activity maps: system misses quiet speakers 2,3,4 (deletions) system splits speaker 6 (deletions+insertions) many short gaps (deletions) • NIST RT04s dev set (80 min): correct: 34.4% ins: 16.5% del: 63.7% MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 2. Segmenting Personal Audio • Easy to record everything you hear ~100GB / year @ 64 kbps • Very hard to find anything how to scan? how to visualize? how to index? • Starting point: Collect data ~ 60 hours (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 41 classes (8 have multiple instances) MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Features for Long Recordings • Feature frames = 1 min (not 25 ms!) • Characterize variation within each frame... Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 5 • 0.6 0.5 bits 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min and structure within coarse auditory bands MM workshop ’04 - Dan Ellis Columbia University 20 freq / bark freq / bark 20 2004-06-18 bits BIC Segmentation • Untrained segmentation technique statistical test indicates good change points: log L(X1 ;M1 )L(X2 ;M2 ) L(X;M0 ) ≷ λ 2 log(N )∆#(M ) • Evaluate against hand-marked boundaries different features & combinations Correct Accept % @ False Accept = 2%: 80.8% 81.1% 81.6% 84.0% 83.6% 0.8 0.7 Sensitivity µdB µH σH/µH µdB + σH/µH µdB + σH/µH + µH 0.6 0.5 0.3 0.2 0 MM workshop ’04 - Dan Ellis Columbia University µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.4 0.005 0.01 0.015 0.02 0.025 1 - Specificity 0.03 0.035 2004-06-18 0.04 Future Work • Clustering of segments .. using spectral clustering compare to ground-truth, confusions • Visualization / browsing: freq / Bark Hand-marked boundaries 20 15 10 5 21:00 22:00 23:00 • Privacy protection 0:00 1:00 2:00 3:00 4:00 Clock time speaker/speech “search and destroy” MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 3. Eigenrhythms: Representing Drum Tracks • Pop songs built on repeating “drum loop” bass drum, snare, hi-hat small variations on a few basic patterns • Eigen-analysis (PCA) to capture variations? by analyzing lots of (MIDI) data • Applications music categorization “beat box” synthesis MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Aligning the Data • Need to align patterns prior to PCA... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against ‘mean’ template MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Eigenrhythms 20+ Eigenvectors for good coverage • Need of 100 training patterns (1200 dims) • Top patterns: MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Eigenrhythms for Classification 6 in • Clusters Eigenspace: All tracks projected onto 1st two eigenrhythms ho:inside hh:gThang rb:honey 4 rc:whteroom pp:dlla l hh:rufryder bl:hideaway rb:heyLover Eigenrhythm 2 2 0 rc:californ nw:evcount s nw:psboysi n ho:pvandyk di:danqueen pp:distance di:boot yrc:zztop nw:dontyou rb:mgirlsat hh:1mChance rb:downlow nw:pur e di:satnight nw:amadeus pu:blitzkr gpu:bSedated rc:jump di:funkytwn hh:nEpisode co:alabama hh:bigpimpn pu:rubysoho rc:money hh:stan hh:jackson bl:crosfire rc:tuesdays bl:thrill pp:lkvirgin pp:fly hh:slmshady pu:beatbrat rc:hardday rc:blackdog nw:deserve pu:waitinRm pp:lvprayer co:SArose hh:superst rdi:lafreak pp:mjBeatit di:dontstop co:walkline nw:bmonday nw:whipi trb:chgWorld pp:loveshck rc:rolstone di:carwash bl:meanwoma nw:dbdance bl:blues2gm co:aftermid co:walkmi d ho:modjo pu:happyguy pu:bombshel co:goodlook bl:onebeer hh:bigPoppa bl:dimples co:byYrMan bl:chicken co:texas rc:layl a co:tennesse rb:volove di:boogient ho:bemylove pu:aWal k ho:dpworld rb:lsaround -2 di:discoinf di:boogiewl pp:bholly ho:onemore bl:boomboom -4 co:ringfire rb:bismine nw:banvenus ho:badtouch -6 -6 -4 -2 0 Eigenrhythm 1 pp:onemore pu:anarchy pp:downundr 2 4 • Genre classification? (10 way) nearest neighbor in 4D eigenspace: 21% correct MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 6 Summary: Audio Analysis for Multimedia • Spatial cues for meeting recordings recover meeting ‘structure’ from ad-hoc recordings • Segmenting ‘personal audio’ recordings what are good features for 1 minute frames? • Eigenrhythm analysis of drum patterns define the subspace of ‘musical’ rhythms MM workshop ’04 - Dan Ellis Columbia University 2004-06-18