Multimedia Applications of Audio Recognition 1. 2.

Multimedia Applications of Audio Recognition Dan Ellis LabROSA Columbia EE dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Finding speaker turns in meetings 2. Segmenting ‘personal audio’ recordings 3. Eigenrhythms: representing drum tracks MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 1. Finding Speaker Turns in Meetings mic recordings carry information • Multiple on speaker turns every voice reaches every mic... (?) ... but with differing coupling filters (delays, gains) • Find turns with minimal assumptions e.g. ad-hoc sensor setups (multiple PDAs) differences to remove effect of source signal - no spectral models, < 1xRT MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Between-Channel Cues: Timing (ITD) & Level Speaker activity Speaker ground-truth xocrr peak lags (5pt med filt) skew/samp 50 Timing diffs (ITD) (2 mic pairs, 250ms win) 0 -50 125 130 135 140 145 150 norm xcorr pk val 155 160 165 170 175 1 Peak correlation coefficient r 0.5 0 125 130 135 140 145 150 per-chan E 155 160 165 170 175 -40 Per-channel energy dB -50 -60 -70 -80 125 130 135 140 145 150 chan E diffs 155 160 165 170 175 10 Between-channel energy differences dB 5 0 -5 -10 125 130 135 140 145 150 time/s 155 MM workshop ’04 - Dan Ellis Columbia University 160 165 170 175 2004-06-18 Choosing “Good” Frames coef. r • Correlation ~ channel similarity: !n mi[n] · m j [n + !] ri j [!] = ! ! m2i ! m2j • Select frames with r in top 50% in both pairs • ITD - high-correlation points (435/1201) 0 Skew34 / samples Skew34 / samples ITD - all points -50 -100 0 -50 -100 • • Cleaner basis for models -150 -100 -50 Skew12 / samples 0 MM workshop ’04 - Dan Ellis Columbia University about 35% of points -150 -100 -50 Skew12 / samples 0 2004-06-18 Spectral Clustering of “affinity matrix” A • Eigenvectors to pick out similar points: Affinity matrix A first 12 eigenvectors (normalized) 2 400 0 350 300 -2 250 -4 200 -6 150 -8 100 -10 50 100 200 300 point index 400 -12 0 100 • • Ad-hoc mapping to clusters 200 300 point index 400 amn = exp{−"x[m] − x[n]"2/2!2} Number of clusters K from eigenvalues ≈ points MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Speaker Models & Classification • Actual clusters depend on ! and K heuristic Gaussians to each cluster, • Fit assign that class to all frames within radius or: consider dimension independently, choose best ICSI0: good points All pts: nearest class All pts: closest dimension 0 0 0 -20 -20 -20 -40 -40 -40 -60 -60 -60 -80 -80 -80 -100 -100 -50 0 -100 -100 MM workshop ’04 - Dan Ellis Columbia University -50 0 -100 -100 -50 0 2004-06-18 Performance Analysis • Compare reference & system activity maps: system misses quiet speakers 2,3,4 (deletions) system splits speaker 6 (deletions+insertions) many short gaps (deletions) • NIST RT04s dev set (80 min): correct: 34.4% ins: 16.5% del: 63.7% MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 2. Segmenting Personal Audio • Easy to record everything you hear ~100GB / year @ 64 kbps • Very hard to find anything how to scan? how to visualize? how to index? • Starting point: Collect data ~ 60 hours (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 41 classes (8 have multiple instances) MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Features for Long Recordings • Feature frames = 1 min (not 25 ms!) • Characterize variation within each frame... Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 5 • 0.6 0.5 bits 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min and structure within coarse auditory bands MM workshop ’04 - Dan Ellis Columbia University 20 freq / bark freq / bark 20 2004-06-18 bits BIC Segmentation • Untrained segmentation technique statistical test indicates good change points: log L(X1 ;M1 )L(X2 ;M2 ) L(X;M0 ) ≷ λ 2 log(N )∆#(M ) • Evaluate against hand-marked boundaries different features & combinations Correct Accept % @ False Accept = 2%: 80.8% 81.1% 81.6% 84.0% 83.6% 0.8 0.7 Sensitivity µdB µH σH/µH µdB + σH/µH µdB + σH/µH + µH 0.6 0.5 0.3 0.2 0 MM workshop ’04 - Dan Ellis Columbia University µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.4 0.005 0.01 0.015 0.02 0.025 1 - Specificity 0.03 0.035 2004-06-18 0.04 Future Work • Clustering of segments .. using spectral clustering compare to ground-truth, confusions • Visualization / browsing: freq / Bark Hand-marked boundaries 20 15 10 5 21:00 22:00 23:00 • Privacy protection 0:00 1:00 2:00 3:00 4:00 Clock time speaker/speech “search and destroy” MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 3. Eigenrhythms: Representing Drum Tracks • Pop songs built on repeating “drum loop” bass drum, snare, hi-hat small variations on a few basic patterns • Eigen-analysis (PCA) to capture variations? by analyzing lots of (MIDI) data • Applications music categorization “beat box” synthesis MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Aligning the Data • Need to align patterns prior to PCA... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against ‘mean’ template MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Eigenrhythms 20+ Eigenvectors for good coverage • Need of 100 training patterns (1200 dims) • Top patterns: MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 Eigenrhythms for Classification 6 in • Clusters Eigenspace: All tracks projected onto 1st two eigenrhythms ho:inside hh:gThang rb:honey 4 rc:whteroom pp:dlla l hh:rufryder bl:hideaway rb:heyLover Eigenrhythm 2 2 0 rc:californ nw:evcount s nw:psboysi n ho:pvandyk di:danqueen pp:distance di:boot yrc:zztop nw:dontyou rb:mgirlsat hh:1mChance rb:downlow nw:pur e di:satnight nw:amadeus pu:blitzkr gpu:bSedated rc:jump di:funkytwn hh:nEpisode co:alabama hh:bigpimpn pu:rubysoho rc:money hh:stan hh:jackson bl:crosfire rc:tuesdays bl:thrill pp:lkvirgin pp:fly hh:slmshady pu:beatbrat rc:hardday rc:blackdog nw:deserve pu:waitinRm pp:lvprayer co:SArose hh:superst rdi:lafreak pp:mjBeatit di:dontstop co:walkline nw:bmonday nw:whipi trb:chgWorld pp:loveshck rc:rolstone di:carwash bl:meanwoma nw:dbdance bl:blues2gm co:aftermid co:walkmi d ho:modjo pu:happyguy pu:bombshel co:goodlook bl:onebeer hh:bigPoppa bl:dimples co:byYrMan bl:chicken co:texas rc:layl a co:tennesse rb:volove di:boogient ho:bemylove pu:aWal k ho:dpworld rb:lsaround -2 di:discoinf di:boogiewl pp:bholly ho:onemore bl:boomboom -4 co:ringfire rb:bismine nw:banvenus ho:badtouch -6 -6 -4 -2 0 Eigenrhythm 1 pp:onemore pu:anarchy pp:downundr 2 4 • Genre classification? (10 way) nearest neighbor in 4D eigenspace: 21% correct MM workshop ’04 - Dan Ellis Columbia University 2004-06-18 6 Summary: Audio Analysis for Multimedia • Spatial cues for meeting recordings recover meeting ‘structure’ from ad-hoc recordings • Segmenting ‘personal audio’ recordings what are good features for 1 minute frames? • Eigenrhythm analysis of drum patterns define the subspace of ‘musical’ rhythms MM workshop ’04 - Dan Ellis Columbia University 2004-06-18

Multimedia Applications of Audio Recognition 1. 2.

Products

Support

Multimedia Applications of Audio Recognition 1. 2.

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib