Audio & Music Research at LabROSA Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. 2. 3. 4. 5. http://labrosa.ee.columbia.edu/ Eigenrhythms: representing drum tracks Frequency-Domain Linear Prediction Anchor-Space Music Similarity Browsing Transformation-based generative models Analyzing ‘personal audio’ recordings Audio/Music @ LabROSA - Dan Ellis 2004-07-29 LabROSA Projects Overview Information Extraction Music Eigenrhythms Anchor space Environment Personal audio Machine Transform model Learning FDLP Signal Processing Speech Audio/Music @ LabROSA - Dan Ellis 2004-07-29 1. Eigenrhythms: Drum Pattern Space with John Arroyo • Pop songs built on repeating “drum loop” bass drum, snare, hi-hat small variations on a few basic patterns • Eigen-analysis (PCA) to capture variations? by analyzing lots of (MIDI) data • Applications music categorization “beat box” synthesis Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Aligning the Data • Need to align patterns prior to PCA... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against ‘mean’ template Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Eigenrhythms 20+ Eigenvectors for good coverage • Need of 100 training patterns (1200 dims) • Top patterns: Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Eigenrhythms for Classification All tracks projected onto 1st two eigenrhythms 6 in • Clusters Eigenspace: ho:inside hh:gThang rb:honey 4 rc:whteroom pp:dlla l hh:rufryder bl:hideaway rb:heyLover Eigenrhythm 2 2 0 rc:californ nw:evcount s nw:psboysi n ho:pvandyk di:danqueen pp:distance di:boot yrc:zztop nw:dontyou rb:mgirlsat hh:1mChance rb:downlow nw:pur e di:satnight nw:amadeus pu:blitzkr gpu:bSedated rc:jump di:funkytwn hh:nEpisode co:alabama hh:bigpimpn pu:rubysoho rc:money hh:stan hh:jackson bl:crosfire rc:tuesdays bl:thrill pp:lkvirgin pp:fly hh:slmshady pu:beatbrat rc:hardday rc:blackdog nw:deserve pu:waitinRm pp:lvprayer co:SArose hh:superst rdi:lafreak pp:mjBeatit di:dontstop co:walkline nw:bmonday nw:whipi trb:chgWorld pp:loveshck rc:rolstone di:carwash bl:meanwoma nw:dbdance bl:blues2gm co:aftermid co:walkmi d ho:modjo pu:happyguy pu:bombshel co:goodlook bl:onebeer hh:bigPoppa bl:dimples co:byYrMan bl:chicken co:texas rc:layl a co:tennesse rb:volove di:boogient ho:bemylove pu:aWal k ho:dpworld rb:lsaround -2 di:discoinf di:boogiewl pp:bholly ho:onemore bl:boomboom -4 co:ringfire rb:bismine nw:banvenus ho:badtouch -6 -6 -4 -2 0 Eigenrhythm 1 pp:onemore pu:anarchy pp:downundr 2 4 • Genre classification? (10 way) nearest neighbor in 4D eigenspace: 21% correct Audio/Music @ LabROSA - Dan Ellis 2004-07-29 6 Eigenrhythm BeatBox • Resynthesize rhythms from eigen space Audio/Music @ LabROSA - Dan Ellis 2004-07-29 2. Frequency-Domain Lin. Pred. Linear Prediction Linear Prediction with Marios Athineos domain (Time-domain) Linear Prediction ••• Time Time domain – The well-known spectral estimator spectralestimator estimator –the Thewell-known well-known spectral TDLP TDLP a y[n ! i] + e[n] y[n] " y[n] == i =1.. "p aii y[n ! i] + e[n] i =1.. p Apply to adomain ‘frequency domain’ signal ••• Frequency Frequency domain estimates temporal envelope ––dual: Frequency is time and vice Frequency is time and vice versa versa DCT DCT FDLP bFDLP Y[k ! i] + E[k] Y[k] " Y[k] == i =1.. "pbiiY[k ! i] + E[k] i =1.. p AthineosAudio/Music & Ellis - Music processing with FDLP @ LabROSA - Dan Athineos & Ellis - Music processing with FDLP Ellis 2004-05-25 2004-07-29 2004-05-25 4/16 4/16 Aside:DCT Spectrogram of the DCT spectrogram • DCT gives a pure-real signal: • Looks like a mirror image over time = freq axis Can we treat it like a waveform? Audio/Music @ LabROSA - Dan Ellis 2004-07-29 FDLP and TDLP Duality !,-. ),-. )*+# Audio/Music @ LabROSA - Dan Ellis !"#$%#&'( 2004-07-29 Subband FDLP • Time-frequency Temporal envelopes without slicing 25 ms windows Auditory STFT (10-25ms + Bark bin) TDLP (per time frame) Subband FDLP (per frequency subband) Audio/Music @ LabROSA - Dan Ellis Athineos & Ellis - Music processing with FDLP 2004-07-29 2004-05-25 12/16 Cascade FDLPTime-Frequency Applications LP • • Time-scale Analysis modification •• Temporal equalization Modulation-domain • Filtering in frequency “temporal equalization” Residual DCT in freq. Synthesis OLA & iDCT 1 sec up to whole sample Overlap • Flat Temporal Envelopes Perceptual audio features... “PLP-squared” (temporal equalization) Athineos & Ellis - Music processingby with FDLP FDLP = Filtering inverse Audio/Music @ LabROSA - Dan Ellis Athineos & Ellis - Music processing with FDLP 2004-05-25 2004-07-29 2004-05-25 13/16 8/16 3. Music Similarity Browsing with Adam Berenzweig • Musical information overload record companies filter/categorize music an automatic system would be less odious • Connecting audio and preference map to a ‘semantic space’? n-dimensional vector in "Anchor Space" Anchor Anchor Audio Input (Class i) p(a1|x) AnchorAnchor Anchor Audio Input (Class j) p(a2n-dimensional |x) vector in "Anchor Space" GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Anchor Space • Frame-by-frame high-level categorizations 0 0.6 0.4 0.2 Electronica fifth cepstral coef compare to raw features? Anchor Space Features Cepstral Features 0 0.2 0.4 0.6 madonna bowie 0.8 1 0.5 0 third cepstral coef 5 10 15 0.5 properties in distributions? dynamics? Audio/Music @ LabROSA - Dan Ellis madonna bowie 15 10 Country 2004-07-29 5 ‘Playola’ Similarity Browser Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Ground-truth data • Hard to evaluate Playola’s ‘accuracy’ user tests... ground truth? • “Musicseer” online survey: ran for 9 months in 2002 > 1,000 users, > 20k judgments http://labrosa.ee.columbia.edu/ projects/musicsim/ Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Evaluation Anchor Space measures against • Compare Musicseer subjective results “triplet” agreement percentage Top-N ranking agreement score: ! " 13 1 !r = 2 N si = ! "rr"kcr r=1 !c = !2r First-place agreement percentage Top rank agreement test - simple significance 80 70 60 SrvKnw 4789x3.58 % 50 SrvAll 6178x8.93 40 GamKnw 7410x3.96 30 GamAll 7421x8.92 20 10 0 cei cmb erd e3d opn Audio/Music @ LabROSA - Dan Ellis kn2 rnd ANK 2004-07-29 4. Transformation-based models with Manuel Reyes and Nebojsa Jojic • HMMs are poor generative models accurate modeling requires 1000s of states • Observation: Speech spectra undergo minor deformations suggests a different generative model? 9 X9t Xt-1 8 Xt-1 X8t Transformation 7 matrix T Xt-1 X7t 6 Xt-1 Xt6 00100 5 = Xt5 0 0 0 1 0 • Xt-1 4 00001 Xt-1 Xt4 3 X3t NP=5 Xt-1 2 Xt-1 X2t 1 Xt-1 X1t Audio/Music @ LabROSA - Dan Ellis NC=3 2004-07-29 States+Transformation Model • Time-frequency state grid → • State explicit prototype a) • or a transformation on prior frame Infer underlying states b) T51 ! ! T14 X51 X50 T!31 X41 X40 T!22 T!21 X31 X30 X21 X20 frequency X10 T12 time Yellow/Orange: Upward motion (darker is steeper) 3 b) Transformation Map " "#$ T!11 Green: Identity transform 2 Audio/Music @ LabROSA - Dan Ellis T13 X11 1 a) Signal % T!23 Blue: Downward motion (darker is steeper) 2004-07-29 Two-layer model • Source-filter decomposition pitch and formants have different dynamics • Apply transformation models for both log-spectra: sum of excitation & filter inference does separation !# !" " ' ( $ $%& = Signal Selected Bin + Harmonics Harmonic Tracking Audio/Music @ LabROSA - Dan Ellis $ $%& Formants Formant Tracking 2004-07-29 Transformation model applications • Compact, accurate source descriptions only a few explicit states needed • a) States b) Reconstruction; Iter. 1 c) Reconstruction; Iter. 3 • Belief propagation can infer missing values d) Reconstruction; Iter. 5 e) Reconstruction; Iter. 8 .. of state grid, hence magnitude spectrum a) Original b) Missing Data Audio/Music @ LabROSA - Dan Ellis c) After iteration 10 d) After iteration 30 2004-07-29 5. Segmenting Personal Audio with Kean sub Lee • Easy to record everything you hear ~100GB / year @ 64 kbps • Very hard to find anything how to scan? how to visualize? how to index? • Starting point: Collect data ~ 60 hours (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 41 classes (8 have multiple instances) Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Features for Long Recordings • Feature frames = 1 min (not 25 ms!) • Characterize variation within each frame... Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 5 • 0.6 0.5 bits 20 freq / bark freq / bark 20 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min and structure within coarse auditory bands Audio/Music @ LabROSA - Dan Ellis 2004-07-29 bits BIC Segmentation • Untrained segmentation technique statistical test indicates good change points: log L(X1 ;M1 )L(X2 ;M2 ) L(X;M0 ) ≷ λ 2 log(N )∆#(M ) • Evaluate: 60hr hand-marked boundaries different features & combinations Correct Accept % @ False Accept = 2%: 80.8% 81.1% 81.6% 84.0% 83.6% 0.8 0.7 Sensitivity µdB µH σH/µH µdB + σH/µH µdB + σH/µH + µH 0.6 0.5 0.3 0.2 0 Audio/Music @ LabROSA - Dan Ellis µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.4 0.005 0.01 0.015 0.02 0.025 1 - Specificity 0.03 0.035 2004-07-29 0.04 Segment clustering activity has lots of repetition: • Daily Automatically cluster similar segments 1 supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus 0.5 cmp lib rst str ... 0 • Spectral clustering achieves ~60% correct 16-way ground truth labels Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Future Work • Visualization / browsing / diary inference link to other information sources • • Privacy protection speaker/speech “search and destroy” Audio/Music @ LabROSA - Dan Ellis 2004-07-29 Summary • Today’s topics: Information Extraction Music Eigenrhythms Anchor space Environment Personal audio Machine Transform model Learning FDLP Signal Processing Speech • + Speech recognition, Meeting recordings Audio/Music @ LabROSA - Dan Ellis 2004-07-29 LabROSA Summary • LabROSA signal processing + machine learning + info extraction • Applications Eigenrhythms: drum pattern models FDLP temporal envelope models Music Similarity Browsing Transformation-based generative models Personal audio analysis • Also... speech recognition, meeting recordings, ... Audio/Music @ LabROSA - Dan Ellis 2004-07-29