ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 15: Research at LabROSA ! 1. 2. 3. 4. Sources, Mixtures, & Perception ! Spatial Filtering ! Time-Frequency Masking Model-Based! Separation Dan Ellis Dept. Electrical Engineering, Columbia University dpwe@ee.columbia.edu E4896 Music Signal Processing (Dan Ellis) http://www.ee.columbia.edu/~dpwe/e4896/ 2014-05-05 - 1 /19 Sparse + Low-Rank + NMF • Optimization to decompose spectogram: minimize |S|1 + |L| + DKL (Y s.t. Y =S+L+H·W S Zhuo Chen L||H · W) Y H•W L S E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 2 /19 case bold letters (e.g. x, d s, and z) denote vectors. f 2 NMF {1, 2, · · · , F } isBeta used to Process index frequency. t 2 {1, 2, · · · , T } Hoffman is used toAutomatically index time. k choose 2 {1, 2,how · · · ,many K} is used Liang, to index dictionarycomponents components.to use BP-NMF is formulated as: • X = D(S (1) Z) + E [5] C ri S N (a) The selected components learned from single-track instrument. For each instrument, the components are sorted by approximated fundamental frequency. The dictionary is cut off above 5512.5 Hz for visualization purposes. E4896 Music Signal Processing (Dan Ellis) [6] N on A M 2014-05-05 [7] - 3 /19 Music Complexity Colin Raffel • How can we capture musical patterns in the Million Song Dataset? • Network analysis of quantized simultaneities after Serrà et al. 2012 from Serrà, Corral, Boguña, Haro, & Arcos, 2012 E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 4 /19 Large-Scale Cover Recognition 1 Thierry Bertin-Mahieux • How can we find covers in 1M songs? @ 1 sec / comparison, one search = 11.5 CPU-days full N2 mining = 16,000 CPU-years • Need a hashing technique landmark-based description of chroma patches ! ! ! ! ! ! Euclidean space projection? E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 5 /19 Large-Scale Cover Recognition 2 Thierry Bertin-Mahieux • 2D Fourier Transform Magnitude (2DFTM) fixed-size feature to capture “essence” of chromagram: ! • First results on finding covers in 1M songs Average rank meanAP random 500,000 0.000 jumpcodes 2 308,369 0.002 137,117 0.020 2DFTM (50 PC) E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 6 /19 Jazz Discography Project • How can MIR help organize jazz collections? our tools are quite genre-specific e.g. beat tracker is fine for pop, useless for Jazz 40 30 20 10 80 60 40 20 0 84 86 E4896 Music Signal Processing (Dan Ellis) 88 90 92 94 96 98 2014-05-05 - 7 /19 Local Tagging • MFCC-statistics classifiers on 5 sec windows trained from MajorMiner data freq / Hz 01 Soul Eyes 2416 1356 761 427 240 135 _90s club trance end drum_bass singing horns punk samples silence quiet noise solo strings indie house alternative r_b funk soft ambient british distortion drum_machine country keyboard saxophone fast instrumental electronica 80s voice beat slow rap hip_hop jazz piano techno dance female bass vocal pop electronic rock synth male guitar drum 50 100 150 200 250 300 1.5 1 0.5 0 −0.5 −1 −1.5 −2 40 80 E4896 Music Signal Processing (Dan Ellis) 120 160 200 240 280 time / s 320 2014-05-05 - 8 /19 Onset Correlation • “Ahead of” or “behind” the beat? Tony Williams E4896 Music Signal Processing (Dan Ellis) Brian McFee Elvin Jones 2014-05-05 - 9 /19 Structural Similarity Diego Silva Helene Papadopoulos • Self-similarity shows repeating structure in music • Can we find similar pieces by finding similar 2020 from Bello 2011 structures? IEEE TRANSACTIONS ON AUDIO, SPEECH, A Fig. 5. Comparison of recurrence plots for two performances of W. A. Fig. 6.- 10 Retrie E4896 Music Signal Mozart’s Processing (Dan Ellis) 2014-05-05 /19 Symphony # 40, movement 3. The figures illustrate how beat-tracking Ordinal LDA Segmentation • Low-rank decomposition of skewed selfsimilarity to identify repeats • Learned weighting Self-similarity 275 220 220 110 165 110 -110 55 -220 0 0 Lag 330 55 110 165 220 275 330 Beat 55 110 165 220 275 330 Beat Latent repetition 0 220 1 110 2 0 -110 3 4 5 6 -220 E4896 Music Signal Processing (Dan Ellis) -330 0 Filtered self-sim. -330 0 Skewed self-sim. 0 Factor Linear Discriminant Analysis between adjacent segments 330 Lag of multiple factors to segment Beat 330 McFee 7 55 110 165 220 275 330 Beat 0 55 110 165 220 275 330 Beat 2014-05-05 - 11/19 Lyric Recognition Matt McVicar • Speech Recognition for Songs lots of interference atypical speech Frequency (kHz) Polyphonic Audio Acapella Audio 4 4 3 3 2 2 1 1 0 0 2 4 6 0 8 0 2 4 Frequency (kHz) Natural Speech 4 3 3 2 2 1 1 E4896 Music 0 1 2 3 4 Time (seconds) 8 Synthesized Speech 4 0 6 5 6 0 0 1 2 3 4 5 Time (seconds) 6 7 Figure 1: Comparison of vocal types used in this paper, example clip ‘This Love’, Levine-Carmichael. Top row: full polyphonic audio (including vocals, two electric guitars, bass guitar, piano and drums), Acapella audio (voice only). Bottom row: Natural speech performed by the authors, synthesized speech using the ‘say’ command in Mac OSX. 2014-05-05 Signal Processing (Dan Ellis) - 12/19 Singing ASR • Speech recognition adapted to singing needs aligned data • Align scraped “acapellas” and full mix McVicar including jumps! E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 13/19 “Remixavier" Raffel • Optimal align-and-cancel of mix and acapella timing and channel may differ E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 14/19 Million Song Dataset • Many Facets Bertin-Mahieux McFee Echo Nest audio features + metadata Echo Nest “taste profile” user-song-listen count Second Hand Song covers musiXmatch lyric BoW last.fm tags ! • Now with audio? resolving artist / album / track / duration against what.cd E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 15/19 MIDI-to-MSD • Aligned MIDI to Audio is a nice Raffel transcription Shi ! ! ! ! ! ! ! ! ! E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 16/19 • Problem: De-DTMF Stationary tones confuse speech detector Adaptively filter sinusoids with steady amplitude 1000 1500 Freq / Hz −1 −1 Imaginary Part 20 0 0 1 Real Part 0.6 0.7 0.68 0.68 0.4 0.2 0.7 0.72 Real Part LPC fit Find roots Transform radii Overlapadd Filter audio frames Add poles Map to zeros Filtered signal Filter response & spectrum 60 1000 55 56 57 Time 40 20 0 −20 500 E4896 Music Signal Processing (Dan Ellis) 1000 1500 Freq / Hz 1 Transformed filter 0 −1 −1 15 0 1 Real Part 1.0 0.8 Framing 2000 0 0 1 LPC poles detail Imaginary Part 3000 20 Imaginary Part Ouput audio 56 57 Time 40 −20 500 Gain / dB Input audio 55 Imaginary Part 1000 0 Frequency Gain / dB 2000 LPC poles 0.0 Mapped radius Frequency tcp_d1_02_counting_cia_irdial Spectrum and LPC fit 3000 60 Transformed filter detail 0.72 0.7 0.68 0.68 0.7 0.72 Real Part 2014-05-05 - 17/19 Pitch-based Filtering • Resample to flatten pitch, then filter E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 18/19 Summary • Signal Separation NMF, RPCA, cancellation, filtering ! • Music Information Beat tracking, segmentation Large datasets Indexing & retrieval ! • Speech Lyric recognition Speech detection & enhancement E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 19/19 References [Bello 2011] J P Bello, “Measuring structural similarity in music”, IEEE Tr. Audio, Speech, & Lang., 19(7): 2013-2025, 2011. [Serra et al. 2012] J Serrà, A Corral, M Boguña, M. Haro, & J. Arcos, “Measuring the evolution of contemporary western popular music”, Scientific Reports, 2:521, 2012. E4896 Music Signal Processing (Dan Ellis) 2014-05-05 - 20/19