LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA ! dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Music 2. Environmental sound 3. Speech Enhancement Laboratory for the Recognition and Organization of Speech and Audio LabROSA Research Overview - Dan Ellis COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK 2014-06-12 - 1 /20 LabROSA • Getting information from sound Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech LabROSA Research Overview - Dan Ellis 2014-06-12 - 2 /20 1. Music Audio Analysis Poliner & Ellis ’06 • Trained classifiers for low-level information • notes, chords, beats, section boundaries • E.g. Polyphonic transcription ! ! ! ! ! ! • feature agnostic • needs training data LabROSA Research Overview - Dan Ellis 2014-06-12 - 3 /20 Million Song Dataset • Industrial-scale database for music information research • Many facets: Bertin-Mahieux McFee • Echo Nest audio features + metadata • Echo Nest “taste profile” user-song-listen count • Second Hand Song covers • musiXmatch lyric BoW • last.fm tags • Now with audio? • resolving artist / album / track / duration against what.cd LabROSA Research Overview - Dan Ellis 2014-06-12 - 4 /20 MIDI-to-MSD Raffel • Aligned MIDI to Audio is a nice transcription ! ! ! ! ! ! ! ! ! • Can we find matches in large databases? LabROSA Research Overview - Dan Ellis 2014-06-12 - 5 /20 Singing ASR McVicar • Speech recognition adapted to singing • needs aligned data • Extensive work to line up scraped “acapellas” and full mix • including jumps! LabROSA Research Overview - Dan Ellis 2014-06-12 - 6 /20 4, one can hear some high frequency isolated coefficients superimposed to the separated voice. This drawback could be reduced by including harmonicity priors in the sparse component of RPCA, as Papadopoulos proposed in [20]. • Ground versus estimated voice activity location. ImRPCAtruth separates vocals and background perfect voice location still allows an improvebasedactivity on low rankinformation optimization ment, although to a lesser extent than with ground-truth voice acparameter • single trade-off tivity information. The decrease in the results mainly comes from basedclassified on higher-level features? • adjust background segments as vocalmusical segments. Block Structure RPCA • e. Fig. 4. Separated for various LabROSA Research Overview voice - Dan Ellis values of λ for the Pink Noise Party song - 7 /20 2014-06-12 Ordinal LDA Segmentation McFee • Low-rank decomposition of skewed self-similarity to identify repeats • Learned weighting of multiple factors to segment ! • Linear Discriminant Analysis between adjacent segments LabROSA Research Overview - Dan Ellis 2014-06-12 - 8 /20 2. Environmental Sound • Extracting useful information from soundtracks • e.g. TRECVID Multimedia Event Detection (MED) • “Making a Sandwich”, “Getting a Vehicle Unstuck” • 100 examples, find matches in 100k videos • manual annotations for ~10 h E009 Getting a Vehicle Unstuck LabROSA Research Overview - Dan Ellis 2014-06-12 - 9 /20 Foreground Event Recognition Cotton, Ellis, Loui ’11 • Transients = foreground events? • Onset detector finds energy bursts • best SNR • PCA basis to represent each • 300 ms x auditory freq • “bag of transients” LabROSA Research Overview - Dan Ellis 2014-06-12 - 10/20 NMF Transient Features Smaragdis & Brown ’03 Abdallah & Plumbley ’04 Virtanen ’07 Cotton & Ellis’ 11 • Decompose spectrograms into ! ! ! !X ! freq / Hz templates + activation Original mixture 5478 3474 2203 1398 883 =W•H • well-behaved gradient descent • 2D patches • sparsity control • computation time… LabROSA Research Overview - Dan Ellis 442 Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) 0 1 2 3 4 5 6 7 8 9 10 time / s 2014-06-12 - 11/20 Background Retrieval McDermott et al. ’09 Ellis, Zheng, McDermott ’11 • Classify soundtracks by statistics of ambience • E.g. Texture features ! ! Sound \x\ Automatic gain control \x\ mel filterbank (18 chans) \x\ \x\ \x\ mean, var, skew, kurt (18 x 4) 1159_10 urban cheer clap 1062_60 quiet dubbed speech music 2404 1273 1 617 0 2 4 6 8 10 0 2 4 6 8 10 time / s Texture features mel band freq / Hz Modulation energy (18 x 6) Cross-band correlations (318 samples) Envelope correlation ! • Subband distributions • Envelope cross-corrs Histogram \x\ ! ! Octave bins 0.5,1,2,4,8,16 Hz FFT 0 level 15 10 5 M V S K 0.5 2 8 32 moments LabROSA Research Overview - Dan Ellis mod frq / Hz 5 10 15 mel band M V S K 0.5 2 8 32 moments mod frq / Hz 5 10 15 mel band 2014-06-12 - 12/20 Auditory Model Features • Subband Autocorrelation PCA Lyon et al. 2010 Lee & Ellis 2012 Cotton & Ellis 2013 • Simplified version of autocorrelogram • 10x faster than Lyon original • Capture fine time structure in multiple bands • information lost in MFCCs de lay short-time autocorrelation e lin frequency channels Sound Cochlea filterbank Subband VQ Subband VQ Histogram Feature Vector Subband VQ Subband VQ freq lag Correlogram slice time LabROSA Research Overview - Dan Ellis 2014-06-12 - 13/20 Subband Autocorrelation lin short-time autocorrelation e Subband VQ Sound Cochlea filterbank Subband VQ Subband VQ Histogram Feature Vector Autocorrelation stabilizes fine time structure lay frequency channels • de Subband VQ freq lag Correlogram slice time • 25 ms window, lags up to 25 ms • calculated every 10 ms • normalized to max (zero lag) LabROSA Research Overview - Dan Ellis 2014-06-12 - 14/20 Retrieval Examples • High precision for in-domain top hits LabROSA Research Overview - Dan Ellis 2014-06-12 - 15/20 3. Speech Enhancement • Noisy speech scenarios • Ambient recording (background noise) • Communication channel (processing distortion) dB CAR KIT - BP 101 43086 20111025 160708 in 100 50 0 0 1000 2000 3000 freq / Hz 0 4000 1000 2000 3000 freq / Hz 100 2000 level / dB freq / Hz HOME LAND - BP 101 84943 20111020 144955 in 50 100 0 level / dB 1500 Hz chan 50 0 87 90 93 96 LabROSA Research Overview - Dan Ellis 99 time / s 67 70 73 76 79 time / s 2014-06-12 - 16/20 RPCA Enhancement • Decompose spectrogram into sparse + low-rank • Sparse activation H of Chen, McFee & Ellis ’14 dictionary W min ! H,L,S H kHk1 L kLk⇤ + + ! + I+ (H) ! s.t. Y != W H + L + S S kSk1 • ASR benefits: C S D I Orig 6.8 10.6 82.6 0.7 RPCA 10.8 36.5 52.7 0.5 wie+RPCA 10.4 40.1 49.5 2.1 LabROSA Research Overview - Dan Ellis 2014-06-12 - 17/20 Classification Pitch Tracker Lee & Ellis ’12 • SAcC: MLP trained on noisy speech with ground-truth pitch track targets ! ! ! ! ! • Large benefits for in-domain noisy speech YIN Wu get_f0 10 SAcC 10 −10 LabROSA Research Overview - Dan Ellis FDA − RBF and pink noise PTE (%) ! 100 0 10 20 SNR (dB) 30 2014-06-12 - 18/20 Pitch-Normalized Enhancement • Use noise-robust pitch tracker for enhancement? Clean signal 1000 500 • Normalize voice pitch 0 ! • Reimpose pitch 0.5 1 1.5 2 Noisy signal 1000 2.5 Frequency 0 3 pitch smoothed pvx 500 ! • Fixed-pitch enhancement 0 0 0.5 1 1.5 2 resampled to pitch = 100 Hz 1000 2.5 3 500 0 0 0.5 1 1.5 2 2.5 Filtered − pvsmooth 0 0.5 1 1.5 2 2.5 3 Resampled back to original pitch 1000 3 3.5 4 4.5 3.5 4 4.5 500 0 1000 500 0 LabROSA Research Overview - Dan Ellis 0 0.5 1 1.5 2 2.5 3 Time 2014-06-12 - 19/20 Summary • Music • transcription, segmentation, … • alignment for ground truth ! • Soundtracks • foreground events, background ambience ! • Noisy Speech • classification pitch tracking • spectrogram enhancement LabROSA Research Overview - Dan Ellis 2014-06-12 - 20/20