Searching and Describing Audio Databases Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Music browsing 2. Meeting recordings 3. ‘Personal Audio’ Searching/Describing Audio - Dan Ellis 2005-05-23 - 1 /30 LabROSA Overview Information Extraction Music Similarity Browsing Machine Learning Environment Audio lifelog Meeting recordings Signal Processing Speech Searching/Describing Audio - Dan Ellis 2005-05-23 - 2 /30 Sound Description • Challenge: Indexing continuous content what are the equivalents of words and phrases? • Perception: sounds as discrete events • Goal: Duplicate perceptual representation Searching/Describing Audio - Dan Ellis 2005-05-23 - 3 /30 1. Music Signal Analysis • A lot of music data available e.g. 60G of MP3 ≈ 1000 hr of audio/15k tracks • What can we do with it? identify implicit structure... • Quality vs. quantity Speech recognition lesson: 10x data, 1/10th annotation, twice as useful • Motivating Applications music search / browsing by similarity insight into music Searching/Describing Audio - Dan Ellis 2005-05-23 - 4 /30 Musical Information Extraction Similarity/ recommend'n Anchor models Semantic bases Music audio Melody extraction Drums extraction Event extraction Fragment clustering Eigenrhythms Synthesis/ generation ? • Lots of data + noisy transcription + weak clustering musical insights? Searching/Describing Audio - Dan Ellis 2005-05-23 - 5 /30 Transcription as Classification with Graham Poliner • Signal models typically used for transcription harmonic spectrum, superposition • But ... trade domain knowledge for data transcription as pure classification problem: Audio Trained classifier p("C0"|Audio) p("C#0"|Audio) p("D0"|Audio) p("D#0"|Audio) p("E0"|Audio) p("F0"|Audio) single N-way discrimination for “melody” per-note classifiers for polyphonic transcription Searching/Describing Audio - Dan Ellis 2005-05-23 - 6 /30 Classifier Transcription Results • Trained on MIDI syntheses (32 songs) SMO SVM (Weka) • Tested on MIREX set Frame-level pitch concordance system fg+bg just fg “jazz3” 71.5% 56.1% foreground/bg separation Searching/Describing Audio - Dan Ellis 2005-05-23 - 7 /30 overall 44.3% 45.4% Chord Transcription • Basic problem: with Alex Sheh Recover chord sequence labels from audio Easier than note transcription ? More relevant to listener perception ? Searching/Describing Audio - Dan Ellis 2005-05-23 - 8 /30 EM HMM Re-Estimation • Estimate ‘soft’ labels using current models • Update model parameters from new labels • Repeat until convergence to local maximum Model inventory ae1 ae2 dh1 dh2 ae3 Labelled training data dh ax k ae t Initialization Θ init parameters Repeat until convergence s ae t aa n dh s ae E-step: probabilities p(qni |XN 1,Θold) of unknowns M-step: maximize via parameters Searching/Describing Audio - Dan Ellis ax k t dh ae t aa n ax k Uniform initialization alignments ae Θ : max E[log p(X,Q | Θ)] 2005-05-23 - 9 /30 Chord Sequence Data Sources • All we need are the chord sequences for our training examples Hal Leonard “Paperback Song Series” - manually retyped for 20 songs: “Beatles for Sale”, “Help”, “Hard Day’s Night” freq / kHz The Beatles - Hard Day's Night # The Beatles - A Hard Day's Night # G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G D G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G C9 G Cadd9 Fadd9 5 4 3 2 1 0 0 5 10 15 20 25 time / sec - hand-align chords for 2 test examples Searching/Describing Audio - Dan Ellis 2005-05-23 - 10/30 Chord Results • Recognition weak, but forced-alignment OK Frame-level Accuracy Recognition Alignment MFCC 8.7% PCP_ROT 21.7% # G # 22.0% F E pitch class Feature Beatles - Beatles For Sale - Eight Days a Week (4096pt) 76.0% # D # (random ~3%) MFCCs are poor (can overtrain) PCPs better (ROT helps generalization) C B # inte A 16.27 true align recog Searching/Describing Audio - Dan Ellis 24.84 time / sec E G D E G DBm E G Bm Bm G G Am Em7 Bm 2005-05-23 - 11/30 Em7 Eigenrhythms: Drum Pattern Space with John Arroyo • Pop songs built on repeating “drum loop” bass drum, snare, hi-hat small variations on a few basic patterns • Eigen-analysis (PCA) to capture variations? by analyzing lots of (MIDI) data • Applications music categorization “beat box” synthesis Searching/Describing Audio - Dan Ellis 2005-05-23 - 12/30 Eigenrhythms • Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims) • Top patterns: Searching/Describing Audio - Dan Ellis 2005-05-23 - 13/30 Eigenrhythms for Classification • Projections in Eigenspace / LDA space PCA(1,2) projection (16% corr) LDA(1,2) projection (33% corr) 10 6 blues 4 country disco hiphop 2 house newwave rock 0 pop punk -2 rnb 5 0 -5 -10 -20 -10 0 10 -4 -8 -6 -4 -2 • 10-way Genre classification (nearest nbr): PCA3: 20% correct LDA4: 36% correct Searching/Describing Audio - Dan Ellis 2005-05-23 - 14/30 0 2 Music Similarity Browsing • Musical information overload with Adam Berenzweig record companies filter/categorize music an automatic system would be less odious • Connecting audio and preference map to a ‘semantic space’? n-dimensional vector in "Anchor Space" Anchor Anchor Audio Input (Class i) p(a1|x) AnchorAnchor Anchor Audio Input (Class j) p(a2n-dimensional |x) vector in "Anchor Space" GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace Searching/Describing Audio - Dan Ellis 2005-05-23 - 15/30 Anchor Space • Frame-by-frame high-level categorizations 0 0.6 0.4 0.2 Electronica fifth cepstral coef compare to raw features? Anchor Space Features Cepstral Features 0 0.2 0.4 0.6 madonna bowie 0.8 1 0.5 0 third cepstral coef 5 10 15 15 0.5 properties in distributions? dynamics? Searching/Describing Audio - Dan Ellis madonna bowie 10 Country 2005-05-23 - 16/30 5 ‘Playola’ Similarity Browser Searching/Describing Audio - Dan Ellis 2005-05-23 - 17/30 2. Meeting Recordings with Jerry Liu and ICSI • Novel application for speech processing summarize & review content of meetings • What really happens in meetings? analysis of decision making, participant roles • Multi-mic recordings for speaker turns e.g. ad-hoc sensor setups (multiple PDAs) Searching/Describing Audio - Dan Ellis 2005-05-23 - 18/30 Speaker Turns from Timing Diffs • Find best timing skew between mic pairs • Find clusters in high-confidence points • Fit Gaussians to each cluster, assign that class to all frames within radius ICSI0: good points All pts: nearest class All pts: closest dimension 0 0 0 -20 -20 -20 -40 -40 -40 -60 -60 -60 -80 -80 -80 -100 -100 -50 0 -100 -100 -50 Searching/Describing Audio - Dan Ellis 0 -100 -100 -50 0 2005-05-23 - 19/30 Browsing Meeting Recordings • Information in patterns of speaker turns Searching/Describing Audio - Dan Ellis 2005-05-23 - 20/30 3. “Personal Audio” • Easy to record everything you hear ~100GB / year @ 64 kbps • Very hard to find anything with Keansub Lee how to scan? how to visualize? how to index? • Starting point: Collect data ~ 60 hours (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 16 classes (8 have multiple instances) Searching/Describing Audio - Dan Ellis 2005-05-23 - 21/30 Applications • Automatic appointment-book history fills in when & where of movements • “Life statistics” how long did I spend in meetings this week vs. last most frequent conversations favorite phrases?? • Retrieving details what exactly did I promise? privacy issues... • Nostalgia? Searching/Describing Audio - Dan Ellis 2005-05-23 - 22/30 Features for Long Recordings • Feature frames = 1 min (not 25 ms!) • Characterize variation within each frame... Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 5 • 0.6 0.5 bits 20 freq / bark freq / bark 20 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min and structure within coarse auditory bands Searching/Describing Audio - Dan Ellis 2005-05-23 - 23/30 bits Segment clustering • Daily activity has lots of repetition: Automatically cluster similar segments ‘affinity’ of segments as KL2 distances 1 supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus 0.5 cmp Searching/Describing Audio - Dan Ellis lib rst str ... 0 2005-05-23 - 24/30 Spectral Clustering • Eigenanalysis of affinity matrix: A = U•S•V’ SVD components: uk•skk•vk' Affinity Matrix 900 800 800 600 700 400 600 200 k=1 k=2 k=3 k=4 500 400 800 300 600 200 400 100 200 200 400 600 800 200 400 600 800 200 400 600 800 eigenvectors vk give cluster memberships • Number of clusters? Searching/Describing Audio - Dan Ellis 2005-05-23 - 25/30 Clustering Results • Clustering of automatic segments gives ‘anonymous classes’ BIC criterion to choose number of clusters make best correspondence to 16 GT clusters • Frame-level scoring gives ~70% correct errors when same ‘place’ has multiple ambiences Searching/Describing Audio - Dan Ellis 2005-05-23 - 26/30 Privacy • Recordings affront expectations of privacy critical barrier to progress • Technical solutions? Speech Scrambling freq / kHz freq / kHz scramble 200ms segs of speech (long term OK) high-confidence speaker ID to bypass 4 Original (dan+kean-ex.wav) 2 20 0 0 -20 4 Scrambled (200ms wins over 1s) -40 -60 level / dB 2 0 0 2 4 6 Searching/Describing Audio - Dan Ellis 8 10 12 14 time / s 2005-05-23 - 27/30 Visualization and Browsing • Visualization / browsing / diary inference link in other information sources - diary - email • NoteTaker interface: “what was I hearing?” Searching/Describing Audio - Dan Ellis 2005-05-23 - 28/30 Personal Audio: Current Work • Voice segment detection / identification poor and variable SNR channel variation • Novelty detection segments that don’t cluster against archive • Spatial information identifying relative motion of head/sources .. for motion detection .. for source separation Searching/Describing Audio - Dan Ellis 2005-05-23 - 29/30 LabROSA Summary • LabROSA signal processing + machine learning + information extraction • Applications Music: Transcription, Recommendation Speech: Recognition, Organization Environment: Detection, Description • Also... signal separation, compression, dolphins... Searching/Describing Audio - Dan Ellis 2005-05-23 - 30/30