Searching and Describing Audio Databases 1. 2.

Searching and Describing Audio Databases Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/ 1. Music browsing 2. Meeting recordings 3. ‘Personal Audio’ Searching/Describing Audio - Dan Ellis 2005-05-23 - 1 /30 LabROSA Overview Information Extraction Music Similarity Browsing Machine Learning Environment Audio lifelog Meeting recordings Signal Processing Speech Searching/Describing Audio - Dan Ellis 2005-05-23 - 2 /30 Sound Description • Challenge: Indexing continuous content what are the equivalents of words and phrases? • Perception: sounds as discrete events • Goal: Duplicate perceptual representation Searching/Describing Audio - Dan Ellis 2005-05-23 - 3 /30 1. Music Signal Analysis • A lot of music data available e.g. 60G of MP3 ≈ 1000 hr of audio/15k tracks • What can we do with it? identify implicit structure... • Quality vs. quantity Speech recognition lesson: 10x data, 1/10th annotation, twice as useful • Motivating Applications music search / browsing by similarity insight into music Searching/Describing Audio - Dan Ellis 2005-05-23 - 4 /30 Musical Information Extraction Similarity/ recommend'n Anchor models Semantic bases Music audio Melody extraction Drums extraction Event extraction Fragment clustering Eigenrhythms Synthesis/ generation ? • Lots of data + noisy transcription + weak clustering musical insights? Searching/Describing Audio - Dan Ellis 2005-05-23 - 5 /30 Transcription as Classification with Graham Poliner • Signal models typically used for transcription harmonic spectrum, superposition • But ... trade domain knowledge for data transcription as pure classification problem: Audio Trained classifier p("C0"|Audio) p("C#0"|Audio) p("D0"|Audio) p("D#0"|Audio) p("E0"|Audio) p("F0"|Audio) single N-way discrimination for “melody” per-note classifiers for polyphonic transcription Searching/Describing Audio - Dan Ellis 2005-05-23 - 6 /30 Classifier Transcription Results • Trained on MIDI syntheses (32 songs) SMO SVM (Weka) • Tested on MIREX set Frame-level pitch concordance system fg+bg just fg “jazz3” 71.5% 56.1% foreground/bg separation Searching/Describing Audio - Dan Ellis 2005-05-23 - 7 /30 overall 44.3% 45.4% Chord Transcription • Basic problem: with Alex Sheh Recover chord sequence labels from audio Easier than note transcription ? More relevant to listener perception ? Searching/Describing Audio - Dan Ellis 2005-05-23 - 8 /30 EM HMM Re-Estimation • Estimate ‘soft’ labels using current models • Update model parameters from new labels • Repeat until convergence to local maximum Model inventory ae1 ae2 dh1 dh2 ae3 Labelled training data dh ax k ae t Initialization Θ init parameters Repeat until convergence s ae t aa n dh s ae E-step: probabilities p(qni |XN 1,Θold) of unknowns M-step: maximize via parameters Searching/Describing Audio - Dan Ellis ax k t dh ae t aa n ax k Uniform initialization alignments ae Θ : max E[log p(X,Q | Θ)] 2005-05-23 - 9 /30 Chord Sequence Data Sources • All we need are the chord sequences for our training examples Hal Leonard “Paperback Song Series” - manually retyped for 20 songs: “Beatles for Sale”, “Help”, “Hard Day’s Night” freq / kHz The Beatles - Hard Day's Night # The Beatles - A Hard Day's Night # G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G D G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G C9 G Cadd9 Fadd9 5 4 3 2 1 0 0 5 10 15 20 25 time / sec - hand-align chords for 2 test examples Searching/Describing Audio - Dan Ellis 2005-05-23 - 10/30 Chord Results • Recognition weak, but forced-alignment OK Frame-level Accuracy Recognition Alignment MFCC 8.7% PCP_ROT 21.7% # G # 22.0% F E pitch class Feature Beatles - Beatles For Sale - Eight Days a Week (4096pt) 76.0% # D # (random ~3%) MFCCs are poor (can overtrain) PCPs better (ROT helps generalization) C B # inte A 16.27 true align recog Searching/Describing Audio - Dan Ellis 24.84 time / sec E G D E G DBm E G Bm Bm G G Am Em7 Bm 2005-05-23 - 11/30 Em7 Eigenrhythms: Drum Pattern Space with John Arroyo • Pop songs built on repeating “drum loop” bass drum, snare, hi-hat small variations on a few basic patterns • Eigen-analysis (PCA) to capture variations? by analyzing lots of (MIDI) data • Applications music categorization “beat box” synthesis Searching/Describing Audio - Dan Ellis 2005-05-23 - 12/30 Eigenrhythms • Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims) • Top patterns: Searching/Describing Audio - Dan Ellis 2005-05-23 - 13/30 Eigenrhythms for Classification • Projections in Eigenspace / LDA space PCA(1,2) projection (16% corr) LDA(1,2) projection (33% corr) 10 6 blues 4 country disco hiphop 2 house newwave rock 0 pop punk -2 rnb 5 0 -5 -10 -20 -10 0 10 -4 -8 -6 -4 -2 • 10-way Genre classification (nearest nbr): PCA3: 20% correct LDA4: 36% correct Searching/Describing Audio - Dan Ellis 2005-05-23 - 14/30 0 2 Music Similarity Browsing • Musical information overload with Adam Berenzweig record companies filter/categorize music an automatic system would be less odious • Connecting audio and preference map to a ‘semantic space’? n-dimensional vector in "Anchor Space" Anchor Anchor Audio Input (Class i) p(a1|x) AnchorAnchor Anchor Audio Input (Class j) p(a2n-dimensional |x) vector in "Anchor Space" GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace Searching/Describing Audio - Dan Ellis 2005-05-23 - 15/30 Anchor Space • Frame-by-frame high-level categorizations 0 0.6 0.4 0.2 Electronica fifth cepstral coef compare to raw features? Anchor Space Features Cepstral Features 0 0.2 0.4 0.6 madonna bowie 0.8 1 0.5 0 third cepstral coef 5 10 15 15 0.5 properties in distributions? dynamics? Searching/Describing Audio - Dan Ellis madonna bowie 10 Country 2005-05-23 - 16/30 5 ‘Playola’ Similarity Browser Searching/Describing Audio - Dan Ellis 2005-05-23 - 17/30 2. Meeting Recordings with Jerry Liu and ICSI • Novel application for speech processing summarize & review content of meetings • What really happens in meetings? analysis of decision making, participant roles • Multi-mic recordings for speaker turns e.g. ad-hoc sensor setups (multiple PDAs) Searching/Describing Audio - Dan Ellis 2005-05-23 - 18/30 Speaker Turns from Timing Diffs • Find best timing skew between mic pairs • Find clusters in high-confidence points • Fit Gaussians to each cluster, assign that class to all frames within radius ICSI0: good points All pts: nearest class All pts: closest dimension 0 0 0 -20 -20 -20 -40 -40 -40 -60 -60 -60 -80 -80 -80 -100 -100 -50 0 -100 -100 -50 Searching/Describing Audio - Dan Ellis 0 -100 -100 -50 0 2005-05-23 - 19/30 Browsing Meeting Recordings • Information in patterns of speaker turns Searching/Describing Audio - Dan Ellis 2005-05-23 - 20/30 3. “Personal Audio” • Easy to record everything you hear ~100GB / year @ 64 kbps • Very hard to find anything with Keansub Lee how to scan? how to visualize? how to index? • Starting point: Collect data ~ 60 hours (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 16 classes (8 have multiple instances) Searching/Describing Audio - Dan Ellis 2005-05-23 - 21/30 Applications • Automatic appointment-book history fills in when & where of movements • “Life statistics” how long did I spend in meetings this week vs. last most frequent conversations favorite phrases?? • Retrieving details what exactly did I promise? privacy issues... • Nostalgia? Searching/Describing Audio - Dan Ellis 2005-05-23 - 22/30 Features for Long Recordings • Feature frames = 1 min (not 25 ms!) • Characterize variation within each frame... Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 15 40 10 20 5 5 dB Average Log Energy 60 dB Log Energy Deviation 120 15 100 10 80 20 freq / bark 20 freq / bark 60 20 freq / bark freq / bark 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 5 • 0.6 0.5 bits 20 freq / bark freq / bark 20 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min and structure within coarse auditory bands Searching/Describing Audio - Dan Ellis 2005-05-23 - 23/30 bits Segment clustering • Daily activity has lots of repetition: Automatically cluster similar segments ‘affinity’ of segments as KL2 distances 1 supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus 0.5 cmp Searching/Describing Audio - Dan Ellis lib rst str ... 0 2005-05-23 - 24/30 Spectral Clustering • Eigenanalysis of affinity matrix: A = U•S•V’ SVD components: uk•skk•vk' Affinity Matrix 900 800 800 600 700 400 600 200 k=1 k=2 k=3 k=4 500 400 800 300 600 200 400 100 200 200 400 600 800 200 400 600 800 200 400 600 800 eigenvectors vk give cluster memberships • Number of clusters? Searching/Describing Audio - Dan Ellis 2005-05-23 - 25/30 Clustering Results • Clustering of automatic segments gives ‘anonymous classes’ BIC criterion to choose number of clusters make best correspondence to 16 GT clusters • Frame-level scoring gives ~70% correct errors when same ‘place’ has multiple ambiences Searching/Describing Audio - Dan Ellis 2005-05-23 - 26/30 Privacy • Recordings affront expectations of privacy critical barrier to progress • Technical solutions? Speech Scrambling freq / kHz freq / kHz scramble 200ms segs of speech (long term OK) high-confidence speaker ID to bypass 4 Original (dan+kean-ex.wav) 2 20 0 0 -20 4 Scrambled (200ms wins over 1s) -40 -60 level / dB 2 0 0 2 4 6 Searching/Describing Audio - Dan Ellis 8 10 12 14 time / s 2005-05-23 - 27/30 Visualization and Browsing • Visualization / browsing / diary inference link in other information sources - diary - email • NoteTaker interface: “what was I hearing?” Searching/Describing Audio - Dan Ellis 2005-05-23 - 28/30 Personal Audio: Current Work • Voice segment detection / identification poor and variable SNR channel variation • Novelty detection segments that don’t cluster against archive • Spatial information identifying relative motion of head/sources .. for motion detection .. for source separation Searching/Describing Audio - Dan Ellis 2005-05-23 - 29/30 LabROSA Summary • LabROSA signal processing + machine learning + information extraction • Applications Music: Transcription, Recommendation Speech: Recognition, Organization Environment: Detection, Description • Also... signal separation, compression, dolphins... Searching/Describing Audio - Dan Ellis 2005-05-23 - 30/30

Searching and Describing Audio Databases 1. 2.

Related documents

Products

Support

Searching and Describing Audio Databases 1. 2.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib