Sound, Mixtures, and Learning: LabROSA overview Sound Content Analysis

Sound, Mixtures, and Learning: LabROSA overview 1 Sound Content Analysis 2 Recognizing sounds 3 Organizing mixtures 4 Accessing large datasets Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Columbia University, New York http://labrosa.ee.columbia.edu/ Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 1 Sound Content Analysis 1 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Sound understanding: the key challenge - what listeners do - understanding = abstraction • Applications - indexing/retrieval - robots - prostheses Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 2 The problem with recognizing mixtures “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events - ... like listeners do • Hearing is ecologically grounded - reflects natural scene properties = constraints - subjective, not absolute Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 3 Approaches to handling sound mixtures • Separate signals, then recognize - e.g. CASA, ICA - nice, if you can do it • Recognize combined signal - ‘multicondition training’ - combinatorics.. • Recognize with parallel models - full joint-state space? - or: divide signal into fragments, then use missing-data recognition Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 4 Outline 1 Sound Content Analysis 2 Recognizing sounds - Speech recognition - Nonspeech 3 Organizing mixtures 4 Accessing large datasets Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 5 Recognizing Sounds: Speech 2 • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") • Dan Ellis Acoustic classifier phone probabilities HMM decoder phone / word sequence Understanding/ application... How to handle additive noise? - just train on noisy data: ‘multicondition training’ Sound, Mixtures & Learning 2003-09-26 - 6 Novel speech signal representations (with Marios Athineos) • Common sound models use 10ms frames - but: sub-10ms envelope is perceptible TDLP resynth CTFLP resynth CTFLP resynth 4x TSM 4k 2k 2k 2k 2k 0 4k 2k 0 0 4k 2k 0 2 Time • 4 0 Frequency 4k Frequency 4k Frequency Frequency Fireplce - orig 4k 0 4k 2k 0 2 Time 4 Use a parametric (LPC) model on spectrum 0 0 4k 2k 0 2 Time 0 4 0 2 Time 5 4.5 4 3.5 4 3 2 2.5 2 0 2-4 kHz 1-2 kHz 0.5-1 kHz 0-0.5 kHz waveform 1.5 2 1 200 • Dan Ellis 400 600 800 1000 1200 1400 1600 1800 2000 Convert to features for ASR - improvements esp. for stops Sound, Mixtures & Learning 4 2003-09-26 - 7 Finding the Information in Speech (with Patricia Scanlon) • Mutual Information in time-frequency: freq / Bark All phones Stops 0.14 0.12 0.1 0.08 0.06 0.04 0.02 15 10 5 Vowels 0.08 15 0.06 10 0.04 5 0.02 0.04 0.03 0.02 0.01 0 Speaker (vowels) 0.025 0.02 0.015 0.01 0.005 -200 • -100 0 100 200 -200 -100 0 100 MI / bits 200 time / ms Use to select classifier input features time / ms IRREG+CHKBD 15 133 ftrs 10 frfeq / Bark 5 15 10 5 85 ftrs 95 ftrs 47 ftrs 57 ftrs 28 ftrs Frame accuracy / % IRREG 70 65 60 55 15 10 5 IRREG+CHKBD RECT+CHKBD IRREG RECT RECT, all frames 50 15 19 ftrs 10 5 -50 Dan Ellis 0 50 9 ftrs -50 0 45 0 50 time / ms Sound, Mixtures & Learning 20 40 60 80 100 120 140 Number of features 2003-09-26 - 8 Alarm sound detection (Ellis 2001) • freq / kHz s0n6a8+20 4 Alarm sounds have particular structure - people ‘know them when they hear them’ - clear even at low SNRs hrn01 bfr02 buz01 20 3 0 2 -20 1 0 0 5 10 15 20 25 time / s • Why investigate alarm sounds? - they’re supposed to be easy - potential applications... • Contrast two systems: - standard, global features, P(X|M) - sinusoidal model, fragments, P(M,S|Y) Dan Ellis Sound, Mixtures & Learning -40 level / dB 2003-09-26 - 9 freq / kHz Alarms: Results Restaurant+ alarms (snr 0 ns 6 al 8) 4 3 2 1 0 MLP classifier output freq / kHz 0 Sound object classifier output 4 6 9 8 7 3 2 1 0 20 • 25 30 40 45 time/sec 50 Both systems commit many insertions at 0dB SNR, but in different circumstances: Noise Neural net system Del Ins Tot Sinusoid model system Del Ins Tot 1 (amb) 7 / 25 2 36% 14 / 25 1 60% 2 (bab) 5 / 25 63 272% 15 / 25 2 68% 3 (spe) 2 / 25 68 280% 12 / 25 9 84% 4 (mus) 8 / 25 37 180% 9 / 25 135 576% 170 192% 50 / 100 147 197% Overall 22 / 100 Dan Ellis 35 Sound, Mixtures & Learning 2003-09-26 - 10 Outline 1 Sound Content Analysis 2 Recognizing sounds 3 Organizing mixtures - Auditory Scene Analysis - Missing data recognition - Parallel model inference 4 Accessing large datasets Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 11 Auditory Scene Analysis 3 (Bregman 1990) • How do people analyze sound mixtures? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes • Grouping ‘rules’ (Darwin, Carlyon, ...): - cues: common onset/offset/modulation, harmonicity, spatial location, ... Onset map Frequency analysis Harmonicity map Source properties Grouping mechanism Position map (after Darwin, 1996) Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 12 Cues to simultaneous grouping freq / Hz • Elements + attributes 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s • Common onset - simultaneous energy has common source • Periodicity - energy in different bands with same cycle • Other cues - spatial (ITD/IID), familiarity, ... • But: Context ... Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 13 Computational Auditory Scene Analysis: The Representational Approach (Cooke & Brown 1993) • input mixture Direct implementation of psych. theory signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - ‘bottom-up’ processing - uses common onset & periodicity cues • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 Dan Ellis 0.6 0.8 1.0 time/s Sound, Mixtures & Learning 0.2 0.4 0.6 0.8 2003-09-26 - 14 1.0 time/s Adding top-down constraints Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups - objects irresistibly appear vs. Prediction-driven (top-down) hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Periodic components Predict & combine predicted features - match observations with parameters of a world-model - need world-model constraints... Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 15 Prediction-Driven CASA f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 Dan Ellis 1 2 3 4 5 Sound, Mixtures & Learning 6 7 8 9 time/s 2003-09-26 - 16 dB Segregation vs. Inference • Source separation requires attribute separation - sources are characterized by attributes (pitch, loudness, timbre + finer details) - need to identify & gather different attributes for different sources ... • Need representation that segregates attributes - spectral decomposition - periodicity decomposition • Sometimes values can’t be separated - e.g. unvoiced speech - maybe infer factors from probabilistic model? p ( O, x , y ) Æ p ( x , y O ) - or: just skip those values, infer from higher-level context - do both: missing-data recognition Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 17 Missing Data Recognition • Speech models p(x|m) are multidimensional... - i.e. means, variances for every freq. channel - need values for all dimensions to get p(•) • But: can evaluate over a subset of dimensions xk p ( xk m ) = • Ú p ( x k, x u m ) dx u Hence, missing data recognition: Present data mask xu p(xk,xu) y p(xk|xu<y ) P(x | q) = dimension P(x1 | q) · P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q) time - hard part is finding the mask (segregation) Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 18 xk p(xk ) Comparing different segregations • Standard classification chooses between models M to match source features X P( M ) M * = argmax P ( M X ) = argmax P ( X M ) ◊ -------------P( X ) M M • Mixtures Æ observed features Y, segregation S, all related by P ( X Y , S ) Observation Y(f ) Source X(f ) Segregation S freq - spectral features allow clean relationship • Joint classification of model and segregation: P( X Y , S ) P ( M , S Y ) = P ( M ) Ú P ( X M ) ◊ ------------------------- dX ◊ P ( S Y ) P( X ) - probabilistic relation of models & segregation Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 19 Multi-source decoding • Search for more than one source q2(t) S2(t) Y(t) S1(t) q1(t) • Mutually-dependent data masks • Use e.g. CASA features to propose masks - locally coherent regions • Lots of issues in models, representations, matching, inference... Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 20 What a speech HMM contains • Markov model structure: states + transitions S A 0.1 0.02 T 0.9 4 Model M'2 0.8 0.9 K freq / kHz 0.8 0.18 0.2 3 0.8 0.05 2 A 0.15 20 T 0.2 30 E 40 O 0 • 10 0.8 0.1 1 O 0.1 0.9 K S E 0.1 freq / kHz M'1 State Transition Probabilities State models (means) 0.9 50 10 A generative model - but not a good speech generator! 4 3 2 1 0 0 1 2 3 4 5 time / sec - only meant for inference of p(X|M) Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 21 20 30 40 50 “One microphone source separation” (Roweis 2000, Manuel Reyes) State sequences Æ t-f estimates Æ mask • Original voices freq / Hz Speaker 1 Speaker 2 4000 3000 2000 1000 0 Mixture 4000 3000 2000 1000 0 Resynthesis masks State means 4000 3000 2000 1000 0 4000 3000 2000 1000 0 0 1 2 3 4 5 6 time / sec 0 1 2 3 4 5 6 time / sec - 1000 states/model (Æ 106 transition probs.) - simplify by modeling subbands (coupled HMM)? Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 22 Subband models (Reyes, Jojic) • Reduce the number of states required - 4000 states ¥1 band Æ 30 states ¥19 bands • Train coupled HMMs via variational approx S 5 ,1 S 5,2 S 5 ,3 S 5,4 S 5 ,5 S 5,6 S 5 ,1 S 5,2 S 5 ,3 S 5,4 S 5 ,5 S 5,6 S 4 ,1 S 4,2 S 4 ,3 S 4,4 S 4 ,5 S 4 ,6 S 4 ,1 S 4,2 S 4 ,3 S 4,4 S 4 ,5 S 4 ,6 S 3 ,1 S 3,2 S 3,3 S 3,4 S 3,5 S 3,6 S 3 ,1 S 3,2 S 3,3 S 3,4 S 3,5 S 3,6 S 2 ,1 S 2,2 S 2 ,3 S 2,4 S 2 ,5 S 2 ,6 S 2 ,1 S 2,2 S 2 ,3 S 2,4 S 2 ,5 S 2 ,6 S 1 ,1 S 1, 2 S 1, 3 S 1, 4 S 1, 5 S 1, 6 S 1 ,1 S 1, 2 S 1, 3 S 1, 4 S 1, 5 S 1, 6 Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 23 Tracking source normalization • Standard HMM sound models try to normalize away energy variation - but: each source in mixture has different ‘gain’ Source 1 Source 2 Data Data Normalization ÷k1 Normalization ÷k2 Model 1 Model 2 • Composed Data Normalization ? Normalization mismatch Model 2 Model 1 Instead, factor out scalar gain for each source S1 S2 S3 a1 x1 a2 x2 S5 S4 a3 x3 S6 a5 a4 x4 x5 a6 x6 - solve with var. approx. to P(S,a) Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 24 Outline 1 Sound Content Analysis 2 Recognizing sounds 3 Organizing mixtures 4 Accessing large datasets - Meeting Recordings - The Listening Machine - Music Information Retrieval Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 25 Accessing large datasets: The Meeting Recorder Project 4 (with ICSI, UW, IDIAP, SRI, Sheffield) • Microphones in conventional meetings - for summarization / retrieval / behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, IDIAP, NIST): - ~100 hours collected & transcribed • Dan Ellis NSF ‘Mapping Meetings’ project Sound, Mixtures & Learning 2003-09-26 - 26 Speaker Turn detection (Huan Wei Hee, Jerry Liu) • Acoustic: Triangulate tabletop mic timing differences - use normalized peak value for confidence mr-2000-11-02-1440: PZM xcorr lags Example cross coupling response, chan3 to chan0 1 4 250 lag 3-4 / ms 0 200 150 100 0 -1 3 -2 50 2 1 0 50 100 • 150 time / s 200 250 300 -3 -3 -2 -1 0 1 lag 1-2 / ms Behavioral: Look for patterns of speaker turns mr04: Hand-marked speaker turns vs. time + auto/manual boundaries Participant 100xR skew/samps 300 10: 9: 8: 7: 5: 3: 2: 1: 0 5 Dan Ellis 10 15 20 25 30 35 Sound, Mixtures & Learning 40 45 50 55 60 time/min 2003-09-26 - 27 2 3 The Listening Machine • Smart PDA records everything • Only useful if we have index, summaries - monitor for particular sounds - real-time description • Scenarios - personal listener Æ summary of your day - future prosthetic hearing device - autonomous robots • Dan Ellis Meeting data, ambulatory audio Sound, Mixtures & Learning 2003-09-26 - 28 freq / Hz Personal Audio • LifeLog / MyLifeBits / Remembrance Agent: Easy to record everything you hear • Then what? - prohibitively time consuming to search - but .. applications if access easier • Automatic content analysis / indexing... 4 2 freq / Bark 0 50 100 150 200 250 time / min 15 10 5 14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30 clock time Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 29 Music Information Retrieval • Transfer search concepts to music? - “musical Google” - finding something specific / vague / browsing - is anything more useful than human annotation? • Most interesting area: finding new music - is there anything on mp3.com that I would like? - audio is only information source for new bands • Basic idea: Project music into a space where neighbors are “similar” • Also need models of personal preference - where in the space is the stuff I like - relative sensitivity to different dimensions • Evaluation problems - requires large, shareable music corpus! Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 30 Music similarity from Anchor space • A classifier trained for one artist (or genre) will respond partially to a similar artist • Each artist evokes a particular pattern of responses over a set of classifiers • We can treat these classifier outputs as a new feature space in which to estimate similarity n-dimensional vector in "Anchor Space" Anchor Anchor p(a1|x) Audio Input (Class i) AnchorAnchor Anchor Audio Input (Class j) |x) p(a2n-dimensional vector in "Anchor Space" GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace • Dan Ellis “Anchor space” reflects subjective qualities? Sound, Mixtures & Learning 2003-09-26 - 31 Playola interface ( www.playola.org ) • Browser finds closest matches to single tracks or entire artists in anchor space • Direct manipulation of anchor space axes Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 32 Evaluation • Are recommendations good or bad? • Subjective evaluation is the ground truth - .. but subjects aren’t familiar with the bands being recommended - can take a long time to decide if a recommendation is good • Measure match to other similarity judgments - e.g. musicseer data: Top rank agreement 80 70 60 SrvKnw 4789x3.58 % 50 SrvAll 6178x8.93 40 GamKnw 7410x3.96 GamAll 7421x8.92 30 20 10 0 cei Dan Ellis cmb erd e3d opn kn2 rnd Sound, Mixtures & Learning ANK 2003-09-26 - 33 Summary • Sound - .. contains much, valuable information at many levels - intelligent systems need to use this information • Mixtures - .. are an unavoidable complication when using sound - looking in the right time-frequency place to find points of dominance • Learning - need to acquire constraints from the environment - recognition/classification as the real task Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 34 DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding Sound, Mixtures & Learning 2003-09-26 - 35 Extra Slides Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 36 Music Applications • Music as a complex, information-rich sound • Applications of separation & recognition: - note/chord detection & classification freq / kHz DYWMB: Alignments to MIDI note 57 mapped to Orig Audio 2 1 0 23 24 31 32 40 41 48 49 2 1 0 72 73 80 81 85 86 88 89 - singing detection (Æ genre identification ...) Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 0 50 Dan Ellis 100 150 Sound, Mixtures & Learning 200 2003-09-26 - 37 time / sec Artist Similarity • Recognizing work from each artist is all very well... • But: what is similarity between artists? - pattern recognition systems give a number... Dan Ellis roxette toni_braxton e ron_carter erasure lara_fabian jessica_simpson mariah_carey new_ janet_jackson a whitney_ eiffel_65 celine_dionpet_shop_boys christina_aguilera aqua lauryn_hill rs sade sof all_saints backstreet_boys madonna pr spice_girlsbelinda_carlisle wain nelly_furtado miroquai annie_lennox • Need subjective ground truth: Collected via web site www.musicseer.com • Results: - 1,000 users, 22,300 judgments collected over 6 months Sound, Mixtures & Learning 2003-09-26 - 38 Independent Component Analysis (ICA) (Bell & Sejnowski 1995 et seq.) • Drive a parameterized separation algorithm to maximize independence of outputs m1 m2 x a11 a12 a21 a22 s1 s2 MutInfo a • Advantages: - mathematically rigorous, minimal assumptions - does not rely on prior information from models • Disadvantages: - may converge to local optima... - separation, not recognition - does not exploit prior information from models Dan Ellis Sound, Mixtures & Learning 2003-09-26 - 39

Sound, Mixtures, and Learning: LabROSA overview Sound Content Analysis

Related documents

Products

Support

Sound, Mixtures, and Learning: LabROSA overview Sound Content Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib