Semantic Audio Analysis 1 Semantic Audio Analysis 2 Organizing Sound Mixtures 3 Applications for Audio Semantics 4 Open Questions Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Columbia University, New York http://labrosa.ee.columbia.edu/ Dan Ellis Semantic Audio Analysis (1 of 19) 2003-10-13 Semantic Audio Analysis 1 • Audio Semantics = what is the meaning / message? - “AI complete”? • “Semantics” is broad! - used for the-stuff-we-can’t-do-yet • How about speech recognition? 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB IT'S NICE TO HAVE THIS FRIDAY WAS IN DOLLARS IN THE BOMBING RAIDS WAS CLOSING THOUSAND TO FOUR GUNMEN CAUSING CONDITIONING SAID THE STOCK ROSE SMOKERS FROM THE TWENTY NINE NINETY FIVE PLUS THREE ON THE EXPERT TECHNICIANS EXPECTED TO REACH - no good even if it worked! Dan Ellis Semantic Audio Analysis (2 of 19) 2003-10-13 Towards Semantic Audio Analysis • What do we want from SAA? - describe sound in human-recognizable terms - “automatic subtitles for real life”? [thunder] S: I THINK THE WEATHER'S CHANGING • Key step in speech recognition is classification - convert continuous signal into discrete classes - need a rich, application-dependent vocabulary - hierarchy of pattern recognition • Listeners perceive sound sources - If SAA primitives are to be subjective percepts... - ...source segregation is the first problem? Dan Ellis Semantic Audio Analysis (3 of 19) 2003-10-13 SAA Applications • Subjective descriptions are the ultimate sound representation 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 - • Dan Ellis 2 4 6 8 10 12 time/s -60 level / dB data compression: “Radio ad for AARCO” signal enhancement: “... without noise” modification: “.. woman’s voice ..” .. needs both analysis and synthesis? Sound understanding useful for: - indexing/retrieval - robots - prostheses? Semantic Audio Analysis (4 of 19) 2003-10-13 Outline 1 Semantic Audio Analysis 2 Organizing Sound Mixtures - Human Auditory Scene Analysis - Organizing Mixtures by Computer 3 Applications for Audio Semantics 4 Open Questions Dan Ellis Semantic Audio Analysis (5 of 19) 2003-10-13 Auditory Scene Analysis 2 freq / Hz (Bregman 1990) • How do people analyze sound mixtures? - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes • Elements + attributes 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s - common onset ( = dependent origins) - periodicity ( = single process) - + spatial cues etc. + familiarity, context ... Dan Ellis Semantic Audio Analysis (6 of 19) 2003-10-13 Computational Auditory Scene Analysis: The Representational Approach (Cooke & Brown 1993) • input mixture Direct implementation of psych. theory signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - ‘bottom-up’ processing - uses common onset & periodicity cues • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 Dan Ellis 0.6 0.8 1.0 time/s Semantic Audio Analysis (7 of 19) 0.2 0.4 0.6 0.8 2003-10-13 1.0 time/s Approaches to handling sound mixtures • Separate signals, then recognize - Computational Auditory Scene Analysis (CASA), Independent Component Analysis - nice, if you can make it work • Recognize combined signal - ‘multicondition training’ - combinatorics seem daunting • Recognize with parallel models - optimal inference from full joint state-space p ( O, x , y ) Æ p ( x , y O ) - or: skip obscured fragments, infer from higher-level context - or do both: missing-data recognition Dan Ellis Semantic Audio Analysis (8 of 19) 2003-10-13 Multi-source decoding (Barker, Cooke & Ellis 2003) • Missing Data recognizes from a subset; Can search for more than one source q2(t) S2(t) Y(t) S1(t) q1(t) • Mutually-dependent data masks • Use e.g. CASA features to propose masks - locally coherent regions • Issues in models, representations, inference... Dan Ellis Semantic Audio Analysis (9 of 19) 2003-10-13 Outline 1 Semantic Audio Analysis 2 Organizing Sound Mixtures 3 Applications for Audio Semantics - Meeting recordings - Audio diary analysis - Semantics of musical signals 4 Open Questions Dan Ellis Semantic Audio Analysis (10 of 19) 2003-10-13 The Meeting Recorder Project 3 (with ICSI, UW, IDIAP, SRI, Sheffield) • Microphones in conventional meetings - for summarization / retrieval / behavior analysis - informal, overlapped speech (Æ ASR...) • Behavioral: Look for patterns of speaker turns Participant mr04: Hand-marked speaker turns vs. time + auto/manual boundaries 10: 9: 8: 7: 5: 3: 2: 1: 0 5 Dan Ellis 10 15 20 25 30 35 40 Semantic Audio Analysis (11 of 19) 45 50 55 60 time/min 2003-10-13 freq / Hz Personal Audio: The Listening Machine • Smart PDA records everything you hear • Only useful if we have index, summaries - semantic descriptions (real time?) • Features appropriate for 1 minute segments... 4 2 freq / Bark 0 50 100 150 200 250 time / min 15 10 5 14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30 clock time - Bark band variance, spectral entropy, ... Dan Ellis Semantic Audio Analysis (12 of 19) 2003-10-13 Music Information Retrieval (Berenzweig & Ellis 2003) • Apply search concepts to music? - “musical Google” – beat human annotation? - application: finding new music • Construct music space where near = “similar” Dan Ellis Semantic Audio Analysis (13 of 19) 2003-10-13 Outline 1 Semantic Audio Analysis 2 Organizing Sound Mixtures 3 Applications for Audio Semantics 4 Open Questions Dan Ellis Semantic Audio Analysis (14 of 19) 2003-10-13 Open Questions 4 • Semantics - What are the abstract perceptual attributes of a sound? How can we describe them? • Mixtures - How do people organize sound mixtures into separate source percepts? - How can we represent generic sound knowledge? • Applications - Search/retrieval: What terminology is most natural for users querying sound databases? - What is the best balance between machine and human listening? - What are the problems for which machine listening can be most useful? Dan Ellis Semantic Audio Analysis (15 of 19) 2003-10-13 Extra slides Dan Ellis Semantic Audio Analysis (16 of 19) 2003-10-13 Missing Data Recognition (Barker, Cooke & Ellis ’03) • Can evaluate speech models p(x|m) over a subset of dimensions xk p ( xk m ) = • Ú p ( x k, x u m ) dx u xu p(xk,xu) y p(xk|xu<y ) xk p(xk ) Hence, missing data recognition: Present data mask P(x | q) = dimension P(x1 | q) · P(x2 | q) · P(x3 | q) · P(x4 | q) · P(x5 | q) · P(x6 | q) .. but need segregation mask time • Fit model and segregation given obs’n: P( X Y , S ) P ( M , S Y ) = P ( M ) Ú P ( X M ) ◊ ------------------------- dX ◊ P ( S Y ) P( X ) Dan Ellis Semantic Audio Analysis (17 of 19) 2003-10-13 What a speech HMM contains • Markov model structure: states + transitions S A 0.1 0.02 T 0.9 4 Model M'2 0.8 0.9 K freq / kHz 0.8 0.18 0.2 3 0.8 0.05 2 A 0.15 20 T 0.2 30 E 40 O 0 • 10 0.8 0.1 1 O 0.1 0.9 K S E 0.1 freq / kHz M'1 State Transition Probabilities State models (means) 0.9 50 10 A generative model - but not a good speech generator! 4 3 2 1 0 0 1 2 3 4 5 time / sec - only meant for inference of p(X|M) Dan Ellis Semantic Audio Analysis (18 of 19) 2003-10-13 20 30 40 50 Music similarity from Anchor space • A classifier trained for one artist (or genre) will respond partially to a similar artist • Each artist evokes a particular pattern of responses over a set of classifiers • We can treat these classifier outputs as a new feature space in which to estimate similarity n-dimensional vector in "Anchor Space" Anchor Anchor p(a1|x) Audio Input (Class i) AnchorAnchor Anchor Audio Input (Class j) |x) p(a2n-dimensional vector in "Anchor Space" GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace • Dan Ellis “Anchor space” reflects subjective qualities? Semantic Audio Analysis (19 of 19) 2003-10-13