Recognition & Organization of Speech and Audio Dan Ellis Electrical Engineering, Columbia University <dpwe@ee.columbia.edu> Outline 1 Sound ‘organization’ 2 Background & related work 3 Existing projects 4 Future projects 5 Summary & conclusions ROSAtalk - Dan Ellis 2000-10-13 - 1 Lab ROSA 1 Organization of sound mixtures frq/Hz 4000 2000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Voice (pleasant) Stab Rumble • ROSAtalk - Dan Ellis Choir Strings Core operation: Converting continuous, scalar signal into discrete, symbolic representation 2000-10-13 - 2 Lab ROSA Positioning sound organization Speech recognition Image/video analysis Sound organization Audio processing Psychoacoustics & physiology Signal processing Pattern recognition Machine learning Artificial intelligence • Draws on many techniques • Abuts/overlaps various areas ROSAtalk - Dan Ellis 2000-10-13 - 3 Lab ROSA About auditory perception • Received waveform is a mixture - two sensors, N signals ... - need knowledge-based constraints • Psychoacoustics: the study of human sound organization - ‘auditory scene analysis’ (Bregman’90) • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) - perceived organization: real-world objects + events (transient) - subjective not canonical (ambiguity) ROSAtalk - Dan Ellis 2000-10-13 - 4 Lab ROSA Key themes for LabROSA • Sound organization - recovering/constructing abstraction hierarchy - at an instant (sources) - along time (segmentation) • Scene analysis - need to find attributes according to objects - use attributes to form objects - ... plus constraints of knowledge • Exploiting large data sets (the ASR lesson) - supervised/labelled: pattern recognition - unsupervised: structure discovery, clustering • Special-purpose cases: - speech recognition - source-specific recognizers • ... within a ‘complete explanation’ ROSAtalk - Dan Ellis 2000-10-13 - 5 Lab ROSA Applications for sound organization What do people do with their ears? • Robots - intelligence requires awareness - Sony’s AIBO: dog-hearing • Human-computer interface - .. includes knowing when (& why) you’ve failed • Archive indexing & retrieval - pure audio archives - true multimedia content analysis • Content ‘understanding’ - intelligent classification & summarization • Autonomous monitoring • Broader ‘structure discovery’ algorithms ROSAtalk - Dan Ellis 2000-10-13 - 6 Lab ROSA Outline 1 Sound ‘organization’ 2 Background & related work - Audio coding & compression - Automatic Speech Recognition - Computational Auditory Scene Analysis - Multimedia information retrieval 3 Existing projects 4 Future projects 5 Summary & conclusions ROSAtalk - Dan Ellis 2000-10-13 - 7 Lab ROSA Audio coding & compression • Goal is reconstruction, not abstraction • But criteria are ‘subjective’: want same percept, not same waveform • MPEG-Audio: - filterbanks - information-theoretic coding - psychoacoustic masking of quantization noise • MPEG-4 ‘Structured Audio’ - computer music synthesis model - instrument definition + control stream - automatic analysis? ROSAtalk - Dan Ellis 2000-10-13 - 8 Lab ROSA Automatic Speech Recognition (ASR) • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") Acoustic classifier phone probabilities HMM decoder phone / word sequence Understanding/ application... • ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) • Segmentation of speech & nonspeech - ... recognizer wouldn’t notice! ROSAtalk - Dan Ellis 2000-10-13 - 9 Lab ROSA Spoken document retrieval • Text-based IR on ASR transcripts - e.g. news broadcasts (CMU’s Informedia, Thisl) • Recognition errors are not the limiting factor - TREC-98 results: average precision 0.5→0.4 • Weak at word level, but OK over paragraphs - replay the audio, don’t show the text! ROSAtalk - Dan Ellis 2000-10-13 - 10 Lab ROSA Computational Auditory Scene Analysis (CASA) • input mixture Implement psychoacoustic theory? (Brown’92) Front end signal features (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - what are the features? how are they used? • Top down constraints are needed (Klassner’96) 3760 Hz Buzzer-Alarm 2540 Hz 2230 Hz 2350 Hz Glass-Clink 1675 Hz 1475 Hz 950 Hz 500 Hz 460 Hz 420 Hz Phone-Ring 1.0 Siren-Chirp 2.0 TIME ROSAtalk - Dan Ellis 3.0 4.0 sec 1.0 2.0 3.0 4.0 sec TIME 2000-10-13 - 11 Lab ROSA Audio Information Retrieval • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures - segmentation... ROSAtalk - Dan Ellis 2000-10-13 - 12 Lab ROSA Music analysis • Automatic transcription (score recovery) - classic ‘hard problem’: can people do it even? - recent success in reduced forms e.g. melody, drum track (Goto’00) • Instrument identification - ideas from speaker identification (basic PR) + instrument family hierarchies (Martin’99) • Fingerprinting - spot recordings despite noise, distortion - relies on perceptual invariants • Music clustering - e.g. music recommendation based on signal - correlate objective features with user ratings? ROSAtalk - Dan Ellis 2000-10-13 - 13 Lab ROSA Multimedia description • MPEG-7 ‘Metadata’ - MPEG is known for audio/video compression standards; - also develop standards for search and indexing • MPEG-7 is a standard format for metadata: Well-defined categories for content description • Focus is on framework & infrastructure • Audio descriptor categories: - from ASR - from computer music community - uses still to emerge ROSAtalk - Dan Ellis 2000-10-13 - 14 Lab ROSA Outline 1 Sound ‘organization’ 2 Background & related work 3 Existing projects - Acoustic change detection - Robust speech recognition - Nonspeech event detection - Prediction-driven CASA 4 Future projects 5 Summary & conclusions ROSAtalk - Dan Ellis 2000-10-13 - 15 Lab ROSA Acoustic change detection (with Williams/Sheffield, Ferreiros/UPMadrid) • Approaches: - ‘metric’: find instants of maximal change - ‘model-based’: best alignment of model set - ‘bayesian’: generate models when warranted Dynamism vs. Entropy for 2.5 second segments of speecn/music 3.5 Spectrogram frq/Hz Speech Music Speech+Music 3 4000 2.5 2000 2 speech music Entropy 0 speech+music 1.5 40 1 20 0 0 2 4 6 8 10 12 time/s 0.5 Posteriors 0 0 0.05 0.1 0.15 0.2 Dynamism 0.25 • Typically agnostic about underlying problem - use any features, find any changes • Good for ASR adaptation, otherwise... ROSAtalk - Dan Ellis 2000-10-13 - 16 0.3 Lab ROSA Speech feature combination (with Bilmes/UW, Hermansky/OGI, ICSI) • ‘Multistream’ approaches Feature 1 calculation Acoustic classifier HMM decoder Posterior combination Feature 2 calculation Input sound Acoustic classifier Word hypotheses Phone probabilities Speech features - streams can correct each other → big gains • Which feature streams to combine? - low mutual information between classifiers indicates complementary streams PLPb MSGa MSGb Conditional Mutual Information between feature streams 0.5 PLPa 0.4 0.3 0.2 0.1 0 PLPa ROSAtalk - Dan Ellis PLPb MSGa MSGb CMI/bits 2000-10-13 - 17 Lab ROSA Tandem speech recognition (with Hermansky, Sharma & Sivadas/OGI, Singh/CMU) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Speech features Words Phone probabilities Input sound Subword likelihoods Speech features Words Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound • ROSAtalk - Dan Ellis Speech features Phone probabilities Subword likelihoods Words 50% relative improvement over GMMs alone - different statistical modeling schemes get different info from same training data 2000-10-13 - 18 Lab ROSA Alarm sound detection freq / Hz • Deconstructing sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 - representation of energy in time-frequency - formation of atomic elements - grouping by common properties (onset &c.) • ROSAtalk - Dan Ellis Alarm sounds have particular structure - people ‘know them when they hear them’ - build a generic detector? 2000-10-13 - 19 Lab ROSA Prediction-driven CASA • input mixture Data-driven (bottom-up) fails for noisy, ambiguous sounds (most mixtures!) Front end signal features (maps) • Object formation discrete objects Grouping rules Source groups Need top-down constraints: hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Periodic components Predict & combine predicted features - fit vocabulary of generic elements to sound ... bottom of a hierarchy? - account for entire scene - driven by prediction failures - pursue alternative hypotheses ROSAtalk - Dan Ellis 2000-10-13 - 20 Lab ROSA PDCASA example f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 ROSAtalk - Dan Ellis 1 2 3 4 5 6 7 8 9 time/s dB 2000-10-13 - 21 Lab ROSA Outline 1 Sound ‘organization’ 2 Background & related work 3 Existing projects 4 Future projects - Meeting Recorder - Missing-data recognition & CASA for ASR - Structure from audio-video features - Speech & speaker recognition - Music organization - Audio archive structure discovery somewhat concrete provisional 5 ROSAtalk - Dan Ellis Summary & conclusions 2000-10-13 - 22 Lab ROSA Meeting recorder (with ICSI, UW, SRI, IBM) • Microphones in conventional meetings - for transcription/summarization/retrieval - informal, overlapped speech • Data collection (ICSI and ...): • Research: ASR, nonspeech, organization - unprecedented data, new applications ROSAtalk - Dan Ellis 2000-10-13 - 23 Lab ROSA Missing data recognition & CASA (with Barker, Cooke, Green/Sheffield) • Missing-data recognition - integrate across ‘don’t-know’ values - ‘perfect’ mask → excellent performance in noise • Multi-source decoder - Viterbi search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder • ROSAtalk - Dan Ellis "1754" CASA for masks/fragments - larger fragments → quicker search 2000-10-13 - 24 Lab ROSA Structure from audio-video features (Peng Xu) • HMM modeling of sports video inter replay start middle end • Distribution of camera motion labels, color features - also need within-state sequential structure • Add features from audio - could be orthogonal/complementary • Audio feature toolkit? - simple feature vectors, boundaries, classes - wealth of potential applications! ROSAtalk - Dan Ellis 2000-10-13 - 25 Lab ROSA Speech & speaker recognition • Words are not enough; Confidence-tagged alternate word hypotheses • Other useful information: - speaker change detection - speaker characterization - phrasing & timing - prosodic cues to dialog state - laughter, pauses, etc. • Integration with other analyses - segmentation for adaptation - nonspeech events to ignore - video-derived information... ROSAtalk - Dan Ellis 2000-10-13 - 26 Lab ROSA Music organization • Music is a special case - lots of structure - highly significant • Trick is to find meaningful, tractable questions - boundary between speech and music? • New (counter-intuitive) approaches? - perceive as whole, not by voice (Scheirer’00) → global features for chord structures - generic ‘event’ cues + local feature classification - more provisional notion of instruments/voices ROSAtalk - Dan Ellis 2000-10-13 - 27 Lab ROSA CASA for audio retrieval • Muscle Fish system uses global features: Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example • Mixtures → need elements & objects: Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis Seach/ comparison Results (Object formation) Query example rushing water Symbolic query ROSAtalk - Dan Ellis Word-to-class mapping Properties alone features calculated on grouped subsets 2000-10-13 - 28 Lab ROSA Audio archive structure discovery • What can you do with a large unlabeled training set (e.g. multimedia clips from the web)? - bootstrap learning: look for common patterns - have to learn generalizations in parallel: e.g. self-organizing maps, EM HMMs - post-filtering by humans may find ‘meaning’ in clusters • Associated text annotations provide a very small amount of labeling - .. but for a very large number of examples – sufficient to obtain purchase? - maximize label utility through NLP-type operations (expansion, disambiguation etc.) - goal is automatic term-to-feature mapping for term-based content queries ROSAtalk - Dan Ellis 2000-10-13 - 29 Lab ROSA Outline 1 Sound ‘organization’ 2 Background & related work 3 Existing projects 4 Future projects 5 Summary & conclusions ROSAtalk - Dan Ellis 2000-10-13 - 30 Lab ROSA DOMAINS Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition ROSAtalk - Dan Ellis • • • • • • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding 2000-10-13 - 31 Lab ROSA Conclusions • Sound is more than just speech! - speech is a special case - most auditory perceivers don’t understand speech • Object-based analysis is critical - it’s what people do - the world presents acoustic mixtures • Whole-scene representation is the way - it’s what people do - provides mutual constraints of overlap • Broad range of approaches for a broad range of phenomena ROSAtalk - Dan Ellis 2000-10-13 - 32 Lab ROSA