Recognition and Organization of Speech and Audio Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline 1 Audio Organization 2 Speech, music, and other 3 Future work Dan Ellis Advent Overview 2002-09-20 - 1 Audio Organization 1 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Analyzing and describing complex sounds: - continuous sound mixture → distinct objects & events • Human listeners as the prototype - strong subjective impression when listening - ..but hard to ‘see’ in signal Dan Ellis Advent Overview 2002-09-20 - 2 Human Sound Organization freq / Hz • Sound percepts depend on ‘grouping cues’ - common onset across frequency - common periodicity (fundamental) - spatial (binaural) cues, familiarity, ... 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s • Hearing confers evolutionary advantage - optimized to get ‘useful’ information from sound • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) Dan Ellis Advent Overview 2002-09-20 - 3 Speech & Speaker recognition 2 (Patricia Scanlon) • - Mutual Information identifies where the information is in time/ frequency: - little temporal structure averaged over all sounds Phoneme info Better with just vowels: Phoneme info Dan Ellis Speaker info Advent Overview 2002-09-20 - 4 Speech Fragment recognition (Barker & Cooke/Sheffield) • Standard classification chooses between models M to match source features X P( M ) M ∗ = argmax P ( M X ) = argmax P ( X M ) ⋅ -------------P( X ) M M • Mixtures → observed features Y, segregation S, all related by P ( X Y , S ) Observation Y(f ) Source X(f ) Segregation S freq - spectral features allow clean relationship • Joint classification of model and segregation: P( X Y , S ) P ( M , S Y ) = P ( M ) ∫ P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y ) P( X ) Dan Ellis Advent Overview 2002-09-20 - 5 The Meeting Recorder Project (CompSci, ICSI, UW, IDIAP, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, IDIAP): - 100 hours collected, ongoing transcription • Dan Ellis NSF ‘Mapping Meetings’ project - also interest from NIST, DARPA, EU Advent Overview 2002-09-20 - 6 Tabletop mics: Turn detection (Huan Wei Hee, Jerry Liu) • 4 mics ~ 1m separated along center of table - 3 timing differences - slight U/D offset to disambiguate 5 4 3 2 1 0 -1 -2 -3 -4 -5 -2.5 • -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Hi-res cross-correlation for timings - use normalized peak value for confidence - cluster results mr-2000-11-02-1440: PZM xcorr lags Example cross coupling response, chan3 to chan0 1 4 250 0 lag 3-4 / ms 100xR skew/samps 300 200 150 100 3 -2 50 0 -1 2 1 0 50 Dan Ellis 100 150 time / s 200 250 Advent Overview 300 -3 -3 -2 2002-09-20 - 7 -1 0 1 lag 1-2 / ms 2 3 Visualization: transPlotter (Jerry Liu) • Speaker turn patterns are informative • Browser for ‘high-level’ view, quick examination - snack, iwidgets based Dan Ellis Advent Overview 2002-09-20 - 8 Music analysis: Structure recovery (Rob Turetsky, Alex Sheh) • Identify & match segment-level structure in music - Foote similarity matrices point to repeated sections - Chord segmentation using ASR ‘hidden state’ model (train with chord transcriptions) - Note transcription/timing by aligning audio to resynthesized MIDI versions Dan Ellis Advent Overview 2002-09-20 - 9 Music Similarity & Recommendation (Adam Berenzweig, Lawrence/NECI) Dan Ellis Advent Overview 2002-09-20 - 10 Sound texture modeling (Marios Athineos) • Sound textures are important but neglected - fire, rain, paper: no clear pitch, onsets, shape - typically modeled with filtered white noise • LPC on spectrum captures temporal structure: - better parameterization of texture structure? 40 30 0 20 -5 10 -10 50 100 150 200 250 300 350 -15 -20 8 6 -25 4 -30 -100 40 2 50 100 150 200 250 300 -90 20 -80 -70 -60 0 350 - generate variations? Dan Ellis Advent Overview 2002-09-20 - 11 Audio Information Retrieval (Manuel Reyes) • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures Dan Ellis Advent Overview 2002-09-20 - 12 Results freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 freq / Hz - sinusoid peaks have invariant properties Speech + alarm 4000 2000 Pr(alarm) 0 1 0.5 0 1 2 3 4 5 6 7 8 9 time / sec - cepstral coefficients are easy to model Dan Ellis Advent Overview 2002-09-20 - 13 2 Future work: Automatic audio-video analysis (Shih-Fu Chang, Kathy McKeown) • Documentary archive management - huge ratio of raw-to-finished material - costly manual logging • Problem: term ↔ signal mapping - training corpus of past annotations - interactive semi-automatic learning MM Concept Network annotations A/V data text processing A/V segmentation and feature extraction terms A/V features concept discovery A/V feature unsupervised clustering Multimedia Fusion: Mining Concepts & Relationships generic detectors Dan Ellis Advent Overview Complex SpatioTemporal Classification Question Answering 2002-09-20 - 14 The ‘Machine listener’ • Goal: An auditory system for machines - use same environmental information as people • Signal understanding - monitor for particular sounds - real-time description • Scenarios - personal listener → summary of your day - future prosthetic hearing device - autonomous robots Dan Ellis Advent Overview 2002-09-20 - 15 DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding Advent Overview 2002-09-20 - 16