Recognition and Organization of Speech and Audio ROSA

Recognition and Organization of Speech and Audio Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline 1 Audio Organization 2 Speech, music, and other 3 Future work Dan Ellis Advent Overview 2002-09-20 - 1 Audio Organization 1 4000 frq/Hz 3000 0 2000 -20 1000 -40 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Analyzing and describing complex sounds: - continuous sound mixture → distinct objects & events • Human listeners as the prototype - strong subjective impression when listening - ..but hard to ‘see’ in signal Dan Ellis Advent Overview 2002-09-20 - 2 Human Sound Organization freq / Hz • Sound percepts depend on ‘grouping cues’ - common onset across frequency - common periodicity (fundamental) - spatial (binaural) cues, familiarity, ... 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s • Hearing confers evolutionary advantage - optimized to get ‘useful’ information from sound • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) Dan Ellis Advent Overview 2002-09-20 - 3 Speech & Speaker recognition 2 (Patricia Scanlon) • - Mutual Information identifies where the information is in time/ frequency: - little temporal structure averaged over all sounds Phoneme info Better with just vowels: Phoneme info Dan Ellis Speaker info Advent Overview 2002-09-20 - 4 Speech Fragment recognition (Barker & Cooke/Sheffield) • Standard classification chooses between models M to match source features X P( M ) M ∗ = argmax P ( M X ) = argmax P ( X M ) ⋅ -------------P( X ) M M • Mixtures → observed features Y, segregation S, all related by P ( X Y , S ) Observation Y(f ) Source X(f ) Segregation S freq - spectral features allow clean relationship • Joint classification of model and segregation: P( X Y , S ) P ( M , S Y ) = P ( M ) ∫ P ( X M ) ⋅ ------------------------- dX ⋅ P ( S Y ) P( X ) Dan Ellis Advent Overview 2002-09-20 - 5 The Meeting Recorder Project (CompSci, ICSI, UW, IDIAP, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, IDIAP): - 100 hours collected, ongoing transcription • Dan Ellis NSF ‘Mapping Meetings’ project - also interest from NIST, DARPA, EU Advent Overview 2002-09-20 - 6 Tabletop mics: Turn detection (Huan Wei Hee, Jerry Liu) • 4 mics ~ 1m separated along center of table - 3 timing differences - slight U/D offset to disambiguate 5 4 3 2 1 0 -1 -2 -3 -4 -5 -2.5 • -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Hi-res cross-correlation for timings - use normalized peak value for confidence - cluster results mr-2000-11-02-1440: PZM xcorr lags Example cross coupling response, chan3 to chan0 1 4 250 0 lag 3-4 / ms 100xR skew/samps 300 200 150 100 3 -2 50 0 -1 2 1 0 50 Dan Ellis 100 150 time / s 200 250 Advent Overview 300 -3 -3 -2 2002-09-20 - 7 -1 0 1 lag 1-2 / ms 2 3 Visualization: transPlotter (Jerry Liu) • Speaker turn patterns are informative • Browser for ‘high-level’ view, quick examination - snack, iwidgets based Dan Ellis Advent Overview 2002-09-20 - 8 Music analysis: Structure recovery (Rob Turetsky, Alex Sheh) • Identify & match segment-level structure in music - Foote similarity matrices point to repeated sections - Chord segmentation using ASR ‘hidden state’ model (train with chord transcriptions) - Note transcription/timing by aligning audio to resynthesized MIDI versions Dan Ellis Advent Overview 2002-09-20 - 9 Music Similarity & Recommendation (Adam Berenzweig, Lawrence/NECI) Dan Ellis Advent Overview 2002-09-20 - 10 Sound texture modeling (Marios Athineos) • Sound textures are important but neglected - fire, rain, paper: no clear pitch, onsets, shape - typically modeled with filtered white noise • LPC on spectrum captures temporal structure: - better parameterization of texture structure? 40 30 0 20 -5 10 -10 50 100 150 200 250 300 350 -15 -20 8 6 -25 4 -30 -100 40 2 50 100 150 200 250 300 -90 20 -80 -70 -60 0 350 - generate variations? Dan Ellis Advent Overview 2002-09-20 - 11 Audio Information Retrieval (Manuel Reyes) • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures Dan Ellis Advent Overview 2002-09-20 - 12 Results freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 freq / Hz - sinusoid peaks have invariant properties Speech + alarm 4000 2000 Pr(alarm) 0 1 0.5 0 1 2 3 4 5 6 7 8 9 time / sec - cepstral coefficients are easy to model Dan Ellis Advent Overview 2002-09-20 - 13 2 Future work: Automatic audio-video analysis (Shih-Fu Chang, Kathy McKeown) • Documentary archive management - huge ratio of raw-to-finished material - costly manual logging • Problem: term ↔ signal mapping - training corpus of past annotations - interactive semi-automatic learning MM Concept Network annotations A/V data text processing A/V segmentation and feature extraction terms A/V features concept discovery A/V feature unsupervised clustering Multimedia Fusion: Mining Concepts & Relationships generic detectors Dan Ellis Advent Overview Complex SpatioTemporal Classification Question Answering 2002-09-20 - 14 The ‘Machine listener’ • Goal: An auditory system for machines - use same environmental information as people • Signal understanding - monitor for particular sounds - real-time description • Scenarios - personal listener → summary of your day - future prosthetic hearing device - autonomous robots Dan Ellis Advent Overview 2002-09-20 - 15 DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding Advent Overview 2002-09-20 - 16

Recognition and Organization of Speech and Audio ROSA

Related documents

Products

Support

Recognition and Organization of Speech and Audio ROSA

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib