Recognition & Organization of Speech and Audio Dan Ellis Electrical Engineering, Columbia University <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/ Outline 1 Introducing LabROSA 2 Tandem modeling for robust ASR 3 Other current projects 4 Future projects 5 Summary & conclusions ROSAtalk - Dan Ellis 2001-01-11 - 1 Lab ROSA 1 Organization of sound mixtures frq/Hz 4000 2000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Voice (pleasant) Stab Rumble • ROSAtalk - Dan Ellis Choir Strings Core operation: Converting continuous, scalar signal into discrete, symbolic representation 2001-01-11 - 2 Lab ROSA Positioning sound organization Speech recognition Image/video analysis Sound organization Audio processing Comp. Aud. Scene Analysis Psychoacoustics & physiology Signal processing Pattern recognition Machine learning Artificial intelligence • Draws on many techniques • Abuts/overlaps various areas ROSAtalk - Dan Ellis 2001-01-11 - 3 Lab ROSA About auditory perception • Received waveform is a mixture - two sensors, N signals ... - need knowledge-based constraints • Psychoacoustics: the study of human sound organization - ‘auditory scene analysis’ (Bregman’90) • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) - perceived organization: real-world objects + events (transient) - subjective not canonical (ambiguity) ROSAtalk - Dan Ellis 2001-01-11 - 4 Lab ROSA Key themes for LabROSA http://www.ee.columbia.edu/~dpwe/LabROSA/ • Sound organization: construct hierarchy - at an instant (sources) - along time (segmentation) • Scene analysis - find attributes according to objects - use attributes to form objects - ... plus constraints of knowledge • Exploiting large data sets (the ASR lesson) - supervised/labeled: pattern recognition - unsupervised: structure discovery, clustering • Special cases: - speech recognition - other source-specific recognizers • ... within a ‘complete explanation’ ROSAtalk - Dan Ellis 2001-01-11 - 5 Lab ROSA Outline 1 Introducing LabROSA 2 Tandem modeling for robust ASR - ASR overview - Tandem modeling - Investigating the benefits 3 Other current projects 4 Future projects 5 Summary & conclusions ROSAtalk - Dan Ellis 2001-01-11 - 6 Lab ROSA Automatic Speech Recognition (ASR) • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") Acoustic classifier phone probabilities HMM decoder phone / word sequence Understanding/ application... • ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) • Can use multiple streams... ROSAtalk - Dan Ellis 2001-01-11 - 7 Lab ROSA Tandem speech recognition (with Hermansky, Sharma & Sivadas/OGI, Singh/CMU) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound • ROSAtalk - Dan Ellis Speech features Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output ‘meaning’ 2001-01-11 - 8 Lab ROSA Words Tandem system results • It works very well (‘Aurora’ noisy digits): WER as a function of SNR for various Aurora99 systems 100 WER / % (log scale) 50 20 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% 10 5 HTK GMM baseline Hybrid connectionist Tandem Tandem + PC 2 1 clean -20 System-features -15 -10 -5 SNR / dB (averaged over 4 noises) 0 5 Avg. WER 20-0 dB Baseline WER ratio HTK-mfcc 13.7% 100% Neural net-mfcc 9.3% 84.5% Tandem-mfcc 7.4% 64.5% Tandem-msg+plp 6.4% 47.2% ROSAtalk - Dan Ellis 2001-01-11 - 9 Lab ROSA Relative contributions • Approx relative impact on baseline WER ratio for different component: Combo over msg: +20% plp Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck tn tn+w Input sound Pre-nonlinearity over + posteriors: +12% PCA orthog'n HTK decoder Gauss mix models s msg Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck KLT over direct: +8% ah Combo-into-HTK over Combo-into-noway: +15% t Words tn Combo over plp: +20% tn+w Combo over mfcc: +25% NN over HTK: +15% Tandem over HTK: +35% Tandem over hybrid: +25% Tandem combo over HTK mfcc baseline: +53% ROSAtalk - Dan Ellis 2001-01-11 - 10 Lab ROSA Inside Tandem systems: What’s going on? • Visualizations of the net outputs Clean 3 3 2 2 1 1 0 0 10 10 5 5 0 0 q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t freq / kHz 4 phone Spectrogram 5dB SNR to ‘Hall’ noise 4 phone “one eight three” (MFP_183A) 10 0 -10 -20 -30 -40 dB Cepstral-smoothed mel spectrum Phone posterior estimates Hidden layer linear outputs freq / mel chan 7 ROSAtalk - Dan Ellis 5 4 0 • 6 0.2 0.4 0.6 time / s 0.8 1 3 1 0.5 0 20 10 0 -10 0 0.2 0.4 0.6 time / s 0.8 1 -20 Neural net normalizes away noise 2001-01-11 - 11 Lab ROSA Tandem feature space ‘magnification’ • Neural net performs a nonlinear remapping of the feature space Smoothed spectrum 12 10 8 6 4 2 20 Linear outputs for /r/, /iy/, /w/ 10 0 Posteriors for /r/, /iy/, /w/ 1 0 0.7 0.75 0.8 time / s - small changes across critical boundaries result in large output changes ROSAtalk - Dan Ellis 2001-01-11 - 12 Lab ROSA Outline 1 Introducing LabROSA 2 Tandem modeling for robust ASR 3 Other current projects - Alarm sound detection - Computational Auditory Scene Analysis - Multi-source and missing-data recognition - The Meeting Recorder project 4 Future projects 5 Summary & conclusions ROSAtalk - Dan Ellis 2001-01-11 - 13 Lab ROSA freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ - build a generic detector? • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 ROSAtalk - Dan Ellis 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 representation of energy in time-frequency formation of atomic elements grouping by common properties (onset &c.) classify by attributes... 2001-01-11 - 14 Lab ROSA Computational Auditory Scene Analysis (CASA) • input mixture Implement psychoacoustic theory? (Brown’92) Front end signal features (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - what are the features? how are they used? • Additional ‘knowledge’ needed (Klassner’96) 3760 Hz Buzzer-Alarm 2540 Hz 2230 Hz 2350 Hz Glass-Clink 1675 Hz 1475 Hz 950 Hz 500 Hz 460 Hz 420 Hz Phone-Ring 1.0 Siren-Chirp 2.0 TIME ROSAtalk - Dan Ellis 3.0 4.0 sec 1.0 2.0 3.0 4.0 sec TIME 2001-01-11 - 15 Lab ROSA Prediction-driven CASA • Data-driven (bottom-up) fails for noisy, ambiguous sounds (most mixtures!) • Need top-down constraints: hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Periodic components Predict & combine predicted features - fit vocabulary of generic elements to sound ... bottom of a hierarchy? - account for entire scene - driven by prediction failures - pursue alternative hypotheses ROSAtalk - Dan Ellis 2001-01-11 - 16 Lab ROSA PDCASA example f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 ROSAtalk - Dan Ellis 1 2 3 4 5 6 7 8 9 time/s dB 2001-01-11 - 17 Lab ROSA Missing data recognition & CASA (with Barker, Cooke, Green/Sheffield) • Missing-data recognition - integrate across ‘don’t-know’ values - ‘perfect’ mask → excellent performance in noise • Multi-source decoder - Viterbi search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder • ROSAtalk - Dan Ellis "1754" CASA for masks/fragments - larger fragments → quicker search 2001-01-11 - 18 Lab ROSA Meeting recorder (with ICSI, UW, SRI, IBM) • Microphones in conventional meetings - for transcription/summarization/retrieval - informal, overlapped speech • Data collection (ICSI and ...): - 10s of hours collected, ongoing - now being transcribed ROSAtalk - Dan Ellis 2001-01-11 - 19 Lab ROSA Meeting recorder: Research issues • Preliminary analysis - transcription & forced alignment - ground truth in turns/overlaps - preliminary distant-mic recordings • Research areas - meeting dialog: overlaps, turns etc. - language modeling for meetings - feature design for distant acoustics • Applications - information retrieval from meetings - ‘mapping’ meeting content - sociological analysis of meeting behavior ROSAtalk - Dan Ellis 2001-01-11 - 20 Lab ROSA Outline 1 Introducing LabROSA 2 Tandem modeling for robust ASR 4 Other current projects 4 Future projects - Audio Content-Based Retrieval - A ‘machine listener’ - Audio-video-text content analysis 5 Summary & conclusions ROSAtalk - Dan Ellis 2001-01-11 - 21 Lab ROSA Audio Information Retrieval • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures ROSAtalk - Dan Ellis 2001-01-11 - 22 Lab ROSA CASA for audio retrieval • When audio material contains mixtures, global features are insufficient • Retrieval based on element/object analysis: Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis Seach/ comparison Results (Object formation) Query example rushing water Symbolic query Word-to-class mapping Properties alone - features are calculated over grouped subsets ROSAtalk - Dan Ellis 2001-01-11 - 23 Lab ROSA A ‘machine listener’ • Goal: Unsupervised structure discovery Broadcast receiver Audio Acoustic event extraction Segment features Unsupervised clustering Class definitions Event templates Interactive refinement • ROSAtalk - Dan Ellis What can you do with a large unlabeled training set (e.g. broadcast)? - bootstrap learning: look for common patterns - have to learn generalizations in parallel: e.g. self-organizing maps, EM HMMs - post-filtering by humans may find ‘meaning’ in clusters 2001-01-11 - 24 Lab ROSA Audio-video-text content analysis (with Shih-Fu Chang, Kathleen McKeown) Audio/Video material Scene alignment and segmentation A/V object-based feature extraction Clustering and structure discovery • Question answering • Browsing • Search • Summary Multimedia Lexicon Term extraction Text sources Interactive refinement • Audio and video provide complementary info - correlate object features to define templates? • Associated text annotations provide a very small amount of labeling - .. but for a very large number of examples – sufficient to obtain purchase? - build a ‘multimedia lexicon’ for questionanswering ROSAtalk - Dan Ellis 2001-01-11 - 25 Lab ROSA Outline 1 Introducing LabROSA 2 Tandem modeling for robust ASR 4 Other current projects 4 Future projects 5 Summary & conclusions ROSAtalk - Dan Ellis 2001-01-11 - 26 Lab ROSA Applications for sound organization What do people do with their ears? • Human-computer interface - .. includes knowing when (& why) you’ve failed • Robots - intelligence requires perceptual awareness - Sony’s AIBO: dog-hearing • Archive indexing & retrieval - pure audio archives - true multimedia content analysis • Content ‘understanding’ - intelligent classification & summarization • Autonomous monitoring • Broader ‘structure discovery’ algorithms ROSAtalk - Dan Ellis 2001-01-11 - 27 Lab ROSA DOMAINS Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition ROSAtalk - Dan Ellis • • • • • • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding 2001-01-11 - 28 Lab ROSA Conclusions • New classification schemes for ASR - ... combining multiple approaches/sources • But sound is more than just speech! - speech is a special case - need to deal with the ‘other stuff’ • Object-based analysis - it’s what people do - the world presents acoustic mixtures • Whole-scene representation - it’s what people do - provides mutual constraints of overlap • Broad range of approaches for a broad range of phenomena ROSAtalk - Dan Ellis 2001-01-11 - 29 Lab ROSA