Recognition & Organization of Speech and Audio Dan Ellis Electrical Engineering, Columbia University <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/ Outline 1 Introducing LabROSA 2 Robust speech recognition 3 General audio analysis 4 Summary ROSAtalk - Dan Ellis 2001-06-21 - 1 Lab ROSA 1 Organization of sound mixtures frq/Hz 4000 2000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Voice (pleasant) Stab Rumble • Choir Strings Core operation: Converting continuous, scalar signal into discrete, symbolic representation ROSAtalk - Dan Ellis 2001-06-21 - 2 Lab ROSA Positioning sound organization Speech recognition Image/video analysis Sound organization Audio processing Comp. Aud. Scene Analysis Psychoacoustics & physiology Signal processing Pattern recognition Machine learning Artificial intelligence • Draws on many techniques • Abuts/overlaps various areas ROSAtalk - Dan Ellis 2001-06-21 - 3 Lab ROSA About auditory perception • Received waveform is a mixture - two sensors, N signals ... - need knowledge-based constraints • Psychoacoustics: the study of human sound organization - ‘auditory scene analysis’ (Bregman’90) • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) - perceived organization: real-world objects + events (transient) - subjective not canonical (ambiguity) ROSAtalk - Dan Ellis 2001-06-21 - 4 Lab ROSA Key themes for LabROSA http://www.ee.columbia.edu/~dpwe/LabROSA/ • Sound organization: construct hierarchy - at an instant (sources) - along time (segmentation) • Scene analysis - find attributes according to objects - use attributes to form objects - ... plus constraints of knowledge • Exploiting large data sets (the ASR lesson) - supervised/labeled: pattern recognition - unsupervised: structure discovery, clustering • Special cases: - speech recognition - other source-specific recognizers • ... within a ‘complete explanation’ ROSAtalk - Dan Ellis 2001-06-21 - 5 Lab ROSA Outline 1 Introducing LabROSA 2 Robust speech recognition - ASR overview - Tandem modeling - Missing data and multisource decoding 3 General audio analysis 4 Summary ROSAtalk - Dan Ellis 2001-06-21 - 6 Lab ROSA Automatic Speech Recognition (ASR) • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Acoustic classifier Word models s ah phone probabilities t HMM decoder Language model p("sat"|"the","cat") p("saw"|"the","cat") phone / word sequence Understanding/ application... • ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) • Can use multiple streams... ROSAtalk - Dan Ellis 2001-06-21 - 7 Lab ROSA Tandem speech recognition (with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound • Speech features Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output ‘meaning’ ROSAtalk - Dan Ellis 2001-06-21 - 8 Lab ROSA Words Tandem system results • It works very well (‘Aurora’ noisy digits): WER as a function of SNR for various Aurora99 systems 100 WER / % (log scale) 50 20 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% 10 5 HTK GMM baseline Hybrid connectionist Tandem Tandem + PC 2 1 clean -20 System-features -15 -10 -5 SNR / dB (averaged over 4 noises) 0 5 Avg. WER 20-0 dB Baseline WER ratio HTK-mfcc 13.7% 100% Neural net-mfcc 9.3% 84.5% Tandem-mfcc 7.4% 64.5% Tandem-msg+plp 6.4% 47.2% ROSAtalk - Dan Ellis 2001-06-21 - 9 Lab ROSA Relative contributions • Approx relative impact on baseline WER ratio for different component: Combo over msg: +20% plp Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck tn tn+w Input sound Pre-nonlinearity over + posteriors: +12% PCA orthog'n HTK decoder Gauss mix models s msg Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck KLT over direct: +8% ah Combo-into-HTK over Combo-into-noway: +15% t Words tn Combo over plp: +20% tn+w Combo over mfcc: +25% NN over HTK: +15% Tandem over HTK: +35% Tandem over hybrid: +25% Tandem combo over HTK mfcc baseline: +53% ROSAtalk - Dan Ellis 2001-06-21 - 10 Lab ROSA Inside Tandem systems: What’s going on? • Visualizations of the net outputs Clean 3 3 2 2 1 1 0 0 10 10 5 5 0 0 q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t freq / kHz 4 phone Spectrogram 5dB SNR to ‘Hall’ noise 4 phone “one eight three” (MFP_183A) 10 0 -10 -20 -30 -40 dB Cepstral-smoothed mel spectrum Phone posterior estimates Hidden layer linear outputs freq / mel chan 7 5 4 0 • 6 0.2 0.4 0.6 time / s 0.8 1 3 1 0.5 0 20 10 0 -10 0 0.2 0.4 0.6 time / s 0.8 1 -20 Neural net normalizes away noise ROSAtalk - Dan Ellis 2001-06-21 - 11 Lab ROSA Tandem for large vocabulary recognition • CI Tandem front end + CD LVCSR back end PLP feature calculation Neural net classifier 1 h# pcl bcl tcl dcl C0 C1 Tandem feature calculation SPHINX recognizer C2 Ck tn tn+w + Pre-nonlinearity outputs Input sound MSG feature calculation Subword likelihoods Neural net classifier 2 C1 s ah t Words MLLR adaptation h# pcl bcl tcl dcl C0 HMM decoder GMM classifier PCA decorrelation C2 Ck tn tn+w • Tandem benefits reduced: 75 mfc plp tandem1 tandem2 70 65 WER / % 60 55 50 45 40 35 30 ROSAtalk - Dan Ellis Context Independent Context Dependent Context Dep. +MLLR 2001-06-21 - 12 Lab ROSA ‘Tandem-domain’ processing (with Manuel Reyes) • Can we improve the ‘tandem’ features with conventional processing (deltas, normalization)? Neural net ... Norm / deltas ? h# pcl bcl tcl dcl C0 C1 C2 Ck tn PCA decorrelation Rank reduce ? Norm / deltas ? GMM classifier ... tn+w • Somewhat.. Processing Avg. WER 20-0 dB Baseline WER ratio Tandem PLP mismatch baseline (24 els) 11.1% 70.3% Rank reduce @ 18 els 11.8% 77.1% Delta → PCA 9.7% 60.8% PCA → Norm 9.0% 58.8% Delta → PCA → Norm 8.3% 53.6% ROSAtalk - Dan Ellis 2001-06-21 - 13 Lab ROSA Missing data recognition (with Cooke, Green, Barker @ Sheffield) • Noisy training seems to miss the point - rather have single ‘clean’ models • Use missing feature theory... - integrate over missing data dimensions xm p( x q) = ∫ p ( xm x g, q ) p ( x g q ) dx m - trick is finding good/bad data mask - soft classification improves "1754" + noise 90 AURORA 2000 - Test Set A 80 Missing-data Recognition WER 70 MD Discrete SNR MD Soft SNR HTK clean training HTK multi-condition 60 50 40 "1754" Mask based on stationary noise estimate 30 20 10 0 -5 ROSAtalk - Dan Ellis 0 5 10 15 2001-06-21 - 14 20 Lab ROSA Clean Multi-source decoding (Jon Barker @ Sheffield) • Search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder • "1754" CASA for masks/fragments - larger fragments → quicker search ROSAtalk - Dan Ellis 2001-06-21 - 15 Lab ROSA Outline 1 Introducing LabROSA 2 Robust speech recognition 3 General audio analysis - Alarm sound detection - Computational Auditory Scene Analysis - Music analysis - The Meeting Recorder project 4 Summary ROSAtalk - Dan Ellis 2001-06-21 - 16 Lab ROSA freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ - build a generic detector? • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 - 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 representation of energy in time-frequency formation of atomic elements grouping by common properties (onset &c.) classify by attributes... ROSAtalk - Dan Ellis 2001-06-21 - 17 Lab ROSA Computational Auditory Scene Analysis (CASA) • input mixture Implement psychoacoustic theory? (Brown’92) signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - what are the features? how are they used? • Additional ‘knowledge’ needed (Klassner’96) 3760 Hz Buzzer-Alarm 2540 Hz 2230 Hz 2350 Hz Glass-Clink 1675 Hz 1475 Hz 950 Hz 500 Hz 460 Hz 420 Hz Phone-Ring 1.0 Siren-Chirp 2.0 TIME ROSAtalk - Dan Ellis 3.0 4.0 sec 1.0 2.0 3.0 4.0 sec TIME 2001-06-21 - 18 Lab ROSA Prediction-driven CASA • Data-driven (bottom-up) fails for noisy, ambiguous sounds (most mixtures!) • Need top-down constraints: hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Periodic components Predict & combine predicted features - fit vocabulary of generic elements to sound ... bottom of a hierarchy? - account for entire scene - driven by prediction failures - pursue alternative hypotheses ROSAtalk - Dan Ellis 2001-06-21 - 19 Lab ROSA PDCASA example f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 1 ROSAtalk - Dan Ellis 2 3 4 5 6 7 8 9 time/s dB 2001-06-21 - 20 Lab ROSA Music analysis: Structure recovery (with Rob Turetsky) • Structure recovery by similarity matrices (after Foote) - similarity distance measure? - segmentation & repetition structure - interpretation at different scales: notes, phrases, movements - incorporating musical knowledge: ‘theme similarity’ ROSAtalk - Dan Ellis 2001-06-21 - 21 Lab ROSA Music analysis: Lyrics extraction (with Adam Berenzweig) • Vocal content is highly salient, useful for retrieval • Can we find the singing? Use an ASR classifier: freq / kHz music (no vocals #1) singing (vocals #17 + 10.5s) 4 2 0 singing phone class Posteriogram Spectrogram speech (trnset #58) 0 1 2 time / sec 0 1 2 time / sec 0 1 2 time / sec • Frame error rate ~20% for segmentation based on posterior-feature statistics • Lyric segmentation + transcribed lyrics → training data for lyrics ASR... ROSAtalk - Dan Ellis 2001-06-21 - 22 Lab ROSA The Meeting Recorder project (with ICSI, UW, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, ...): - 100 hours collected, ongoing transcription - headsets + tabletop + ‘PDA’ ROSAtalk - Dan Ellis 2001-06-21 - 23 Lab ROSA Meeting recorder: Difficult data Spkr A speaker active speaker B cedes floor Spkr B Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table top 20 0 120 • 125 130 135 140 145 150 155 time / secs Cross-correlation gives speaker turns, motion participant movement mr-2000-11-02-1440 Spkr C freq / chan 20 15 10 Coupling C→A delay / samples 5 20 15 10 5 Coupling C → Tabletop delay / samples 20 15 10 5 1020 ROSAtalk - Dan Ellis 1020.5 1021 1021.5 1022 1022.5 1023 1023.5 1024 1024.5 time / sec 2001-06-21 - 24 Lab ROSA Outline 1 Introducing LabROSA 2 Robust speech recognition 3 General audio analysis 4 Summary - Some future project ideas - LabROSA Summary ROSAtalk - Dan Ellis 2001-06-21 - 25 Lab ROSA Future: Audio Information Retrieval • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures ROSAtalk - Dan Ellis 2001-06-21 - 26 Lab ROSA CASA for audio retrieval • When audio material contains mixtures, global features are insufficient • Retrieval based on element/object analysis: Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis Seach/ comparison Results (Object formation) Query example rushing water Symbolic query Word-to-class mapping Properties alone - features are calculated over grouped subsets ROSAtalk - Dan Ellis 2001-06-21 - 27 Lab ROSA Future: ‘Machine listener’ • Goal: Unsupervised structure discovery Broadcast receiver Audio Acoustic event extraction Segment features Unsupervised clustering Class definitions Event templates Interactive refinement • What can you do with a large unlabeled training set (e.g. broadcast)? - bootstrap learning: look for common patterns - have to learn generalizations in parallel: e.g. self-organizing maps, EM HMMs - post-filtering by humans may find ‘meaning’ in clusters ROSAtalk - Dan Ellis 2001-06-21 - 28 Lab ROSA Audio-video-text content analysis (with Shih-Fu Chang, Kathleen McKeown) Audio/Video material Scene alignment and segmentation A/V object-based feature extraction Clustering and structure discovery • Question answering • Browsing • Search • Summary Multimedia Lexicon Term extraction Text sources Interactive refinement • Audio and video provide complementary info - correlate object features to define templates? • Associated text annotations provide a very small amount of labeling - .. but for a very large number of examples – sufficient to obtain purchase? - build a ‘multimedia lexicon’ for questionanswering ROSAtalk - Dan Ellis 2001-06-21 - 29 Lab ROSA Summary: Applications for sound organization What do people do with their ears? • Human-computer interface - .. includes knowing when (& why) you’ve failed • Robots - intelligence requires perceptual awareness - Sony’s AIBO: dog-hearing • Archive indexing & retrieval - pure audio archives - true multimedia content analysis • Content ‘understanding’ - intelligent classification & summarization • Autonomous monitoring • ‘Structure discovery’ algorithms ROSAtalk - Dan Ellis 2001-06-21 - 30 Lab ROSA DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • ROSAtalk - Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding 2001-06-21 - 31 Lab ROSA