Recognition & Organization of Speech & Audio Dan Ellis Electrical Engineering, Columbia University <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/ Outline 1 Introducing LabROSA 2 Speech recognition & processing 3 Auditory Scene Analysis 4 Projects & applications 5 Summary ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 1 Lab ROSA Sound organization 1 frq/Hz 4000 2000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Central operation: - continuous sound mixture → distinct objects & events • Perceptual impression is very strong - but hard to ‘see’ in signal ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 2 Lab ROSA Bregman’s lake “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture - two sensors, N signals ... • Disentangling mixtures as primary goal - perfect solution is not possible - need knowledge-based constraints ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 3 Lab ROSA The information in sound freq / Hz Steps 1 Steps 2 4000 3000 2000 1000 0 0 1 2 3 4 0 1 2 3 4 time / s • A sense of hearing is evolutionarily useful - gives organisms ‘relevant’ information • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) - special-purpose processing reflects ‘natural scene’ properties - subjective not canonical (ambiguity) ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 4 Lab ROSA Key themes for LabROSA http://labrosa.ee.columbia.edu/ • Sound organization: construct hierarchy - at an instant (sources) - along time (segmentation) • Scene analysis - find attributes according to objects - use attributes to form objects - ... plus constraints of knowledge • Exploiting large data sets (the ASR lesson) - supervised/labeled: pattern recognition - unsupervised: structure discovery, clustering • Special cases: - speech recognition - other source-specific recognizers • ... within a ‘complete explanation’ ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 5 Lab ROSA Outline 1 Introducing LabROSA 2 Speech recognition & processing - Connectionist and tandem recognition - Speech and speaker detection - Musical information extraction 3 Auditory Scene Analysis 4 Projects & applications 5 Summary ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 6 Lab ROSA Automatic Speech Recognition (ASR) • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Acoustic classifier Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") phone probabilities HMM decoder phone / word sequence Understanding/ application... • ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) • Can use multiple streams... ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 7 Lab ROSA The connectionist-HMM hybrid (Morgan & Bourlard, 1995) • Conventional recognizers use P(Xi|Si), acoustic likelihood model - model distribution with, e.g., Gaussian mixtures • Can replace with posterior, P(Si|Xi): Neural network acoustic classifier h# pcl bcl tcl dcl C0 C1 Feature calculation Word models HMM decoder C2 sound Language model Posterior probabilities for context-independent phone classes Ck tn Words tn+w - neural network estimates phone given acoustics - discriminative • Simpler structure for research ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 8 Lab ROSA Tandem speech recognition (with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound • Speech features Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output ‘meaning’ ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 9 Lab ROSA Words Tandem system results • It works very well (‘Aurora’ noisy digits): WER as a function of SNR for various Aurora99 systems 100 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% WER / % (log scale) 50 20 10 5 HTK GMM baseline Hybrid connectionist Tandem Tandem + PC 2 1 -5 0 System-features 5 10 15 SNR / dB (averaged over 4 noises) 20 clean Avg. WER 20-0 dB Baseline WER ratio HTK-mfcc 13.7% 100% Neural net-mfcc 9.3% 84.5% Tandem-mfcc 7.4% 64.5% Tandem-msg+plp 6.4% 47.2% ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 10 Lab ROSA Connectionist speaker recognition (with Dominique Genoud) • Use neural networks to model speakers rather than phones? • Specialize a phone classifier for a particular speaker? • Do both at once for “Twin-output MLP”: Feature vectors Posterior probability vectors Output layer Current vector t0-4 t0+4 12 features Non-speech 53 'speaker' phones 9x12 features = 108 inputs t ROSAtalk @ NEC - Dan Ellis MLP 108:2000:107 53 'world' phones 2001-08-17 - 11 Lab ROSA Speech/music discrimination (with Gethin Williams) • Neural net is very sensitive to speech: - characteristic jumping between phones - define statistics to distinguish speech regions e.g. entropy, ‘dynamism’ (delta-magnitude): Dynamism vs. Entropy for 2.5 second segments of speecn/music 3.5 Spectrogram frq/Hz Speech Music Speech+Music 3 4000 2.5 2000 2 speech music Entropy 0 speech+music 1.5 40 1 20 0 0 2 4 6 8 10 12 time/s 0.5 Posteriors 0 0 0.05 0.1 0.15 0.2 Dynamism 0.25 • 1.4% classification error on 2.5 s segments - use HMM structure for segmentation • Good predictor of ASR success ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 12 0.3 Lab ROSA Music analysis: Lyrics extraction (with Adam Berenzweig) • Vocal content is highly salient, useful for retrieval • Can we find the singing? Use an ASR classifier: freq / kHz music (no vocals #1) singing (vocals #17 + 10.5s) 4 2 0 singing phone class Posteriogram Spectrogram speech (trnset #58) 0 1 2 time / sec 0 1 2 time / sec 0 1 2 time / sec • Frame error rate ~20% for segmentation based on posterior-feature statistics • Lyric segmentation + transcribed lyrics → training data for lyrics ASR... ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 13 Lab ROSA Music analysis: Structure recovery (with Rob Turetsky) • Structure recovery by similarity matrices (after Foote) - similarity distance measure? - segmentation & repetition structure - interpretation at different scales: notes, phrases, movements - incorporating musical knowledge: ‘theme similarity’ ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 14 Lab ROSA Outline 1 Introducing LabROSA 2 Speech recognition & processing 3 Auditory Scene Analysis - Perception of sound mixtures - Illusions - Computational modeling 4 Projects & applications 5 Summary ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 15 Lab ROSA Sound mixtures • Speech 4000 Frequency Sound ‘scene’ is almost always a mixture - always stuff going on - sound is ‘transparent’ Noise Speech+Noise 3000 = + 2000 1000 0 0 0.5 1 Time 0 0.5 1 Time 0 0.5 1 Time • Need information related to our ‘world model’ - i.e. separate objects - a wolf howling in a blizzard is the same as a wolf howling in a rainstorm - whole-signal statistics won’t do this • ‘Separateness’ is similar to independence - objects/sounds that change in isolation - but: depends on the situation e.g. passing car vs. mechanic’s diagnosis ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 16 Lab ROSA Human Sound Organization • “Auditory Scene Analysis” [Bregman 1990] - break mixture into small elements (in time-freq) - elements are grouped in to sources using cues - sources have aggregate attributes • Grouping ‘rules’ (Darwin, Carlyon, ...): - cues: common onset/offset/modulation, harmonicity, spatial location, ... (from Darwin 1996) ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 17 Lab ROSA Cues and grouping • Common attributes and ‘fate’ freq time - harmonicity, common onset → perceived as a single sound source/event • But: can have conflicting cues freq ∆f ∆t time - determine how ∆t and ∆f affect • segregation of harmonic • pitch of complex ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 18 Lab ROSA The effect of context Context can create an ‘expectation’: i.e. a bias towards a particular interpretation • e.g. Bregman’s “old-plus-new” principle: A change in a signal will be interpreted as an added source whenever possible frequency / kHz • 2 1 + 0 0.0 0.4 0.8 1.2 time / s - a different division of the same energy depending on what preceded it ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 19 Lab ROSA Computational ASA • Goal: Systems to ‘pick out’ sounds in a mixture - ... like people do • Implement psychoacoustic theory? input mixture signal features Front end (maps) Object formation discrete objects Source groups Grouping rules freq onset time period frq.mod - ‘bottom-up’, using common onset & periodicity • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 0.6 0.8 1.0 ROSAtalk @ NEC - Dan Ellis time/s 0.2 0.4 0.6 0.8 2001-08-17 - 20 1.0 time/s Lab ROSA Adding top-down cues Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven (top-down) (PDCASA) hypotheses Noise components Hypothesis management prediction errors input mixture • Front end signal features Compare & reconcile Periodic components Predict & combine predicted features Motivations - detect non-tonal events (noise & click elements) - support ‘restoration illusions’... → hooks for high-level knowledge + ‘complete explanation’, multiple hypotheses, ... ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 21 Lab ROSA PDCASA and complex scenes f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 1 ROSAtalk @ NEC - Dan Ellis 2 3 4 5 6 7 8 9 time/s 2001-08-17 - 22 dB Lab ROSA Outline 1 Introducing LabROSA 2 Speech recognition & processing 3 Auditory Scene Analysis 4 Projects & applications - Missing data recognition - Hearing prostheses - The machine listener 5 Summary ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 23 Lab ROSA Missing data recognition (Cooke, Green, Barker... @ Sheffield) • Energy overlaps in time-freq. hide features - some observations are effectively missing • Use missing feature theory... - integrate over missing data dimensions xm p( x q) = • ∫ p ( xg x m, q ) p ( x m q ) d x m Effective in speech recognition - trick is finding good/bad data mask "1754" + noise 90 AURORA 2000 - Test Set A 80 Missing-data Recognition WER 70 MD Discrete SNR MD Soft SNR HTK clean training HTK multi-condition 60 50 40 30 "1754" 20 10 Mask based on stationary noise estimate ROSAtalk @ NEC - Dan Ellis 0 -5 0 5 10 SNR (dB) 15 2001-08-17 - 24 20 Clean Lab ROSA Multi-source decoding (Jon Barker @ Sheffield) • Search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder "1754" • CASA for masks/fragments - larger fragments → quicker search • Use with nonspeech models? ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 25 Lab ROSA Audio Information Retrieval • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 26 Lab ROSA CASA for audio retrieval • When audio material contains mixtures, global features are insufficient • Retrieval based on element/object analysis: Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis Seach/ comparison Results (Object formation) Query example rushing water Symbolic query Word-to-class mapping Properties alone - features are calculated over grouped subsets ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 27 Lab ROSA freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 • 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 representation of energy in time-frequency formation of atomic elements grouping by common properties (onset &c.) classify by attributes... Key: recognize despite background ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 28 Lab ROSA Future prosthetic listening devices • CASA to replace lost hearing ability - sound mixtures are difficult for hearing impaired • Signal enhancement - resynthesize a single source without background - (need very good resynthesis) • Signal understanding - monitor for particular sounds (doorbell, knocks) & translate into alternative mode (vibro alarm) - real-time textual descriptions i.e. “automatic subtitles for real life” [thunder] S: I THINK THE WEATHER'S CHANGING ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 29 Lab ROSA The ‘Machine listener’ • Goal: An auditory system for machines - use same environmental information as people • Aspects: - recognize spoken commands (but not others) - track ‘acoustic channel’ quality (for responses) - categorize environment (conversation, crowd...) • Scenarios - personal listener → summary of your day - autonomous robots: need awareness ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 30 Lab ROSA Outline 1 Introducing LabROSA 2 Tandem modeling: Neural net features 3 Meeting recorder data analysis 4 Computational Auditory Scene Analysis 5 Summary ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 31 Lab ROSA Summary: Applications for sound organization What do people do with their ears? • Human-computer interface - .. includes knowing when (& why) you’ve failed • Robots - intelligence requires perceptual awareness - Sony’s AIBO: dog-hearing • Archive indexing & retrieval - pure audio archives - true multimedia content analysis • Content ‘understanding’ - intelligent classification & summarization • Autonomous monitoring • ‘Structure discovery’ algorithms ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 32 Lab ROSA DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • ROSAtalk @ NEC - Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding 2001-08-17 - 33 Lab ROSA Extra slides... ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 34 Lab ROSA Positioning sound organization Speech recognition Image/video analysis Sound organization Audio processing Comp. Aud. Scene Analysis Psychoacoustics & physiology Signal processing Pattern recognition Machine learning Artificial intelligence • Draws on many techniques • Abuts/overlaps various areas ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 35 Lab ROSA Tandem recognition: Relative contributions • Approx relative impact on baseline WER ratio for different component: Combo over msg: +20% plp Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck tn tn+w Input sound Pre-nonlinearity over + posteriors: +12% PCA orthog'n HTK decoder Gauss mix models s msg Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck KLT over direct: +8% ah Combo-into-HTK over Combo-into-noway: +15% t Words tn Combo over plp: +20% tn+w Combo over mfcc: +25% NN over HTK: +15% Tandem over HTK: +35% Tandem over hybrid: +25% Tandem combo over HTK mfcc baseline: +53% ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 36 Lab ROSA Inside Tandem systems: What’s going on? • Visualizations of the net outputs Clean 3 3 2 2 1 1 0 0 10 10 5 5 0 0 q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t freq / kHz 4 phone Spectrogram 5dB SNR to ‘Hall’ noise 4 phone “one eight three” (MFP_183A) 10 0 -10 -20 -30 -40 dB Cepstral-smoothed mel spectrum Phone posterior estimates Hidden layer linear outputs freq / mel chan 7 5 4 0 • 6 0.2 0.4 0.6 time / s 0.8 1 3 1 0.5 0 20 10 0 -10 0 0.2 0.4 0.6 time / s 0.8 1 -20 Neural net normalizes away noise ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 37 Lab ROSA Acoustic Change Detection (ACD) (with Javier Ferreiros, UPM) • Find optimal segmentation points via Bayesian Information Criterion (BIC) • Cluster segments to find underlying ‘sources’ • Repeat segmentation incorporating cluster assignments Speech Feature Extraction 1 Feature Extraction 2 Broad Class Recognizer Hypothesis Generator ACD CLUSTERING ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 38 Lab ROSA The Meeting Recorder project (with ICSI, UW, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, ...): - 100 hours collected, ongoing transcription - headsets + tabletop + ‘PDA’ ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 39 Lab ROSA Crosstalk cancellation • Baseline speaker activity detection is hard: Spkr A speaker active speaker B cedes floor Spkr B Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table top 120 20 0 125 130 135 140 145 150 155 time / secs • Noisy crosstalk model: m = C ⋅ s + n • Estimate subband CAa from A’s peak energy - ... including pure delay (10 ms frames) - ... then linear inversion ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 40 Lab ROSA Participant motion detection • Cross-correlation gives speaker-mic coupling: participant movement mr-2000-11-02-1440 Spkr C freq / chan 20 15 10 Coupling C→A delay / samples 5 20 15 10 5 Coupling C → Tabletop delay / samples 20 15 10 5 1020 1020.5 1021 1021.5 1022 1022.5 1023 1023.5 1024 1024.5 time / sec • Changes in coupling impulse response show changes in path/orientation • Comparison between different channels → distinguish speaker and listener motion ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 41 Lab ROSA PDA-based speaker change detection • Goal: small conference-tabletop device • Speaker turns from PDA mock-up signals? pda.aif: excerpt with 512-pt xcorr, 80% max thresh 8000 6000 4000 2000 0 Morgan 31 32 what's this one here this this one has a couple Adam Dan 33 34 35 cheesy electrets oh 36 37 38 that's connected too or 39 yeah that's 40 that's connected twoyep that's the dummy dummy P.D.A. both yeah both channels yeah it is 1.5 lag/ms 1 0.5 0 -0.5 -1 -1.5 31 32 33 34 35 36 37 38 39 40 31 32 33 34 35 time/s 36 37 38 39 40 0.5 0 -0.5 • SCD algo on spectral + interaural features - average spectral + per-channel ITD, ∆φ ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 42 Lab ROSA Computational ASA CASA Object 1 description Object 2 description Object 3 description ... • Goal: Automatic sound organization ; Systems to ‘pick out’ sounds in a mixture - ... like people do • E.g. voice against a noisy background - to improve speech recognition • Approach: - psychoacoustics describes grouping ‘rules’ - ... just implement them? ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 43 Lab ROSA CASA front-end processing • Correlogram: Loosely based on known/possible physiology de lay l ine short-time autocorrelation sound Cochlea filterbank frequency channels Correlogram slice envelope follower freq lag time - linear filterbank cochlear approximation static nonlinearity zero-delay slice is like spectrogram periodicity from delay-and-multiply detectors ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 44 Lab ROSA Problems with ‘bottom-up’ CASA freq time • Circumscribing time-frequency elements - need to have ‘regions’, but hard to find • Periodicity is the primary cue - how to handle aperiodic energy? • Resynthesis via masked filtering - cannot separate within a single t-f element • Bottom-up leaves no ambiguity or context - how to model illusions? ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 45 Lab ROSA Generic sound elements for PDCASA • f/Hz Goal is a representational space that - covers real-world perceptual sounds - minimal parameterization (sparseness) - separate attributes in separate parameters NoiseObject profile magProfile H[k] 4000 ClickObject1 decayTimes XT [n,k] T[k] f/Hz 2000 4000 2000 1000 1000 400 400 200 H[k] 0 0.00 magContour 0.01 A[n] 0.005 2.2 2.4 time / s 0.0 0.1 time / s 0.000 0.1 p[n] 200 100 XC[n,k] = H[k]·exp((n - n0) / T[k]) 0.0 H W[n,k] 200 f/Hz 400 XN[n,k] 0.010 WeftObject1 0.2 time/s XN[n,k] = H[k]·A[n] 50 f/Hz 400 Correlogram analysis Periodogram 200 100 50 0.0 0.2 0.4 0.6 0.8 time/s XW[n,k] = HW[n,k]·P[n,k] • Object hierarchies built on top... ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 46 Lab ROSA PDCASA for old-plus-new • t1 Incremental analysis t2 t3 Input signal Time t1: initial element created Time t2: Additional element required Time t3: Second element finished ROSAtalk @ NEC - Dan Ellis 2001-08-17 - 47 Lab ROSA