Audio Information Extraction Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline 1 Sound organization 2 Speech, music, and other 3 General sound organization 4 Future work & summary AIE @ JHU - Dan Ellis 2002-03-28 - 1 Lab ROSA Sound Organization 1 4000 frq/Hz 3000 2000 1000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Analyzing and describing complex sounds: - continuous sound mixture → distinct objects & events • Human listeners as the prototype - strong subjective impression when listening - ..but hard to ‘see’ in signal AIE @ JHU - Dan Ellis 2002-03-28 - 2 Lab ROSA Bregman’s lake “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture - two sensors, N signals ... • Disentangling mixtures as primary goal - perfect solution is not possible - need knowledge-based constraints AIE @ JHU - Dan Ellis 2002-03-28 - 3 Lab ROSA The information in sound freq / Hz Steps 1 Steps 2 4000 3000 2000 1000 0 0 1 2 3 4 0 1 2 3 4 time / s • Hearing confers evolutionary advantage - optimized to get ‘useful’ information from sound • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) - special-purpose processing reflects ‘natural scene’ properties - subjective not canonical (ambiguity) AIE @ JHU - Dan Ellis 2002-03-28 - 4 Lab ROSA Audio Information Extraction AutomaticSpeech Recognition ASR Text Information Retrieval IR Spoken Document Retrieval SDR Text Information Extraction IE Blind Source Separation BSS Independent Component Analysis ICA Music Information Retrieval Music IR Audio Information Extraction AIE Computational Auditory Scene Analysis CASA Audio Fingerprinting Audio Content-based Retrieval Audio CBR Unsupervised Clustering • Domain - text ... speech ... music ... general audio • Operation - recognize ... index/retrieve ... organize AIE @ JHU - Dan Ellis Multimedia Information Retrieval MMIR 2002-03-28 - 5 Lab ROSA AIE Applications • Multimedia access - sound as complementary dimension - need all modalities for complete information • Personal audio - continuous sound capture quite practical - different kind of indexing problem • Machine perception - intelligence requires awareness - necessary for communication • Music retrieval - area of hot activity - specific economic factors AIE @ JHU - Dan Ellis 2002-03-28 - 6 Lab ROSA Outline 1 Sound organization 2 Speech, music, and other - Speech recognition - Multi-speaker processing - Music classification - Other sounds 3 General sound organization 4 Future work & summary AIE @ JHU - Dan Ellis 2002-03-28 - 7 Lab ROSA Automatic Speech Recognition (ASR) • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Acoustic classifier Word models s ah phone probabilities t HMM decoder Language model p("sat"|"the","cat") p("saw"|"the","cat") phone / word sequence Understanding/ application... • ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) • Can use multiple streams... AIE @ JHU - Dan Ellis 2002-03-28 - 8 Lab ROSA Tandem speech recognition (with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound • Speech features Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output ‘meaning’ AIE @ JHU - Dan Ellis 2002-03-28 - 9 Lab ROSA Words Tandem system results • It works very well (‘Aurora’ noisy digits): WER as a function of SNR for various Aurora99 systems 100 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% WER / % (log scale) 50 20 10 5 HTK GMM baseline Hybrid connectionist Tandem Tandem + PC 2 1 -5 0 System-features 5 10 15 SNR / dB (averaged over 4 noises) 20 clean Avg. WER 20-0 dB Baseline WER ratio HTK-mfcc 13.7% 100% Neural net-mfcc 9.3% 84.5% Tandem-mfcc 7.4% 64.5% Tandem-msg+plp 6.4% 47.2% AIE @ JHU - Dan Ellis 2002-03-28 - 10 Lab ROSA Relative contributions • Approx relative impact on baseline WER ratio for different component: Combo over msg: +20% plp Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck tn tn+w Input sound Pre-nonlinearity over + posteriors: +12% PCA orthog'n HTK decoder Gauss mix models s msg Neural net classifier h# pcl bcl tcl dcl C0 C1 C2 Ck KLT over direct: +8% ah Combo-into-HTK over Combo-into-noway: +15% t Words tn Combo over plp: +20% tn+w Combo over mfcc: +25% NN over HTK: +15% Tandem over HTK: +35% Tandem over hybrid: +25% Tandem combo over HTK mfcc baseline: +53% AIE @ JHU - Dan Ellis 2002-03-28 - 11 Lab ROSA Inside Tandem systems: What’s going on? • Visualizations of the net outputs Clean 3 3 2 2 1 1 0 0 10 10 5 5 0 0 q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t q h# ax uw ow ao ah ay ey eh ih iy w r n v th f z s kcl tcl k t freq / kHz 4 phone Spectrogram 5dB SNR to ‘Hall’ noise 4 phone “one eight three” (MFP_183A) 10 0 -10 -20 -30 -40 dB Cepstral-smoothed mel spectrum Phone posterior estimates Hidden layer linear outputs freq / mel chan 7 5 4 0 • 6 0.2 0.4 0.6 time / s 0.8 1 3 1 0.5 0 20 10 0 -10 0 0.2 0.4 0.6 time / s 0.8 1 -20 Neural net normalizes away noise AIE @ JHU - Dan Ellis 2002-03-28 - 12 Lab ROSA Tandem for large vocabulary recognition • CI Tandem front end + CD LVCSR back end PLP feature calculation Neural net classifier 1 h# pcl bcl tcl dcl C0 C1 Tandem feature calculation SPHINX recognizer C2 Ck tn tn+w + Pre-nonlinearity outputs Input sound MSG feature calculation PCA decorrelation HMM decoder GMM classifier Subword likelihoods Neural net classifier 2 C1 ah t Words MLLR adaptation h# pcl bcl tcl dcl C0 s C2 Ck tn tn+w • Tandem benefits reduced: 75 mfc plp tandem1 tandem2 70 65 WER / % 60 55 50 45 40 35 30 AIE @ JHU - Dan Ellis Context Independent Context Dependent Context Dep. +MLLR 2002-03-28 - 13 Lab ROSA ‘Tandem-domain’ processing (with Manuel Reyes) • Can we improve the ‘tandem’ features with conventional processing (deltas, normalization)? Neural net ... Norm / deltas ? h# pcl bcl tcl dcl C0 C1 C2 Ck tn PCA decorrelation Rank reduce ? Norm / deltas ? GMM classifier ... tn+w • Somewhat.. Processing Avg. WER 20-0 dB Baseline WER ratio Tandem PLP mismatch baseline (24 els) 11.1% 70.3% Rank reduce @ 18 els 11.8% 77.1% Delta → PCA 9.7% 60.8% PCA → Norm 9.0% 58.8% Delta → PCA → Norm 8.3% 53.6% AIE @ JHU - Dan Ellis 2002-03-28 - 14 Lab ROSA Tandem vs. other approaches Aurora 2 Eurospeech 2001 Evaluation Columbia Rel improvement % - Multicondition 60 Philips UPC Barcelona Bell Labs 50 IBM 40 Motorola 1 Motorola 2 30 Nijmegen ICSI/OGI/Qualcomm 20 ATR/Griffith AT&T Alcatel 10 Siemens UCLA Microsoft 0 Avg. rel. improvement -10 Slovenia Granada • 50% of word errors corrected over baseline • Beat ‘bells and whistles’ system using large-vocabulary techniques AIE @ JHU - Dan Ellis 2002-03-28 - 15 Lab ROSA Missing data recognition (Cooke, Green, Barker @ Sheffield) • Energy overlaps in time-freq. hide features - some observations are effectively missing • Use missing feature theory... - integrate over missing data xm under model M p( x M ) = • ∫ p( x p x m, M ) p ( x m M ) d x m Effective in speech recognition - trick is finding good/bad data mask "1754" + noise 90 AURORA 2000 - Test Set A 80 Missing-data Recognition WER 70 MD Discrete SNR MD Soft SNR HTK clean training HTK multi-condition 60 50 40 "1754" Mask based on stationary noise estimate 30 20 10 0 -5 AIE @ JHU - Dan Ellis 0 5 10 2002-03-28 - 16 15 Lab ROSA 20 Clean Multi-source decoding (Jon Barker @ Sheffield) • Search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder "1754" - also search over data mask K: p ( M, K x ) ∝ p ( x K , M ) p ( K M ) p ( M ) • CASA for masks → larger fragments → faster AIE @ JHU - Dan Ellis 2002-03-28 - 17 Lab ROSA The Meeting Recorder project (with ICSI, UW, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, ...): - 100 hours collected, ongoing transcription mr-2000-11-02 C A O M J L 0 5 10 15 AIE @ JHU - Dan Ellis 20 25 30 35 40 45 50 2002-03-28 - 18 55 60 Lab ROSA Crosstalk cancellation (with Sam Keene) • Baseline speaker activity detection is hard: backchannel (signals desire to regain floor?) floor seizure mr-2000-06-30-1600 Spkr A speaker active speaker B cedes floor Spkr B Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table top 120 20 0 125 130 135 140 145 150 155 time / secs • Noisy crosstalk model: m = C ⋅ s + n • Estimate subband CAa from A’s peak energy - ... then linear inversion AIE @ JHU - Dan Ellis 2002-03-28 - 19 Lab ROSA Speaker localization (with Huan Wei Hee) • Tabletop mics form an array; time differences locate speakers Inferred talker positions (x=mic) mr-2000-11-02-1440: PZM xcorr lags 1 4 lag 3-4 / ms 0 1 -1 4 2 z/m 0.5 3 -2 -3 -3 0 2 2 1 -2 -1 0 1 lag 1-2 / ms 2 3 1 1 0 y/m • Ambiguity: - mic positions not fixed - speaker motions • Detect speaker activity, overlap AIE @ JHU - Dan Ellis 2 3 -1 0 x/m -2 2002-03-28 - 20 -2 Lab ROSA freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 freq / Hz - sinusoid peaks have invariant properties Speech + alarm 4000 2000 Pr(alarm) 0 1 0.5 0 1 2 3 4 5 6 7 8 9 time / sec - cepstral coefficients are easy to model AIE @ JHU - Dan Ellis 2002-03-28 - 21 Lab ROSA Music analysis: Structure recovery (with Rob Turetsky) • Structure recovery by similarity matrices (after Foote) - similarity distance measure? - segmentation & repetition structure - interpretation at different scales: notes, phrases, movements - incorporating musical knowledge: ‘theme similarity’ AIE @ JHU - Dan Ellis 2002-03-28 - 22 Lab ROSA Music analysis: Lyrics extraction (with Adam Berenzweig) • Vocal content is highly salient, useful for retrieval • Can we find the singing? Use an ASR classifier: freq / kHz music (no vocals #1) singing (vocals #17 + 10.5s) 4 2 0 singing phone class Posteriogram Spectrogram speech (trnset #58) 0 1 2 time / sec 0 1 2 time / sec 0 1 2 • Frame error rate ~20% for segmentation based on posterior-feature statistics • Lyric segmentation + transcribed lyrics → training data for lyrics ASR... AIE @ JHU - Dan Ellis 2002-03-28 - 23 Lab ROSA time / sec Artist similarity • Train network to discriminate specific artists: Mouse w60o40 stats based on LE plp12 2001-12-28 50 100 150 0 100 200 300 50 100 150 200 250 0 0 50 100 150 200 250 0 200 400 600 Artist class 0 100 200 300 Tori XTC Beck 0 100 200 300 0 100 200 300 400 500 t 0 • Focus on vocal segments for consistency - accuracy 57% → 65% • .. then clustering for recommendation AIE @ JHU - Dan Ellis 2002-03-28 - 24 Lab ROSA Outline 1 Sound organization 2 Speech, music, and other 3 General sound organization - Computational Auditory Scene Analysis - Audio Information Retrieval 4 Future work & summary AIE @ JHU - Dan Ellis 2002-03-28 - 25 Lab ROSA Computational Auditory Scene Analysis (CASA) CASA Object 1 description Object 2 description Object 3 description ... • Goal: Automatic sound organization ; Systems to ‘pick out’ sounds in a mixture - ... like people do • E.g. voice against a noisy background - to improve speech recognition • Approach: - psychoacoustics describes grouping ‘rules’ - ... can we implement them? AIE @ JHU - Dan Ellis 2002-03-28 - 26 Lab ROSA CASA front-end processing • Correlogram: Loosely based on known/possible physiology de lay l ine short-time autocorrelation sound frequency channels Correlogram slice Cochlea filterbank envelope follower freq lag time - linear filterbank cochlear approximation static nonlinearity zero-delay slice is like spectrogram periodicity from delay-and-multiply detectors AIE @ JHU - Dan Ellis 2002-03-28 - 27 Lab ROSA Adding top-down cues Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven (top-down) (PDCASA) hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Periodic components Predict & combine predicted features • Motivations - detect non-tonal events (noise & click elements) - support ‘restoration illusions’... • Machine Learning for sound models - corpus of isolated sounds? AIE @ JHU - Dan Ellis 2002-03-28 - 28 Lab ROSA PDCASA and complex scenes f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 AIE @ JHU - Dan Ellis 1 2 3 4 5 6 7 8 9 time/s 2002-03-28 - 29 dB Lab ROSA Audio Information Retrieval (with Manuel Reyes) • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures AIE @ JHU - Dan Ellis 2002-03-28 - 30 Lab ROSA Audio Retrieval: Results • Musclefish corpus - most commonly reported set • Features - mfcc, brightness, bandwidth, pitch ... - no temporal sequence structure • Results: - 208 examples, 16 classes Global features: 41% corr HMM models: 81% corr. Mu Musical Sp 59/ 46 11/ 6 Speech Env An Mec Mu 24 2 19 136/ 6 4 5 1 Eviron. 7/ 2 Animals 2 1/ 2 4 1 8/ 4 Mechan 1 AIE @ JHU - Dan Ellis 1 Sp 14/ 2 Env An Mec 2 1 5 5 3 1 7/ 1 4/ 3 3 2002-03-28 - 31 1 12/ Lab ROSA What are the HMM states? • No sub-units defined for nonspeech sounds • Final states depend on EM initialization - labels - clusters - transition matrix • Have ideas of what we’d like to get - investigate features/initialization to get there dogBarks2 freq / kHz s7 s3 s2 s4 s2 s3 s7 s2 s3s5 s2 s3 s2 s4 s3 s4 s3 s7 s4 8 7 6 5 4 3 2 1 time 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2 AIE @ JHU - Dan Ellis 2002-03-28 - 32 Lab ROSA CASA for audio retrieval • When audio material contains mixtures, global features are insufficient • Retrieval based on element/object analysis: Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis Seach/ comparison Results (Object formation) Query example rushing water Symbolic query Word-to-class mapping Properties alone - features are calculated over grouped subsets AIE @ JHU - Dan Ellis 2002-03-28 - 33 Lab ROSA Outline 1 Sound organization 2 Speech, music, and other 3 General sound organization 4 Future work & summary AIE @ JHU - Dan Ellis 2002-03-28 - 34 Lab ROSA Automatic audio-video analysis (with Prof. Shih-Fu Chang, Prof. Kathy McKeown) • Documentary archive management - huge ratio of raw-to-finished material - costly manual logging - missed opportunities for cross-fertilization • Problem: term ↔ signal mapping - training corpus of past annotations - interactive semi-automatic learning - need object-related features MM Concept Network annotations A/V data text processing A/V segmentation and feature extraction terms A/V features concept discovery A/V feature unsupervised clustering generic detectors AIE @ JHU - Dan Ellis Multimedia Fusion: Mining Concepts & Relationships Complex SpatioTemporal Classification Question Answering 2002-03-28 - 35 Lab ROSA The ‘Machine listener’ • Goal: An auditory system for machines - use same environmental information as people • Signal understanding - monitor for particular sounds - real-time description • Scenarios - personal listener → summary of your day - future prosthetic hearing device - autonomous robots AIE @ JHU - Dan Ellis 2002-03-28 - 36 Lab ROSA DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • AIE @ JHU - Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding 2002-03-28 - 37 Lab ROSA Summary: Audio Info Extraction • Sound carries information - useful and detailed - often tangled in mixtures • Various important general classes - Speech: activity, recognition - Music: segmentation, clustering - Other: detection, description • General processing framework - Computational Auditory Scene Analysis - Audio Information Retrieval • Future applications - Ubiquitous intelligent indexing - Intelligent monitoring & description AIE @ JHU - Dan Ellis 2002-03-28 - 38 Lab ROSA