Audio Information Extraction Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline 1 Audio Information Extraction 2 Speech, music, and other 3 General sound organization 4 Future work & summary AIE @ MERL - Dan Ellis 2002-01-07 - 1 Lab ROSA 1 Audio Information Extraction (AIE) 4000 frq/Hz 3000 2000 1000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Central operation: - continuous sound mixture → distinct objects & events • Perceptual impression is very strong - but hard to ‘see’ in signal AIE @ MERL - Dan Ellis 2002-01-07 - 2 Lab ROSA Perceptual organization: Bregman’s lake “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture - two sensors, N signals ... • Disentangling mixtures as primary goal - perfect solution is not possible - need knowledge-based constraints AIE @ MERL - Dan Ellis 2002-01-07 - 3 Lab ROSA The information in sound freq / Hz Steps 1 Steps 2 4000 3000 2000 1000 0 0 1 2 3 4 0 1 2 3 4 time / s • A sense of hearing is evolutionarily useful - gives organisms ‘relevant’ information • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) - special-purpose processing reflects ‘natural scene’ properties - subjective not canonical (ambiguity) AIE @ MERL - Dan Ellis 2002-01-07 - 4 Lab ROSA Positioning AIE AutomaticSpeech Recognition ASR Text Information Retrieval IR Spoken Document Retrieval SDR Text Information Extraction IE Blind Source Separation BSS Independent Component Analysis ICA Music Information Retrieval Music IR Audio Information Extraction AIE Computational Auditory Scene Analysis CASA Audio Fingerprinting Audio Content-based Retrieval Audio CBR Unsupervised Clustering • Domain - text ... speech ... music ... general audio • Operation - recognize ... index/retrieve ... organize AIE @ MERL - Dan Ellis Multimedia Information Retrieval MMIR 2002-01-07 - 5 Lab ROSA AIE Applications • Multimedia access - sound as complementary dimension - need all modalities for complete information • Personal audio - continuous sound capture quite practical - different kind of indexing problem • Machine perception - intelligence requires awareness - necessary for communication • Music retrieval - area of hot activity - specific economic factors AIE @ MERL - Dan Ellis 2002-01-07 - 6 Lab ROSA Outline 1 Audio Information Extraction 2 Speech, music, and other - Speech recognition - Multi-speaker processing - Music classification - Other sounds 3 General sound organization 4 Future work & summary AIE @ MERL - Dan Ellis 2002-01-07 - 7 Lab ROSA Automatic Speech Recognition (ASR) • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Acoustic classifier Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") phone probabilities HMM decoder phone / word sequence Understanding/ application... • ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) • Can use multiple streams... AIE @ MERL - Dan Ellis 2002-01-07 - 8 Lab ROSA Tandem speech recognition (with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound • Speech features Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output ‘meaning’ AIE @ MERL - Dan Ellis 2002-01-07 - 9 Lab ROSA Words Tandem system results: Aurora ‘noisy digits’ (with Manuel Reyes) Aurora 2 Eurospeech 2001 Evaluation Columbia Rel improvement % - Multicondition 60 Philips UPC Barcelona Bell Labs 50 IBM 40 Motorola 1 Motorola 2 30 Nijmegen ICSI/OGI/Qualcomm 20 ATR/Griffith AT&T Alcatel 10 Siemens UCLA Microsoft 0 Avg. rel. improvement -10 Slovenia Granada • 50% of word errors corrected over baseline • Beat even ‘bells and whistles’ system using intensive large-vocabulary techniques AIE @ MERL - Dan Ellis 2002-01-07 - 10 Lab ROSA Missing data recognition (Cooke, Green, Barker... @ Sheffield) • Energy overlaps in time-freq. hide features - some observations are effectively missing • Use missing feature theory... - integrate over missing data dimensions xm p( x q) = • ∫ p( x p x m, q ) p ( x m q ) d x m Effective in speech recognition - trick is finding good/bad data mask "1754" + noise 90 AURORA 2000 - Test Set A 80 Missing-data Recognition WER 70 MD Discrete SNR MD Soft SNR HTK clean training HTK multi-condition 60 50 40 30 "1754" 20 10 Mask based on stationary noise estimate AIE @ MERL - Dan Ellis 0 -5 0 5 10 SNR (dB) 15 2002-01-07 - 11 20 Clean Lab ROSA The Meeting Recorder project (with ICSI, UW, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, ...): - 100 hours collected, ongoing transcription - headsets + tabletop + ‘PDA’ AIE @ MERL - Dan Ellis 2002-01-07 - 12 Lab ROSA Crosstalk cancellation • Baseline speaker activity detection is hard: backchannel (signals desire to regain floor?) floor seizure mr-2000-06-30-1600 Spkr A speaker active speaker B cedes floor Spkr B Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table top 120 20 0 125 130 135 140 145 150 155 time / secs • Noisy crosstalk model: m = C ⋅ s + n • Estimate subband CAa from A’s peak energy - ... including pure delay (10 ms frames) - ... then linear inversion AIE @ MERL - Dan Ellis 2002-01-07 - 13 Lab ROSA Speaker localization (with Wei Hee Huan) • Tabletop mics form an array; time differences locate speakers 5 4 3 2 1 0 -1 -2 -3 -4 -5 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 • Ambiguity: - mic positions not fixed - geometric symmetry • Detect speaker activity, overlap AIE @ MERL - Dan Ellis 2002-01-07 - 14 Lab ROSA Music analysis: Structure recovery (with Rob Turetsky) • Structure recovery by similarity matrices (after Foote) - similarity distance measure? - segmentation & repetition structure - interpretation at different scales: notes, phrases, movements - incorporating musical knowledge: ‘theme similarity’ AIE @ MERL - Dan Ellis 2002-01-07 - 15 Lab ROSA Music analysis: Lyrics extraction (with Adam Berenzweig) • Vocal content is highly salient, useful for retrieval • Can we find the singing? Use an ASR classifier: freq / kHz music (no vocals #1) singing (vocals #17 + 10.5s) 4 2 0 singing phone class Posteriogram Spectrogram speech (trnset #58) 0 1 2 time / sec 0 1 2 time / sec 0 1 2 • Frame error rate ~20% for segmentation based on posterior-feature statistics • Lyric segmentation + transcribed lyrics → training data for lyrics ASR... AIE @ MERL - Dan Ellis 2002-01-07 - 16 Lab ROSA time / sec Artist similarity • Train network to discriminate specific artists: Mouse w60o40 stats based on LE plp12 2001-12-28 50 100 150 0 100 200 300 50 100 150 200 250 0 0 50 100 150 200 250 0 200 400 600 Artist class 0 100 200 300 Tori XTC Beck 0 100 200 300 0 100 200 300 400 500 t 0 • Focus on vocal segments for consistency • .. then clustering for recommendation AIE @ MERL - Dan Ellis 2002-01-07 - 17 Lab ROSA freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 • 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 representation of energy in time-frequency formation of atomic elements grouping by common properties (onset &c.) classify by attributes... Key: recognize despite background AIE @ MERL - Dan Ellis 2002-01-07 - 18 Lab ROSA Sound textures (with Marios Athineos) • Textures: Large class of sounds - no clear pitch, onsets, shape - fire, rain, paper, machines, ... - ‘bulk’ subjective properties • Abstract & synthesize by: - project into low-dimensional parameter space - learn dynamics within space - generate endless versions 40 30 0 20 -5 10 -10 50 100 150 200 250 300 350 -15 -20 8 6 -25 4 -30 -100 40 2 50 100 150 200 AIE @ MERL - Dan Ellis 250 300 -90 20 -80 -70 -60 0 350 2002-01-07 - 19 Lab ROSA Outline 1 Audio Information Extraction 2 Speech, music, and other 3 General sound organization - Computational Auditory Scene Analysis - Audio Information Retrieval 4 Future work & summary AIE @ MERL - Dan Ellis 2002-01-07 - 20 Lab ROSA Computational Auditory Scene Analysis (CASA) CASA Object 1 description Object 2 description Object 3 description ... • Goal: Automatic sound organization ; Systems to ‘pick out’ sounds in a mixture - ... like people do • E.g. voice against a noisy background - to improve speech recognition • Approach: - psychoacoustics describes grouping ‘rules’ - ... just implement them? AIE @ MERL - Dan Ellis 2002-01-07 - 21 Lab ROSA CASA front-end processing • Correlogram: Loosely based on known/possible physiology de lay l ine short-time autocorrelation sound Cochlea filterbank frequency channels Correlogram slice envelope follower freq lag time - linear filterbank cochlear approximation static nonlinearity zero-delay slice is like spectrogram periodicity from delay-and-multiply detectors AIE @ MERL - Dan Ellis 2002-01-07 - 22 Lab ROSA Adding top-down cues Perception is not direct but a search for plausible hypotheses • Data-driven (bottom-up)... input mixture Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven (top-down) (PDCASA) hypotheses Noise components Hypothesis management prediction errors input mixture • Front end signal features Compare & reconcile Periodic components Predict & combine predicted features Motivations - detect non-tonal events (noise & click elements) - support ‘restoration illusions’... → hooks for high-level knowledge + ‘complete explanation’, multiple hypotheses, ... AIE @ MERL - Dan Ellis 2002-01-07 - 23 Lab ROSA PDCASA and complex scenes f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 AIE @ MERL - Dan Ellis 1 2 3 4 5 6 7 8 9 time/s 2002-01-07 - 24 dB Lab ROSA Audio Information Retrieval (with Manuel Reyes) • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures AIE @ MERL - Dan Ellis 2002-01-07 - 25 Lab ROSA Audio Retrieval: Results • Musclefish corpus - most commonly reported set • Features - mfcc, brightness, bandwidth, pitch ... - no temporal sequence structure • Results: - 208 examples, 16 classses, 84% correct - confusions: Instr Musical instrs. Spch Env Anim Mech 136 (14) Speech 17 (7) Eviron. 2 Animals 2 Mechanical 1 AIE @ MERL - Dan Ellis 2 6 (1) 2 1 (0) 15 (2) 2002-01-07 - 26 Lab ROSA CASA for audio retrieval • When audio material contains mixtures, global features are insufficient • Retrieval based on element/object analysis: Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis Seach/ comparison Results (Object formation) Query example rushing water Symbolic query Word-to-class mapping Properties alone - features are calculated over grouped subsets AIE @ MERL - Dan Ellis 2002-01-07 - 27 Lab ROSA Outline 1 Audio Information Extraction 2 Speech, music, and other 3 General sound organization 4 Future work & summary AIE @ MERL - Dan Ellis 2002-01-07 - 28 Lab ROSA Automatic audio-video analysis (with Prof. Shih-Fu Chang, Prof. Kathy McKeown) • Documentary archive management - huge ratio of raw-to-finished material - costly manual logging - missed opportunities for cross-fertilization • Problem: term <-> signal mapping - training corpus of past annotations - interactive semi-automatic learning - need object-related features MM Concept Network annotations A/V data text processing A/V segmentation and feature extraction terms A/V features concept discovery A/V feature unsupervised clustering generic detectors AIE @ MERL - Dan Ellis Multimedia Fusion: Mining Concepts & Relationships Complex SpatioTemporal Classification Question Answering 2002-01-07 - 29 Lab ROSA The ‘Machine listener’ • Goal: An auditory system for machines - use same environmental information as people • Signal understanding - monitor for particular sounds - real-time description • Scenarios - personal listener → summary of your day - future prosthetic hearing device - autonomous robots AIE @ MERL - Dan Ellis 2002-01-07 - 30 Lab ROSA DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • AIE @ MERL - Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding 2002-01-07 - 31 Lab ROSA Summary: Audio Info Extraction • Sound carries information - useful and detailed - often tangled in mixtures • Various important general classes - Speech: activity, recognition - Music: segmentation, clustering - Other: detection, description • General processing framework - Computational Auditory Scene Analysis - Audio Information Retrieval • Future applications - Ubiquitous intelligent indexing - Intelligent monitoring & description AIE @ MERL - Dan Ellis 2002-01-07 - 32 Lab ROSA Audio Information Extraction: panacea or punishment? AIE @ MERL - Dan Ellis 2002-01-07 - 33 Lab ROSA