Sound content analysis for indexing and understanding Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 Sound content analysis 2 Speech recognition 3 Auditory scene analysis 4 Audio content indexing 5 Conclusions Sound content analysis - Dan Ellis 2000-02-08 - 1 Sound content analysis 1 • Overall goal: ‘Useful’ data from sound frq/Hz 4000 2000 0 0 2 4 6 8 10 12 time/s - which depends on the goal • Involving: - continuous → discrete - source separation - extract ‘semantic’ content words actions/events Sound content analysis - Dan Ellis 2000-02-08 - 2 The space of sound analysis research abstract Speech recognition Computational auditory scene analysis Content-based retrieval Coding/ compression Auditory modeling concrete engineering Sound content analysis - Dan Ellis perception 2000-02-08 - 3 Outline 1 Sound content analysis 2 Speech recognition - Classic speech recognition - The connectionist-HMM hybrid - Strength through combinations 3 Auditory scene analysis 4 Audio content indexing 5 Conclusions Sound content analysis - Dan Ellis 2000-02-08 - 5 Speech recognition: Dictation 2 • Observations X = {X1..XN} → States S ={S1..SN} Markov assumption S∗ = argmax P ( S X ) S P ( S, X ) = argmax ------------------P( X ) S = argmax S ∏ P( X i Si ) ⋅ P( Si Si – 1 ) i acoustic prob. • transition prob. State sequence {Si} (e.g. phones) define words Word models s Feature calculation feature vectors sound • ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") HMM decoder Acoustic classifier phone likelihoods words Training (on large datasets) is the key - EM iteration for acoustic & transition probs. Sound content analysis - Dan Ellis 2000-02-08 - 6 The connectionist-HMM hybrid (Morgan & Bourlard, 1995) • P(Xi|Si) is acoustic likelihood model - model distribution with, e.g., Gaussian mixtures • Replace with posterior, P(Si|Xi): Neural network acoustic classifier h# pcl bcl tcl dcl C0 C1 Feature calculation Word models HMM decoder C2 sound Language model Posterior probabilities for context-independent phone classes Ck tn Words tn+w - neural network estimates phone given acoustics - discriminative • Simpler structure for research Sound content analysis - Dan Ellis 2000-02-08 - 7 Visualizing speech recognition sound Feature calculation feature vectors Network weights Acoustic classifier phone probabilities Word models Language model HMM decoder phone & word labeling Sound content analysis - Dan Ellis 2000-02-08 - 8 Combination schemes • How to use complementary features? Feature 1 calculation Feature combination Feature 2 calculation HMM decoder Acoustic classifier Speech features Phone probabilities Word hypotheses “^” Word hypotheses “•” Input sound Feature 1 calculation Acoustic classifier HMM decoder Posterior combination Feature 2 calculation Input sound Acoustic classifier Phone probabilities Speech features Feature 1 calculation Acoustic classifier HMM decoder Hypothesis combination Feature 2 calculation Input sound Acoustic classifier Speech features Sound content analysis - Dan Ellis Final hypotheses HMM decoder Phone probabilities Word hypotheses 2000-02-08 - 9 Combining feature streams • How to allocate feature dimensions to models? - lower-dimension models train more quickly - higher-dimension models find more interactions • PLP & MSG for Aurora (digits in noise): - PLP are ‘conventional’ features - MSG developed as robust alternative - Evaluate by word-error rate (WER) compared to default baseline Features Parameters baseline WER ratio plp12•dplp12 136k 97.6% plp12^dplp12 124k 89.6% msg3a•msg3b 145k 101.1% msg3a^msg3b 133k 85.8% plp12•dplp12•msg3a•msg3b 281k 76.5% plp12^dplp12^msg3a^msg3b 245k 74.1% plp12^dplp12•msg3a^msg3b 257k 63.0% Sound content analysis - Dan Ellis 2000-02-08 - 10 Tandem connectionist models (with Hermansky et al., OGI) • How can we combine neural net & GM models? Feature calculation Input sound Neural net model (Posterior decoder) (Phone probabilities) Speech features Pre-nonlinearity outputs PCA orthogn'n (Hybrid system output) Subword likelihoods Othogonal features HTK GM model Tandem system output HTK decoder - (GMM system does not know they are phones) • Result: better performance than either alone! - neural net has trained discriminatively - GMM HMMs learn context-dependent structure →extract complementary info from training data System-features baseline WER ratio HTK-mfcc 100.0% Hybrid-mfcc 84.6% Tandem-mfcc 64.5% Tandem-plp+msg 47.2% Sound content analysis - Dan Ellis 2000-02-08 - 11 Aurora “Distributed SR” evaluation • 7 telecoms company submissions: Aurora DSR Evaluation 1999 Results Avg. WER -20-0dB Baseline improvement 100.00% 80.00% 60.00% 40.00% 20.00% Ta nd em 2 S6 S5 S4 S3 Ta nd em 1 -20.00% S2 S1 Ba se lin e 0.00% - Tandem systems from OGI-ICSI-Qualcomm Sound content analysis - Dan Ellis 2000-02-08 - 12 Outstanding issues in speech recognition • Are we on the right path? - useful dictation products exist - evaluation results improve every year - .. but appear to be asymptoting • Is dictation enough? - a useful focus initially - .. but not speech understanding - .. and has skewed research • What should be our research priorities? - straight ASR research is hard to fund - need to look at harder domains - need to connect it to applications Sound content analysis - Dan Ellis 2000-02-08 - 13 Outline 1 Sound content analysis 2 Speech recognition 3 Auditory scene analysis - Psychological phenomena - Computational modeling - Prediction-driven analysis - Incorporating speech 4 Audio content indexing 5 Conclusions Sound content analysis - Dan Ellis 2000-02-08 - 14 Auditory Scene Analysis (ASA) 3 “The organization of sound scenes according to their inferred sources” f/Hz city22 4000 −40 2000 −50 1000 400 −60 200 −70 0 1 2 3 4 5 6 7 8 9 dB time/s • Sounds rarely occur in isolation - need to ‘separate’ for useful information • Human audition is very effective - computational models have a lot to learn Sound content analysis - Dan Ellis 2000-02-08 - 15 Psychology of ASA • Extensive experimental research - perception of simplified stimuli (sinusoids, noise) • “Auditory Scene Analysis” [Bregman 1990] - first: break mixture into small elements - elements are grouped in to sources using cues • Grouping ‘rules’ (Darwin, Carlyon, ...): - common onset/offset/modulation, harmonicity, spatial location, ... - relate to intrinsic (ecological) regularities Onset map Frequency analysis Harmonicity map Source properties Grouping mechanism Position map (after Darwin, 1996) Sound content analysis - Dan Ellis 2000-02-08 - 16 Computational Auditory Scene Analysis (CASA) • input mixture Literal model of Bregman... (e.g. Brown 1992): Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod • Goals - identify and segregate different sources - resynthesize separate outputs! Sound content analysis - Dan Ellis 2000-02-08 - 17 Grouping model results • frq/Hz Able to extract voiced speech: brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 • 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 Limitations - resynthesis via filter-mask - only periodic targets - robustness of discrete objects Sound content analysis - Dan Ellis 2000-02-08 - 18 1.0 time/s Context, expectations & predictions Perception is not direct but a search for plausible hypotheses • Bregman’s “old-plus-new” principle: A change in a signal will be interpreted as an added source whenever possible • E.g. the ‘continuity illusion’: f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 time/s - tones alternates with noise bursts - noise is strong enough to mask tone ... so listener discriminate presence - continuous tone perceived for gaps ~100s of ms → Inference acts at low, preconscious level Sound content analysis - Dan Ellis 2000-02-08 - 19 Modeling top-down processing: ‘Prediction-driven’ CASA (PDCASA): • input mixture Data-driven... Front end signal features Object formation discrete objects Grouping rules Source groups vs. Prediction-driven hypotheses Noise components Hypothesis management prediction errors input mixture • Front end signal features Compare & reconcile Periodic components Predict & combine predicted features PDCASA key features: - ‘complete explanation’ of all scene energy - vocabulary of periodic/noise/transient elements - multiple hypotheses - explanation hierarchy Sound content analysis - Dan Ellis 2000-02-08 - 20 PDCASA for the continuity illusion • Subjects hear the tone as continuous ... if the noise is a plausible masker f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 i 1.4 / • Data-driven analysis gives just visible portions: • Prediction-driven can infer masking: Sound content analysis - Dan Ellis 2000-02-08 - 21 PDCASA analysis of a complex scene f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 f/Hz 1 2 3 Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000 1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 1 Sound content analysis - Dan Ellis 2 3 4 5 6 7 8 9 time/s dB 2000-02-08 - 22 CASA for speech recognition • Data-driven: CASA as preprocessor - problems with ‘holes’ (but: Okuno) - doesn’t exploit knowledge of speech structure • Missing data (Cooke &c, de Cheveigné) - CASA cues distinguish present/absent - RESPITE project: modifications to recognizer • Prediction-driven: speech as component - same ‘reconciliation’ of speech hypotheses - need to express ‘predictions’ in signal domain Speech components Hypothesis management input mixture Noise components Predict & combine Periodic components Front end Sound content analysis - Dan Ellis Compare & reconcile 2000-02-08 - 23 Other signal-separation approaches • HMM decomposition (RK Moore ’86) - recover combined source states directly model 2 model 1 observations • Blind source separation (Bell & Sejnowski ’94) - find exact separation parameters by maximizing statistic e.g. signal independence Sound content analysis - Dan Ellis 2000-02-08 - 24 Outstanding issues in CASA • What is the architecture? - data-driven versus prediction-driven - representations at different levels - hypothesis search • How to combine different cues? - priority of different cues - resolving conflicting cues - bottom-up versus top-down • How to exploit training data? - .. the big lesson from speech recognition • Evaluation - .. a more subtle lesson Sound content analysis - Dan Ellis 2000-02-08 - 25 Outline 1 Sound content analysis 2 Speech recognition 3 Auditory scene analysis 4 Audio content indexing - Spoken document retrieval - Handling nonspeech audio - Object-based analysis and retrieval - Audio-video content organization 5 Conclusions Sound content analysis - Dan Ellis 2000-02-08 - 26 4 F0: F4: F5: Audio content indexing: Spoken document retrieval (SDR) • Idea: speech recognition transcripts as indexes • Best broadcast news systems are not great - 15-30% WER on real broadcasts • Word errors vary in their impact: THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELECTION SEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADER DANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE PRESIDENTIAL CANDIDATES OF THE ELECTION THE LABOR MIGHT DO WELL TO REMEMBER THE LOST A MAJOR EPISODE OF TRANSATLANTIC CONNECT TO A CORPORATION IN BOTH CONSERVATIVE PARTY OFFICIALS FROM BRITAIN GOING TO WASHINGTON THEY WENT TO WOOD BUYS GEORGE BUSH ON HOW TO WIN A SECOND TO NONE IN LONDON THIS IS STEPHEN BEARD FOR MARKETPLACE • Good enough for information retrieval (IR) - e.g. TREC-8 average precision: reference transcript ~ 0.5 30% WER ~ 0.4 Sound content analysis - Dan Ellis 2000-02-08 - 27 Thematic Indexing of Spoken Language (with Sheffield, Cambridge, BBC) • SDR for BBC broadcast news archive - 1000+ hr archive, automatically updated Sound content analysis - Dan Ellis 2000-02-08 - 28 Speech and nonspeech (with Gethin Williams) • ASR run over entire soundtracks? - for nonspeech, result is nonsense • Watch behavior of speech acoustic model: - average per-frame entropy - ‘dynamism’ - mean-squared 1st-order difference Dynamism vs. Entropy for 2.5 second segments of speecn/music 3.5 Spectrogram frq/Hz Speech Music Speech+Music 3 4000 2.5 2000 2 speech music Entropy 0 speech+music 1.5 40 1 20 0 0 2 4 6 8 10 12 time/s 0.5 Posteriors 0 0 • 0.05 0.1 0.15 0.2 Dynamism 0.25 0.3 1.3% error on 2.5 second speech-music testset Sound content analysis - Dan Ellis 2000-02-08 - 29 Element-based audio indexing • Search for nonspeech audio databases - e.g. Muscle Fish ‘SoundFisher’ for SFX libraries • Segment-level features Segment feature analysis Sound segment database Feature vectors Seach/ comparison Segment feature analysis Query example - well-performing features: spectral centroid, dynamics, tonality ... • Each segment is an object - not applicable to continuous recordings Sound content analysis - Dan Ellis 2000-02-08 - 30 Results Object-based audio indexing • Using ‘generic sound elements’ - decompose sound into elements; match subsets - how to generalize? - how to use segment-style features? • Form into objects for higher-order properties - CASA-type object formation (onset, harmonicity) Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis (Object formation) Query example rushing water Symbolic query Sound content analysis - Dan Ellis Word-to-class mapping Properties alone 2000-02-08 - 31 Seach/ comparison Results Audio-video organization & retrieval • How it might work... Boundaries & structuring Speech processing audio Nonspeech analysis Audio-video data IR engine Indexing terms symbolic analogic A-V match video Query Video features Summarization Sound content analysis - Dan Ellis results 2000-02-08 - 32 AV indexing components • Recovering broad temporal structure - speaker turns ; speech & music ; repetition - characteristic of genres e.g. news shows - indexible attributes in themselves • Posing queries: - term-based - proximity to examples - dynamic audio-visual sketches? • How to define index/query terms? - different kinds of terms: literal versus thematic - machine learning of event classes • Summarization - for displaying ‘hits’: impacts usability - text / image / video / sound - tricks e.g. to find most salient words Sound content analysis - Dan Ellis 2000-02-08 - 33 Open issues in audio indexing • Information from speech - multiple, confidence-tagged results? (not WER) - prosodics; emphasis; speaking style - speaker tracking, identity, character • Information from nonspeech - how to define objects - how to match symbolic search terms • Integrating audio and video - combining information for search elements - forms of query • Related applications - ‘structured content’ encoders (e.g. MPEG4SA) - semantic hearing aids ; robot monitors Sound content analysis - Dan Ellis 2000-02-08 - 34 Outline 1 Sound content analysis 2 Speech recognition 3 Auditory scene analysis 4 Audio content indexing 5 Conclusions Sound content analysis - Dan Ellis 2000-02-08 - 35 5 Conclusions: The state of sound content analysis • Speech recognition: - focussed application, practical results - powerful statistical pattern recognition tools - able to exploit large training sets • Computational Auditory Scene Analysis: - real-world sounds are mixtures - discover advanced ecological constraints - results still rather preliminary • Content-based retrieval: - compelling problem; forgiving application - leveraging audio-visual correlations - fertile ground for research Sound content analysis - Dan Ellis 2000-02-08 - 36