General Soundtrack Analysis Dan Ellis <dpwe@ee.columbia.edu> Laboratory for Recognition and Organization of Speech and Audio (LabROSA) Electrical Engineering, Columbia University http://labrosa.ee.columbia.edu/ Outline 1 LabROSA introduction 2 Broadcast soundtrack monitoring 3 Example technologies 4 Summary Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 1 Lab ROSA LabROSA: Sound Organization 1 freq / Hz alj0-20-50 8000 6000 4000 2000 0 0 5 10 15 20 25 time / sec 30 Analysis Voice2 Voice1 Music1 Stab Music2 Music2 (quiet) • Analyzing and describing complex sounds: - continuous sound mixture → distinct objects & events • Human listeners as the prototype - strong subjective impression when listening - ..but hard to ‘see’ in signal Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 2 Lab ROSA The information in sound freq / Hz Steps 1 Steps 2 4000 3000 2000 1000 0 0 1 2 3 4 0 1 2 3 4 time / s • Hearing confers evolutionary advantage - optimized to get ‘useful’ information from sound • Enormous detail is available in familiar sounds - ‘ecological’ influence on our sense of hearing Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 3 Lab ROSA Audio Information Extraction AutomaticSpeech Recognition ASR Text Information Retrieval IR Spoken Document Retrieval SDR Text Information Extraction IE Blind Source Separation BSS Independent Component Analysis ICA Music Information Retrieval Music IR Audio Information Extraction AIE Computational Auditory Scene Analysis CASA Audio Fingerprinting Audio Content-based Retrieval Audio CBR Unsupervised Clustering • Domain - text ... speech ... music ... general audio • Operation - recognize ... index/retrieve ... organize Soundtrack @ FBIS - Dan Ellis Multimedia Information Retrieval MMIR 2002-03-12 - 4 Lab ROSA DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • HCI • Broadcast • Movies • Lectures • Music ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speaker description • Nonspeech recognition • Scene Analysis • Audio-visual integration • Music analysis • Multimedia access & search • Personal media management • Machine perception & awareness • Prostheses / human augmentation • Automatic judgments/recommendation Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 5 Lab ROSA Outline 1 LabROSA introduction 2 Broadcast soundtrack monitoring - Available information - Information filtering - Intelligent analysis 3 Example technologies 4 Summary Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 6 Lab ROSA 2 Broadcast soundtrack monitoring: What information is available? • Video and soundtrack reinforce, complement - hard things in one domain can be easy in other • Information at different scales: generic dialog style sound events words specific program genre activity level speaker emotion story coverage schedule changes speaker ID seconds days timescale Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 7 Lab ROSA Information filtering • Maximizing analyst utility: information selection • Automatic support for ‘triage’: - segmentation - inferring schedules - identify repeated segments - generic categorization/labeling - task-specific flagging Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 8 Lab ROSA Automatic labeling/flagging • Computer as vigilant monitor Signal Models Monitoring computer Flagged events • Range of tasks - generic ‘unusual’ patterns - specific trigger events • Issues - false alarms vs. misses - combinations of simple detectors for higher-level classes e.g. program type Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 9 Lab ROSA Schedule summarization • Learn the patterns of a particular broadcaster: - 24/7 monitoring - automatic segmentation into programs - automatic classification of genres M T W Th F S Su 0000 News 0600 Talk Sports 1200 Music Other 1800 2400 • Support information extraction - program types, repetitions, summarization - schedule changes → activity? Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 10 Lab ROSA Outline 1 LabROSA introduction 2 Broadcast soundtrack monitoring 3 Example technologies - Sound enhancement - Locating repetitions - Labeling soundtracks 4 Summary Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 11 Lab ROSA Example technologies: Sound enhancement 3 • Time scale modification freq / Hz Original speech 8000 6000 4000 2000 Sped up by 50% 8000 - speed up for rapid skimming 6000 4000 2000 0 Slowed down to 50% 8000 - slow down for close listening 6000 4000 2000 0 Noise reduced at pk - 20dB 8000 6000 • Noise reduction 4000 2000 0 0 Soundtrack @ FBIS - Dan Ellis 1 2 3 4 5 2002-03-12 - 12 6 7 8 Lab ROSA 9 time / sec Repetition detection • Matching signals is expensive - but matching strings is fast • Represent audio as simple string - recognition into arbitrary classes - match detection becomes string matching Models perc spee cell soundtrack anim viol mach Recognizer v p a v c a detection String match target • Recognizer p a c p v a s Monitor for many ‘signatures’ Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 13 Lab ROSA Soundtrack labeling • Broad class: speech-music freq / Hz alj0-20-50 8000 6000 4000 Pr(mus) 2000 0 1 0.5 0 0 freq / Hz • 5 10 15 20 25 time / sec 30 Specific events: alarms Speech + alarm 4000 2000 Pr(alarm) 0 1 0.5 0 1 2 Soundtrack @ FBIS - Dan Ellis 3 4 5 6 7 8 9 time / sec 2002-03-12 - 14 Lab ROSA Speaking style • Recognizing information other than words • Meeting Recorder project: Locate overlaps backchannel (signals desire to regain floor?) floor seizure mr-2000-06-30-1600 Spkr A speaker active speake cedes f Spkr B Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table top 20 0 120 125 130 135 140 145 150 155 time / secs mr-2000-11-02 C A O M J L 0 5 10 • 15 20 25 30 35 40 45 50 55 60 Speaker emotion - depends on good baseline Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 15 Lab ROSA Outline 1 LabROSA introduction 2 Broadcast soundtrack monitoring 3 Example technologies 4 Summary Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 16 Lab ROSA 4 Summary: Soundtrack analysis • Soundtrack carries information - useful and detailed - complementary to image - rapid processing • Open source monitoring - Need to find the interesting bits - Short-time: specific detectors - Long-time: schedule, genre classification • Current techniques - Signal enhancement - Detectors for speech, music, alarms etc. - Classification of interaction style, emotion Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 17 Lab ROSA Extra slides Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 18 Lab ROSA Automatic Speech Recognition (ASR) • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Acoustic classifier Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") phone probabilities HMM decoder phone / word sequence Understanding/ application... • ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) • Can use multiple streams... Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 19 Lab ROSA The Meeting Recorder project (with ICSI, UW, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, ...): - 100 hours collected, ongoing transcription - headsets + tabletop + ‘PDA’ Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 20 Lab ROSA Crosstalk cancellation • Baseline speaker activity detection is hard: backchannel (signals desire to regain floor?) floor seizure mr-2000-06-30-1600 Spkr A speaker active speaker B cedes floor Spkr B Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table top 120 20 0 125 130 135 140 145 150 155 time / secs • Noisy crosstalk model: m = C ⋅ s + n • Estimate subband CAa from A’s peak energy - ... including pure delay (10 ms frames) - ... then linear inversion Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 21 Lab ROSA freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 • 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 representation of energy in time-frequency formation of atomic elements grouping by common properties (onset &c.) classify by attributes... Key: recognize despite background Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 22 Lab ROSA Audio Information Retrieval (with Manuel Reyes) • Searching in a database of audio - speech .. use ASR - text annotations .. search them - sound effects library? • e.g. Muscle Fish “SoundFisher” browser - define multiple ‘perceptual’ feature dimensions - search by proximity in (weighted) feature space Segment feature analysis Sound segment database Feature vectors Seach/ comparison Results Segment feature analysis Query example - features are ‘global’ for each soundfile, no attempt to separate mixtures Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 23 Lab ROSA Audio Retrieval: Results • Musclefish corpus - most commonly reported set • Features - mfcc, brightness, bandwidth, pitch ... - no temporal sequence structure • Results: - 208 examples, 16 classes, 84% correct - confusions: Instr Musical instrs. Spch Env Anim Mech 136 (14) Speech 17 (7) Eviron. 2 Animals 2 Mechanical 1 Soundtrack @ FBIS - Dan Ellis 2 6 (1) 2 1 (0) 15 (2) 2002-03-12 - 24 Lab ROSA CASA for audio retrieval • When audio material contains mixtures, global features are insufficient • Retrieval based on element/object analysis: Generic element analysis Continuous audio archive Object formation Objects + properties Element representations Generic element analysis Seach/ comparison Results (Object formation) Query example rushing water Symbolic query Word-to-class mapping Properties alone - features are calculated over grouped subsets Soundtrack @ FBIS - Dan Ellis 2002-03-12 - 25 Lab ROSA