Recognition & Organization of Speech & Audio Dan Ellis http://labrosa.ee.columbia.edu/ Outline 1 Introducing LabROSA 2 Projects in speech, music & audio 3 Summary ROSA overview - Dan Ellis 2001-09-28 - 1 Lab ROSA Sound organization 1 frq/Hz 4000 2000 0 0 2 4 6 8 10 12 time/s Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings • Central operation: - continuous sound mixture → distinct objects & events • Perceptual impression is very strong - but hard to ‘see’ in signal ROSA overview - Dan Ellis 2001-09-28 - 2 Lab ROSA Bregman’s lake “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture - two sensors, N signals ... • Disentangling mixtures as primary goal - perfect solution is not possible - need knowledge-based constraints ROSA overview - Dan Ellis 2001-09-28 - 3 Lab ROSA The information in sound freq / Hz Steps 1 Steps 2 4000 3000 2000 1000 0 0 1 2 3 4 0 1 2 3 4 time / s • A sense of hearing is evolutionarily useful - gives organisms ‘relevant’ information • Auditory perception is ecologically grounded - scene analysis is preconscious (→ illusions) - special-purpose processing reflects ‘natural scene’ properties - subjective not canonical (ambiguity) ROSA overview - Dan Ellis 2001-09-28 - 4 Lab ROSA Key themes for LabROSA http://labrosa.ee.columbia.edu/ • Sound organization: construct hierarchy - at an instant (sources) - along time (segmentation) • Scene analysis - find attributes according to objects - use attributes to form objects - ... plus constraints of knowledge • Exploiting large data sets (the ASR lesson) - supervised/labeled: pattern recognition - unsupervised: structure discovery, clustering • Special cases: - speech recognition - other source-specific recognizers • ... within a ‘complete explanation’ ROSA overview - Dan Ellis 2001-09-28 - 5 Lab ROSA Outline 1 Introducing LabROSA 2 Projects in speech, music & audio - Tandem speech recognition - ‘Meeting recorder’ speech analysis - Musical information extraction - Alarm sound detection 3 Summary ROSA overview - Dan Ellis 2001-09-28 - 6 Lab ROSA Automatic Speech Recognition (ASR) • Standard speech recognition structure: sound Feature calculation D AT A feature vectors Acoustic model parameters Acoustic classifier Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") phone probabilities HMM decoder phone / word sequence Understanding/ application... • ‘State of the art’ word-error rates (WERs): - 2% (dictation) - 30% (telephone conversations) • Can use multiple streams... ROSA overview - Dan Ellis 2001-09-28 - 7 Lab ROSA Tandem speech recognition (with Manuel Reyes, ICSI, OGI, CMU) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound • Speech features Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output ‘meaning’ ROSA overview - Dan Ellis 2001-09-28 - 8 Lab ROSA Words Tandem system results: Aurora digits WER as a function of SNR for various Aurora99 systems 100 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% WER / % (log scale) 50 20 10 5 HTK GMM baseline Hybrid connectionist Tandem Tandem + PC 2 1 -5 0 5 10 15 SNR / dB (averaged over 4 noises) 20 clean Aurora 2 Eurospeech 2001 Evaluation Columbia Philips Rel improvement % - Multicondition 60 UPC Barcelona Bell Labs IBM 50 40 Motorola 1 Motorola 2 Nijmegen ICSI/OGI/Qualcomm 30 ATR/Griffith AT&T Alcatel Siemens 20 10 0 Avg. rel. improvement -10 ROSA overview - Dan Ellis UCLA Microsoft Slovenia Granada 2001-09-28 - 9 Lab ROSA The Meeting Recorder project (with ICSI, UW, SRI, IBM) • Microphones in conventional meetings - for summarization/retrieval/behavior analysis - informal, overlapped speech • Data collection (ICSI, UW, ...): - 100 hours collected, ongoing transcription - headsets + tabletop + ‘PDA’ ROSA overview - Dan Ellis 2001-09-28 - 10 Lab ROSA Crosstalk cancellation • Baseline speaker activity detection is hard: Spkr A speaker active speaker B cedes floor Spkr B Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table top 120 20 0 125 130 135 140 145 150 155 time / secs • Noisy crosstalk model: m = C ⋅ s + n • Estimate subband CAa from A’s peak energy - ... including pure delay (10 ms frames) - ... then linear inversion ROSA overview - Dan Ellis 2001-09-28 - 11 Lab ROSA PDA-based speaker change detection • Goal: small conference-tabletop device • Speaker turns from PDA mock-up signals? pda.aif: excerpt with 512-pt xcorr, 80% max thresh 8000 6000 4000 2000 0 Morgan 31 32 what's this one here this this one has a couple Adam Dan 33 34 35 cheesy electrets oh 36 37 38 that's connected too or 39 yeah that's 40 that's connected twoyep that's the dummy dummy P.D.A. both yeah both channels yeah it is 1.5 lag/ms 1 0.5 0 -0.5 -1 -1.5 31 32 33 34 35 36 37 38 39 40 31 32 33 34 35 time/s 36 37 38 39 40 0.5 0 -0.5 • SCD algo on spectral + interaural features - average spectral + per-channel ITD, ∆φ ROSA overview - Dan Ellis 2001-09-28 - 12 Lab ROSA Music analysis: Lyrics extraction (with Adam Berenzweig) • Vocal content is highly salient, useful for retrieval • Can we find the singing? Use an ASR classifier: freq / kHz music (no vocals #1) singing (vocals #17 + 10.5s) 4 2 0 singing phone class Posteriogram Spectrogram speech (trnset #58) 0 1 2 time / sec 0 1 2 time / sec 0 1 2 time / sec • Frame error rate ~20% for segmentation based on posterior-feature statistics • Lyric segmentation + transcribed lyrics → training data for lyrics ASR... ROSA overview - Dan Ellis 2001-09-28 - 13 Lab ROSA Music analysis: Structure recovery (with Rob Turetsky) • Structure recovery by similarity matrices (after Foote) - similarity distance measure? - segmentation & repetition structure - interpretation at different scales: notes, phrases, movements - incorporating musical knowledge: ‘theme similarity’ ROSA overview - Dan Ellis 2001-09-28 - 14 Lab ROSA freq / Hz Alarm sound detection • Alarm sounds have particular structure - people ‘know them when they hear them’ • Isolate alarms in sound mixtures 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 1 1.5 • 2 2.5 0 1 1.5 2 time / sec 2.5 0 1 1 1.5 2 2.5 representation of energy in time-frequency formation of atomic elements grouping by common properties (onset &c.) classify by attributes... Key: recognize despite background ROSA overview - Dan Ellis 2001-09-28 - 15 Lab ROSA The ‘Machine listener’ • Goal: An auditory system for machines - use same environmental information as people • Aspects: - recognize spoken commands (but not others) - track ‘acoustic channel’ quality (for responses) - categorize environment (conversation, crowd...) • Scenarios - personal listener → summary of your day - autonomous robots: need awareness ROSA overview - Dan Ellis 2001-09-28 - 16 Lab ROSA Outline 1 Introducing LabROSA 2 Projects in speech, music & audio 3 Summary ROSA overview - Dan Ellis 2001-09-28 - 17 Lab ROSA DOMAINS LabROSA Summary • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • ROSA overview - Dan Ellis • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding 2001-09-28 - 18 Lab ROSA