Mapping Meetings: Columbia’s plans Dan Ellis http://labrosa.ee.columbia.edu/ Outline 1 Columbia participants 2 Mapping Meetings: Perspectives 3 Techniques 4 Summary Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 1 of 24 Lab ROSA Columbia participants 1 • Dan Ellis Laboratory for Recognition and Organization of Speech and Audio (LabROSA) - signal processing, abstraction and indexing - speech recognition - computational auditory scene analysis • Kathy McKeown Nautral Language Processing (NLP) group - text analysis, information extraction, generation • “Mapping meetings” support - 2 students, 3 academic years Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 2 of 24 Lab ROSA LabROSA DOMAINS http://labrosa.ee.columbia.edu/ • Meetings • Personal recordings • Location monitoring • Broadcast • Movies • Lectures ROSA • Object-based structure discovery & learning APPLICATIONS • Speech recognition • Speech characterization • Nonspeech recognition • • • • • • Scene analysis • Audio-visual integration • Music analysis Structuring Search Summarization Awareness Understanding Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 3 of 24 Lab ROSA Natural Language Processing Group • 4 faculty & senior researchers 13 Ph.D. students + MS ... • Summarization - Sports News task: max info-per-word • Shallow parsing - phrase-based (not sentences) - statistical • Intersection of similar documents - common phrases → summary - identify contradictions etc. Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 4 of 24 Lab ROSA Outline 1 Columbia Participants 2 Mapping meetings: Perspectives - project themes - broad plans 3 Techniques 4 Summary Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 5 of 24 Lab ROSA Mapping meetings: Project themes • Discourse analysis - interaction patterns: predicted, emergent - individuals .. pairs .. groups - words, speech style, timing, ... (speaker-relative) • Summarization & abstraction - topic content - “meeting acts” - participant roles (...) • Browsing & visualization - access methods / dimensions - scale (time, detail); pov orientation - data classes & types; rendering Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 6 of 24 Lab ROSA Information flow Words ASR per-speaker Speaker ID prosody close-talking x-cancel Topic tracking/ segmentation text sources Acts/episodes between-chan Speaker activity motion Audio per-chan multi-pitch nonspeech Speaker profiles missing-data masks de-reverb background laughter tabletop mics between-chan azimuth, triangulation separation (CASA, ICA) Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 7 of 24 Lab ROSA Signal organization plans • Speaker turns from channel energy envelopes - mixing matrix inversion • Extracting sources from mixed signals - source ID from spatial cues, pitch - feature/mask extraction → missing-data recog. • Distant signal recognition - tandem modeling - channel compensation - multi-source recognition Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 8 of 24 Lab ROSA NLP plans (Kathy McKeown) • Argumentative Summaries - special case of summarization - same theme, different perspectives • Outcomes - identify points of disagreement - identify resulting consensus • Depends on - information extraction: words, discourse, prosody • Manpower - one student starting in January Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 9 of 24 Lab ROSA Outline 1 Columbia Participants 2 Mapping Meetings: Perspectives 3 Techniques - speaker turn detection - per-speaker signal extraction - speech recognition 4 Summary Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 10 of 24 Lab ROSA Crosstalk cancellation • Baseline speaker activity detection is hard: Spkr A speaker active speaker B cedes floor Spkr B Spkr C interruptions Spkr D breath noise Spkr E crosstalk level/dB 40 Table top 20 0 120 125 130 135 140 145 150 155 time / secs • Noisy crosstalk model: m = C ⋅ s + n • Estimate column CxA from A’s peak energy - ... including pure delay (10 ms frames) - ... then linear inversion Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 11 of 24 Lab ROSA Passive motion detection • Cross-correlation recovers impulse response • Coupling to each mic gives distances; can infer which are moving S xy = H xy ⋅ S xx participant movement mr-2000-11-02-1440 Spkr C freq / chan 20 15 10 Coupling C→A delay / samples 5 20 15 10 5 Coupling C → Tabletop delay / samples 20 15 10 5 1020 1020.5 1021 Mapping meetings: Kickoff - Dan Ellis 1021.5 1022 1022.5 1023 1023.5 1024 1024.5 2001-10-13 - 12 of 24 time / sec Lab ROSA PDA-based speaker change detection • Goal: small conference-tabletop device • Speaker turns from PDA mock-up signals? pda.aif: excerpt with 512-pt xcorr, 80% max thresh 8000 6000 4000 2000 0 Morgan 31 32 what's this one here this this one has a couple Adam Dan 33 34 35 cheesy electrets oh 36 37 38 that's connected too or 39 yeah that's 40 that's connected twoyep that's the dummy dummy P.D.A. both yeah both channels yeah it is 1.5 lag/ms 1 0.5 0 -0.5 -1 -1.5 31 32 33 34 35 36 37 38 39 40 31 32 33 34 35 time/s 36 37 38 39 40 0.5 0 -0.5 • SCD algo on spectral + interaural features - average spectral + per-channel ITD, ∆φ Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 13 of 24 Lab ROSA Computational Auditory Scene Analysis (CASA) • input mixture Implement psychoacoustic theory? (Brown’92) Front end signal features (maps) Element formation discrete elements Grouping rules Source groups freq onset time period frq.mod - what are the features? how are they used? • Need top-down constraints: hypotheses Noise components Hypothesis management prediction errors input mixture Front end signal features Compare & reconcile Mapping meetings: Kickoff - Dan Ellis Periodic components Predict & combine predicted features 2001-10-13 - 14 of 24 Lab ROSA Pitch-based monaural separation • frq/Hz Bottom-up brn1h.aif frq/Hz 3000 3000 2000 1500 2000 1500 1000 1000 600 600 400 300 400 300 200 150 200 150 100 brn1h.fi.aif 100 0.2 0.4 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1.0 time/s - time-frequency cells tagged with fundamental - delete cells not dominated by target f0 • Model-based (wefts) f/Hz Weft1 4000 2000 1000 400 200 Voices f/Hz 1000 4000 2000 1000 400 200 -40 1000 -50 400 200 100 50 0.0 400 200 100 50 -30 f/Hz -60 dB 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Wefts2-5 4000 2000 1000 400 200 1000 400 200 100 50 0.0 0.2 0.4 0.6 0.8 1.0 - estimate multiple periodicities per cell Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 15 of 24 1.2 1.4 time/s Lab ROSA Tandem speech recognition (with Manuel Reyes) • Neural net estimates phone posteriors; but Gaussian mixtures model finer detail • Combine them! Hybrid Connectionist-HMM ASR Feature calculation Conventional ASR (HTK) Noway decoder Neural net classifier Feature calculation HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t s ah t tn+w Input sound Words Phone probabilities Speech features Input sound Subword likelihoods Speech features Tandem modeling Feature calculation Neural net classifier HTK decoder Gauss mix models h# pcl bcl tcl dcl C0 C1 C2 s Ck tn ah t tn+w Input sound • Speech features Phone probabilities Subword likelihoods Words Train net, then train GMM on net output - GMM is ignorant of net output ‘meaning’ Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 16 of 24 Lab ROSA Words Tandem system results: Aurora digits WER as a function of SNR for various Aurora99 systems 100 Average WER ratio to baseline: HTK GMM: 100% Hybrid: 84.6% Tandem: 64.5% Tandem + PC: 47.2% WER / % (log scale) 50 20 10 5 HTK GMM baseline Hybrid connectionist Tandem Tandem + PC 2 1 -5 0 5 10 15 SNR / dB (averaged over 4 noises) 20 clean Aurora 2 Eurospeech 2001 Evaluation Columbia Philips Rel improvement % - Multicondition 60 UPC Barcelona Bell Labs IBM 50 40 Motorola 1 Motorola 2 Nijmegen ICSI/OGI/Qualcomm 30 ATR/Griffith AT&T Alcatel Siemens 20 10 0 Avg. rel. improvement -10 Mapping meetings: Kickoff - Dan Ellis UCLA Microsoft Slovenia Granada 2001-10-13 - 17 of 24 Lab ROSA Missing data recognition (with Cooke, Green, Barker @ Sheffield) • Noisy training seems to miss the point - rather have single ‘clean’ models • Use missing feature theory... - integrate over missing data dimensions xm p ( q xo ) = ∫ p(q x o, x m ) p ( x m x o ) dx m - trick is finding good/bad data mask - soft classification improves "1754" + noise 90 AURORA 2000 - Test Set A 80 Missing-data Recognition WER 70 MD Discrete SNR MD Soft SNR HTK clean training HTK multi-condition 60 50 40 "1754" Mask based on stationary noise estimate 30 20 10 0 -5 Mapping meetings: Kickoff - Dan Ellis 0 5 10 2001-10-13 - 18 of 24 15 Lab ROSA 20 Clean Multi-source decoding (Jon Barker @ Sheffield) • Search of sound-fragment interpretations "1754" + noise Mask based on stationary noise estimate Mask split into subbands `Grouping' applied within bands: Spectro-Temporal Proximity Common Onset/Offset Multisource Decoder "1754" • Comparing different masks - evaluate p(M,K|O) = p(M|K,O)·p(K|O) • CASA for masks/fragments Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 19 of 24 Lab ROSA Outline 1 Columbia Participants 2 Mapping Meetings: Perspectives 3 Techniques 4 Summary Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 20 of 24 Lab ROSA Summary • Columbia: Audio Organization, Language Processing • Meetings: Raw information sources, Multiple analyses • Audio signals (close talk / tabletop): Turns Isolated speech Other information Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 21 of 24 Lab ROSA Random ideas ... Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 22 of 24 Lab ROSA Speaker turn (H)MM • Markov model for speaker changes - optimal path for ambiguous cues A AD BC AC C B AB BD CD D • Transition matrix represents... - turn durations (self-loops) - response patterns • ML choice between alternate trans. matrices - detect different meeting modes: presentation, debate, conflict... Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 23 of 24 Lab ROSA Marginalizing turn features • Each turn may have features - pitch range, duration, rate, fluency • Features depend on speaker and discussion mode - marginalize two ways - speaker-relative features indicate mode meetings A B participants C D E F mr1 mr2 mr3 ed1 ed2 au1 G H meeting profiles participant profiles Mapping meetings: Kickoff - Dan Ellis 2001-10-13 - 24 of 24 Lab ROSA