Speech interfaces: A survey and some current projects Dan Ellis & Nelson Morgan International Computer Science Institute Berkeley CA {dpwe,morgan}@icsi.berkeley.edu Outline 1 Speech recognition: the state of the art 2 Current projects at ICSI 3 Conclusions HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 1 Speech recognition 1 sound Feature calculation feature vectors Acoustic model parameters Acoustic classifier Word models s ah t Language model p("sat"|"the","cat") p("saw"|"the","cat") • phone probabilities HMM decoder phone / word sequence Understanding/ application... Elements of a recognizer: - feature design - acoustic modeling - pronunciation/language modeling HCC - Speech Interfaces - Dan Ellis } data! 2000-02-24 - 2 How good is speech recognition? • F0: F4: Standard measure is word error rate (WER): - dictation (close-mic): 2-5% - broadcast news: ~15% - telephone conversations: ~30% THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELECTION SEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADER DANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE PRESIDENTIAL CANDIDATES OF THE ELECTION • What are the problems? - acoustic variability (noise, channel) - speech variability (accent, manner) - exploiting linguistic constraints - speech understanding... HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 3 Frontiers of speech recognition • Acoustic modeling - beyond head-mounted mics - background noise (mobile phones) - speech in mixtures (broadcast) → robust feature design, better statistical models • Speaking styles - coarticulation - pronunciation variability - speaking styles → lump into acoustic model, more training data, better pron. models, context-dep. models • Linguistic constraints - ‘inferred’ words - ambiguity → higher-order n-grams (more training data), tree grammars HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 4 Applications of speech recognition • Command & control - more or less constrained • Dictation - large vocabulary - known, co-operative user • Voice response systems - dialog & speech understanding - robustness! - human factors (timing, barge-in etc.) • Information extraction & retrieval - multimedia archive retrieval - live ‘listener’ HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 5 Outline 1 Speech recognition: the state of the art 2 Current projects at ICSI - Recognizer confidence measures - Combining information sources - The meeting recorder - Audio content-based retrieval 3 Conclusions HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 6 Recognizer confidence measures (Warner Warren, Andy Hatch, Eric Fosler + SRI) • Knowing which words are wrong can help - hard to tell because recognition only just works • Average per-phone entropy + re-estimation: DET plot for word-level confidence estimation (AURORA) 40 Miss probability (in %) 20 10 5 2 1 0.5 raw posteriors fwd-bwd posteriors 0.2 0.1 0.1 0.2 • 0.5 1 2 5 10 False Alarm probability (in %) 20 40 Use for combining recognizer outputs HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 7 Combination schemes (Mike Shire, Barry Chen + Michael Jordan) Feature 1 calculation Feature combination Feature 2 calculation HMM decoder Acoustic classifier Speech features Word hypotheses Phone probabilities “^” Input sound Feature 1 calculation Acoustic classifier HMM decoder Posterior combination Feature 2 calculation Input sound Acoustic classifier Word hypotheses Phone probabilities “•” Speech features • How best to combine different feature streams? Features Avg. WER plp 8.9% msg 9.5% plp ^ msg 8.1% plp • msg 7.1% HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 8 Tandem acoustic modeling (with Hermansky et al., OGI) • ICSI pioneered ‘hybrid-connectionist’ ASR; Can it be combined with conventional models? Feature calculation Input sound Neural net model (Posterior decoder) (Phone probabilities) Speech features Pre-nonlinearity outputs • PCA orthogn'n (Hybrid system output) Subword likelihoods Othogonal features HTK GM model Tandem system output HTK decoder Result: better performance than either alone! - neural net & Gaussian mixture models extract different information from training data System-features Avg. WER HTK-mfcc 13.7% Hybrid-mfcc 9.3% Tandem-mfcc 7.4% Tandem-plp+msg 6.4% HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 9 Aurora “Distributed SR” evaluation • Organized by ETSI (European Telecoms. Standards Institute) Avg WER -20-0 Baseline reduction ETSI Aurora 1999 Evaluation 70% 60% 50% 40% 30% 20% 10% 2 Ta nd em S6 1 Ta nd em S5 S4 S3 S2 S1 Ba se lin e 0% - Tandem systems from OGI-ICSI-Qualcomm HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 10 The meeting recorder project (Adam Janin, Eric Fosler + UW, SRI, UPM, James Landay) • Idea: PDA records meetings to replace / enhance note-taking • First task: Collect a training corpus • Related to DARPA Communicator, SmartKom HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 11 Meeting recorder: Research areas • Audio recognition - recognition from noisy microphones - speaker identification & tracking - nonspeech events • Indexing application - understanding the structure of meetings - information retrieval - user interface HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 12 Audio content-based retrieval (with Sheffield, Cambridge, BBC, Avideh Zakhor) • Idea: speech recognition output as indexes for broadcast news - useful even with 15-30% WER HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 13 Audio-video organization & retrieval • Proposed project: Boundaries & structuring Speech processing audio Nonspeech analysis Audio-video data IR engine Indexing terms symbolic analogic A-V match video Query Video features Summarization results • Synergy between audio & video features • Query by terms or by examples • Recovering temporal structuret HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 14 Conclusions 3 • Speech recognition is now practical - .. but still plenty of problems • Ongoing research in speech recognition - recognition in demanding conditions - understanding / discourse a big issue • Multimodal information retrieval - forgiving & fertile research area HCC - Speech Interfaces - Dan Ellis 2000-02-24 - 15