An overview of Speech Recognition research at ICSI Dan Ellis International Computer Science Institute, Berkeley CA <dpwe@icsi.berkeley.edu> Outline 1 About ICSI 2 Hybrid connectionist-HMM speech recognition 3 Overview of current projects 4 Some details: Combinations, News retrieval, SpeechCorder 5 Conclusions ASR for TICSP - Dan Ellis 1999sep02 - 1 About ICSI 1 http://www.icsi.berkeley.edu/ • Founded 1988 as ‘portal’ between U.S. and European academic systems - attached to UC Berkeley (but independent) - about 100 people at any time • Infrastructure funding from European government/industry - .. in return for hosting visitors - Germany, Italy, Spain, Switzerland, Netherlands • Research groups consist of staff, visitors and UC Berkeley graduate students • Original vision: ‘massively parallel systems’ - has diversified since then ASR for TICSP - Dan Ellis 1999sep02 - 2 Groups at ICSI • “Realization” (Real Speech Collective?) - Nelson Morgan, Steve Greenberg, Dan Ellis - Speech recognition for realistic conditions - Systems support for ASR applications • ACIRI (AT&T Center for Internet Research at ICSI) - Network routing, traffic, security • Theory - mathematical complexity, coding theory • Applications - natural langauge understanding - connectionist models of cognition • Networks - multimedia communications applications ASR for TICSP - Dan Ellis 1999sep02 - 3 ICSI Realization highlights • Connectionist framework for ASR (Morgan & Bourlard) • RASTA front-end processing (Hermansky & Morgan) • The Ring Array Processor (RAP) • SPERT/T0 - SBUS boards with custom 8-way vector µ−proc ASR for TICSP - Dan Ellis 1999sep02 - 4 Outline 1 About ICSI 2 Hybrid connectionist-HMM speech recognition - visualizing speech recognition - building a recognizer - some issues in ASR 3 Overview of current projects 4 Some details 5 Conclusions ASR for TICSP - Dan Ellis 1999sep02 - 5 2 The Hybrid Connectionist-HMM system • Conventional ASR: Symbols S, observations X S∗ = argmax P ( S X ) S P ( S, X ) = argmax ------------------P( X ) S = argmax S ∏ P( X i Si ) ⋅ P( Si Si – 1 ) i • P(Xi|Si) is acoustic likelihood model e.g. GMM • Connectionist replaces with posterior, P(Si|Xi): Neural network acoustic model h# pcl bcl tcl dcl C0 C1 Feature calculation Input sound Word pronunciation models Grammar HMM decoder C2 Posterior probabilities for context-independent phone classes Ck tn tn+w ASR for TICSP - Dan Ellis 1999sep02 - 6 Word sequence hypotheses Visualizing speech recognition • Speech as a sequence of discrete symbols qi Sound Front end Feature vectors Acoustic models Phone classifier Label probabilities Word models Grammar HMM decoder Phone & word labelling ASR for TICSP - Dan Ellis 1999sep02 - 7 Building a recognizer • Define pronunciation models - application vocabulary - standard dictionaries + phonetic rules? • Build language model - P(Wi|Wi-1,...) - count n-grams in example texts? • Train acoustic model - choose features to suit conditions - train neural net on large labeled corpus - relabel & retrain? ASR for TICSP - Dan Ellis 1999sep02 - 8 How much training data? • The bulk of recent improvements derives from larger training sets WER for PLP12N nets vs. net size & training data 44 42 WER% 40 38 36 34 32 9.25 500 18.5 Training set / hours 1000 37 2000 Hidden layer / units 74 4000 • Largest model: 2.5M parameters, 32M patterns = 24 days to train on TetraSPERT, 1015 ops ASR for TICSP - Dan Ellis 1999sep02 - 9 Some issues in ASR • ‘Spectrogram reading’ paradigm - short-time features, concatenative models • Goal: classifier accuracy / Word Error Rate (WER) - objective measures, but quite opaque - normalization vs. generalization • HMM requires search over all word sequences - dominates processing time in large-vocabulary • Best solutions (e.g. features) depends on task - RASTA plus delta-features good for small vocab - plain normalized PLP best for Broadcast News - modulation spectrum features best for combo... • Key challenge = Robustness - to: task, acoustic conditions, speaking rate, style, accent ... ASR for TICSP - Dan Ellis 1999sep02 - 10 Outline 1 About ICSI 2 Hybrid connectionist-HMM speech recognition 3 Overview of current projects - Front-end features - Pronunciation and language modeling - Modeling alternatives - Other topics 4 Some details 5 Conclusions ASR for TICSP - Dan Ellis 1999sep02 - 11 The modulation-filtered spectrogram (Brian Kingsbury) • Goal: invariance to variable acoustics speech - filter out irrelevant modulations - channel adaptation (on-line auto. gain control) - multiple representations • Bark-scale power-spectral filterbank x x lowpass 0-16 Hz envelope filtering τ = 160 ms AGC AGC τ = 320 ms AGC AGC automatic gain control lowpass features Comparison: bandpass 2-16 Hz AGC AGC τ =160 ms AGC AGC τ =640 ms bandpass features freq / Bark plp log-spect features 15 10 5 freq / Bark msg1a features 10 5 0.5 ASR for TICSP - Dan Ellis 1 1.5 2 2.5 3 time / s 3.5 1999sep02 - 12 Data-driven feature design (Mike Shire) • ‘Optimal’ features for different conditions - subband envelope domain - linear-discriminant analysis (LDA) for filter coeffs - separation of labeled classes is optimized • Modulation-frequency domain responses for clean, reverb, mixture: 1st Discriminant Filter 0 -5 dB -10 -15 Clean Light Reverb Severe Reverb Clean+Severe Reverb -20 -25 -30 2nd Discriminant Filter 0 -5 dB -10 -15 -20 -25 -30 0 10 ASR for TICSP - Dan Ellis 1 10 Hz 1999sep02 - 13 Automatic pronunciation extraction (Eric Fosler) • Canonical pronunciations are too limited • Phonetic rules overproduce (→ homonyms) • Filter candidates against acoustics: Word strings Dictionary Phone trees The Dow Jones or the=dh ax dow=d aw jones=jh... ax d aw or Letter-tophone trees d o w FSG building Acoustics FSG dh ax dcl d aw .. . iy d aa Alignment dh dh dh dh ax ax dcl d... FSG decoder Acoustic confidences dh=0.24 • ax=0.4 dcl=0.81... Decision trees for canonical→real mapping ASR for TICSP - Dan Ellis 1999sep02 - 14 Topic modeling (Latent Semantic Analysis) (Dan Gildea & Thomas Hofmann) • Bayesian model: - p(word | doc) = ∑t p(word | topic) p(topic | doc) - EM modeling of p(word | topic) & p(topic | doc) over training set - p(topic | doc) estimated from context in recognition • Use to modify language model weights - p(word) ∝ ptri(word) ptop(word) / puni(word) - Trigram language model perplexity of 109 reduced 17% • Use for topic segmentation? ASR for TICSP - Dan Ellis 1999sep02 - 15 Multiband systems (Adam Janin / Nikki Mirghafori) • Separate recognizers look at different bands - Fletcher/Allen model of human speech recog. - noise/corruption in one channel is limited - how to combine results? Prob. estimator .. Power Power Front end .. Power .. • .. .. .. .. MLP Merger .. .. Power Speech Signal .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ViterbiDecoder Recognized Words .. Weighted average of all possible combos - p(S | a,b,c,d) = ∑B p(S | B,a,b,c,d) . p(B) B ranges over 16 possible band combinations - p(B) from? constant, local feature (entropy) ASR for TICSP - Dan Ellis 1999sep02 - 16 Buried Markov Models (Jeff Bilmes) • Increasing the scope of the classifier input • Add state-dependent sparse links: Q t —1 = qt —1 Q t = qt Q t + 1 = qt + 1 Q t + 2 = qt + 2 X Y • How to add links? - maximum conditional mutual information - greedy algorithm • How to model? - linear dependence as first attempt ASR for TICSP - Dan Ellis 1999sep02 - 17 Connectionist speaker recognition (Dominique Genoud) • Use neural networks to model speakers rather than phones? • Specialize a phone classifier for a particular speaker? • Do both at once for “Twin-output MLP”: Feature vectors Posterior probability vectors Output layer Current vector t0-4 t0+4 12 features Non-speech 53 'speaker' phones 9x12 features = 108 inputs t ASR for TICSP - Dan Ellis MLP 108:2000:107 53 'world' phones 1999sep02 - 18 Acoustic Segment Classification (Gethin Williams) • Features from posteriors show utterance type: - average per-frame entropy - ‘dynamism’ - mean squared 1st-order difference - average energy of ‘silence’ label - covariance matrix distance to clean speech Dynamism vs. Entropy for 2.5 second segments of speecn/music 3.5 Spectrogram frq/Hz Speech Music Speech+Music 3 4000 2.5 2000 2 speech music Entropy 0 speech+music 1.5 40 1 20 0 0 2 4 6 8 10 12 time/s 0.5 Posteriors 0 0 0.05 0.1 0.15 0.2 Dynamism 0.25 0.3 • 1.3% error on 2.5 second speech-music testset • Use for finding segment boundaries? ASR for TICSP - Dan Ellis 1999sep02 - 19 Perceptual experiments (Steven Greenberg et al.) • Use spectrally-sparse speech: Frequency (kHz) Spectrogram Original 4 0 3.5 Time (s) Slit 4 Waveform 8 6000 4762 3 2 1 2381 1890 945 750 375 298 • Effect of temporal alignment between bands: Synchronized Slits De-synchronized Slits 25 ms delay 10.3 58.6 88.3% 47.1 51.7 70.7 79.7 79.5 80.4 63.4 65.0 54.8 58.9 (Intelligibility WER%) 50 ms delay 53.6 54.6 75 ms delay 41.9 ASR for TICSP - Dan Ellis 40.7 1999sep02 - 20 Outline 1 About ICSI 2 Hybrid connectionist-HMM speech recognition 3 Overview of current projects 4 Some details: - Information stream combination - Broadcast News spoken-document retrieval - SpeechCorder PDA 5 Conclusions ASR for TICSP - Dan Ellis 1999sep02 - 21 4 .1 Information stream combinations • Task: AURORA noisy digits - continuous digits - test: 4 noise types x 7 SNR levels - train: mixed clean/noisy data • Feature design evaluation - intermediate representation for mobile phones - evaluation specifies GMM-HMM (HTK) system • Baseline results: • Feature System WER ratio mfcc HTK 100.0% plp Hybrid 89.6% msg Hybrid 87.1% msg HTK 205.0% msg KG HTK 184.5% Can we combine features advantageously? ASR for TICSP - Dan Ellis 1999sep02 - 22 Combination schemes Feature 1 calculation Feature combination Feature 2 calculation HMM decoder Acoustic classifier Speech features Word hypotheses Phone probabilities Input sound Feature 1 calculation Acoustic classifier HMM decoder Posterior combination Feature 2 calculation Input sound Acoustic classifier Word hypotheses Phone probabilities Speech features Feature 1 calculation Acoustic classifier HMM decoder Hypothesis combination Feature 2 calculation Input sound • Acoustic classifier Speech features Final hypotheses HMM decoder Phone probabilities Word hypotheses Simple probability combination works best: P(qi|X1,X2) = P(qi|X1)·P(qi|X2) / P(qi) ... if X1⊥X2|q ASR for TICSP - Dan Ellis 1999sep02 - 23 Posterior multiplication Features System WER ratio plp + msg Feature combo 74.1% plp + msg Prob. combo 63.0% plp + msg HTK on probs. 51.6% ASR for TICSP - Dan Ellis 1999sep02 - 24 4 .2 Spoken document retrieval • Based on DARPA/NIST Broadcast News • Training material recorded off-air - ABC, CNN, CSPAN, NPR - 200 hour training set (TREC: 550 hour archive) - training: word transcriptions + speaker time boundaries • Best WER results: - 1996: HTK: 27% - 1997: HTK: 16% (but: easier; 22% on 1996 eval) - 1998: LIMSI: 14% (SPRACH: 21%) • Some clear conclusions - one classifier for all conditions (or male/female) - feature adaptation (VTLN, MLLR, SAT) - importance of segmentation - more data is useful ASR for TICSP - Dan Ellis 1999sep02 - 25 Applications for BN systems F0: F4: F5: • Live transcription - subtitles - transcripts - but: more than words? • Video editing - precision word-time alignments - commercial systems by IBM, Virage, etc. • Information Retrieval (IR) - TREC/MUC ‘spoken documents’ - tolerant of word error rate, e.g.: THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELECTION SEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADER DANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE PRESIDENTIAL CANDIDATES OF THE ELECTION THE LABOR MIGHT DO WELL TO REMEMBER THE LOST A MAJOR EPISODE OF TRANSATLANTIC CONNECT TO A CORPORATION IN BOTH CONSERVATIVE PARTY OFFICIALS FROM BRITAIN GOING TO WASHINGTON THEY WENT TO WOOD BUYS GEORGE BUSH ON HOW TO WIN A SECOND TO NONE IN LONDON THIS IS STEPHEN BEARD FOR MARKETPLACE ASR for TICSP - Dan Ellis 1999sep02 - 26 Thematic Indexing of Spoken Language (Thisl) • EC collaboration, BBC providing data • 1000+ hr archive data • IR is key factor - stop lists - weighting schemes - query expansion Query Archive Database Segmentation Control IR Receiver ASR NLP ASR Text http Audio Video ASR for TICSP - Dan Ellis 1999sep02 - 27 ThislGui • Tcl/Tk front-end to Thisl IR engine • Spoken query input: SPRACHdemo/AbbotDemo • NLP integration: prolog lattice parser ASR for TICSP - Dan Ellis 1999sep02 - 28 4 .3 SpeechCorder • Convergence of interesting problems: - ubiquitous PDAs - multimedia processing - very fast, low-power CPU design - resource-bound speech recognition • ICSI / UC Berkeley / MIT collaboration - ICSI: speech & audio processing - UCB: user interface design - MIT: new low-power CPUs • Current proposal - PDA - meeting / memo recorder - .. also for sociological study? (new Human Centered Computing consortium) ASR for TICSP - Dan Ellis 1999sep02 - 29 The SpeechCorder GUI • Live annotation of recognized speech • Application issues: - correcting errorful transcriptions - finding places in the recording - annotations (speaker, notes, emphasis) ASR for TICSP - Dan Ellis 1999sep02 - 30 SpeechCorder: Audio • Speech recognition - not close-mic’d - speaker ID - low-power / low-memory / vectorized - non-local processing for ‘best’ transcript? • Other audio issues - identifying speech versus nonspeech - finding / indexing nonspeech events... ASR for TICSP - Dan Ellis 1999sep02 - 31 Outline 1 About ICSI 2 Hybrid connectionist-HMM speech recognition 3 Overview of current projects 4 Some details 5 Conclusions - more criticisms of ASR - future work ASR for TICSP - Dan Ellis 1999sep02 - 32 Conclusions 5 • The downside of objective evaluation - research priority has been pragmatic goal: reduce WER - human speech recog. uses many constraints - grammatic/semantic constraints implicit in word sequence statistics (grammar) - automatic analysis of large corpora is possible & helpful • The problems with a grammar - unexpected (unseen) phrases are discounted - highly brittle alternatives - masks underlying performance • A more scientific approach - first work on the underlying phoneme classifier - follow nonsense syllable performance (Fletcher) ASR for TICSP - Dan Ellis 1999sep02 - 33 The signal model in speech recognition • Systems & approach have been optimized for speech-alone situation - minimize classifier parameters, maximize use of ‘feature space’ - e.g. cepstra [example] • Possibly non-lexical data thrown away - pitch - timing/rhythm - speaker identification • Dire consequences - .. dealing with nonspeech sounds - .. distinguishing success & failure • Popular focus of research - e.g. segmental models, pitch features - fail to obtain robust improvements ASR for TICSP - Dan Ellis 1999sep02 - 34 Future work • Continue improving robustness - better features - better pronunciations - better modeling • Still looking for a good architecture - multiband - multistream - more adaptation - more contextual dependence • Speech recognition: useful for applications - archive indexing, summarizing - personal devices, new interfaces - tie-in to general audio analysis... ASR for TICSP - Dan Ellis 1999sep02 - 35