Automatic Speech Recognition ILVB-2006 Tutorial 1 The Noisy Channel Model • Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993] • The noisy channel model [Lee et al., 1996] – Acoustic input considered a noisy version of a source sentence Noisy Channel Source sentence Decoder Noisy sentence 버스 정류장이 어디에 있나요? ILVB-2006 Tutorial Guess at original sentence 버스 정류장이 어디에 있나요? 2 The Noisy Channel Model • What is the most likely sentence out of all sentences in the language L given some acoustic input O? • Treat acoustic input O as sequence of individual observations – O = o1,o2,o3,…,ot • Define a sentence as a sequence of words: – W = w1,w2,w3,…,wn Wˆ arg max P(W | O) Bayes rule WL P(O | W ) P(W ) ˆ W arg max P(O) WL Wˆ arg max P(O | W ) P(W ) WL ILVB-2006 Tutorial 3 Golden rule Speech Recognition Architecture Meets Noisy Channel 버스 정류장이 어디에 있나요? Wˆ arg max P(O | W ) P(W ) WL O Speech Signals Feature Extraction Decoding 버스 정류장이 어디에 있나요? W Word Sequence Network Construction Speech DB HMM Estimation Acoustic Pronunciation Model Model G2P Text Corpora ILVB-2006 Tutorial LM Estimation 4 Language Model Feature Extraction • The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] Preemphasis/ Hamming Window X(n) FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) – Frame size : 25ms / Frame rate : 10ms 25 ms ... 10ms a1 a2 a3 – 39 feature per 10ms frame – Absolute : Log Frame Energy (1) and MFCCs (12) – Delta : First-order derivatives of the 13 absolute coefficients – Delta-Delta : Second-order derivatives of the 13 absolute coefficients ILVB-2006 Tutorial 5 MFCC (12-Dimension) Acoustic Model • Provide P(O|Q) = P(features|phone) • Modeling Units [Bahl et al., 1986] – Context-independent : Phoneme – Context-dependent : Diphone, Triphone, Quinphone – pL-p+pR : left-right context triphone • Typical acoustic model [Juang et al., 1986] – Continuous-density Hidden Markov Model ( A, B, ) K – Distribution : Gaussian Mixture b j ( x j ) c jk N ( xt ; jk , jk ) k 1 – HMM Topology : 3-state left-to-right model for each phone, 1-state for silence or pause bj(x) codebook ILVB-2006 Tutorial 6 Pronunciation Model • Provide P(Q|W) = P(phone|word) • Word Lexicon [Hazen et al., 2002] – Map legal phone sequences into words according to phonotactic rules – G2P (Grapheme to phoneme) : Generate a word lexicon automatically – Several word may have multiple pronunciations • Example – Tomato 0.2 [ow] [ey] 0.5 1.0 1.0 [m] [t] 0.8 [ah] 1.0 1.0 [t] 0.5 [aa] 1.0 – P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 – P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 ILVB-2006 Tutorial 7 [ow] Training • Training process [Lee et al., 1996] Speech DB Feature Extraction Baum-Welch Re-estimation yes Converged? no HMM • Network for training Sentence HMM ONE Word HMM ONE Phone HMM W ILVB-2006 Tutorial TWO ONE TWO THREE ONE W 1 8 AH 2 3 THREE ONE N End Language Model • Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] – We saw this was also used in the decoding process as the probability of transitioning from one word to another. – Word sequence : W = w1,w2,w3,…,wn n P( w1 wn ) P( wi | w1 wi 1 ) i 1 – The problem is that we cannot reliably estimate the conditional word probabilities, P( wi | w1 wi 1 ) for all words and all sequence lengths in a given language – n-gram Language Model – n-gram language models use the previous n-1 words to represent the history P( wi | w1 wi 1 ) P( wi | wi ( n 1) wi 1 ) – Bi-grams are easily incorporated in a viterbi search ILVB-2006 Tutorial 9 Language Model • Example – Finite State Network (FSN) 서울 부산 에서 출발 세시 네시 출발 대구 대전 하는 기차 버스 도착 – Context Free Grammar (CFG) $time = 세시|네시; $city = 서울|부산|대구|대전; $trans = 기차|버스; $sent = $city (에서 $time 출발 | 출발 $city 도착) 하는 $trans – Bigram P(에서|서울)=0.2 P(세시|에서)=0.5 P(출발|세시)=1.0 P(하는|출발)=0.5 P(출발|서울)=0.5 P(도착|대구)=0.9 … ILVB-2006 Tutorial 10 Network Construction • Expanding every word to state level, we get a search network [Demuynck et al., 1997] Acoustic Model Pronunciation Model I 일 I L 이 I 삼 S A 사 S A Language Model L 이 일 S M 사 A M 삼 Intra-word transition start 이 P(이|x) Search Network Word transition end I LM is applied 일 P(일|x) I P(사|x) Between-word transition S L 사 A P(삼|x) 삼 S ILVB-2006 Tutorial A 11 M Decoding • Find Wˆ arg max P(W | O) WL • Viterbi Search : Dynamic Programming – Token Passing Algorithm [Young et al., 1989] • • ILVB-2006 Tutorial Initialize all states with a token with a null history and the likelihood that it’s a start state For each frame ak – For each token t in state s with probability P(t), history H – For each state r – Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H 12 Decoding • Pruning [Young et al., 1996] – Entire search space for Viterbi search is much too large – Solution is to prune tokens for paths whose score is too low – Typical method is to use: – histogram: only keep at most n total hypotheses – beam: only keep hypotheses whose score is a fraction of best score • N-best Hypotheses and Word Graphs – Keep multiple tokens and return n-best paths/scores – Can produce a packed word graph (lattice) • Multiple Pass Decoding – Perform multiple passes, applying successively more fine-grained language models ILVB-2006 Tutorial 13 Large Vocabulary Continuous Speech Recognition (LVCSR) • Decoding continuous speech over large vocabulary – Computationally complex because of huge potential search space • Weighted Finite State Transducers (WFST) [Mohri et al., 2002] – Efficiency in time and space Word : Sentence Phone : Word WFST Search Network WFST Combination HMM : Phone WFST State : HMM WFST • Dynamic Decoding – On-demand network constructions – Much less memory requirements ILVB-2006 Tutorial 14 Optimization References (1/2) • L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model ICASSP, pp.49–52. • • • • C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic Word Concatenations, In Proceedings of 8th European Conference on Speech Communication and Technology, vol. 4, pp.1563-1566. K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146. T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104. M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88. ILVB-2006 Tutorial 15 References (2/2) • • • • • • • B. H. Juang, S. E. Levinson, and M. M. Sondhi, 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309. C. H. Lee, F. K. Soong, and K. K. Paliwal, 1996. Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers. K. K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173. L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, vol.77, no.2, pp.257–286. L. R. Rabiner and B. H. Juang, 1993. Fundamentals of Speech Recognition, Prentice-Hall. S. J. Young, N. H. Russell, and J. H. S Thornton, 1989. Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department. S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book. Entropics Cambridge Research Lab., Cambridge, UK. ILVB-2006 Tutorial 16