Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept. Page 1 Overview Marquette University Intro: The problem with sequential data Markov chains Hidden Markov Models Key HMM algorithms Evaluation Alignment Training / parameter estimation Examples / applications Page 2 Big Picture View of Statistical Models HMMs Basic Gaussian Nonstationary sequential data Marquette University 1 0.5 0 -0.5 -1 0 0.2 0.4 0.6 0.8 Original speed 1 0 0.2 0.4 0.6 0.8 1 Same word, different tempo 1.2 1.4 1.6 1.2 1.4 1.6 1 0.5 0 -0.5 -1 Page 4 Historical Method: Dynamic Time Warping Marquette University DTW is a dynamic path search versus template Can solve using Dynamic Programming Warping grid 7 6 Template 5 4 3 2 1 0 0 1 2 3 4 5 6 7 Input Page 5 Alternative: Sequential modeling Marquette University Use a Markov Chain (state machine) State Machine S1 S2 S3 State Distribution Models Data Page 6 Markov Chains (discrete-time & state) Marquette University A Markov chain is a discrete-time discrete-state Markov Process. The likelihood of the current RV going to any new state is determined solely by the current state, called a transition probability aij P si s j Note: since transition probabilities are fixed, there is also a time-invariance assumption. (Also false of course, but useful) Page 7 Graphical representation Marquette University a13 a11 S1 a22 a12 a33 a23 S2 S3 a32 a21 a31 Markov chain parameters include Transition probability values aij Initial state probabilities 1 2 3 Page 8 Example: Weather Patterns Marquette University Probability of Rain, Clouds, or Sunshine modeled as a Markov chain: A= R C S R 0.7 0.4 0.1 C 0.2 0.4 0.1 S 0.1 0.2 0.8 Note: A matrix of this form (square, row sum=1) is called a stochastic matrix. Page 9 Two-step probabilities Marquette University If it’s raining today, what’s the probability of it raining two days from now? Need two-step probabilities. Answer = 0.7*0.7 + 0.2*0.4 + 0.1*0.1 = .58 Can also get these directly from A2 : A2 = R C S R 0.58 0.46 0.19 C 0.23 0.26 0.14 S 0.19 0.28 0.67 Page 10 Steady-state Marquette University The N-step probabilities can be gotten from AN, so A is sufficient to determine the likelihoods of all possible sequences. What’s the limiting case? Does it matter if it was raining 1000 days ago? A1000 = R C S R 0.4 0.4 0.4 C 0.2 0.2 0.2 S 0.4 0.4 0.4 Page 11 Probability of state sequence Marquette University The probability of any state sequence is given by: P( s1s2 s3 ...sT ) P(s1 ) P(s2 | s1 ) P(s3 | s2 )...P(sT | sT 1 ) 1as1s2 as2 s3 ...as(T 1) sT Training: Learn the transition probabilities by keeping count of the state sequences in the training data. Page 12 Weather classification Marquette University Using a Markov chain for classification: Train one Markov chain model for each class ex: A weather transition matrix for each city; Milwaukee, Phoenix, and Miami Given a sequence of state observations, identify which is the most likely city by choosing the model that gives the highest overall probability. Page 13 Hidden states & HMMs Marquette University What if you can’t directly observe states? But… there are measures/observations that relate to the probability of different states. States hidden from view = Hidden Markov Model. Page 14 General Case HMM Marquette University si : state i b1(ot) aij : P(si sj ) ot : output at time t bj(ot) : P (ot | sj ) b3(ot) Initial: 1 2 3 b2(ot) b4(ot) Page 15 Weather HMM Marquette University Extend Weather Markov Chain to HMM’s Can’t see if it’s raining, cloudy, or sunny. But, we can make some observations: Humidity H Temperature T Pressure P How do we calculate … Probability of an observation sequence under a model How do we learn … State transition probabilities for unseen states Observation probabilities in each state Page 16 Observation models Marquette University How do we characterize these observations? Discrete/categorical observations: Learn probability mass function directly. Continuous observations: Assume a parametric model. Our Example: Assume a Gaussian distribution Need to estimate the mean and variance of the humidity, temperature and pressure for each state (9 means and 9 variances, for each city model) Page 17 HMM classification Marquette University Using a HMM for classification: Training: One HMM for each class Transition matrix plus state means and variances (27 parameters) for each city Classification: Given a sequence of observations: Evaluate P(O|model) for each city (Much harder to compute for HMM than for Markov Chain) Choose the model that gives the highest overall probability. Page 18 Using for Speech Recognition Marquette University States represent beginning, middle, end of a phoneme a13 a24 a22 S1 a12 a35 a33 a23 S2 a44 a34 S3 a45 S4 Start State b2(•) b3(•) S5 End State b4(•) Gaussian Mixture Model in each state Page 19 Fundamental HMM Computations Marquette University Evaluation: Given a model and an observation sequence O = (o1, o2, …, oT), compute P(O | ). Alignment: Given and O, compute the ‘correct’ state sequence S = (s1, s2, …, sT), such as S = argmaxS { P (S |O, ) }. Training: Given a group of observation sequences, find an estimate of , such as ML = argmax { P (O | ) }. Page 20 Evaluation: Forward/Backward algorithm Marquette University Define i(t) = P(o1o2..ot, st=i | ) Define i(t) = P(ot+1ot+2..oT | st=i , ) Each of these can be implemented efficiently via dynamic programming recursions starting at t=1 (for ) and t=T (for ). By putting the forward & backward together: i (t ) i (t ) P(o1...ot ...oT | st i) N P(O | ) i (t ) i (t ) i 1 Page 21 Forward Recursion Marquette University 1. Initialization i (1) ibi (o1 ) i {1..N} 2. Recursion N j (t 1) i (t )aij b j (ot 1 ) i {1..N}, t {1..T} i 1 3. Termination N P(O | ) i (T ) i 1 Page 22 Backward recursion Marquette University 1. Initialization i (T ) 1 i {1..N} 2. Recursion N i (t ) aij b j (ot 1 ) j (t 1) i {1..N}, t {(T 1)..T } j 1 3. Termination N P(O | ) ibi (o1 ) i (1) i 1 Page 23 Note: Computation improvement Marquette University Direct computation: P(O | ) = the sum of the observation probabilities for all possible state sequences = NT. Time complexity = O(T NT) F/B algorithm: For each state at each time step do a maximization over all state values from the previous time step: Time Complexity = O(T N2) Page 24 From i(t) and i(t) : Marquette University One-State Occupancy probability t (i) t (i) t (i) t (i) t (i) N P(O | ) t ( j ) t ( j ) j 1 Two-state Occupancy probability t (i)aijb j (ot 1 ) t 1 ( j ) t (i, j ) P(O | ) t (i)aijb j (ot 1 ) t 1 ( j ) N N (k )a k 1 l 1 t b (ot 1 ) t 1 (l ) kl l Page 25 Alignment: Viterbi algorithm Marquette University To find single most likely state sequence S, use Viterbi dynamic programming algorithm: 1. Initialization: i (1) ibi (o1 ) i {1..N} 2. Recursion: j (t ) max i (t 1)aij b j (ot ) i 3. Termination: Pmax ( S , O | ) max i (T ) i Page 26 Training Marquette University We need to learn the parameters of the model, given the training data. Possibilities include: Maximum a Priori (MAP) arg max P( | O) Maximum Likelihood (ML) arg max P(O | ) Minimum Error Rate arg min Error Rate over Training Data Page 27 Expectation Maximization Marquette University Expectation Maximization(EM) can be used for ML estimation of parameters in the presence of hidden variables. Basic iterative process: 1. Compute the state sequence likelihoods given current parameters 2. Estimate new parameter values given the state sequence likelihoods. Page 28 EM Training: Baum-Welch Marquette University for Discrete Observations (e.g. VQ coded) Basic Idea: Using current and F/B equations, compute state occupation probabilities. Then, compute new values: T 1 E{Num berof transitions from i to j} ij' E{Num berof transitions from i} (i, j ) t 1 T 1 t t t 1 (i ) T 1 E{Num berof observations of ot in i} i' (ot ) E{Num berof tim es in i} t (i ) t 1 s.t. state i emits o t T 1 t 1 t (i ) Page 29 Marquette University Update equations for Gaussian distributions: T μˆ i Ps o o t 1 T i t t Ps o t 1 i t T ˆ Σ i P si ot ot μ k ot μ k T t 1 n Ps o k 1 i t GMMs are similar, but need to incorporate mixture likelihoods as well as state likelihoods Page 30 Toy example: Genie and the urns Marquette University There are N urns in a nearby room; each contains many balls of M different colors. A genie picks out a sequence of balls from the urns and shows you the result. Can you determine the sequence of urns they came from? Model as HMM: N states, M outputs probabilities of picking from an urn are state transitions number of different colored balls in each urn makes up the probability mass function for each state. Page 31 Working out the Genie example Marquette University There are three baskets of colored balls Basket one: 10 blue and 10 red Basket two: 15 green, 5 blue, and 5 red Basket three: 10 green and 10 red The genie chooses from baskets at random 25% chance of picking from basket one or two 50% chance of picking from basket three Page 32 Genie Example Diagram Marquette University Page 33 Two Questions Marquette University Assume that the genie reports a sequence of two balls as {blue, red}. Answer two questions: What is the probability that a two ball sequence will be {blue, red}? What is the most likely sequence of baskets to produce the sequence {blue, red}? Page 34 Probability of {blue, red} for Specific Basket Sequence Marquette University P O , i, j i bi blue aij b j red p i bi blue p j b j red First/Second Basket One Basket Two Basket Three Basket One 0.01562 0.00625 0.03125 Basket Two 0.00625 0.00250 0.01250 Basket Three 0.0 0.0 0.0 Page 35 Probability of {blue,red} Marquette University What is the total probability of {blue,red}? Sum(matrix values)= 0.074375 What is the most likely sequence of baskets visited? Argmax(matrix values) = {Basket 1, Basket 3} Corresponding max likelihood = 0.03125 Page 36 Viterbi method Marquette University 1 (1) 1b1 (o1 ) (.25)(.5) 0.125 2 (1) 2b2 (o1 ) (.25)(.2) 0.05 3 (1) 3b3 (o1 ) (.5)(0) 0 1 (2) max (.125)(.25)(.5), (.05)(.25)(.5), 0 0.015625 2 (2) max (.125)(.25)(.2), (.05)(.25)(.2), 0 0.00625 3 (2) max (.125)(.5)(.5), (.05)(.5)(.5), 0 0.03125 Best path ends in state 3, coming previously from state 1. Page 37 Composite Models Marquette University Training data is at sentence level, generally not annotated at sub-word (HMM model) level. Need to be able to form composite models from a sequence of word or phoneme labels. a13 a22 S1 Start State a12 S2 a24 a33 a23 S3 a35 a13 a44 a34 S4 a22 a45 S5 End State S1 Start State a12 S2 a24 a33 a23 S3 a35 a44 a34 S4 a45 S5 End State Page 38 Viterbi and Token Passing Marquette University b a a b b b c c c d d d ... ... ... z z z ... ... ... ... ... a c ... d Best Sentence b c d ... a c z Recognition Network a b c c f d ... e d Word Graph Page 39 HMM Notation Marquette University Discrete HMM Case: N Q M V T O B A The set of all parameters for one HMM Number of states in a model Set of possible states Number of output symbols Set of possible outputs Number of observations in observation Sequence of observations {o1 .. oT} Output matrix, NxM, with row i = output distribution for state i State transition matrix, NxN Initial probability vector, length N Page 40 Marquette University Continuous HMM Case: N Q T O bi( ) i i A The set of all parameters for one HMM Number of states in a model Set of possible states Number of observations in observation Sequence of observations {o1 .. oT} Output distribution for state i, = N(i, i) (diagonal covariance matrix) Vector of mean values for state i Vector of standard deviation values for state i State transition matrix, NxN Initial probability vector, length N Page 41 Multi-mixture, multi-observation case: Marquette University N M R T Q Nq Tr O ot S st (q) St aij i bj(ot) cjm bjm(ot) jm jm The set of all parameters for one HMM Number of states in current model Number of mixtures in output distribution in current model Number of sentences in training set Number of observations in current sentence Number of models in current (training) sentence label Number of states in model q Number of observations in sentence r The sequence of observations in current sentence The observation vector at time t The sequence of states The state at time t The state of model q at time t The probability of a transition from state i to j in current model The initial probability of being in state I The observation output probability in state j of current model The mixture weight for mixture m in state j of current model The observation output probability for mixture m in state j of current model The mean vector for mixture component m in state j The covariance matrix for mixture component m in state j Page 42