Math 20 Project Reed Harder Hidden Markov Models In class, we have worked with Markov chains, processes that move between states in a set probabilistically: the probability of moving to a certain state depends only on the current state. The chains we have worked with have been observable. In other words, we can tell by watching the process which states are occurring at each instant in time. These chains have shown themselves to be immensely useful: we have employed them in modeling weather, genetic trait inheritance, certain games, and the wandering of a drunkard. But for many problems and applications that require modeling a process over time, observable Markov chains are too simple to be very helpful. For many of these problems, we can turn to hidden Markov models (HMMs), which add another layer of chance, for a more applicable description. This paper will develop hidden Markov models from the mathematical ideas about Markov chains we have learned in class and through example. It will discuss some basic questions we can ask with hidden Markov models and some ways we can go about answering these questions using probability theory and some helpful tools provided by MATLAB. Finally, it will look at some ways that we can apply these questions to speech recognition technology, and some other applications. 1 An Introductory Example Consider a dysfunctional printer. Like many printers, this printer prints print jobs on a regular basis, but unlike many printers, the ink colors of each successive print job seem to be unpredictable. Suppose that in a sequence of ten print jobs [T, the number of print jobs/outputs = 10; t1,t2,…,t10] this dysfunctional printer produced the following colors in this order: [Red Red Red Blue Blue Red Green Green Red Red] (1) We could try to model the process that produced this sequence with an observable Markov chain. Thus, according to some initial probability distribution, the printer would start out in a certain state (in this case, a red ink state), and for each subsequent print job would jump from state to state with probabilities based on the current state. We could perhaps model it with the following transition probabilities: Transition Matrix R G R 0.6000 0.1000 G 0.3000 0.4000 B 0.5000 0.5000 B 0.3000 0.3000 0.0000 [the MATLAB program crazyPrinter.m produces a sequence from an observable Markov chain with these probabilities under the output “state_sequence=(…)”] 1 The active components of this paper are integrated into the passive components: examples, problems and programs are used and worked through to help develop the HMM. Some built-in MATLAB functions (of the hmm[xxx] type) are from the statistics toolbox. This is not unlike other Markov chain examples we have encountered. However, we could also assume that the output sequence we can observe (i.e. different colored print jobs) does not directly tell us the underlying sequence of states, and that this sequence of states only gives rise to the observed outputs probabilistically. Thus, we need to specify both the probability distribution for the initial state and the transition matrix as in an observable Markov chain, as well as the probability of each observable output given each state. We might call each state red-ink preferring, green-ink preferring, and blue-ink preferring (rather than red, green and blue), which emit their preferred color with the highest probability, and we can represent these emission probabilities with an emission matrix or an extended diagram: Transition Matrix Rpref Gpref Rpref 0.6000 0.1000 Gpref 0.3000 0.4000 Bpref 0.5000 0.5000 Bpref 0.3000 0.3000 0.0000 Emission Matrix R G Rpref 0.8000 0.1500 Gpref 0.2000 0.6000 Bpref 0.2000 0.2500 B 0.0500 0.1000 0.5000 P 0.0000 0.1000 0.0500 [The rows of the emission matrix represent each state, red-preferring(Rpref)etc. The columns represent the possible observable outputs. Thus the red-preferring state produces a red print job with probability .8. Notice that there is a potential output not preferred by a state, P (for Purple). This output did not appear in the observed sequence (1), and given its emission probabilities, this is not surprising, but with this model, it may very well have turned up as an output] The solid lines in this diagram represent transition probabilities; the dotted lines represent emission probabilities. This is the essence of a hidden Markov model: the process moves between states in the same way as a visible Markov chain, but the visible outputs are probabilistic functions of the given state. It is specified by a transition matrix, emission matrix, and initial state probability distribution, often notated µ = (A, B, ∏) respectively, as well as the set S of N states and the set K of M emissions. [The MATLAB program crazyPrinter.m produces a sequence of print colors according to this model, with the above transition and emission matrices under the output “emission_sequence=(…).” Note that the output “state_sequence=(…)” produces the state sequence for the model, though it also represents the visible sequence were this a non-hidden Markov model (see above). Thus, the observable Markov chain is a degenerate form of the HMM: if the emission probabilities in states Rpref, Gpref and Bpref for outputs R,G and B were all 1, this Model would behave the same as a Markov chain with the given transition matrix. This program, crazyPrinter(10,1), produced the initial example sequence (1). The state sequence produced was “rrgbrrrrrb.” Compare this to the emission sequence (1) shown above: “RRRBBRGGRR.” Similar, but obviously not a perfect match: a red inkpreferring state emits a green print job in the third print job, (t3) for example] ___________________________________ Are the frequencies of the various colors in the emission sequence close to what we would expect? For example, since the transition matrix is regular, to compute the expected proportion of red emissions to other emissions (assuming we start in the red ink-preferring state) as T →∞, we can compute its fixed vector w: Thus, the expected proportion of red print jobs in the sequence is ~.4967 Does the proportion of red print jobs in the sample approach this with large T? We can run the program crazyPrinter(10000000,1). This is a hefty computation, but the large input for T=num_outputs will hopefully give us a relatively precise result. Sure enough, the output “proportion_red=(…)” gives us the expected proportion of red print jobs: .4967 Three Fundamental Hidden Markov Model Problems The hidden Markov model’s new layer of emission probabilities and observed emissions has some interesting implications, allowing us to ask some useful questions, questions that are helpful in modeling and solving a variety of real world problems : 1. Given a hidden Markov model µ = (A, B, ∏), what is the probability of observing a certain emission sequence O=O1,O2,…,OT, i.e. what is P(O|µ)? How can we compute this efficiently? 2. Given a model µ and an emission sequence O, how can we choose a state sequence X=X1,X2,…,XT that best explains O? 3. Given emission sequence O, how can we adjust the parameters of model µ = (A, B, ∏) to best explain O, i.e. how do we maximize P(O|µ) from problem 1? So, in reverse order (as is often followed in applications), we are estimating and refining hidden Markov model parameters to fit observed data, choosing a sequence of states based on our observed data and chosen model, and computing the probability of observed data given our chosen model (this process is known as decoding, and its result is called the forward probability). *Answering Problem 1: To calculate the probability of sequence O of length T given model µ, we start by considering a single fixed state sequence X=X1,X2,…,XT. The probability of an emission sequence O given this state sequence is: P(O|X,µ) = ∏𝑇𝑡=1 P(𝑂𝑡 |𝑋𝑡 , 𝜇) = bX1(O1)* bX2(O2)*…* bXT(OT) [bxt is emission probability from emission matrix B given state Xt of emitting output Ot] The probability of each given state sequence X is: P(X|µ) = πX1 aX1,X2 a X2,X3 …aX(T-1)X(T) [πX1 is the initial probability vector ∏ entry for state X1, the a’s that follow are the transition probabilities of transition matrix A between each state in the given sequence] The probability that both of both given O and given X occurring is the product of the previous two probabilities: P(O,X|µ) = P(O|X,µ) P(X|µ). Then the probability of sequence O given the model µ is the sum of these probabilities over all possible state sequences X: P(O|µ) = ∑𝐴𝑙𝑙 𝑋 P(𝑂|𝑋, µ) P(𝑋|µ) = ∑𝐴𝑙𝑙 𝑋. πX1bX1(O1)* aX1,X2 bX2(O2)* …*aX(T-1)X(T) bXT(OT) This method can be used to calculate small sequence probabilities. For example, we can compute the probability of seeing the T=2 sequence [Red, Purple] given the dysfunctional printer model above, and given the initial state probability vector as [Rpref, Gpref, Bpref] = [.6 .4 0] (so, it never starts in the blue ink-preferring state)2: Possible state sequences Rpref→Rpref Rpref→Gpref Rpref→Bpref Gpref→Rpref Gpref→Gpref Gpref→Bpref Bpref→Rpref Bpref→Gpref Bpref→Bpref πX1 .6 .6 .6 .4 .4 .4 0 0 0 bX1(O1) .8 .8 .8 .2 .2 .2 aX1,X2 .6 .1 .3 .3 .4 .3 bX2(O2) 0 .1 .05 0 .1 .05 Irrelevant, product of row is 0 See cell to right See cell to right See cell above See cell above See cell above See cell above See cell above See cell above P(O|µ) = ∑𝐴𝑙𝑙 𝑋. πX1bX1(O1)* aX1,X2 bX2(O2) = sum(product of all entries across each row) = 0 + .6 * .8 * .1 * .1 + .6 * .8 * .3 * .05 + 0 + .4 * .2 * .4 * .1 + .4 * .2 * .3 * .05 = 0.0164 This method is quite useful for evaluating small sequences and small numbers of states, but most realworld applications require far longer sequences and far more states, and the number of calculations 2 This problem is inspired by Exercise 9.2 in “Foundations of Statistical Language Processing” Chapter 9 by Manning and Schutze, modified to fit the crazy printer HMM. necessary to perform this computation quickly piles up: (2T-1)*NT + NT-1 calculations are required (NT possible state sequences and 2T-1 multiplications per state sequence, NT-1 additions in summing all state sequences). Thus, with 3 states and a sequence length of 100, we would need around 1050 calculations. Luckily, it turns out there is a more efficient procedure called the forward-backward procedure that we can use to solve this: in general this works by solving for conditional probabilities of partial emission sequences inductively. There is a MATLAB command in the statistical toolbox that can help us with this procedure, hmmdecode(O, A, B). This function, given an emission sequence, a transition matrix and an emission matrix, can provide us with the posterior probabilities (crazyPrinter.m output PSTATES), a N x T matrix of the probabilities of being at each possible state at each point in the emission sequence. It also can provide us with the logarithm of the probability of the given sequence (crazyPrinter.m output logpseq). We need only raise e to the power of logpseq to get an answer to problem 1. Let’s say we run crazyPrinter(10,1) and find the probability of the sequence that is emitted occurring. This would be a bit unpleasant to do by hand. This returned us the emission sequence “BBGRRBRBBB,” which has a probability of elogpseq = e-14.7521 = 3.9 *10-7. The probability of this emission sequence is quite small, but as we shall see when we discuss applications in speech recognition, being able to compare the probabilities of different sequences occurring, no matter their absolute size, is quite useful. *Answering Problem 2: Answering Problem 2 is a little less straight forward that Problem 1. Problem 1 has an exact answer, but in Problem 2, there is no strictly “correct” state sequence that gives rise emission sequence O. We need to optimize a state sequence, but there are different possible criteria for this optimization. For example, we could find the sequence made up of states that are individually most likely for each observed output in the sequence. For this we could look to the matrix PSTATES found with the function hmmdecode, picking out the most likely of the N states for each Ot. This would maximize the expected number of states guessed correctly. However, this method might cause problems with certain transition matrices, like our crazy printer matrix: one of our transition probabilities is 0, so a sequence chosen from most probable individual states might contain an adjacent pair of states (in our case, Bpref followed by Bpref) that is simply impossible. This suggests that we should look to optimization methods that take a broader look at possible sequences. Indeed, the most common criterion used to optimize state sequences is finding the single best state sequence by maximizing P(X|O,µ). Most often, this is done by means of an algorithm called the Viterbi Algorithm. Again, there is a MATLAB function that can help us here. hmmviterbi(O, A, B) given an emission sequence O, a transition matrix A and an emission matrix B, finds the most probable single state path by the above method. Let’s run crazyPrinter(10,1) again, and take a look at the emission sequence, the state sequence and the “optimal” state sequence chosen by the viterbi algorithm. Note here that we are not trying to model some natural phenomenon: since our emission sequence was produced from a constructed HMM, there is a “true” underlying state sequence. The emission sequence was “RRBRBRGRGR,” the underlying state sequence was “rrbrrrrgrr” and the best single state path was calculated to be “rrbrbrrrrr.” Only two states wrong: not bad. *Answering Problem 3: Problem 3, the question of how to adjust parameters (A, B, ∏) to maximize the probability of an emission sequence, is the most difficult of the 3 problems to answer. There is no known analytic method for choosing parameters to maximize P(O| µ ). While we cannot necessarily find the best model, we can find a local maximum using an iterative method known as the Baum-Welch method (or the expectation-modification method). In general, the method works as follows: using a (perhaps randomly) chosen model, we work out the probability of a given sequence of emissions (this is Problem 1). Looking at this calculation, we determine which state transitions and emissions were probably used the most, and increase their probabilities in a revised model that increases P(O|µ). This generally continues until the revised model no longer is improving significantly. This process is called training the model, and the emission sequences used are called training sequences. Again, MATLAB is here to help us. The function hmmtrain(O, estimated A, estimated B) takes a sequence and guesses of the transition and emission matrices and returns improved estimates of these parameters, using the Baum-Welch method. Let’s look at crazyPrinter(500,1) and create “estimated” transition and emission matrices. Just from eyeballing the frequencies of different colors in the emission matrix, suppose we made the following rather rough guesses: Transition matrix guess = 0.8000 0.1000 0.1000 0.1000 0.8000 0.8000 0.1000 0.1000 0.1000 Emission matrix guess = 0.7000 0.1000 0.1000 0.1000 0.7000 0.1000 0.1000 0.1000 0.7000 0.1000 0.1000 0.1000 How does the Baum-Welch method adjust these? ESTTR is the adjusted transition matrix, ESTEMIT is the adjusted emission matrix. For reference, the actual parameters are displayed as well. 0.0000 0.3313 0.0000 Transition Matrix Rpref Gpref Rpref 0.6000 0.1000 Gpref 0.3000 0.4000 Bpref 0.5000 0.5000 Bpref 0.3000 0.3000 0.0000 0.1934 0.1249 0.4371 Emission Matrix R G Rpref 0.8000 0.1500 Gpref 0.2000 0.6000 Bpref 0.2000 0.2500 B 0.0500 0.1000 0.5000 ESTTR = 0.8958 0.1116 0.0519 0.1042 0.5571 0.9481 ESTEMIT = 0.6806 0.4179 0.2134 0.0961 0.4572 0.1492 0.0298 0.0000 0.2003 P 0.0000 0.1000 0.0500 In general, it seems the estimates improved some of our guesses and made a few worse. This may speak partially to the bias inherent in “guessing” parameters when they are already known. It is also important to remember that with this algorithm it is easy to get “stuck” at a local maximum. It also may be that 500 time-slots is simply not enough data with this model for the method to maximize effectively. HMM Variants Many applications and problems that use a hidden Markov model require more nuanced variants on the standard form. The left-right or feel-forward model differs in its transition matrix: it is not ergodic (not every state can be reached by every state in a finite number of steps). Its names come from the fact that each state in an ordered set of states it transitions to is at an equal or higher order number than the previous state. Thus the transition matrix follows a form such like the following, where each * is a positive probability: ∗ ∗ 0 ∗ 0 0 0 ∗ ∗ This is useful for modeling some types of signals that evolve over time. Another variant on the standard HMM is the use of null or epsilon transitions. These designate certain transitions, regardless of their probability, as not precipitating emission, even though the destination state can produce a full spectrum of emissions with different prior states. Some Applications Hidden Markov models are perhaps best known for their applications in speech recognition technology. How does this work at the basic level, and how do the fundamental questions regarding HMMs come into play? For simplicity’s sake, consider a speech recognizer designed to identify a single isolated word at a time. First, we need to build models for each of the words we want in our machine’s vocabulary (say we want W words). To do this, we chose a word and look at its speech signals by different speakers. A given signal might look something like this: From http://www.isip.piconepress.com. This would give us our observed emission sequence: each observation would be some measurement of the signal. We might take 40 observations over the course of a word. Again for simplicity’s sake, assume that each possible emission is a small range of heights measured on the y-axis (or some similar parameter), and we have a total of M unique possible emissions. Say we have K recorded speech signals for a given word, and thus K emission sequences. These are used as training sequences in conjunction with the methods used in Problem 3 to estimate and refine transition and emission parameters µ = (A, B, ∏) for each word. The solution to problem 2, using the viterbi algorithm to link a sequence of states to the training sequences, allows for further insight into the physical meaning of these states and thus their connection to spectral properties of the speech signal (i.e. different possible emissions). This allows for further refinement of the model: the number of states and the number of discrete emissions can be modified to improve P(X|O,µ). When the models for each word are optimized and well-studied, they are ready to analyze “emission sequences” from unknown words. Ultimate recognition of the unknown word is thanks to the methods of Problem 1: given an emission sequence and various models µ for different words, we can calculate P(O|µ) for different µ (word models). The µ with the highest P(O|µ) is selected as the recognized word. Another interesting application of HMMs is in determining the routes of certain robots. HMMs play a key role in allowing the robot to figure out where it is on a map that it has stored and thus navigate its surroundings. These robots are equipped with range finders, which tell the robot how far it is from certain surfaces. From this observed sequence of measurements (“O”) and a model µ of the environment (a map of its surroundings), it infers a hidden sequence of states: its changing location within its environment. This is Problem 2: essentially, the robot is trying to maximize P(X|O,µ). Sources Foundations of Statistical Language Processing, Manning and Schutze, MIT Press 1999 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Lawrence Rabiner. Proceedings of the IEEE, Feb 1989 MATLAB MathWorks helpfiles http://www.youtube.com/watch?v=S_Lm8aN-la0&feature=related