Hidden Markov Models

advertisement
Math 20 Project
Reed Harder
Hidden Markov Models
In class, we have worked with Markov chains, processes that move between states in a set
probabilistically: the probability of moving to a certain state depends only on the current state. The chains
we have worked with have been observable. In other words, we can tell by watching the process which
states are occurring at each instant in time. These chains have shown themselves to be immensely useful:
we have employed them in modeling weather, genetic trait inheritance, certain games, and the wandering
of a drunkard. But for many problems and applications that require modeling a process over time,
observable Markov chains are too simple to be very helpful. For many of these problems, we can turn to
hidden Markov models (HMMs), which add another layer of chance, for a more applicable description.
This paper will develop hidden Markov models from the mathematical ideas about Markov chains we have
learned in class and through example. It will discuss some basic questions we can ask with hidden Markov
models and some ways we can go about answering these questions using probability theory and some
helpful tools provided by MATLAB. Finally, it will look at some ways that we can apply these questions to
speech recognition technology, and some other applications. 1
An Introductory Example
Consider a dysfunctional printer. Like many printers, this printer prints print jobs on a regular basis, but
unlike many printers, the ink colors of each successive print job seem to be unpredictable. Suppose that in
a sequence of ten print jobs [T, the number of print jobs/outputs = 10; t1,t2,…,t10] this dysfunctional printer
produced the following colors in this order:
[Red Red Red Blue Blue Red Green Green Red Red] (1)
We could try to model the process that produced this sequence with an observable Markov chain. Thus,
according to some initial probability distribution, the printer would start out in a certain state (in this case,
a red ink state), and for each subsequent print job would jump from state to state with probabilities based
on the current state. We could perhaps model it with the following transition probabilities:
Transition Matrix
R
G
R
0.6000
0.1000
G
0.3000
0.4000
B
0.5000
0.5000
B
0.3000
0.3000
0.0000
[the MATLAB program crazyPrinter.m
produces a sequence from an observable
Markov chain with these probabilities
under the output “state_sequence=(…)”]
1
The active components of this paper are integrated into the passive components: examples, problems and
programs are used and worked through to help develop the HMM. Some built-in MATLAB functions (of the hmm[xxx]
type) are from the statistics toolbox.
This is not unlike other Markov chain examples we have encountered. However, we could also assume
that the output sequence we can observe (i.e. different colored print jobs) does not directly tell us the
underlying sequence of states, and that this sequence of states only gives rise to the observed outputs
probabilistically. Thus, we need to specify both the probability distribution for the initial state and the
transition matrix as in an observable Markov chain, as well as the probability of each observable output
given each state. We might call each state red-ink preferring, green-ink preferring, and blue-ink preferring
(rather than red, green and blue), which emit their preferred color with the highest probability, and we
can represent these emission probabilities with an emission matrix or an extended diagram:
Transition Matrix
Rpref Gpref
Rpref 0.6000 0.1000
Gpref 0.3000 0.4000
Bpref 0.5000 0.5000
Bpref
0.3000
0.3000
0.0000
Emission Matrix
R
G
Rpref 0.8000 0.1500
Gpref 0.2000 0.6000
Bpref 0.2000 0.2500
B
0.0500
0.1000
0.5000
P
0.0000
0.1000
0.0500
[The rows of the emission matrix
represent each state,
red-preferring(Rpref)etc. The columns
represent the possible observable
outputs. Thus the red-preferring
state produces a red print job with
probability .8. Notice that there is
a potential output not preferred by
a state, P (for Purple). This
output did not appear in the observed
sequence (1), and given its emission
probabilities, this is not surprising,
but with this model, it may very well
have turned up as an output]
The solid lines in this diagram represent transition probabilities;
the dotted lines represent emission probabilities.
This is the essence of a hidden Markov model: the process moves between states in the same way as a
visible Markov chain, but the visible outputs are probabilistic functions of the given state. It is specified by
a transition matrix, emission matrix, and initial state probability distribution, often notated µ = (A, B, ∏)
respectively, as well as the set S of N states and the set K of M emissions. [The MATLAB program
crazyPrinter.m produces a sequence of print colors according to this
model, with the above transition and emission matrices under the output
“emission_sequence=(…).” Note that the output “state_sequence=(…)”
produces the state sequence for the model, though it also represents the
visible sequence were this a non-hidden Markov model (see above). Thus,
the observable Markov chain is a degenerate form of the HMM: if the
emission probabilities in states Rpref, Gpref and Bpref for outputs R,G
and B were all 1, this Model would behave the same as a Markov chain
with the given transition matrix. This program, crazyPrinter(10,1),
produced the initial example sequence (1). The state sequence produced
was “rrgbrrrrrb.” Compare this to the emission sequence (1) shown above:
“RRRBBRGGRR.” Similar, but obviously not a perfect match: a red inkpreferring state emits a green print job in the third print job, (t3)
for example]
___________________________________
Are the frequencies of the various colors in the emission sequence close to what we would expect? For
example, since the transition matrix is regular, to compute the expected proportion of red emissions to
other emissions (assuming we start in the red ink-preferring state) as T →∞, we can compute its fixed
vector w:
Thus, the expected proportion of red print jobs in the sequence is ~.4967
Does the proportion of red print jobs in the sample approach this with large T?
We can run the program crazyPrinter(10000000,1). This is a hefty
computation, but the large input for T=num_outputs will hopefully
give us a relatively precise result. Sure enough, the output
“proportion_red=(…)” gives us the expected proportion of red print
jobs: .4967
Three Fundamental Hidden Markov Model Problems
The hidden Markov model’s new layer of emission probabilities and observed emissions has some
interesting implications, allowing us to ask some useful questions, questions that are helpful in modeling
and solving a variety of real world problems :
1. Given a hidden Markov model µ = (A, B, ∏), what is the probability of observing a certain emission
sequence O=O1,O2,…,OT, i.e. what is P(O|µ)? How can we compute this efficiently?
2. Given a model µ and an emission sequence O, how can we choose a state sequence X=X1,X2,…,XT
that best explains O?
3. Given emission sequence O, how can we adjust the parameters of model µ = (A, B, ∏) to best
explain O, i.e. how do we maximize P(O|µ) from problem 1?
So, in reverse order (as is often followed in applications), we are estimating and refining hidden Markov
model parameters to fit observed data, choosing a sequence of states based on our observed data and
chosen model, and computing the probability of observed data given our chosen model (this process is
known as decoding, and its result is called the forward probability).
*Answering Problem 1: To calculate the probability of sequence O of length T given model µ, we start by
considering a single fixed state sequence X=X1,X2,…,XT. The probability of an emission sequence O given
this state sequence is:
P(O|X,µ) = ∏𝑇𝑡=1 P(𝑂𝑡 |𝑋𝑡 , 𝜇) = bX1(O1)* bX2(O2)*…* bXT(OT) [bxt is emission probability from
emission matrix B given state Xt of emitting output Ot]
The probability of each given state sequence X is:
P(X|µ) = πX1 aX1,X2 a X2,X3 …aX(T-1)X(T) [πX1 is the initial probability vector ∏ entry for state X1, the a’s
that follow are the transition probabilities of transition matrix A between each state in the given
sequence]
The probability that both of both given O and given X occurring is the product of the previous two
probabilities: P(O,X|µ) = P(O|X,µ) P(X|µ). Then the probability of sequence O given the model µ is the sum
of these probabilities over all possible state sequences X:
P(O|µ) = ∑𝐴𝑙𝑙 𝑋 P(𝑂|𝑋, µ) P(𝑋|µ) = ∑𝐴𝑙𝑙 𝑋. πX1bX1(O1)* aX1,X2 bX2(O2)* …*aX(T-1)X(T) bXT(OT)
This method can be used to calculate small sequence probabilities. For example, we can compute the
probability of seeing the T=2 sequence [Red, Purple] given the dysfunctional printer model above, and
given the initial state probability vector as [Rpref, Gpref, Bpref] = [.6 .4 0] (so, it never starts in the blue
ink-preferring state)2:
Possible state sequences
Rpref→Rpref
Rpref→Gpref
Rpref→Bpref
Gpref→Rpref
Gpref→Gpref
Gpref→Bpref
Bpref→Rpref
Bpref→Gpref
Bpref→Bpref
πX1
.6
.6
.6
.4
.4
.4
0
0
0
bX1(O1)
.8
.8
.8
.2
.2
.2
aX1,X2
.6
.1
.3
.3
.4
.3
bX2(O2)
0
.1
.05
0
.1
.05
Irrelevant, product of row is 0
See cell to right
See cell to right
See cell above
See cell above
See cell above
See cell above
See cell above
See cell above
P(O|µ) = ∑𝐴𝑙𝑙 𝑋. πX1bX1(O1)* aX1,X2 bX2(O2) = sum(product of all entries across each row) =
0 + .6 * .8 * .1 * .1 + .6 * .8 * .3 * .05 + 0 + .4 * .2 * .4 * .1 + .4 * .2 * .3 * .05 = 0.0164
This method is quite useful for evaluating small sequences and small numbers of states, but most realworld applications require far longer sequences and far more states, and the number of calculations
2
This problem is inspired by Exercise 9.2 in “Foundations of Statistical Language Processing” Chapter 9 by Manning
and Schutze, modified to fit the crazy printer HMM.
necessary to perform this computation quickly piles up: (2T-1)*NT + NT-1 calculations are required (NT
possible state sequences and 2T-1 multiplications per state sequence, NT-1 additions in summing all state
sequences). Thus, with 3 states and a sequence length of 100, we would need around 1050 calculations.
Luckily, it turns out there is a more efficient procedure called the forward-backward procedure that we
can use to solve this: in general this works by solving for conditional probabilities of partial emission
sequences inductively.
There is a MATLAB command in the statistical toolbox that can help us
with this procedure, hmmdecode(O, A, B). This function, given an
emission sequence, a transition matrix and an emission matrix, can
provide us with the posterior probabilities (crazyPrinter.m output
PSTATES), a N x T matrix of the probabilities of being at each possible
state at each point in the emission sequence. It also can provide us
with the logarithm of the probability of the given sequence
(crazyPrinter.m output logpseq). We need only raise e to the power of
logpseq to get an answer to problem 1.
Let’s say we run crazyPrinter(10,1) and find the probability of the
sequence that is emitted occurring. This would be a bit unpleasant to do
by hand. This returned us the emission sequence “BBGRRBRBBB,” which has
a probability of elogpseq = e-14.7521 = 3.9 *10-7. The probability of this
emission sequence is quite small, but as we shall see when we discuss
applications in speech recognition, being able to compare the
probabilities of different sequences occurring, no matter their absolute
size, is quite useful.
*Answering Problem 2: Answering Problem 2 is a little less straight forward that Problem 1. Problem 1 has
an exact answer, but in Problem 2, there is no strictly “correct” state sequence that gives rise emission
sequence O. We need to optimize a state sequence, but there are different possible criteria for this
optimization. For example, we could find the sequence made up of states that are individually most likely
for each observed output in the sequence. For this we could look to the matrix PSTATES found with the
function hmmdecode, picking out the most likely of the N states for each Ot. This would maximize the
expected number of states guessed correctly. However, this method might cause problems with certain
transition matrices, like our crazy printer matrix: one of our transition probabilities is 0, so a sequence
chosen from most probable individual states might contain an adjacent pair of states (in our case, Bpref
followed by Bpref) that is simply impossible.
This suggests that we should look to optimization methods that take a broader look at possible sequences.
Indeed, the most common criterion used to optimize state sequences is finding the single best state
sequence by maximizing P(X|O,µ). Most often, this is done by means of an algorithm called the Viterbi
Algorithm.
Again, there is a MATLAB function that can help us here. hmmviterbi(O,
A, B) given an emission sequence O, a transition matrix A and an
emission matrix B, finds the most probable single state path by the
above method.
Let’s run crazyPrinter(10,1) again, and take a look at the emission
sequence, the state sequence and the “optimal” state sequence chosen by
the viterbi algorithm. Note here that we are not trying to model some
natural phenomenon: since our emission sequence was produced from a
constructed HMM, there is a “true” underlying state sequence. The
emission sequence was “RRBRBRGRGR,” the underlying state sequence was
“rrbrrrrgrr” and the best single state path was calculated to be
“rrbrbrrrrr.” Only two states wrong: not bad.
*Answering Problem 3: Problem 3, the question of how to adjust parameters (A, B, ∏) to maximize the
probability of an emission sequence, is the most difficult of the 3 problems to answer. There is no known
analytic method for choosing parameters to maximize P(O| µ ). While we cannot necessarily find the best
model, we can find a local maximum using an iterative method known as the Baum-Welch method (or the
expectation-modification method). In general, the method works as follows: using a (perhaps randomly)
chosen model, we work out the probability of a given sequence of emissions (this is Problem 1). Looking at
this calculation, we determine which state transitions and emissions were probably used the most, and
increase their probabilities in a revised model that increases P(O|µ). This generally continues until the
revised model no longer is improving significantly. This process is called training the model, and the
emission sequences used are called training sequences.
Again, MATLAB is here to help us. The function hmmtrain(O, estimated A,
estimated B) takes a sequence and guesses of the transition and emission
matrices and returns improved estimates of these parameters, using the
Baum-Welch method.
Let’s look at crazyPrinter(500,1) and create “estimated” transition and
emission matrices. Just from eyeballing the frequencies of different
colors in the emission matrix, suppose we made the following rather
rough guesses:
Transition matrix guess =
0.8000
0.1000
0.1000
0.1000
0.8000
0.8000
0.1000
0.1000
0.1000
Emission matrix guess =
0.7000
0.1000
0.1000
0.1000
0.7000
0.1000
0.1000
0.1000
0.7000
0.1000
0.1000
0.1000
How does the Baum-Welch method adjust these? ESTTR is the adjusted transition
matrix, ESTEMIT is the adjusted emission matrix. For reference, the actual
parameters are displayed as well.
0.0000
0.3313
0.0000
Transition Matrix
Rpref Gpref
Rpref 0.6000 0.1000
Gpref 0.3000 0.4000
Bpref 0.5000 0.5000
Bpref
0.3000
0.3000
0.0000
0.1934
0.1249
0.4371
Emission Matrix
R
G
Rpref 0.8000 0.1500
Gpref 0.2000 0.6000
Bpref 0.2000 0.2500
B
0.0500
0.1000
0.5000
ESTTR =
0.8958
0.1116
0.0519
0.1042
0.5571
0.9481
ESTEMIT =
0.6806
0.4179
0.2134
0.0961
0.4572
0.1492
0.0298
0.0000
0.2003
P
0.0000
0.1000
0.0500
In general, it seems the estimates improved some of our guesses and made a few worse. This may speak
partially to the bias inherent in “guessing” parameters when they are already known. It is also important
to remember that with this algorithm it is easy to get “stuck” at a local maximum. It also may be that 500
time-slots is simply not enough data with this model for the method to maximize effectively.
HMM Variants
Many applications and problems that use a hidden Markov model require more nuanced variants on the
standard form. The left-right or feel-forward model differs in its transition matrix: it is not ergodic (not
every state can be reached by every state in a finite number of steps). Its names come from the fact that
each state in an ordered set of states it transitions to is at an equal or higher order number than the
previous state. Thus the transition matrix follows a form such like the following, where each * is a positive
probability:
∗ ∗
0 ∗
0 0
0
∗
∗
This is useful for modeling some types of signals that evolve over time. Another variant on the standard
HMM is the use of null or epsilon transitions. These designate certain transitions, regardless of their
probability, as not precipitating emission, even though the destination state can produce a full spectrum
of emissions with different prior states.
Some Applications
Hidden Markov models are perhaps best known for their applications in speech recognition technology.
How does this work at the basic level, and how do the fundamental questions regarding HMMs come into
play? For simplicity’s sake, consider a speech recognizer designed to identify a single isolated word at a
time. First, we need to build models for each of the words we want in our machine’s vocabulary (say we
want W words). To do this, we chose a word and look at its speech signals by different speakers. A given
signal might look something like this:
From http://www.isip.piconepress.com.
This would give us our observed emission sequence: each observation would be some measurement of
the signal. We might take 40 observations over the course of a word. Again for simplicity’s sake, assume
that each possible emission is a small range of heights measured on the y-axis (or some similar
parameter), and we have a total of M unique possible emissions. Say we have K recorded speech signals
for a given word, and thus K emission sequences. These are used as training sequences in conjunction with
the methods used in Problem 3 to estimate and refine transition and emission parameters µ = (A, B, ∏)
for each word. The solution to problem 2, using the viterbi algorithm to link a sequence of states to the
training sequences, allows for further insight into the physical meaning of these states and thus their
connection to spectral properties of the speech signal (i.e. different possible emissions). This allows for
further refinement of the model: the number of states and the number of discrete emissions can be
modified to improve P(X|O,µ). When the models for each word are optimized and well-studied, they are
ready to analyze “emission sequences” from unknown words. Ultimate recognition of the unknown word
is thanks to the methods of Problem 1: given an emission sequence and various models µ for different
words, we can calculate P(O|µ) for different µ (word models). The µ with the highest P(O|µ) is selected as
the recognized word.
Another interesting application of HMMs is in determining the routes of certain robots. HMMs play a key
role in allowing the robot to figure out where it is on a map that it has stored and thus navigate its
surroundings. These robots are equipped with range finders, which tell the robot how far it is from certain
surfaces. From this observed sequence of measurements (“O”) and a model µ of the environment (a map
of its surroundings), it infers a hidden sequence of states: its changing location within its environment.
This is Problem 2: essentially, the robot is trying to maximize P(X|O,µ).
Sources
Foundations of Statistical Language Processing, Manning and Schutze, MIT Press 1999
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Lawrence Rabiner.
Proceedings of the IEEE, Feb 1989
MATLAB MathWorks helpfiles
http://www.youtube.com/watch?v=S_Lm8aN-la0&feature=related
Download