Uploaded by vekaca9029

HMM

advertisement
MARKOV CHAIN AND HIDDEN MARKOV MODEL
JIAN ZHANG
JIANZHAN@STAT.PURDUE.EDU
Markov chain and hidden Markov model are probably the simplest models which can be used to model
sequential data, i.e. data samples which are not independent from each other.
Markov Chain
Let I be a countable set. Each i ∈ I is called a state and I is called the state-space. Without loss
of generality we assume I = {1, 2, . . .}, and in most cases we have I a finite set and use P
the notation
I = {1, 2, . . . , k} or I = {S1 , S2 , . . . , Sk }. λ is said to be a distribution on I if 0 ≤ λi < ∞ and i∈I λi = 1.
Definition 1.1. A matrix T ∈ Rk×k is stochastic if each row of T is a probability distribution.
One example of a stochastic matrix is
T =
1−α
β
α
1−β
with α, β ∈ [0, 1]. Figure 1 shows another example of a transition matrix on I = {S1 , S2 , S3 } using finite
state machine.
Figure 1. Finite state machine for a Markov chain X0 → X1 → X2 → · · · → Xn where
the random variables Xi ’s take values from I = {S1 , S2 , S3 }. The numbers T (i, j)’s on the
arrows are the transition probabilities such that Tij = P (Xt+1 = Sj |Xt = Si ).
Definition 1.2. We say that (Xn )n≥0 is a Markov chain with initial distribution λ and transition matrix
T if
(i) X0 has distribution λ;
(ii) for n ≥ 0, conditional on Xn = i, Xn+1 has distribution (Tij : j ∈ I) and is independent of X0 , . . . , Xn−1 .
By the Markov property we have
(1)
P (X0 , . . . , Xn ) =
(2)
=
P (X0 )P (X1 |X0 ) · · · P (Xn |X0 , . . . , Xn−1 )
n
Y
P (X0 )
P (Xt |Xt−1 )
t=1
1
englishMARKOV CHAIN AND HIDDEN MARKOV MODEL
2
which greatly simplifies the joint distribution of X0 , . . . , Xn . Note also that in our definition the process is
homogeneous, i.e. we have P (Xt = Sj |Xt−1 = Si ) = Tij which does not depend on t.
Assume that X takes values from X = {S1 , . . . , Sk }, the behavior of the process can then be described
by a transition matrix T ∈ Rk×k where we have Tij = P (Xt = Sj |Xt−1 = Si ). The set of parameters Θ for
a Markov chain is Θ = {λ, T }.
Graphical Model for Markov Chain.
The Markov chain X0 , . . . , Xn can be represented in terms of a graphical model, where each node represents
a random variable, and the edges indicate conditional dependence structure. Graphical model is a very useful
tool to visualize probabilistic models as well as to design efficient inference algorithms.
Figure 2. Graphical Model for Markov Chain
Random Walk on Graphs.
The behavior of a Markov chain can also be described as a random walk on the graph shown in Figure 1.
Initially a vertice is chosen according to the initial distribution λ and is denoted as SX0 ; at time t the current
position is SXt and the next vertice is chosen with respect to the probability TXt ,. , the Xt -th row of the
transition matrix T .
Many properties of Markov chain can be identified by studying λ and T . For example, the distribution
of X0 is determined by λ, while the distribution of X1 is determined by λT 1, etc.
Hidden Markov Model
A hidden Markov model is an extension of a Markov chain which is able to capture the sequential relations
among hidden variables. Formally we have Zt = (Xt , Yt ) for t = 0, 1, . . . , n with Xt ∈ I and Yt ∈ O =
{O1 , . . . , Ol } such that the joint probability of Z0 , . . . , Zn can be factorized as:
n
Y
(3)
P (Z0 , . . . , Zn ) = [P (X0 )P (Y0 |X0 )]
[P (Xt |Xt−1 )P (Yt |Xt )]
t=1
(4)
=
"
P (X0 )
n
Y
t=1
#"
P (Xt |Xt−1 )
n
Y
t=0
#
P (Yt |Xt ) .
In other words, the X0 , . . . , Xn is a Markov chain and Yt is independent of all other variables given Xt . The
set of parameters for a HMM Θ = {λ, T, Γ} where Γ ∈ Rk×l is defined as Γij = P (Yt = Oj |Xt = Si ). If
P (Yt |Xt ) is assumed to be a Multinomial distribution, then the total number of parameters for a HMM is
(k − 1) + k(k − 1) + k(l − 1). Figure 3 shows the graphical model for HMM, from which we can easily see
the conditional independence structure of all variables (X0 , Y0 ), . . . , (Xn , Yn ).
Figure 3. Graphical Model for Hidden Markov Model
1We assume λ ∈ R1×k to be a row vector.
englishMARKOV CHAIN AND HIDDEN MARKOV MODEL
3
HMM is suitable for situations where the observed sequences Y0 , . . . , Yn are influenced by a hidden Markov
chain X0 , . . . , Xn . For example, in speech recognition, we observe the phoneme sequences Y0 , . . . , Yn . The
sequence of Y0 , . . . , Yn can be thought as noisy observations of the underlying words X0 , . . . , Xn . In this
case, we would like to infer the unknown words based on the observation sequence Y0 , . . . , Yn .
Three Fundamental Problems in HMM
There are three basic problems of interest for the hidden Markov model:
• Problem 1 : Given an observation sequence y0 y1 . . . yn and the model parameters Θ = {λ, T, Γ}, how
to efficiently compute P (Y = y|Θ) = P (Y0 = y0 , . . . , Yn = yn |Θ), the probability of the observation
sequence given the model?
• Problem 2 : Given an observation sequence y0 y1 . . . yn and the model parameters Θ = {λ, T, Γ}, how
to find the optimal sequence of states x0 x1 . . . xn in the sense of maximizing P (X = x|Θ, Y = y) =
P (X0 = x0 , . . . , Xn = xn |Θ, Y0 = y0 , . . . , Yn = yn )?
• Problem 3 : How to estimate the model parameters Θ = {λ, T, Γ} by maximizing P (Y = y|Θ)?
Forward-Backward Algorithm.
The solution of problem 1 can be computed as
P (Y = y|Θ)
X
=
P (X = x|Θ)P (Y = y|Θ, X = x)
x
(5)
=
XX
x0
···
x1
X
xn
"
P (X0 = x0 )
n
Y
P (Xt = xt |Xt−1 = xt−1 )
t=1
n
Y
#
P (Yt = yt |Xt = xt )
t=0
However, the total number of possible hidden sequences x is large and thus direct computation is very
expensive. Intuitively, we want to move some of the sums inside the product to reduce the computation.
The basic idea of the forward algorithm is as follows. First, the forward variable αt (i) is defined by
(6)
αt (i) = P (y0 , . . . , yt , Xt = Si )
is the probability of observing a partial sequence y0 , . . . , yt and ending up in state Si . We have
(7)
(8)
αt+1 (i) =
=
(9)
(10)
=
=
(11)
=
P (y0 , . . . , yt+1 , Xt+1 = Si )
P (Xt+1 = Si )P (y0 , . . . , yt+1 |Xt+1 = Si )
P (Xt+1 = Si )P (yt+1 |Xt+1 = Si )P (y0 , . . . , yt |Xt+1 = Si )
P (yt+1 |Xt+1 = Si )P (y0 , . . . , yt , Xt+1 = Si )
X
P (y0 , . . . , yt , Xt = xt , Xt+1 = Si )
P (yt+1 |Xt+1 = Si )
xt
(12)
=
P (yt+1 |Xt+1 = Si )
X
P (Xt+1 = Si |Xt = xt )P (y0 , . . . , yt , Xt = xt )
xt
(13)
=
Γi,yt+1
k
X
Tj,i αt (j).
j=1
Initially we have α0 (i) = λi Γi,y0 and the final solution is
(14)
P (Y = y|Θ) =
k
X
αn (i).
i=1
The backward algorithm can be constructed similarly by defining the backward variable
βt (i) = P (yt+1 , . . . , yn |Xt = Si ).
englishMARKOV CHAIN AND HIDDEN MARKOV MODEL
4
Viterbi Algorithm.
The solution of problem 2 can be written as
x∗
(15)
=
arg max P (X = x|Y = y, Θ)
=
arg max P (X = x, Y = y, Θ).
x
(16)
x
A formal technique for finding the best state sequence x∗ based on dynamic programming is known as the
Viterbi algorithm. Define the quantity
(17)
δt (i) =
max
x0 ,...,xt−1
P (x0 , . . . , xt−1 , Xt = Si , y0 , . . . , yt |Θ),
which is the highest probability along a single path at time t ending at state Si . We have
(18)
δt+1 (j) =
(19)
=
max {δt (i)P (Xt+1 = Sj |Xt = Si )P (Yt+1 = yt+1 |Xt+1 = Sj )}
i
max δt (i)Tij Γj,yt+1 .
i
Initially we have δ0 (i) = λi Γi,y0 and the final highest probability is P ∗ = maxSi ∈I δn (i). To find the optimal
sequence x∗ we need to define some auxiliary variables ψt+1 (j) which stores the optimal path:
(20)
ψt+1 (j) = arg max δt (i)Tij Γj,yt+1 = arg max {δt (i)Tij } ,
i
i
for t = 1, 2, . . . , n. The final optimal path can be traced back by using x∗n = arg maxi δn (i) and x∗t =
ψt+1 (x∗t+1 ) for t = n − 1, . . . , 0.
Baum-Welch Algorithm.
Let Θ = (λ, T, Γ) represent all of the parameters of the HMM model. Given m observation sequences
y1 , . . . , ym , the parameters can be estimated by maximizing the (log)-likelihood:
m
Y
(21)
p(Y = yl |Θ)
Θ̂ = arg max
Θ
(22)
l=1
= arg max
Θ
(23)
m
X
log p(Y = yl |Θ)
l=1
= arg max
Θ
m
X
log
l=1

X

x0
···
X
λx0
xnl
nl
Y
Txt ,xt+1
t=1
nl
Y
Γxt ,ytl
t=0


.

In principle, the above equation can be maximized using standard numerical optimization methods to find
Θ̂. In practice, the above estimation is often solved by the well-known Baum-Welch algorithm, which is a
special case of the Expectation Maximization (EM) algorithm. Details will be discussed after we introduce
the EM algorithm.
Learning with (x, y).
There are often cases where we are able to know both the state sequences and the observation sequences.
That is, given m pairs of sequences (x1 , y1 ), . . . , (xm , ym ), we want to estimate parameters λ, T and Γ. Since
the state sequences are observed (and thus the summation over x is not needed any more), the maximum
likelihood estimation Θ̂ can be computed easily in this case:
m
X
(24)
log p(Y = yl , X = xl |Θ)
Θ̂ = arg max
Θ
(25)
=
arg max
Θ
(26)
=
arg max
Θ
l=1
m
X
(
log λxl0
l=1
(m
X
l=1
nl
Y
Txlt ,xlt+1
t=1
log λxl0 +
nl
m X
X
l=1 t=1
nl
Y
t=0
Γxlt ,ytl
)
log Txlt ,xlt+1 +
nl
m X
X
l=1 t=0
log Γxlt ,ytl
)
which is straightforward to solve by adding the constraints that λ and each row of T and Γ are probability
distributions.
englishMARKOV CHAIN AND HIDDEN MARKOV MODEL
5
Discussion
HMM has been applied to many applications such as speech recognition, robotics, bio-informatics, etc,
and it is also the simplest example of what is known as Dynamic Bayesian Networks (DBN) or directed
graphical models. More complicated models (generalizations of HMM) include: factorial HMM, HMM
decision trees, etc. Other related models include Conditional Random Field (CRF) which is a member of
undirected graphical models.
References
[1] J. Norris. Markov Chains. Cambridge University Press, 1997.
[2] M. Jordan. An Introduction to Graphical Models. unpublished manuscript, 2001.
[3] L. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the
IEEE, 77(2), 1989.
Download