Machine Learning - Northwestern University

advertisement
Machine Learning
Hidden Markov Models
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Markov Property
A stochastic process has the Markov property if
the conditional probability of future states of the
process, depends only upon the present state.
i.e. what I’m likely to do next
depends only on where I am
now, NOT on how I got here.
P(qt | qt-1,…,q1) = P(qt | qt-1)
1
2
K
…
Which processes have the Markov property?
Doug Downey, adapted from Bryan Pardo,Northwestern University
Markov model for Dow Jones
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Dishonest Casino
A casino has two dice:
• Fair die
P(1) = P(2) =…= P(5) = P(6) = 1/6
• Loaded die
P(1) = P(2) =…= P(5) = 1/10; P(6) = ½
I think the casino switches back and forth
between fair and loaded die once every 20
turns, on average
Doug Downey, adapted from Bryan Pardo,Northwestern University
My dishonest casino model
This is a hidden Markov model (HMM)
0.05
0.95
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
0.95
LOADED
0.05
Doug Downey, adapted from Bryan Pardo,Northwestern University
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
Elements of a Hidden Markov Model
• A finite set of states Q = { q1, ..., qK }
• A set of transition probabilities between states, A
…each aij, in A is the prob. of going from state i to state j
• The probability of starting in each state P = {p1, …, pK}
…each pK in P is the probability of starting in state k
• A set of emission probabilities, B
…where each bi(oj) in B is the probability of observing
output oj when in state i
Doug Downey, adapted from Bryan Pardo,Northwestern University
My dishonest casino model
This is a HIDDEN Markov model because the states
are not directly observable.
If the fair die were red and the unfair die were blue,
then the Markov model would NOT be hidden.
0.05
0.95
FAIR
0.95
LOADED
0.05
Doug Downey, adapted from Bryan Pardo,Northwestern University
HMMs are good for…
• Speech Recognition
• Gene Sequence Matching
• Text Processing
– Part of speech tagging
– Information extraction
– Handwriting recognition
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Three Basic Problems for HMMs
• Given: observation sequence O=(o1o2…oT), of events from
the alphabet , and HMM model  = (A,B,p)…
• Problem 1 (Evaluation):
What is P(O| ), the probability of the observation sequence, given
the model
• Problem 2 (Decoding):
What sequence of states Q=(q1q2…qT) best explains the observations
• Problem 3 (Learning):
How do we adjust the model parameters  = (A,B,p) to maximize
P(O|  )?
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Evaluation Problem
• Given observation sequence O and HMM , compute
P(O| )
• Helps us pick which model is the best one
0.05
0.95
FAIR
0.95
LOADED
O = 1,6,6,2,6,3,6,6
0.05
0.95
0.05
FAIR
0.95
LOADED
0.05
Doug Downey, adapted from Bryan Pardo,Northwestern University
Computing P(O|)
•
•
•
•
Naïve: Try every path through the model
Sum the probabilities of all possible paths
This can be intractable. O(NT)
What we do instead:
– The Forward Algorithm. O(N2T)
0.95
0.05
FAIR
0.95
LOADED
0.05
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Forward Algorithm
Doug Downey, adapted from Bryan Pardo,Northwestern University
The inductive step,
• Computation of t(j) by summing all previous values t-1(i) for all i
A hidden state at time t-1
t-1(i)
transition
probability
t(j)
Doug Downey, adapted from Bryan Pardo,Northwestern University
Forward Algorithm Example
Model =
0.95
0.05
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
0.95
LOADED
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
0.05
Start prob
P (fair)
= .7
P (loaded) = .3
Observation sequence = 1,6,6,2
1(i)
State 1 (fair)
0.7*1/6
State 2 (loaded) 0.3*1/10
2(i)
3(i)
4(i)
1(1)*0.05*1/6+
1(2)*0.05*1/6
2(1)*0.05*1/6+
2(2)*0.05*1/6
3(1)*0.05*1/6+
3(2)*0.05*1/6
1(1)*0.95*1/2+
1(2)*0.95*1/2
2(1)*0.95*1/2+
2(2)*0.95*1/2
3(1)*0.95*1/10+
3(2)*0.95*1/10
Doug Downey, adapted from Bryan Pardo,Northwestern University
Markov model for Dow Jones
Doug Downey, adapted from Bryan Pardo,Northwestern University
Forward trellis for Dow Jones
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Decoding Problem
• What sequence of states Q=(q1q2…qT) best
explains the observation sequence O=(o1o2…oT)?
• Helps us find the path through a model.
ART
N
V
The
dog
sat
ADV
quietly
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Decoding Problem
What sequence of states Q=(q1q2…qT) best explains the
observation sequence O=(o1o2…oT)?
• Viterbi Decoding:
– slight modification of the forward algorithm
– the major difference is the maximization over previous states
Note: Most likely state sequence is not the same as the
sequence of most likely states
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Viterbi Algorithm
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Forward inductive step
• Computation of t(j)
 t ( j ) =  t 1 (k )aijb j (ot )
i
ot-1
ot
t-1(j)
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Viterbi inductive step
• Computation of vt(j)
vt ( j ) = max (vt 1 (i )akj b j (ot ))
i
ot-1
ot
Keep track of who the
predecessor was at each
step.
vt-1(i)
Doug Downey, adapted from Bryan Pardo,Northwestern University
Viterbi for Dow Jones
Doug Downey, adapted from Bryan Pardo,Northwestern University
The Learning Problem
• Given O, how do we adjust the model
parameters  = (A,B,p) to maximize P(O|
 )?
• In other words: How do we make a
hidden Markov Model that best models the
what we observe?
Doug Downey, adapted from Bryan Pardo,Northwestern University
Baum-Welch Local Maximization
• 1st step: You determine
– The number of hidden states, N
– The emission (observation alphabet)
• 2nd step: randomly assign values to…
A - the transition probabilities
B - the observation (emission) probabilities
p - the starting state probabilities
• 3rd step: Let the machine re-estimate
A, B, p
Doug Downey, adapted from Bryan Pardo,Northwestern University
Estimation Formulae
p i = expected frequency of state i at time = 1
expected num transitions from state i to state j
a ij =
expected num of transitions from state i
expected num of observations of symbol k in state j
b j (k ) =
expected number of times in state j
Doug Downey, adapted from Bryan Pardo,Northwestern University
Learning transitions…
 t (i, j ) = P(qt = Si , qt 1 = S j | O,  )
Doug Downey, adapted from Bryan Pardo,Northwestern University
Math…
 t (i, j ) = P (qt = Si , qt 1 = S j | O,  )
 t (i )aijb(Ot 1 )  t 1 ( j )
 t (i, j ) =
P (O |  )
 t (i )aijb j (Ot 1 )  t 1 ( j )
 t (i, j ) = N N
  t (k )akl bl (Ot 1 ) t 1 (l )
k =1 l =1
Doug Downey, adapted from Bryan Pardo,Northwestern University
Estimation of starting probs.
p i = expected frequency of state i at time 1
p i =  1 (i )
where...
N
 t (i ) =   t (i, j )
j =1
This is number of transitions
from i at time t
Doug Downey, adapted from Bryan Pardo,Northwestern University
Estimation Formulae
expected num transitions from state i to state j
a ij =
expected num of transitions from state i
T 1
a ij =
  (i, j )
t =1
T 1
t
  (i)
t =1
t
Doug Downey, adapted from Bryan Pardo,Northwestern University
Estimation Formulae
expected num of observations of symbol k in state j
b j (k ) =
expected number of times in state j
T
  ( j)
t
b j (k ) =
t =1
Such that Ot = vkk
T
  ( j)
t =1
t
Doug Downey, adapted from Bryan Pardo,Northwestern University
What are we maximizing again?
The current model is...
 = ( A, B, p )
Our reestimate d model is...
 = ( A, B ,p )
Doug Downey, adapted from Bryan Pardo,Northwestern University
The game is…
• EITHER the current model is at a local maximum
and…
reestimate = current model
• OR our reestimate will be slightly better and…
reestimate != current model
• SO we feed in the reestimate as the current
model, over and over until we can’t improve any
more.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Caveats
• This is a kind of hill-climbing technique
– Often has serious problems with local maxima
– You don’t know when you’re done
So…how else could we do this?
• Standard gradient descent techniques?
– Hill climb?
– Beam search?
– Genetic Algorithm?
Doug Downey, adapted from Bryan Pardo,Northwestern University
Back to the fundamental question
• Which processes have the Markov
property?
– What if a hidden state variable is included?
(an in an HMM)
Doug Downey, adapted from Bryan Pardo,Northwestern University
Download