Brief Introduction to Hidden Markov Models

advertisement
Apaydin slides with a several modifications and additions by Christoph Eick.
Introduction
 Modeling dependencies in input; no longer iid; e.g the
order of observations in a dataset matters:
 Temporal Sequences:


In speech; phonemes in a word (dictionary), words in a sentence
(syntax, semantics of the language).
Stock market (stock values over time)
 Spatial Sequences

Base pairs in DNA Sequences
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
2
Discrete Markov Process
 N states: S1, S2, ..., SN
 First-order Markov
State at “time” t, qt = Si
P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)
 Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si)
aij ≥ 0 and Σj=1N aij=1
 Initial probabilities
πi ≡ P(q1=Si)
Σj=1N πi=1
3
Stochastic Automaton/Markov Chain
T
P O  Q | A ,    P q 1  P q t | q t  1    q 1 a q 1q 2  a q T  1q T
t 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
4
Example: Balls and Urns
 Three urns each full of balls of one color
S1: blue, S2: red, S3: green
  0 . 5 , 0 . 2 , 0 . 3 
T
 0 .4

A  0 .2

 0 . 1
0 .3
0 .6
0 .1
0 .3 

0 .2

0 . 8 
O  S1 , S1 , S 3 , S 3
P O | A ,    P  S 1   P  S 1 | S 1   P  S 3 | S 1   P  S 3 | S 3 
  1  a 11  a 13  a 33
 0 . 5  0 . 4  0 . 3  0 . 8  0 . 048
5
Balls and Urns: Learning
 Given K example sequences of length T
ˆ i 
aˆ ij 

# sequences
starting
# sequences

with S i 
# transition s from S i to S j 


1q 1  S i 
k
k
K
# transition s from S i 
 
k
T- 1
t 1
1q t  S i and q t  1  S j 
k
 
k
k
T- 1
t 1
1q t  S i 
k
Remark: Extract the probabilities from the observed sequences:
s1-s2-s1-s3
s2-s1-s1-s2  1=1/3, 2=2/3, a11=1/3, a12=1/3, a13=1/3, a21=3/4,…
s2-s3-s2-s1
6
http://en.wikipedia.org/wiki/Hidden_Markov_model
Hidden Markov Models
 States are not observable
 Discrete observations {v1,v2,...,vM} are recorded; a
probabilistic function of the state
 Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj)
 Example: In each urn, there are balls of different colors,
but with different probabilities.
 For each observation sequence, there are multiple state
sequences
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
7
http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter10.htm
l
HMM Unfolded in Time
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
8
Now a more complicated problem
1
2
3
Markov
Chains
We observe:
Hidden
Markov
Models
What urn sequence create it?
1. 1-1-2-2 (somewhat trivial, as states are observable!)
2. (1 or 2)-(1 or 2)-(2 or 3)-(2 or 3) and the potential sequences have different
probabilities—e.g drawing a blue ball from urn1 is more likely than from urn2!
9
Another Motivating Example
10
Elements of an HMM
 N: Number of states
 M: Number of observation symbols
 A = [aij]: N by N state transition probability matrix
 B = bj(m): N by M observation probability matrix
 Π = [πi]: N by 1 initial state probability vector
λ = (A, B, Π), parameter set of HMM
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
11
Three Basic Problems of HMMs
1. Evaluation: Given λ, and sequence O, calculate P (O | λ)
2. Most Likely State Sequence: Given λ and sequence O, find
state sequence Q* such that
P (Q* | O, λ ) = maxQ P (Q | O , λ )
3. Learning: Given a set of sequence O={O1,…Ok}, find λ* such
that λ* is the most like explanation for the sequences in O.
P ( O | λ* )=maxλ k P ( Ok | λ )
(Rabiner, 1989)
12
Evaluation
Probability of observing O1-…-Ot
and additionally being in state i
 Forward variable:
 t i   P O 1  O t , q t  S i |  
Initializa tion :
 1 i    i b i O 1 
Recursion
:
 N

 t  1  j      t i a ij  b j O t  1 
 i 1

Using i the probability of the observed
sequence can be computed as follows:
P O |   
N
  i 
T
i 1
Complexity: O(N2*T)
13
Probability of observing Ot+1-…-OT
and additionally being in state i
 Backward variable:
 t  i   P O t  1  O T | q t  S i ,  
Initializa tion :
 T i   1
Recursion
:
 t i  
N
 a b O   j 
ij
j
t 1
t 1
j 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
14
Finding the Most Likely State Sequence
 t i   P q t  S i O ,  

 t i   t i 

N
j 1
t(i):=Probability
of being in
state i at
step t.
 t  j  t  j 
Choose the state that has the highest probability,
Observe: O1…OtOt+1…OT
for each time step:
qt*= arg maxi γt(i)
 t i   t i 
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
15
Only briefly discussed in 2014!
Viterbi’s Algorithm
δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)
 Initialization:
δ1(i) = πibi(O1), ψ1(i) = 0
 Recursion:
δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-1(i)aij
 Termination:
p* = maxi δT(i), qT*= argmaxi δT (i)
 Path backtracking:
qt* = ψt+1(qt+1* ), t=T-1, T-2, ..., 1
Idea: Combines path probability computations
with backtracking over competing paths.
16
Baum-Welch Algorithm
BaumWelch
Algorithm
O={O1,…,OK}
Model =(A,B,)
Hidden State Sequence
O
Observed Symbol Sequence
Learning a Model 
from Sequences O
An EM-style algorithm is used!
E  Step :
This is a hidden(latent)
variable, measuring the
probability of going from state i
to state j at step t+1 observing
Ot+1, given a model  and an
observed sequence O Ok.
 t i , j   P q t  S i , q t  1  S j | O ,  
 t i , j  
 t i a ij b j O t  1  t  1  j 
 
k
 t i  

K
j 1
l
 t  k a kl b l O t  1  t  1 l 
 t i , j 
This is a hidden(latent) variable,
measuring the probability of
being in state i step t observing
given a model  and an
observed sequence O Ok. 18
Baum-Welch Algorithm: M-Step
M  step :
K
ˆ i 
bˆ j

 1 i 
k
k 1
aˆ ij 
K
K
T k 1
k 1
K
t 1
T k 1
 
 
k 1
K
T k 1
k 1
t 1
K
 
m  
 
k 1
t
k
 j 1O tk
T k 1
t 1
 vm
t 1
 t i , j 
k
 t i 
k
Probability
going from i to j
Probability
being in i

 t i 
k
Remark: k iterates over the observed sequences O1,…,OK;
for each individual sequence OrO r and r are computed in the E-step; then,
the actual model  is computed in the M-step by averaging over the estimates
of i,aij,bj (based on k and k) for each of the K observed sequences.
19
Baum-Welch Algorithm: Summary
Estimate initial model   (A, B,  )
REPEAT
E - Step : E stimate  t i  and  t i , j  based on model   (A, B,  ) and O
M - step : Reestimate
  (A, B,  ) based on  t i , j 
UNTIL CONVERGENC
E
For more discussion see: http://www.robots.ox.ac.uk/~vgg/rg/slides/hmm.pdf
O={O1,…,OK}
BaumWelch
Algorithm
Model =(A,B,)
See also: http://www.digplanet.com/wiki/Baum%E2%80%93Welch_algorithm
Generalization of HMM: Continuous Observations
The observations generated at each time step are vectors
consisting of k numbers; a multivariate Gaussian with k
dimensions is associated with each state j, defining the
probabilities of k-dimensional vector v generated when
being in state j:
P O t | q t  S j ,   ~ N   j ,  j 
O
Hidden State Sequence
=(A, (j,j) j=1,…n,B)
Observed Vector Sequence
21
Generalization: HMM with Inputs
 Input-dependent observations:
P O t | q t  S j , x ,   ~ N g j  x |  j , 
t
t
2
j

 Input-dependent transitions (Meila and Jordan, 1996;
Bengio and Frasconi, 1996):
P q t 1  S j | q t  S i , x
 Time-delay input:
t

x  f O t  ,..., O t 1 
t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
22
Download