Note to other teachers and users of these slides. Andrew... this source material useful in giving your own lectures. Feel...

advertisement
Note to other teachers and users of these slides. Andrew would be delighted if you found
this source material useful in giving your own lectures. Feel free to use these slides
verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If
you make use of a significant portion of these slides in your own lecture, please include this
message, or the following link to the source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Hidden Markov
Models
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Copyright © Andrew W. Moore
Slide 1
A Markov System
Has N states, called s1, s2 .. sN
s2
s1
There are discrete timesteps,
t=0, t=1, …
s3
N=3
t=0
Copyright © Andrew W. Moore
Slide 2
A Markov System
Has N states, called s1, s2 .. sN
s2
Current State
s1
s3
There are discrete timesteps,
t=0, t=1, …
On the t’th timestep the system is
in exactly one of the available
states. Call it qt
Note: qt {s1, s2 .. sN }
N=3
t=0
qt=q0=s3
Copyright © Andrew W. Moore
Slide 3
A Markov System
Has N states, called s1, s2 .. sN
Current State
s2
s1
N=3
t=1
s3
There are discrete timesteps,
t=0, t=1, …
On the t’th timestep the system is
in exactly one of the available
states. Call it qt
Note: qt {s1, s2 .. sN }
Between each timestep, the next
state is chosen randomly.
qt=q1=s2
Copyright © Andrew W. Moore
Slide 4
P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
A Markov System
P(qt+1=s3|qt=s2) = 0
Has N states, called s1, s2 .. sN
P(qt+1=s1|qt=s1) = 0
P(qt+1=s2|qt=s1) = 0
s2
P(qt+1=s3|qt=s1) = 1
s1
s3
N=3
t=1
qt=q1=s2
Copyright © Andrew W. Moore
P(qt+1=s1|qt=s3) = 1/3
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
There are discrete timesteps,
t=0, t=1, …
On the t’th timestep the system is
in exactly one of the available
states. Call it qt
Note: qt {s1, s2 .. sN }
Between each timestep, the next
state is chosen randomly.
The current state determines the
probability distribution for the
next state.
Slide 5
P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
A Markov System
P(qt+1=s3|qt=s2) = 0
Has N states, called s1, s2 .. sN
P(qt+1=s1|qt=s1) = 0
s2
P(qt+1=s2|qt=s1) = 0
1/2
P(qt+1=s3|qt=s1) = 1
2/3
1/2
s1
N=3
t=1
qt=q1=s2
1/3
s3
On the t’th timestep the system is
in exactly one of the available
states. Call it qt
Note: qt {s1, s2 .. sN }
1
P(qt+1=s1|qt=s3) = 1/3
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
Often notated with arcs
between states
Copyright © Andrew W. Moore
There are discrete timesteps,
t=0, t=1, …
Between each timestep, the next
state is chosen randomly.
The current state determines the
probability distribution for the
next state.
Slide 6
P(qt+1=s1|qt=s2) = 1/2
P(qt+1=s2|qt=s2) = 1/2
qt+1 is conditionally independent
of { qt-1, qt-2, … q1, q0 } given qt.
P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0
s2
P(qt+1=s2|qt=s1) = 0
1/2
P(qt+1=s3|qt=s1) = 1
N=3
t=1
qt=q1=s2
Copyright © Andrew W. Moore
In other words:
P(qt+1 = sj |qt = si ) =
2/3
1/2
s1
Markov Property
P(qt+1 = sj |qt = si ,any earlier history)
1/3
s3
1
P(qt+1=s1|qt=s3) = 1/3
Question: what would be the best
Bayes Net structure to represent
the Joint Distribution of ( q0, q1,
… q3,q4 )?
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
Slide 7
P(qt+1=s1|qt=s2) = 1/2
Answer:
P(qt+1=s2|qt=s2) = 1/2
q
0=s3|qt=s2) = 0
P(qt+1
P(qt+1=s1|qt=s1) = 0
P(qt+1=s2|qt=s1) = 0
q1
s2
qt+1 is conditionally independent
of { qt-1, qt-2, … q1, q0 } given qt.
1/2
P(qt+1=s3|qt=s1) = 1
1/2
s1
N=3
t=1
qt=q1=s2
q1/3
2
Markov Property
In other words:
P(qt+1 = sj |qt = si ) =
2/3
P(qt+1 = sj |qt = si ,any earlier history)
s3
1
q
P(q3t+1=s1|qt=s3) = 1/3
Question: what would be the best
Bayes Net structure to represent
the Joint Distribution of ( q0, q1,
q2,q3,q4 )?
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
q4
Copyright © Andrew W. Moore
Slide 8
P(qt+1=s1|qt=s2) = 1/2
Answer:
Markov Property
P(qt+1=s2|qt=s2) = 1/2
q
0=s3|qt=s2) = 0
P(qt+1
i
P(qt+1=s1|qt=s1) = 0
P(qt+1=s2|qt=s1) = 0
q1
P(qt+1=s3|qt=s1) = 1
1/2
s1
Each of these
probability
N=3
tables is
identical
t=1
qt=q1=s2
q1/3
2
1
t
s1 2a11
2
a21
3
a31
1/2
:2/3
:
i
ai1
s3
N
aN1
q
P(q3t+1=s1|qt=s3) = 1/3
P(qt+1=s2|qt=s3) = 2/3
P(qt+1=s3|qt=s3) = 0
q4
Copyright © Andrew W. Moore
qt+1 is conditionally independent
P(qt+1=s1|q =sof
=s,2|qq
=si) ,…
P(q q=s
|q =s ) … P(qt+1=sq
i) P({qt+1
N|q.=si)
qt-1
t-2 … t+1
1, j q0 i } given
t
t
a12
t
…a
1j
…a
1N
a2j
…a
2N
…a
3N
In other words:
…
a22
P(q
a32t+1 = sj |q…
t =a3jsi ) =
:
P(qt+1 =
ai2
t
: :
sj |q…
t=
: :
si ,any earlier
history)
…
aij
aiN
Question: what would be the best
…a
Bayes
to …represent
aN2 Net structure
aNN
Nj
the Joint Distribution of ( q0, q1,
q2,q3,q4 )?
Notation:
aij  P(qt 1  s j | qt  si )
Slide 9
A Blind Robot
A human and a
robot wander
around randomly
on a grid…
R
H
STATE q =
Copyright © Andrew W. Moore
Location of Robot,
Location of Human
Slide 10
Dynamics of System
q0 =
R
H
Each timestep the
human moves
randomly to an
adjacent cell. And
Robot also moves
randomly to an
adjacent cell.
Typical Questions:
• “What’s the expected time until the human is
crushed like a bug?”
• “What’s the probability that the robot will hit the
left wall before it hits the human?”
• “What’s the probability Robot crushes human
on next time step?”
Copyright © Andrew W. Moore
Slide 11
Example Question
“It’s currently time t, and human remains uncrushed. What’s the
probability of crushing occurring at time t + 1 ?”
If robot is blind:
We can compute this in advance.
If robot is omnipotent:
(I.E. If robot knows state at time t),
can compute directly.
If robot has some sensors, but
incomplete state information …
We’ll do this first
Too Easy. We
won’t do this
Main Body
of Lecture
Hidden Markov Models are
applicable!
Copyright © Andrew W. Moore
Slide 12
What is P(qt =s)? slow, stupid answer
Step 1: Work out how to compute P(Q) for any path Q
= q1 q2 q3 .. qt
Given we know the start state q1 (i.e. P(q1)=1)
P(q1 q2 .. qt) = P(q1 q2 .. qt-1) P(qt|q1 q2 .. qt-1)
WHY?
= P(q1 q2 .. qt-1) P(qt|qt-1)
= P(q2|q1)P(q3|q2)…P(qt|qt-1)
Step 2: Use this knowledge to get P(qt =s)
P (qt  s ) 
Copyright © Andrew W. Moore
 P(Q)
QPaths of length t that end in s
Slide 13
What is P(qt =s) ? Clever answer
• For each state si, define
pt(i) = Prob. state is si at time t
= P(qt = si)
• Easy to do inductive definition
i
p0 (i) 
j
pt 1 ( j )  P(qt 1  s j ) 
Copyright © Andrew W. Moore
Slide 14
What is P(qt =s) ? Clever answer
• For each state si, define
pt(i) = Prob. state is si at time t
= P(qt = si)
• Easy to do inductive definition
1 if si is the start state
i p0 (i)  
otherwise
0
j
pt 1 ( j )  P(qt 1  s j ) 
Copyright © Andrew W. Moore
Slide 15
What is P(qt =s) ? Clever answer
• For each state si, define
pt(i) = Prob. state is si at time t
= P(qt = si)
• Easy to do inductive definition
1 if si is the start state
i p0 (i)  
otherwise
0
j
pt 1 ( j )  P(qt 1  s j ) 
N
 P(q
i 1
Copyright © Andrew W. Moore
t 1
 s j  qt  si ) 
Slide 16
What is P(qt =s) ? Clever answer
• For each state si, define
pt(i) = Prob. state is si at time t
= P(qt = si)
• Easy to do inductive definition
1 if si is the start state
i p0 (i)  
otherwise
0
j
pt 1 ( j )  P(qt 1  s j ) 
N
 P(q
t 1
i 1
Remember,
 s j  qt  si ) 
N
 P(q
i 1
Copyright © Andrew W. Moore
t 1
aij  P(qt 1  s j | qt  si )
 s j | qt  si ) P(qt  si ) 
N
a
i 1
ij
pt (i )
Slide 17
What is P(qt =s) ? Clever answer
• For each state si, define
pt(i) = Prob. state is si at time t
= P(qt = si)
• Easy to do inductive definition
1 if si is the start state
i p0 (i)  
otherwise
0
j
pt 1 ( j )  P(qt 1  s j ) 
N
 P(q
t 1
i 1
 P(q
Copyright © Andrew W. Moore
t
pt(1) pt(2) …
pt(N)
0
0
0
1
1
:
tfinal
 s j  qt  si ) 
N
i 1
• Computation is simple.
• Just fill in this table in this
order:
t 1
 s j | qt  si ) P(qt  si ) 
N
a
i 1
ij
pt (i )
Slide 18
What is P(qt =s) ? Clever answer
• For each state si, define
pt(i) = Prob. state is si at time t
= P(qt = si)
• Easy to do inductive definition
1 if si is the start state
i p0 (i)  
otherwise
0
j
pt 1 ( j )  P(qt 1  s j ) 
N
 P(q
t 1
i 1
 s j  qt  si ) 
N
 P(q
i 1
Copyright © Andrew W. Moore
• Cost of computing Pt(i) for all
states Si is now O(t N2)
• The stupid way was O(Nt)
• This was a simple example
• It was meant to warm you up
to this trick, called Dynamic
Programming, because
HMMs do many tricks like
this.
t 1
 s j | qt  si ) P(qt  si ) 
N
a
i 1
ij
pt (i )
Slide 19
Hidden State
“It’s currently time t, and human remains uncrushed. What’s the
probability of crushing occurring at time t + 1 ?”
If robot is blind:
We can compute this in advance.
If robot is omnipotent:
(I.E. If robot knows state at time t),
can compute directly.
If robot has some sensors, but
incomplete state information …
We’ll do this first
Too Easy. We
won’t do this
Main Body
of Lecture
Hidden Markov Models are
applicable!
Copyright © Andrew W. Moore
Slide 20
Hidden State
• The previous example tried to estimate P(qt = si)
unconditionally (using no observed evidence).
• Suppose we can observe something that’s affected
by the true state.
• Example: Proximity sensors. (tell us the contents of
the 8 adjacent squares)
R0
W
W
W

H
H
True state qt
Copyright © Andrew W. Moore
W
denotes
“WALL”
What the robot sees:
Observation Ot
Slide 21
Noisy Hidden State
• Example: Noisy Proximity sensors. (unreliably tell us
the contents of the 8 adjacent squares)
R0
W
W
W
W
denotes
“WALL”

H
H
True state qt
Uncorrupted Observation
W
W

H
W
H
What the robot sees:
Observation Ot
Copyright © Andrew W. Moore
Slide 22
Noisy Hidden State
• Example: Noisy Proximity sensors. (unreliably tell us
the contents of the 8 adjacent squares)
R0
2
W
W
W
W
denotes
“WALL”

H
H
True state qt
Ot is noisily determined depending on
the current state.
Assume that Ot is conditionally
independent of {qt-1, qt-2, … q1, q0 ,Ot-1,
Ot-2, … O1, O0 } given qt.
In other words:
Uncorrupted Observation
W
W

H
W
H
What the robot sees:
Observation Ot
P(Ot = X |qt = si ) =
P(Ot = X |qt = si ,any earlier history)
Copyright © Andrew W. Moore
Slide 23
Noisy Hidden State
• Example: Noisy Proximity sensors. (unreliably tell us
the contents of the 8 adjacent squares)
R0
2
W
W
W
W
denotes
“WALL”

H
H
True state qt
Ot is noisily determined depending on
the current state.
Assume that Ot is conditionally
independent of {qt-1, qt-2, … q1, q0 ,Ot-1,
Ot-2, … O1, O0 } given qt.
In other words:
Uncorrupted Observation
W
W

H
W
H
What the robot sees:
Observation Ot
Question: what’d be the best Bayes Net
structure to represent the Joint Distribution
P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?
P(Ot = X |qt = si ) =
Copyright © Andrew W. Moore
Slide 24
Noisy Hidden State
Answer:
q0
• Example: Noisy
O0 Proximity sensors. (unreliably tell us
the contents of the 8 adjacent squares)
q1
R0
H
O1
2
W
W
W
W
denotes
“WALL”

H
True state qt
q2 determined depending on
Ot is noisily
O2
the current state.
Assume that Ot is conditionally
q3 of {qt-1, qt-2, … q1, q0 ,Ot-1,
independent
O3qt.
Ot-2, … O1, O0 } given
In other words:
Uncorrupted Observation
W
W

H
W
H
What the robot sees:
Observation Ot
P(Ot = q
X4|qt = si ) =
Question: what’d be the best Bayes Net
structure to represent the Joint Distribution
O4
P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?
Copyright © Andrew W. Moore
Slide 25
Noisy Hidden StateNotation:
Answer:
q0
bi (k )(unreliably
 P(Ot  k | qtell
• Example: Noisy
O0 Proximity sensors.
t  sus
i)
the contents of the 8i adjacent
squares)
P(O =1|q =s ) P(O =2|q =s ) … P(O =k|q =s ) … P(O =M|q =s )
t
q1
R0
H
O1
2
t
i
t
t
i
t
i
t
b1(M)
b (k)
2
… b (k)
3
…
W
b2 (M)
…
b3 (M)
“WALL”
:
:
:
:
bi (2)
…
bi(k)
…
bi (M)
:
:
b1 (2)
2 b2 (1)
b2 (2)
3 b3 (1)
b3 (2)
W
H
…
b1 (k)
W
: :
True state qt
q2 determined depending
i bi(1)on
Ot is noisily
O2
the current state.
: :
:
:
:
:
N bN (1)
Assume that Ot is conditionally
q3 of {qt-1, qt-2, … q1, q0 ,Ot-1,
independent
O3qt.
Ot-2, … O1, O0 } given
bN (2)
…
bN(k)
In other words:
t
…
1 b1(1)
…
i
t
W
denotes
Uncorrupted Observation
W
H
W
W
… bN (M)
H
What the robot sees:
Observation Ot
P(Ot = q
X4|qt = si ) =
Question: what’d be the best Bayes Net
structure to represent the Joint Distribution
O4
P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?
Copyright © Andrew W. Moore
Slide 26
Hidden Markov Models
Our robot with noisy sensors is a good example of an HMM
• Question 1: State Estimation
What is P(qT=Si | O1O2…OT)
It will turn out that a new cute D.P. trick will get this for us.
• Question 2: Most Probable Path
Given O1O2…OT , what is the most probable path that I took?
And what is that probability?
Yet another famous D.P. trick, the VITERBI algorithm, gets
this.
• Question 3: Learning HMMs:
Given O1O2…OT , what is the maximum likelihood HMM that
could have produced this string of observations?
Very very useful. Uses the E.M. Algorithm
Copyright © Andrew W. Moore
Slide 27
Are H.M.M.s Useful?
You bet !!
• Robot planning + sensing when there’s uncertainty
(e.g. Reid Simmons / Sebastian Thrun / Sven
Koenig)
• Speech Recognition/Understanding
Phones  Words, Signal  phones
• Human Genome Project
Complicated stuff your lecturer knows nothing
about.
• Consumer decision modeling
• Economics & Finance.
Plus at least 5 other things I haven’t thought of.
Copyright © Andrew W. Moore
Slide 28
Some Famous HMM Tasks
Question 1: State Estimation
What is P(qT=Si | O1O2…Ot)
Copyright © Andrew W. Moore
Slide 29
Some Famous HMM Tasks
Question 1: State Estimation
What is P(qT=Si | O1O2…Ot)
Copyright © Andrew W. Moore
Slide 30
Some Famous HMM Tasks
Question 1: State Estimation
What is P(qT=Si | O1O2…Ot)
Copyright © Andrew W. Moore
Slide 31
Some Famous HMM Tasks
Question 1: State Estimation
What is P(qT=Si | O1O2…Ot)
Question 2: Most Probable Path
Given O1O2…OT , what is the
most probable path that I
took?
Copyright © Andrew W. Moore
Slide 32
Some Famous HMM Tasks
Question 1: State Estimation
What is P(qT=Si | O1O2…Ot)
Question 2: Most Probable Path
Given O1O2…OT , what is the
most probable path that I
took?
Copyright © Andrew W. Moore
Slide 33
Some Famous HMM Tasks
Question 1: State Estimation
What is P(qT=Si | O1O2…Ot) Woke up at 8.35, Got on Bus at 9.46,
Question 2: Most Probable Path Sat in lecture 10.05-11.22…
Given O1O2…OT , what is the
most probable path that I
took?
Copyright © Andrew W. Moore
Slide 34
Some Famous HMM Tasks
Question 1: State Estimation
What is P(qT=Si | O1O2…Ot)
Question 2: Most Probable Path
Given O1O2…OT , what is the
most probable path that I
took?
Question 3: Learning HMMs:
Given O1O2…OT , what is the
maximum likelihood HMM
that could have produced
this string of observations?
Copyright © Andrew W. Moore
Slide 35
Some Famous HMM Tasks
Question 1: State Estimation
What is P(qT=Si | O1O2…OT)
Question 2: Most Probable Path
Given O1O2…OT , what is the
most probable path that I
took?
Question 3: Learning HMMs:
Given O1O2…OT , what is the
maximum likelihood HMM
that could have produced
this string of observations?
Copyright © Andrew W. Moore
Slide 36
Ot
Some Famous HMM Tasks
aBB
Question 1: State Estimation
What is P(qT=Si | O1O2…OT)
Ot-1
Question 2: Most Probable Path
b (O )
Given O1O2…OT , what is the
most probable path that I
a
took?
Question 3: Learning HMMs:
Given O1O2…OT , what is the
maximum likelihood HMM
that could have produced
this string of observations?
A
t-1
AA
Copyright © Andrew W. Moore
bB(Ot)
Bus
aAB
aCB
aBA
Eat
Ot+1
aBC
bC(Ot+1)
walk
aCC
Slide 37
Basic Operations in HMMs
For an observation sequence O = O1…OT, the three basic HMM
operations are:
Problem
Evaluation:
Calculating P(qt=Si | O1O2…Ot)
Inference:
Computing Q* = argmaxQ P(Q|O)
Learning:
Computing * = argmax P(O|)
Algorithm
Complexity
Forward-Backward
O(TN2)
Viterbi Decoding
O(TN2)
Baum-Welch (EM)
O(TN2)
+
T = # timesteps, N = # states
Copyright © Andrew W. Moore
Slide 38
HMM Notation
(from Rabiner’s Survey)
The states are labeled S1 S2 .. SN
*L. R. Rabiner, "A Tutorial on
Hidden Markov Models and
Selected Applications in Speech
Recognition," Proc. of the IEEE,
Vol.77, No.2, pp.257--286, 1989.
Available from
For a particular trial….
Let T
be the number of observations
T
is also the number of states passed
through
O = O1 O2 .. OT is the sequence of observations
Q = q1 q2 .. qT is the notation for a path of states
http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626
λ = N,M,i,,aij,bi(j)
Copyright © Andrew W. Moore
is the specification of an
HMM
Slide 39
HMM Formal Definition
An HMM, λ, is a 5-tuple consisting of
• N the number of states
This is new. In our
• M the number of possible observations
previous example,
• {1, 2, .. N} The starting state probabilities
start state was
P(q0 = Si) = i
deterministic
• a11
a22
…
a1N
a21
a22
…
a2N
The state transition probabilities
:
:
:
P(qt+1=Sj | qt=Si)=aij
aN1
aN2
…
aNN
•
b1(1)
b2(1)
:
bN(1)
Copyright © Andrew W. Moore
b1(2)
b2(2)
:
bN(2)
…
…
…
b1(M)
b2(M)
:
bN(M)
The observation probabilities
P(Ot=k | qt=Si)=bi(k)
Slide 40
Here’s an HMM
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
Choose one of the output
symbols in each state at
random.
2/3
ZX
1/3
S3
N=3
M=3
1 = 1/2
2 = 1/2
3 = 0
a11 = 0
a12 = 1/3
a13 = 1/3
a12 = 1/3
a22 = 0
a32 = 1/3
a13 = 2/3
a13 = 2/3
a13 = 1/3
b1 (X) = 1/2
b2 (X) = 0
b3 (X) = 1/2
b1 (Y) = 1/2
b2 (Y) = 1/2
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = 1/2
b3 (Z) = 1/2
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
1/3
Slide 41
Here’s an HMM
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
ZX
Choose one of the output
symbols in each state at
random.
Let’s generate a sequence of
observations:
2/3
1/3
S3
N=3
M=3
1 = ½
2 = ½
3 = 0
a11 = 0
a12 = ⅓
a13 = ⅓
a12 = ⅓
a22 = 0
a32 = ⅓
a13 = ⅔
a13 = ⅔
a13 = ⅓
b1 (X) = ½
b2 (X) = 0
b3 (X) = ½
b1 (Y) = ½
b2 (Y) = ½
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = ½
b3 (Z) = ½
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
50-50 choice
between S1 and
S2
1/3
q0 =
__
O0=
__
q1 =
__
O1 =
__
q2 =
__
O2 =
__
Slide 42
Here’s an HMM
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
ZX
Choose one of the output
symbols in each state at
random.
Let’s generate a sequence of
observations:
2/3
1/3
S3
N=3
M=3
1 = ½
2 = ½
3 = 0
a11 = 0
a12 = ⅓
a13 = ⅓
a12 = ⅓
a22 = 0
a32 = ⅓
a13 = ⅔
a13 = ⅔
a13 = ⅓
b1 (X) = ½
b2 (X) = 0
b3 (X) = ½
b1 (Y) = ½
b2 (Y) = ½
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = ½
b3 (Z) = ½
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
1/3
50-50 choice
between X and
Y
q0 =
S1
O0=
__
q1 =
__
O1 =
__
q2 =
__
O2 =
__
Slide 43
Here’s an HMM
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
ZX
Choose one of the output
symbols in each state at
random.
Let’s generate a sequence of
observations:
2/3
1/3
S3
N=3
M=3
1 = ½
2 = ½
3 = 0
a11 = 0
a12 = ⅓
a13 = ⅓
a12 = ⅓
a22 = 0
a32 = ⅓
a13 = ⅔
a13 = ⅔
a13 = ⅓
b1 (X) = ½
b2 (X) = 0
b3 (X) = ½
b1 (Y) = ½
b2 (Y) = ½
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = ½
b3 (Z) = ½
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
1/3
Goto S3 with
probability 2/3 or
S2 with prob. 1/3
q0 =
S1
O0=
X
q1 =
__
O1 =
__
q2 =
__
O2 =
__
Slide 44
Here’s an HMM
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
ZX
Choose one of the output
symbols in each state at
random.
Let’s generate a sequence of
observations:
2/3
1/3
S3
N=3
M=3
1 = ½
2 = ½
3 = 0
a11 = 0
a12 = ⅓
a13 = ⅓
a12 = ⅓
a22 = 0
a32 = ⅓
a13 = ⅔
a13 = ⅔
a13 = ⅓
b1 (X) = ½
b2 (X) = 0
b3 (X) = ½
b1 (Y) = ½
b2 (Y) = ½
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = ½
b3 (Z) = ½
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
1/3
50-50 choice
between Z and
X
q0 =
S1
O0=
X
q1 =
S3
O1 =
__
q2 =
__
O2 =
__
Slide 45
Here’s an HMM
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
ZX
Choose one of the output
symbols in each state at
random.
Let’s generate a sequence of
observations:
2/3
1/3
S3
N=3
M=3
1 = ½
2 = ½
3 = 0
a11 = 0
a12 = ⅓
a13 = ⅓
a12 = ⅓
a22 = 0
a32 = ⅓
a13 = ⅔
a13 = ⅔
a13 = ⅓
b1 (X) = ½
b2 (X) = 0
b3 (X) = ½
b1 (Y) = ½
b2 (Y) = ½
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = ½
b3 (Z) = ½
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
Each of the
three next states
is equally likely
1/3
q0 =
S1
O0=
X
q1 =
S3
O1 =
X
q2 =
__
O2 =
__
Slide 46
Here’s an HMM
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
ZX
Choose one of the output
symbols in each state at
random.
Let’s generate a sequence of
observations:
2/3
1/3
S3
N=3
M=3
1 = ½
2 = ½
3 = 0
a11 = 0
a12 = ⅓
a13 = ⅓
a12 = ⅓
a22 = 0
a32 = ⅓
a13 = ⅔
a13 = ⅔
a13 = ⅓
b1 (X) = ½
b2 (X) = 0
b3 (X) = ½
b1 (Y) = ½
b2 (Y) = ½
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = ½
b3 (Z) = ½
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
1/3
50-50 choice
between Z and
X
q0 =
S1
O0=
X
q1 =
S3
O1 =
X
q2 =
S3
O2 =
__
Slide 47
Here’s an HMM
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
ZX
Choose one of the output
symbols in each state at
random.
Let’s generate a sequence of
observations:
2/3
1/3
S3
N=3
M=3
1 = ½
2 = ½
3 = 0
a11 = 0
a12 = ⅓
a13 = ⅓
a12 = ⅓
a22 = 0
a32 = ⅓
a13 = ⅔
a13 = ⅔
a13 = ⅓
b1 (X) = ½
b2 (X) = 0
b3 (X) = ½
b1 (Y) = ½
b2 (Y) = ½
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = ½
b3 (Z) = ½
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
1/3
q0 =
S1
O0=
X
q1 =
S3
O1 =
X
q2 =
S3
O2 =
Z
Slide 48
State Estimation
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
ZX
Choose one of the output
symbols in each state at
random.
Let’s generate a sequence of
observations:
2/3
1/3
S3
N=3
M=3
1 = ½
2 = ½
3 = 0
a11 = 0
a12 = ⅓
a13 = ⅓
a12 = ⅓
a22 = 0
a32 = ⅓
a13 = ⅔
a13 = ⅔
a13 = ⅓
b1 (X) = ½
b2 (X) = 0
b3 (X) = ½
b1 (Y) = ½
b2 (Y) = ½
b3 (Y) = 0
b1 (Z) = 0
b2 (Z) = ½
b3 (Z) = ½
Copyright © Andrew W. Moore
Start randomly in state 1 or 2
1/3
This is what the
observer has to
work with…
q0 =
?
O0=
X
q1 =
?
O1 =
X
q2 =
?
O2 =
Z
Slide 49
Prob. of a series of observations
What is P(O) = P(O1 O2 O3) =
P(O1 = X ^ O2 = X ^ O3 = Z)?
1/3
S1
 P(O  Q)
QPaths of length 3

 P(O | Q) P(Q)
ZY
1/3
XY
Slow, stupid way:
P(O) 
S2
2/3
1/3
2/3
ZX
1/3
S3
1/3
QPaths of length 3
How do we compute P(Q) for
an arbitrary path Q?
How do we compute P(O|Q)
for an arbitrary path Q?
Copyright © Andrew W. Moore
Slide 50
Prob. of a series of observations
What is P(O) = P(O1 O2 O3) =
P(O1 = X ^ O2 = X ^ O3 = Z)?
1/3
S1
 P(O  Q)
2/3
1/3
2/3
ZX
QPaths of length 3

1/3
S3
 P(O | Q) P(Q)
QPaths of length 3
ZY
1/3
XY
Slow, stupid way:
P(O) 
S2
1/3
P(Q)= P(q1,q2,q3)
How do we compute P(Q) for
an arbitrary path Q?
=P(q1) P(q2,q3|q1) (chain rule)
How do we compute P(O|Q)
for an arbitrary path Q?
=P(q1) P(q2|q1) P(q3| q2) (why?)
=P(q1) P(q2|q1) P(q3| q2,q1) (chain)
Example in the case Q = S1 S3 S3:
=1/2 * 2/3 * 1/3 = 1/9
Copyright © Andrew W. Moore
Slide 51
Prob. of a series of observations
What is P(O) = P(O1 O2 O3) =
P(O1 = X ^ O2 = X ^ O3 = Z)?
1/3
S1
2/3
 P(O  Q)
1/3
QPaths of length 3

 P(O | Q) P(Q)
QPaths of length 3
ZY
1/3
XY
Slow, stupid way:
P(O) 
S2
2/3
ZX
1/3
S3
1/3
P(O|Q)
How do we compute P(Q) for = P(O1 O2 O3 |q1 q2 q3 )
an arbitrary path Q?
= P(O1 | q1 ) P(O2 | q2 ) P(O3 | q3 ) (why?)
How do we compute P(O|Q)
for an arbitrary path Q?
Example in the case Q = S1 S3 S3:
= P(X| S1) P(X| S3) P(Z| S3) =
=1/2 * 1/2 * 1/2 = 1/8
Copyright © Andrew W. Moore
Slide 52
Prob. of a series of observations
What is P(O) = P(O1 O2 O3) =
P(O1 = X ^ O2 = X ^ O3 = Z)?
1/3
S1
2/3
 P(O  Q)
1/3
QPaths of length 3

 P(O | Q) P(Q)
QPaths of length 3
ZY
1/3
XY
Slow, stupid way:
P(O) 
S2
2/3
ZX
1/3
S3
1/3
P(O|Q)
How do we compute P(Q) for = P(O1 O2 O3 |q1 q2 q3 )
an arbitrary path Q?
= P(O1 | q1 ) P(O2 | q2 ) P(O3 | q3 ) (why?)
How do we compute P(O|Q)
for an arbitrary path Q?
Example in the case Q = S1 S3 S3:
= P(X| S1) P(X| S3) P(Z| S3) =
=1/2 * 1/2 * 1/2 = 1/8
So let’s be smarter…
Copyright © Andrew W. Moore
Slide 53
The Prob. of a given series of
observations, non-exponential-cost-style
Given observations O1 O2 … OT
Define
αt(i) = P(O1 O2 … Ot  qt = Si | λ)
where 1 ≤ t ≤ T
αt(i) = Probability that, in a random trial,
• We’d have seen the first t observations
• We’d have ended up in Si as the t’th state visited.
In our example, what is α2(3) ?
Copyright © Andrew W. Moore
Slide 54
αt(i): easy to define recursively
αt(i) = P(O1 O2 … OT  qt = Si | λ) (αt(i) can be defined stupidly by considering all paths length “t”. How?)
1 i )  PO1  q1  Si )
 Pq1  S i )PO1 q1  Si )

what?
 t 1  j )  PO1O2 ...Ot Ot 1  qt 1  S j )

Copyright © Andrew W. Moore
Slide 55
αt(i): easy to define recursively
αt(i) = P(O1 O2 … OT  qt = Si | λ) (αt(i) can be defined stupidly by considering all paths length “t”. How?)
1 i )  PO1  q1  Si )
 Pq1  Si )PO1 q1  Si )

what?
 t 1  j )  PO1O2 ...Ot Ot 1  qt 1  S j )
  PO1O2 ...Ot  qt  Si  Ot 1  qt 1  S j )
N
i 1
  POt 1 , qt 1  S j O1O2 ...Ot  qt  Si )PO1O2 ...Ot  qt  Si )
N
i 1
  POt 1 , qt 1  S j qt  Si ) t i )
i

)
  Pqt 1  S j qt  Si )P Ot 1 qt 1  S j  t i )
i
  aijb j Ot 1 ) t i )
i
Copyright © Andrew W. Moore
Slide 56
in our example
 t i )  PO1O2 ..Ot  qt  Si  )
1 i )  bi O1 ) i
 t 1  j )   aijb j Ot 1 ) t i )
1/3
S1
S2
ZY
1/3
XY
2/3
1/3
2/3
ZX
1/3
S3
i
1/3
WE SAW O1 O2 O3 = X X Z
1
1 1) 
4
Copyright © Andrew W. Moore
1 2)  0
1 3)  0
 2 1)  0
 2 2)  0
 3 1)  0
1
 3 2) 
72
1
 2 3) 
12
1
 3 3) 
72
Slide 57
Easy Question
We can cheaply compute
αt(i)=P(O1O2…Otqt=Si)
(How) can we cheaply compute
P(O1O2…Ot) ?
(How) can we cheaply compute
P(qt=Si|O1O2…Ot)
Copyright © Andrew W. Moore
Slide 58
Easy Question
We can cheaply compute
αt(i)=P(O1O2…Otqt=Si)
(How) can we cheaply compute
P(O1O2…Ot) ?
N
  (i )
i 1
(How) can we cheaply compute
P(qt=Si|O1O2…Ot)
t
 t (i )
N
 ( j )
j 1
Copyright © Andrew W. Moore
t
Slide 59
Most probable path given observations
What' s most probable path given O1O2 ...OT , i.e.
argmax PQ O O ...O )?
What is
1
2
T
Q
Slow, stupid answer :
argmax PQ O O ...O )
1
Q
 argmax
Q
2
T
PO1O2 ...OT Q )P(Q)
PO1O2 ...OT )
 argmax PO1O2 ...OT Q )PQ )
Q
Copyright © Andrew W. Moore
Slide 60
Efficient MPP computation
We’re going to compute the following variables:
δt(i)=
max
P(q1 q2 .. qt-1  qt = Si  O1 .. Ot)
q1q2..qt-1
= The Probability of the path of Length t-1 with the
maximum chance of doing all these things:
…OCCURING
and
…ENDING UP IN STATE Si
and
…PRODUCING OUTPUT O1…Ot
DEFINE:
mppt(i) = that path
So:
δt(i)= Prob(mppt(i))
Copyright © Andrew W. Moore
Slide 61
The Viterbi Algorithm
max
 t i )  q1q2 ...qt 1 Pq1q2 ...qt 1  qt  Si  O1O2 ..Ot )
arg max
mppt i )  q q ...q
Pq1q2 ...qt 1  qt  Si  O1O2 ..Ot )
1 2
t 1
 1 i )  one choice Pq1  Si  O1 )
max
 Pq1  Si )PO1 q1  Si )
  i bi O1 )
Now, suppose we have all the δt(i)’s and mppt(i)’s for all i.
HOW TO GET
mppt(1)
δt+1(j) and
Prob=δt(1)
mppt(2)
S1
S2
:
mppt(N)
mppt+1(j)?
:
Prob=δt(2)
Prob=δt(N)
SN
qt
Copyright © Andrew W. Moore
?
Sj
qt+1
Slide 62
The Viterbi Algorithm
time t
S1
:
Si
:
time t+1
Sj
The most prob path with last two
states Si Sj
is
the most prob path to Si ,
followed by transition Si → Sj
Copyright © Andrew W. Moore
Slide 63
The Viterbi Algorithm
time t
time t+1
S1
:
Si
:
Sj
The most prob path with last two
states Si Sj
is
the most prob path to Si ,
followed by transition Si → Sj
What is the prob of that path?
δt(i) x P(Si → Sj  Ot+1 | λ)
=
δt(i) aij bj (Ot+1)
SO The most probable path to Sj has
Si* as its penultimate state
where i*=argmax δt(i) aij bj (Ot+1)
i
Copyright © Andrew W. Moore
Slide 64
The Viterbi Algorithm
time t
time t+1
S1
:
Si
:
Sj
The most prob path with last two
states Si Sj
is
the most prob path to Si ,
followed by transition Si → Sj
What is the prob of that path?
δt(i) x P(Si → Sj  Ot+1 | λ)
Summary:
=
δt(i) aij bj (Ot+1)
δt+1(j) = δt(i*) aij bj (Ot+1) with i* defined
SO The most probable path
to(j)Sj=has
to the left
mppt+1
mppt+1(i*)Si*
Si* as its penultimate state
where i*=argmax δt(i) aij bj (Ot+1)
}
i
Copyright © Andrew W. Moore
Slide 65
What’s Viterbi used for?
Classic Example
Speech recognition:
Signal  words
HMM  observable is signal
 Hidden state is part of word
formation
What is the most probable word given this signal?
UTTERLY GROSS SIMPLIFICATION
In practice: many levels of inference; not
one big jump.
Copyright © Andrew W. Moore
Slide 66
HMMs are used and useful
But how do you design an HMM?
Occasionally, (e.g. in our robot example) it is reasonable to
deduce the HMM from first principles.
But usually, especially in Speech or Genetics, it is better to infer
it from large amounts of data. O1 O2 .. OT with a big “T”.
Observations previously
in lecture
O1 O2 .. OT
Observations in the
next bit
O1 O2 .. O
Copyright © Andrew W. Moore
T
Slide 67
Inferring an HMM
Remember, we’ve been doing things like
P(O1 O2 .. OT | λ )
That “λ” is the notation for our HMM parameters.
Now
We have some observations and we want to
estimate λ from them.
AS USUAL: We could use
(i)
MAX LIKELIHOOD λ = argmax P(O1 .. OT | λ)
λ
(ii)
BAYES
Work out P( λ | O1 .. OT )
and then take E[λ] or max P( λ | O1 .. OT )
λ
Copyright © Andrew W. Moore
Slide 68
Max likelihood HMM estimation
Define
γt(i) = P(qt = Si | O1O2…OT , λ )
εt(i,j) = P(qt = Si  qt+1 = Sj | O1O2…OT ,λ )
γt(i) and εt(i,j) can be computed efficiently i,j,t
(Details in Rabiner paper)
T 1
  t i ) 
t 1
Expected number of transitions
out of state i during the path
T 1
  t i, j ) 
t 1
Copyright © Andrew W. Moore
Expected number of transitions from
state i to state j during the path
Slide 69
 t i )  Pqt  Si O1O2 ..OT ,  )
 t i, j )  Pqt  Si  qt 1  S j O1O2 ..OT ,  )
T 1
  i )  expected number of transitio ns out of state i during
t 1
t
path
T 1
  i, j )  expected number of transitio ns out of i and into
t 1
t
j during path
 expected frequency 

 t i, j ) 

i j

Notice t T11

 expected frequency 

)

i



t
i
t 1


T 1
HMM
estimation
 Estimate of Prob Next state S j This state Si )
We can re - estimate
a ij 
  i, j )
  i )
t
t
We can also re - estimate
b j Ok )  
Copyright © Andrew W. Moore
(See Rabiner)
Slide 70
new
a
We want ij  new estimate of P ( qt 1  s j | qt  si )
Copyright © Andrew W. Moore
Slide 71
new
a
We want ij  new estimate of P ( qt 1  s j | qt  si )

Expected # transitio ns i  j | old , O1 , O2 ,OT
N
old
Expected
#
transitio
ns
i

k
|

, O1 , O2 , OT

k 1
Copyright © Andrew W. Moore
Slide 72
new
a
We want ij  new estimate of P ( qt 1  s j | qt  si )

Expected # transitio ns i  j | old , O1 , O2 ,OT
N
old
Expected
#
transitio
ns
i

k
|

, O1 , O2 , OT

k 1
T

old
P
(
q

s
,
q

s
|

 t 1 j t i , O1 , O2 ,OT )
t 1
N T
old
P
(
q

s
,
q

s
|

 t 1 k t i , O1 , O2 ,OT )
k 1 t 1
Copyright © Andrew W. Moore
Slide 73
new
a
We want ij  new estimate of P ( qt 1  s j | qt  si )

Expected # transitio ns i  j | old , O1 , O2 ,OT
N
old
Expected
#
transitio
ns
i

k
|

, O1 , O2 , OT

k 1
T

old
P
(
q

s
,
q

s
|

 t 1 j t i , O1 , O2 ,OT )
t 1
N T
old
P
(
q

s
,
q

s
|

 t 1 k t i , O1 , O2 ,OT )
k 1 t 1

Sij
N
 Sik
k 1
Copyright © Andrew W. Moore
where
T
Sij   P(qt 1  s j , qt  si , O1 , OT |  )
old
t 1
 What?
Slide 74
new
a
We want ij  new estimate of P ( qt 1  s j | qt  si )

Expected # transitio ns i  j | old , O1 , O2 ,OT
N
old
Expected
#
transitio
ns
i

k
|

, O1 , O2 , OT

k 1
T

old
P
(
q

s
,
q

s
|

 t 1 j t i , O1 , O2 ,OT )
t 1
N T
old
P
(
q

s
,
q

s
|

 t 1 k t i , O1 , O2 ,OT )
k 1 t 1

Sij
N
 Sik
k 1
where
T
Sij   P(qt 1  s j , qt  si , O1 , OT |  )
old
t 1
T
 aij   t (i )  t 1 ( j )b j (Ot 1 )
t 1
Copyright © Andrew W. Moore
Slide 75
We want
new
ij
a
Copyright © Andrew W. Moore
 Sij
N
S
k 1
T
ik where
Sij  aij   t (i )  t 1 ( j )b j (Ot 1 )
t 1
Slide 76
We want
new
ij
a
T
 Sij
N
S
k 1

N
Copyright © Andrew W. Moore
T
ik where
Sij  aij   t (i )  t 1 ( j )b j (Ot 1 )
t 1
T

N
Slide 77
EM for HMMs
If we knew λ we could estimate EXPECTATIONS of quantities
such as
Expected number of times in state i
Expected number of transitions i  j
If we knew the quantities such as
Expected number of times in state i
Expected number of transitions i  j
We could compute the MAX LIKELIHOOD estimate of
λ = aij,bi(j), i
Roll on the EM Algorithm…
Copyright © Andrew W. Moore
Slide 78
EM 4 HMMs
1.
Get your observations O1 …OT
2.
Guess your first λ estimate λ(0), k=0
3.
k = k+1
4.
Given O1 …OT, λ(k) compute
γt(i) , εt(i,j)
1 ≤ t ≤ T,
1 ≤ i ≤ N,
5.
Compute expected freq. of state i, and expected freq. i→j
6.
Compute new estimates of aij, bj(k), i accordingly. Call
them λ(k+1)
7.
Goto 3, unless converged.
•
Also known (for the HMM case) as the BAUM-WELCH
algorithm.
Copyright © Andrew W. Moore
1 ≤ j ≤ N
Slide 79
Bad News
• There are lots of local minima
Good News
• The local minima are usually adequate models of the
data.
Notice
• EM does not estimate the number of states. That must
be given.
• Often, HMMs are forced to have some links with zero
probability. This is done by setting aij=0 in initial estimate
λ(0)
• Easy extension of everything seen today: HMMs with
real valued outputs
Copyright © Andrew W. Moore
Slide 80
Bad News
Trade-off between too few states (inadequately
modeling the structure in the data) and too many
(fitting the noise).
• There are lots of local minima
Thus #states is a regularization parameter.
Good News
Blah blah blah… bias variance tradeoff…blah
blah…cross-validation…blah blah….AIC,
• The local minimaBIC….blah
are usually
models
blahadequate
(same ol’ same
ol’) of the
data.
Notice
• EM does not estimate the number of states. That must
be given.
• Often, HMMs are forced to have some links with zero
probability. This is done by setting aij=0 in initial estimate
λ(0)
• Easy extension of everything seen today: HMMs with
real valued outputs
Copyright © Andrew W. Moore
Slide 81
What You Should Know
•
•
•
•
•
What is an HMM ?
Computing (and defining) αt(i)
DON’T PANIC:
The Viterbi algorithm
starts on p. 257.
Outline of the EM algorithm
To be very happy with the kind of maths and
analysis needed for HMMs
• Fairly thorough reading of Rabiner* up to page 266*
[Up to but not including “IV. Types of HMMs”].
*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2,
pp.257--286, 1989.
http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626
Copyright © Andrew W. Moore
Slide 82
Download