The Infinite Hidden Markov Model (Part 1) Pfunk

advertisement
The Infinite Hidden
Markov Model
(Part 1)
Pfunk
November 11, 2011
J. Scholz (RIM@GT)
08/26/2011
1
Motivation: modeling
sequential data
•
Laser ranges or observations (localization)
•
Phonemes (speech recognition)
•
Visual features (tracking)
•
Human Actions (learning from demonstration
J. Scholz (RIM@GT)
08/26/2011
2
The classical HMM
• Encodes markov
assumption for state
transitions
• Each state has it’s
own distribution
over possible
observations
(“emissions”)
J. Scholz (RIM@GT)
08/26/2011
3
Parameterizing an HMM
Transition matrix (N x N)
Emission matrix (N x M)
(Both are conditional
probability tables)
J. Scholz (RIM@GT)
08/26/2011
4
Things people might want to
do with an HMM
•
•
Sample a hidden state trajectory
•
•
•
Find most likely sequence (Viterbi)
•
Learn the underlying CPTs (Baum-Welsch/EM)
Filtering/Smoothing to find hidden state
marginals (forward-backward)
Evaluate likelihood of a hidden state trajectory
Evaluate likelihood of an emission sequence
(dynamic programming)
J. Scholz (RIM@GT)
08/26/2011
5
Information Required?
•
•
Sample a hidden state trajectory
CPTs, Emissions
Filtering/Smoothing to find hidden state
marginals (forward-backward)
CPTs, Emissions
•
•
•
Find most likely sequence (Viterbi)
CPTs, Emissions
•
Learn the underlying CPTs (Baum-Welsch/EM)
Evaluate likelihood of a hidden state trajectory CPTs, Emissions
Evaluate likelihood of an emission sequence
(dynamic programming)
J. Scholz (RIM@GT)
08/26/2011
CPTs
Emissions
6
Running Example
•
•
•
10 States
6 Emission tokens
Emission
Sequence: 10
copies of
“ABCDEFEDCB”
Perfect Model (log-likelihood = 0)
J. Scholz (RIM@GT)
08/26/2011
7
Things WE might want to do
with an HMM
• First figure out the number of states!
• Then do the rest of that stuff
J. Scholz (RIM@GT)
08/26/2011
8
Infinite Version
Will let us get from here:
• R.runHDP(50);
J. Scholz (RIM@GT)
08/26/2011
9
Infinite Version
To here:
• R.runHDP(50);
J. Scholz (RIM@GT)
08/26/2011
10
Learning the CPTs
• Fixed-size versions can use EM approach
• Won’t work for DP (why??)
• Can’t compute analytical marginals over
an infinite number of states
• DP is a generative model, so we’ll use
MCMC instead
J. Scholz (RIM@GT)
08/26/2011
11
Gibbs Sampling the state sequence
• Algorithm:
• Iterate from t=1:end
• Resample S(t) given markov blanket *
• Update CPTs accordingly **
*How do we condition on the MB?
** Decrement old transition/emission, increment new
J. Scholz (RIM@GT)
08/26/2011
12
Likelihoods in an HMM
• Forward
likelihoods are
rows
• Backward
likelihoods are
columns
J. Scholz (RIM@GT)
08/26/2011
13
Adding the dirichlet proces
• Replace the fixed size CPT with a DP
version
• each row is it’s own DP
• problems?
• Yes! Separate DP’s have no common
support. Doesn’t work for a CPT
J. Scholz (RIM@GT)
08/26/2011
14
Dirichlet Processes Dirichlet Processes
A Dirichlet
Process (DP) is aDefinition
distribution over probability
Definition
chlet Processes
measures.
A Dirichlet Process
(DP) is a distribution ov
tion
A
Dirichlet
Process
(DP)
is
a
distribution
over
probability
A DP has two parameters:
measures.
Quick review of DP math
measures.
Base Process
distribution
H,
which is like
the
mean
of
the DP.
A Dirichlet
(DP)
is
a distribution
over
probability
A
DP
has
two
parameters:
A
DP
has
two
parameters:
• A DP has two parameters:
measures.
Strength parameter
α, which
an mean
inverse-variance
thethe
DP.
distribution
H, which of
is like
mean
Base distribution
H, whichisislike
likeBase
the
of the DP.
A DP has two parameters:
Strength
parameter
Base distribution
H,α,
which
is like
thean
mean
of the DP α, which
Strength
parameter
which
is like
inverse-variance
of the is
DP.like an inv
We write:
•
•
Base distribution H, which is like the mean of the DP.
We write:
We
write:
Strength parameter
α, which
is likeα,an
inverse-variance
of the DP. of the DP
Strength
parameter
which
is like an inverse-variance
We write:
G ∼GDP(α,
H)
∼ DP(α,
H)
G ∼ DP(α, H)
We write:
•
forX:any partition (A1 , . . . , An ) of X:
for any partition
(A G, .∼.(A
.DP(α,
,1 ,A. .n.)H)
ofn )ifX:
if for any partition
,A
of
1
if for any partition (A
, A. .n .) ,of
X: n )) ∼ Dirichlet(αH(A
1 , . .1.),
(G(A1 ),1.),. . ., .G(A
Dirichlet(αH(
(G(A
G(A
,
αH(A
))
n )) n∼
(G(A1 ), . . . , G(An )) ∼ Dirichlet(αH(A1 ), . . . , αH(An ))
(G(A1 ), . . . , G(An )) ∼ Dirichlet(αH(A1 ), . . . , αH(An ))
A4
A1
A1
A1
A3 A2
Yee Whye Teh (Gatsby)
Yee Whye Teh (Gatsby)
Whye
Teh (Gatsby)
J. Scholz
(RIM@GT)
A2
A5
6
3
A2
A6
A3A
A
A5
A1
A4
A4
A4
A6
A3
A2
A5
A5
Yee
Whye
Teh (Gatsby)
DP and
HDP
Tutorial
DP and HDP Tutorial
DP and 08/26/2011
HDP Tutorial
Mar 1, 2007 / CUED
DP and
HDP Tutorial
Mar 1, 2007
/ CUED
5 / 53
5 / 53
try
it! >> Prelim 2.1
Mar 1, 2007 / CUED
155 / 5
A closer look
•
A DP has two parameters:
•
•
•
Base distribution H, which is like the mean of the DP
Strength parameter α, which is like an inverse-variance of the DP
We write:
What is the form of H?
∼
G ∼ DP(α, H) if for any partition (A1 , . . . , An ) of X:
(G(A1 ), . . . , G(A
∼n )) ∼ Dirichlet(αH(A1 ), . . . , αH(An ))
A4
A1
A3
A2
J. Scholz (RIM@GT)
A6
A5
08/26/2011
16
What is the form of H?
• Can be any distribution defined over our
event space (e.g. gaussian)
• continuous or discrete: both legal
• only condition is that it has to return a
density for any partition A we give it
J. Scholz (RIM@GT)
08/26/2011
17
Topics
•
Discrete N-D probability distributions
(categorical, multinomial, dirichlet)
•
•
Dirichlet Process Definition
Dirichlet Process Metaphors
•
•
•
Polya Urn
Chinese Restaurant Process
Stick-breaking process
J. Scholz (RIM@GT)
08/26/2011
18
So where is the Process in a DP?
•
3 Metaphors:
• Polya-urn
• Involves drawing and replacing balls from an urn
• Chinese Restaurant
• Involves customers sitting at tables in proportion
to their popularity
• Stick-breaking
• Involves breaking off pieces of a stick of unit
length
J. Scholz (RIM@GT)
08/26/2011
19
So where is the Process in a DP?
•
3 Metaphors:
• Polya-urn
• Involves drawing and replacing balls from an urn
• Chinese Restaurant
• Involves customers sitting at tables in proportion
to their popularity
•
Stick-breaking
•
J. Scholz (RIM@GT)
Involves breaking off pieces of a stick of unit
length
08/26/2011
20
The Polya-urn Scheme
ya’s Urn Scheme
Pòlya’s urn scheme produces a sequence θ1 , θ2 , . . . with the
Polya
urn scheme produces a sequence θ1, θ2, . . . with the
following
conditionals:
•
following conditionals:
eq 1:
θn |θ1:n−1 ∼
�n−1
δθi + αH
n−1+α
i=1
number of θi colored balls
Imagine picking balls of different colors from an urn:
Imagine picking balls of different colors from an urn:
• Start
with no balls in the urn.
with
probability
∝ α,
draw
θ ∼
H, and add a ball of
Start with no
balls
in the
urn.
•
that color into the urn.
With probability
α,pick
drawa θball∼ at
H,random
and add from
a ball of that
With
∝ n −∝1,
• probability
n
n
color
into the
the urn,
record
θn tourn.
be its color, return the ball into
the urn and place a second ball of same color into
urn. With probability ∝ n − 1, pick a ball at random from the
urn, record θn to be its color, return the ball into the urn
and place a second ball of same color into urn.
•
Yee Whye Teh (Gatsby)
J. Scholz (RIM@GT)
DP and HDP Tutorial
08/26/2011
Mar 1, 2007 / CUED
10 / 53
21
Polya sampling in practice
•
Equation 1 is of the form (p)f(Ω) + (1-p)g(Ω)
•
Implies that proportion p of density is
associated with f, so we can split the task in
half:
•
first flip a bern(p) coin. If heads, draw from
f, if tails, draw from g
•
for polya urn, gives us either a sample from
existing balls (f), or a new color (g)*
*if g is a continuous density on Ω, then the probability of
sampling an existing cluster from g is zero. (why?)
J. Scholz (RIM@GT)
08/26/2011
22
ya’s Urn Scheme
Analyzing Polya Urn
One (infinitely long) “run” of our process
∼ ADP(α,
is a random
probability
measure.
draw GH)
∼ DP(α,
H) is a random
probability
measure.
A draw G ∼ DP(α,H) is a random probability measure
•
Treating
G as a distribution,
consider
i.i.d. draws
from
as
a distribution,
consider
i.i.d. draws
from
G:G:
• Treating G as a distribution, consider i.i.d. draws from G:
θi |G ∼ G
•
θi |G ∼ G
One component drawn from G
Marginalizing
out G, marginally
each
H, while
the conditional
i ∼
Marginalizing
out G, each
θi ∼θH,
while
the conditional
ng
out G, marginally
each θi ∼ H, while the conditional
distributions
are,
distributions
are:
s are,
�n−1
i=1 δθi + αH
θn |θ1:n−1
∼
�n−1
n
−
1
+
α
δ + αH
θn |θ1:n−1 ∼
•
i=1
θi
This is the Pòlya
scheme.
n −we
1 did
+ αin the Polya urn scheme*
This isurn
precisely
what
Pòlya urn
* This is scheme.
why people say the that the DP is the “De Finetti distribution underlying the Urn process. It’s what
makes the θi exchangeable. (Since θi are i.i.d. ∼ G, their joint distribution is invariant to permutations)
J. Scholz (RIM@GT)
08/26/2011
23
Problem for CPT
• Each time one of the row DP’s draws a new
state, it draws from H
• However, each draw from H will be unique
• Thus, each DP builds its own state
representation
J. Scholz (RIM@GT)
08/26/2011
24
Solution?
•
Need some way to allow each row DP to
SHARE states, without limiting their number
•
Answer: another DP!
•
Each time a row DP tries to draw from it’s
base distribution, we draw from a high-level
DP (“Oracle”)
•
This oracle is shared across all row DPs
J. Scholz (RIM@GT)
08/26/2011
25
Compared to infinite gaussian
mixture model
• Traded one form of complexity for another
• In the iGMM, atoms were associated with
Gaussian components (thus atoms had an
associated mean and variance)
• In the iHMM, atoms are not (necessarily)
mixture components, but they need to be
shared somehow
J. Scholz (RIM@GT)
08/26/2011
26
Comparing the Graphical Models
distribution over
components
π
Zi
GMM
θ
Xi
distribution over
values given
components
HMM
Models
using the
i = 1, . . . , n
iGMM
modelling xi , while
over parameters
odel.
G
"i
ei
π
Xi
θ
iHMM
!
G0
!
%
"0
Gj
"0
$j
xi
(a)
J. Scholz (RIM@GT)
∑
H
H
!
Si
#ji
zji
xji
xji
nj
J
nj
J
(b)
08/26/2011
Figure 3: (a) A hierarchical Dirichlet process mixture model.27(b) A
A little trick...
• Original Paper used pure gibbs sampling
approach with full generative model
• Each step we decrement all DP data
structures
• What if we don’t want to mess around with
the oracles?
• Can re-draw the oracle counts after each
sweep, and then redraw the oracle
distribution from a dirichlet (problems?)
J. Scholz (RIM@GT)
08/26/2011
28
Augmenting a dirichlet
distribution online
• In a DP, π is a distribution over all
represented states, plus the probability of
sampling a new state
• New state probability was proportional
to number of “black balls”
• If we sample a new state, can augment π
using the “stick-breaking construction”
J. Scholz (RIM@GT)
08/26/2011
29
Original iHMM
(Beal et al. 2002)
nij
nii + !
# nij + " + !
j
self
transition
a)
b)
c)
d)
"
# nij + " + ! #nij + " + !
j
j
existing
transition
oracle
j=i
njo
$
# njo + $
# njo + $
existing
state
new
state
j
j
Figure 1:
(left) State transition generative mechanism. (right a-d) Sampled state trajectories
of length T = 250 (time along horizontal axis) from the HDP: we give examples of four modes of
behaviour. (a) α = 0.1, β = 1000,
γ = 100, explores many states with a sparse transition
matrix. (b)
10
2500
e
!
α = 0, β = 0.1, γ = 100, retraces multiple interacting trajectory segments. (c) α = 8, β = 2, γ = 2,
switches between ea few different states. (d) α = 1, β = 1, γ = 10000, has strict left-to-right transition
miq +long
! linger time.2000
dynamics#q with
2
miq
#miq + !e
q
existing
emission
oracle
1500
Under the oracle, with probability proportional to γ an entirely new state10 is transitioned
to. This is the only mechanism
1000 for visiting new states from the infinitely many available to
e
m
"
us. After
each transition we set nij ← nij + 1 and, if we transitioned to the state j via the
q
o
o
oracle
DP
just described
then
o
e
o
e 500in addition we set nj ← nj + 1. If we transitioned to a new
mq + "
mq + " o
#qstate
then the size #
qof n and n will increase.
1
existing
new
0
0
0.5
1
1.5
10over which the
Self-transitions
aresymbol
special because
their
probability
defines 2a time 2.5scale
symbol
0
20
40
x 10
dynamics of the hidden state evolves. We assign a finite prior mass α to self transitions for
each state; this is the third hyperparameter in our model. Therefore, when first visited (via
2:
(left)
State
emission generative
mechanism.
(middle) Word occurence
γ in the
HDP),
its self-transition
count is initialised
to α.
0
4
Figure
60
80
100
for entire Alice
novel: each
is assigned
a unique
integeris identity
it appears.
Wordinidentity
The word
full hidden
state transition
mechanism
a two-levelasDP
hierarchy shown
decision (vertical) is plotted
the
position
the
text.
(right)
(Exp 1) under
Evolution
ofwith
number of represented
treeword
form in
Figure 1.(horizontal)
Alongside areinshown
typical
state trajectories
the prior
J. Scholz against
(RIM@GT)
08/26/2011
30
The
Stick-Breaking
Construction
Stick-breaking
Construction
Stick-breaking Construction
•
But how do But
draws
∼ draws
DP(α,G
H)∼look
like?
howGdo
DP(α,
H) look like?
G is discreteGwith
probability
so:
is discrete
withone,
probability
one, so:
mixing proportion
But what do draws G ∼ DP(α,H) look like?
•
∞
�
G one,
= so:
πk δG
θk∗ =
G is discrete with probability
k =1
∞
�
point mass
πk δθk∗
k =1
The stick-breaking construction shows that G ∼ DP(α,H) if:
•
The stick-breaking
construction
shows thatshows
G ∼ DP(α,
The stick-breaking
construction
that GH)
∼ if:
DP(α, H) if:
πk = βk
k�
−1
l=1
(1
πk −
= ββlk)
k�
−1
l=1
(1 − βl )
βkα)
∼ Beta(1, α)
βk ∼ Beta(1,
!
∗
∗
θk ∼ H
θk ∼ H
!
!(3)
!(4)
!(4)
(5)
(6)
•
!(6)
!(2)
!(3)
!(1)
!(2)
!(1)
!(5)
WeGEM(α)
write
π∼
π
(π
. . distributed
.) is distributed
We write
∼
if πGEM(α)
= if(ππ
πif2(π
, .=
.1,.)π
is21, ,distributed
as above.
1, =
Weπwrite
π ∼ GEM(α)
. π. .)2 , is
as as above.
above
Yee Whye Teh (Gatsby)
Yee Whye Teh (Gatsby)
J. Scholz (RIM@GT)
DP and HDP Tutorial
DP and HDP Tutorial
08/26/2011
Mar 1, 2007
/ CUED
Mar 1, 2007 / CUED
15 / 53
31
15
The Stick-Breaking Construction
Stick-breaking Construction
•
But how do draws G ∼ DP(α, H) look like?
Why does this make
G issense?
discrete with probability one, so:
•
•
∞
�
Draws from the beta(1,alpha) give
G =a π δ
distribution over the interval (0,1), which we
The stick-breaking construction shows that G ∼ DP(α, H) if:
can think of as where to break the stick
k θk∗
k=1
k−1
�
th sample
The product
scales
the
k
!
l=1
!
β
∼
Beta(1,
α)
according to how
much has been broken
off
!
k
!
∗
θk ∼ H
!
already
πk = βk
(1 − βl )
!(1)
(2)
(3)
(4)
(5)
(6)
•
We write π ∼ GEM(α) if π = (π1 , π2 , . . .) is distributed as above.
In the limit we get another infinite partitioning
of our interval [0,1], and therefore a (discrete)
probability measure
Yee Whye Teh (Gatsby)
J. Scholz (RIM@GT)
08/26/2011
DP and HDP Tutorial
Mar 1, 2007 / CUED
32
1
Download