Markov chains/HMMs/DBNs - Indiana University Computer Science

advertisement
CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Temporal sequences: Hidden Markov Models and
Dynamic Bayesian Networks
MOTIVATION

Observing a stream of data
Monitoring (of people,
computer systems, etc)
 Surveillance, tracking
 Finance & economics
 Science


Questions:


Modeling & forecasting
Unobserved variables
TIME SERIES MODELING

Time occurs in steps t=0,1,2,…

Time step can be seconds, days, years, etc
State variable Xt, t=0,1,2,…
 For partially observed problems, we see
observations Ot, t=1,2,… and do not see the X’s


X’s are hidden variables (aka latent variables)
MODELING TIME

Arrow of time
Causes

Effects
Causality => Bayesian networks are natural
models of time series
PROBABILISTIC MODELING

For now, assume fully observable case
X0

X1
X2
X3
X1
X2
X3
What parents?
X0
MARKOV ASSUMPTION
Assume Xt+k is independent of all Xi for i<t
P(Xt+k | X0,…,Xt+k-1) = P(Xt+k | Xt,…,Xt+k-1)
 K-th order Markov Chain

Order 0
X0
X1
X2
X3
Order 1
X0
X1
X2
X3
Order 2
X0
X1
X2
X3
Order 3
X0
X1
X2
X3
1ST ORDER MARKOV CHAIN
MC’s of order k>1 can be converted into a 1st
order MC on the variable Yt = {Xt,…,Xt+k-1}
 So w.o.l.o.g., “MC” refers to a 1st order MC

X0
X1
X2
X3
X0
X1’
X2’
X3’
X1
X2
X3
X4
Y0
Y1
Y2
Y3
INFERENCE IN MC

What independence relationships can we read
from the BN?
X0
X1
X2
X3
Observe X1
X0 independent of X2, X3, …
P(Xt|Xt-1) known as transition model
INFERENCE IN MC

Prediction: the probability of future state?

P(Xt) =
S
=S
=
S
x0,…,xt-1P
x0,…,xt-1P
(X0)
xt-1P(Xt|Xt-1)

(X0,…,Xt)
P
x1,…,xt
P(Xt-1)
P(Xi|Xi-1)
[Recursive approach]
Approach: maintain a belief state bt(X)=P(Xt), use
above equation to advance to bt+1(X)

Equivalent to VE algorithm in sequential order
BELIEF STATE EVOLUTION


P(Xt) =
Sxt-1P(Xt|Xt-1) P(Xt-1)
“Blurs” over time, and (typically) approaches a
stationary distribution as t grows
Limited prediction power
 Rate of blurring known as mixing time

STATIONARY DISTRIBUTIONS

For discrete variables Val(X)={1,…,n}:
Transition matrix Tij = P(Xt=i|Xt-1=j)
 Belief bt(X) is just a vector bt,i=P(Xt=i)
 Belief update equation: bt+1 = T*bt


A stationary distribution b is one in which b = Tb
=> b is an eigenvector of T with eigenvalue 1
 => b is in the null space of (T-I)

HISTORY DEPENDENCE
In Markov models, the state must be chosen so
that the future is independent of history given
the current state
 Often this requires adding variables that cannot
be directly observed

minimum
essentials
“the bare”
market
wipes himself
with the rabbit
Are these people walking toward
you or away from you?
What comes next?
PARTIAL OBSERVABILITY

Hidden Markov Model (HMM)
X0
X1
X2
X3
Hidden state
variables
O1
O2
O3
Observed
variables
P(Ot|Xt) called the observation
model (or sensor model)
INFERENCE IN HMMS
Filtering
 Prediction
 Smoothing, aka hindsight
 Most likely explanation

X0
X1
X2
X3
O1
O2
O3
INFERENCE IN HMMS
Filtering
 Prediction
 Smoothing, aka hindsight
 Most likely explanation

Query variable
X0
X1
X2
O1
O2
FILTERING

Name comes from signal processing

P(Xt|o1:t) =

S
xt-1
P(xt-1|o1:t-1) P(Xt|xt-1,ot)
P(Xt|Xt-1,ot) = P(ot|Xt-1,Xt)P(Xt|Xt-1)/P(ot|Xt-1)
= a P(ot|Xt)P(Xt|Xt-1)
Query variable
X0
X1
X2
O1
O2
FILTERING

P(Xt|o1:t) = a Sxt-1P(xt-1|o1:t-1) P(ot|Xt)P(Xt|xt-1)
Forward recursion
 If we keep track of belief state bt(X) = P(Xt|o1:t)
=> O(|Val(X)|2) updates for each t!

Query variable
X0
X1
X2
O1
O2
PREDICT-UPDATE INTERPRETATION
Given old belief state bt-1(X)
 Predict: First compute MC update

bt’(Xt)=P(Xt|o1:t-1) = a Sxbt-1(x) P(Xt|Xt-1=x)

Update: Re-weight to account for observation
probabilities:

bt(x) = bt’(x)P(ot|Xt=x)
X0
Query variable
X1
X2
O1
O2
INFERENCE IN HMMS
Filtering
 Prediction
 Smoothing, aka hindsight
 Most likely explanation

Query
X0
X1
X2
X3
O1
O2
O3
PREDICTION
P(Xt+k|o1:t)
 2 steps: P(Xt|o1:t), then P(Xt+k|Xt)
 Filter to time t, then predict as with standard
MC

Query
X0
X1
X2
X3
O1
O2
O3
INFERENCE IN HMMS
Filtering
 Prediction
 Smoothing, aka hindsight
 Most likely explanation

Query
X0
X1
X2
X3
O1
O2
O3
SMOOTHING
P(Xk|o1:t) for k < t
 P(Xk|o1:k,ok+1:t)
= P(ok+1:t|Xk,o1:k)P(Xk|o1:k)/P(ok+1:t|o1:k)
= a P(ok+1:t|Xk)P(Xk|o1:k)

Standard filtering to time k
Query
X0
X1
X2
X3
O1
O2
O3
SMOOTHING

Computing P(ok+1:t|Xk)

P(ok+1:t|Xk) =
S
=S
=
S
xk+1P(ok+1:t|Xk,xk+1)
xk+1P(ok+1:t|xk+1)
P(xk+1|Xk)
P(xk+1|Xk)
xk+1P(ok+2:t|xk+1)P(ok+1|xk+1)P(xk+1|Xk)
Backward recursion
Given prior states
X0
X1
O1
X2
O2
X3
O3
What’s the
probability of this
sequence?
INTERPRETATION

Filtering/prediction:


Smoothing:


Equivalent to forward variable elimination / belief
propagation
Equivalent to forward VE/BP up to query variable,
then backward VE/BP from last observation back to
query variable
Running BP to completion gives the smoothed
estimates for all variables (forward-backward
algorithm)
INFERENCE IN HMMS
Filtering
 Prediction
 Smoothing, aka hindsight
 Most likely explanation


Subject of next lecture
X0
X1
X2
X3
O1
O2
O3
Query returns a path
through state space
x0,…,x3
APPLICATIONS OF HMMS IN NLP
Speech recognition
 Hidden phones
(e.g., ah eh ee th r)
 Observed, noisy acoustic
features (produced by signal
processing)

PHONE OBSERVATION MODELS
Phonet
Model defined to be
robust over variations
in accent, speed,
pitch, noise
Signal processing
Features
(24,13,3,59)
Featurest
PHONE TRANSITION MODELS
Phonet
Phonet+1
Good models will capture (among other things):
Featurest
Pronunciation of words
Subphone structure
Coarticulation effects
Triphone models = order 3 Markov chain
WORD SEGMENTATION
Words run together when
pronounced
 Unigrams P(wi)
 Bigrams P(wi|wi-1)
 Trigrams P(wi|wi-1,wi-2)

Random 20 word samples from R&N using N-gram models
Logical are as
confusion a may right
tries agent goal the
was diesel more object
then informationgathering search is
Planning purely diagnostic
expert systems are very
similar computational
approach would be
represented compactly
using tic tac toe a predicate
Planning and
scheduling are
integrated the success
of naïve bayes model is
just a possible prior
source by that time
WHAT ABOUT MODELS WITH MANY
VARIABLES?

Say X has n binary variables, O has m binary variables
Naively, a distribution over Xt may be intractable to
represent (2n entries)
Transition models P(Xt |Xt-1) require 22n entries
Observation models P(Ot |Xt) require 2n+m entries

Is there a better way?



EXAMPLE: FAILURE DETECTION

Consider a battery meter sensor
Battery = true level of battery
 BMeter = sensor reading

Transient failures: send garbage at time t
 Persistent failures: send garbage forever

EXAMPLE: FAILURE DETECTION

Consider a battery meter sensor
Battery = true level of battery
 BMeter = sensor reading


Transient failures: send garbage at time t


5555500555…
Persistent failures: sensor is broken

5555500000…
DYNAMIC BAYESIAN NETWORK
Template model relates variables on prior time
step to the next time step (2-TBN)
 “Unrolling” the template for all t gives the
ground Bayesian network

Batteryt-1
Batteryt
BMetert
BMetert ~ N(Batteryt,s)
DYNAMIC BAYESIAN NETWORK
Batteryt-1
Batteryt
BMetert
BMetert ~ N(Batteryt,s)
Transient
failure model
P(BMetert=0 | Batteryt=5) = 0.03
RESULTS ON TRANSIENT FAILURE
Meter reads 55555005555…
E(Batteryt)
Transient failure occurs
Without model
With model
RESULTS ON PERSISTENT FAILURE
Meter reads 5555500000…
E(Batteryt)
Persistent failure occurs
With transient
model
PERSISTENT FAILURE MODEL
Brokent-1
Brokent
Batteryt-1
Batteryt
BMetert
BMetert ~ N(Batteryt,s)
P(BMetert=0 | Batteryt=5) = 0.03
P(BMetert=0 | Brokent) = 1
RESULTS ON PERSISTENT FAILURE
Meter reads 5555500000…
E(Batteryt)
Persistent failure occurs
With persistent
failure model
With transient
model
HOW TO PERFORM INFERENCE ON
DBN?

Exact inference on “unrolled” BN
E.g. Variable Elimination
 Typical order: eliminate sequential time steps so that
the network isn’t actually constructed
 Unrolling is done only implicitly

Br0
Br1
Ba0
Br2
Br3
Br4
Ba1
Ba2
Ba3
Ba4
BM1
BM2
BM3
BM4
ENTANGLEMENT PROBLEM

After n time steps, all n variables in the belief
state become dependent!
Unless 2-TBN can be partitioned into disjoint subsets
(rare)
 Lost sparsity structure

APPROXIMATE INFERENCE IN DBNS
Limited history updates
 Assumed factorization of belief state
 Particle filtering

INDEPENDENT FACTORIZATION
Idea: assume belief state P(Xt) factors across
individual attributes P(Xt) = P(X1,t)*…*P(Xn,t)
 Filtering: only maintain factored distributions
P(X1,t|O1:t),…,P(Xn,t|O1:t)
 Filtering update: P(Xk,t|O1:t) = Sxt-1P(Xk,t|Ot,Xt-1)
P(Xt-1|O1:t-1) = marginal probability query over 2TBN

X1,t-1
X1,t
Xn,t-1
Xn,t
O1,t
Om,t
NEXT TIME

Viterbi algorithm


Read K&F 13.2 for some context
Kalman and particle filtering

Read K&F 15.3-4
Download