Extending Expectation Propagation for Graphical Models Yuan (Alan) Qi

advertisement
Extending Expectation Propagation
for Graphical Models
Yuan (Alan) Qi
Joint work with Tom Minka
Motivation
• Graphical models are widely used in realworld applications, such as wireless
communications and bioinformatics.
• Inference techniques on graphical models
often sacrifice efficiency for accuracy or
sacrifice accuracy for efficiency.
• Need a method that better balances the
trade-off between accuracy and efficiency.
Error
Motivation
Current
Techniques
What we want
Computational Time
Outline
• Background on expectation propagation (EP)
• Extending EP on Bayesian networks for
dynamic systems
– Poisson tracking
– Signal detection for wireless communications
• Tree-structured EP on loopy graphs
• Conclusions and future work
Outline
• Background on expectation propagation (EP)
• Extending EP on Bayesian networks for
dynamic systems
– Poisson tracking
– Signal detection for wireless communications
• Tree-structured EP on loopy graphs
• Conclusions and future work
Graphical Models
Directed
( Bayesian networks)
x1
x2
x1
x2
y1
y2
y1
y2
p(x, y ) 
1
a (x, y )

Z a
p(x, y)   p(xi | x pa(i ) ) p(y j | x pa( j ) )
i
Undirected
( Markov networks)
j
Inference on Graphical Models
• Bayesian inference techniques:
– Belief propagation (BP): Kalman filtering
/smoothing, forward-backward algorithm
– Monte Carlo: Particle filter/smoothers,
MCMC
• Loopy BP: typically efficient, but not accurate
on general loopy graphs
• Monte Carlo: accurate, but often not efficient
Expectation Propagation in a Nutshell
• Approximate a probability distribution by
simpler parametric terms:
p ( x)   f a ( x)
a
~
q ( x)   f a ( x)
a
For directed graphs: f a (x)  p( xia | x ja )
For undirected graphs: f a (x)   (x a )
~
• Each approximation term f a (x) lives in an
exponential family (e.g. Gaussian)
EP in a Nutshell
~
• The approximate term f a (x) minimizes the
following KL divergence by moment matching:
~
arg min D ( f a ( x ) q ( x ) || f a ( x ) q \ a ( x ))
~
\a
fa ( x )
Where the leave-one-out approximation is
~
q ( x)
q ( x )   f b ( x)  ~
f a ( x)
ba
\a
Limitations of Plain EP
• Can be difficult or expensive to analytically
compute the needed moments in order to
minimize the desired KL divergence.
• Can be expensive to compute and maintain a
valid approximation distribution q(x), which is
coherent under marginalization.
– Tree-structured q(x):
 q( x ,x ) q( x )
i
xj
j
i
Three Extensions
~
1. Instead of choosing the approximate term f a (x) to
minimize the following KL divergence:
~
\a
arg min D( f a (x)q (x) || f a (x)q \ a (x))



~
fa (x)
use other criteria.
2. Use numerical approximation to compute
moments: Quadrature or Monte Carlo.
3. Allow the tree-structured q(x) to be non-coherent
during the iterations. It only needs to be coherent in
the end.
Efficiency vs. Accuracy
Error
Loopy BP (Factorized EP)
Extended EP ?
Monte Carlo
Computational Time
Outline
• Background on expectation propagation (EP)
• Extending EP on Bayesian networks for
dynamic systems
– Poisson tracking
– Signal detection for wireless communications
• Tree-structured EP on loopy graphs
• Conclusions and future work
Object Tracking
Guess the position of an object given noisy observations
y2
x2
x1
x3
y1
Object
y4
y3
x4
Bayesian Network
e.g.
x1
x2
xT
y1
y2
yT
xt  xt 1  νt
(random walk)
yt  xt  noise
want distribution of x’s given y’s
Approximation
p(x, y )  p( x1 ) p( y1 | x1 ) p( xt | xt 1 ) p( yt | xt )
t 1
q(x)  p( x1 )o~1 ( x1 ) ~
pt 1t ( xt ) ~
pt t 1 ( xt 1 )o~t ( xt )
t 1
Factorized and Gaussian in x
Message Interpretation
~
~
~
q( xt )  pt 1t ( xt )o ( xt ) pt 1t ( xt )
= (forward msg)(observation msg)(backward msg)
Forward
Message
xt
Backward
Message
Observation
Message
yt
Extensions of EP
• Instead of matching moments, use any
method for approximate filtering.
– Examples: statistical linearization, unscented
Kalman filter (UKF), mixture of Kalman filters
Turn any deterministic filtering method into a
smoothing method!
All methods can be interpreted as finding
linear/Gaussian approximations to original
terms.
• Use quadrature or Monte Carlo for term
approximations
Example: Poisson Tracking
• yt is an integer valued Poisson variate with
mean exp( xt )
Poisson Tracking Model
p( x1 ) ~ N (0,100)
p( xt | xt 1 ) ~ N ( xt 1 ,0.01)
p ( yt | xt )  exp( yt xt  e xt ) / yt !
Extended EP vs. Monte Carlo: Accuracy
Mean
Variance
Accuracy/Efficiency Tradeoff
Bayesian network for Wireless Signal Detection
s1
s2
sT
x1
x2
xT
y1
y2
yT
si: Transmitted signals
xi: Channel coefficients for digital wireless communications
yi: Received noisy observations
Extended-EP Joint Signal Detection and
Channel Estimation
• Turn mixture of Kalman filters into a
smoothing method
• Smoothing over the last  observations
• Observations before (t   ) act as prior for
the current estimation
Computational Complexity
• Expectation propagation O(nLd2)
• Stochastic mixture of Kalman filters O(LMd2)
• Rao-blackwellised particle smoothers O(LMNd2)
n: Number of EP iterations (Typically, 4 or 5)
d: Dimension of the parameter vector
L: Smooth window length
M: Number of samples in filtering (Often larger than 500)
N: Number of samples in smoothing (Larger than 50)
EP is about 5,000 times faster than Rao-blackwellised
particle smoothers.
Experimental Results
(Chen, Wang, Liu 2000)
Signal-Noise-Ratio
Signal-Noise-Ratio
EP outperforms particle smoothers in efficiency
with comparable accuracy.
Bayesian Networks for Adaptive Decoding
e1
e2
eT
x1
x2
xT
y1
y2
yT
The information bits et are coded by a convolutional
error-correcting encoder.
EP Outperforms Viterbi Decoding
Signal-Noise-Ratio
Outline
• Background on expectation propagation (EP)
• Extending EP on Bayesian networks for
dynamic systems
– Poisson tracking
– Signal detection for wireless communications
• Tree-structured EP on loopy graphs
• Conclusions and future work
Inference on Loopy Graphs
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
Problem: estimate marginal distributions of the
variables indexed by the nodes in a loopy graph,
e.g., p(xi), i = 1, . . . , 16.
4-node Loopy Graph
x1
x4
x3
x2
Joint distribution is product of pairwise potentials for all edges:
p ( x)   f a ( x)
a
Want to approximate p (x ) by a simpler distribution
BP vs. TreeEP
x1
x4
x2
BP
x3
x1
x1
x4
x2
TreeEP
x4
x3
x2
x3
Junction Tree Representation
p(x)
Junction tree
p(x)
q(x)
q(x)
Junction tree
Two Kinds of Edges
• On-tree edges, e.g., (x1,x4): exactly
incorporated into the junction tree
x1
x4
x2
x3
• Off-tree edges, e.g., (x1,x2): approximated
by projecting them onto the tree structure
KL Minimization
KL minimization  moment matching
•
• Match single and pairwise marginals of
x1
x1
and
x4
x2
x3
x4
x2
x3
• Reduces to exact inference on single loops
– Use cutset conditioning
Matching Marginals on Graph
x1 x2
(1) Incorporate edge (x3 x4)
x1 x3
x1 x4
x3 x4
x3 x5
x5 x7
x3 x6
x1 x2
x1 x3
(2) Incorporate edge (x6 x7)
x3 x5
x5 x7
x3 x6
x6 x7
x1 x4
Drawbacks of Global Propagation
• Update all the cliques even when only
incorporating one off-tree edge
– Computationally expensive
• Store each off-tree data message as a whole
tree
– Require large memory size
Solution: Local Propagation
• Allow q(x) be non-coherent during the
iterations. It only needs to be coherent in the
end.
• Exploit the junction tree representation: only
locally propagate information within the
minimal loop (subtree) that is directly connected
to the off-tree edge.
– Reduce computational complexity
– Save memory
x1 x2
x1 x3
x1 x2
x1 x4
x3 x4
x3 x5
x1 x3
x3 x5
x5 x7
x3 x6
(1) Incorporate edge(x3 x4)
x3 x6
x1 x2
x3 x5
x3 x6
x6 x7
x1 x4
x5 x7
(2) Propagate evidence
On this simple graph,
x1 x3
local propagation runs
x1 x4
roughly 2 times faster
and uses 2 times less
memory to store
x5 x7
messages than plain EP
(3) Incorporate edge (x6 x7)
New Interpretation of TreeEP
• Marry EP with Junction algorithm
• Can perform efficiently over hypertrees and
hypernodes
4-node Graph
TreeEP = the proposed method
GBP = generalized belief propagation on triangles
TreeVB = variational tree
BP = loopy belief propagation = Factorized EP
MF = mean-field
Fully-connected graphs
Results are averaged over 10 graphs with randomly
generated potentials
• TreeEP performs the same or better than all other
methods in both accuracy and efficiency!
8x8 grids, 10 trials
Method
FLOPS
Error
Exact
30,000
0
TreeEP
300,000
0.149
BP/double-loop 15,500,000
0.358
GBP
0.003
17,500,000
TreeEP versus BP and GBP
• TreeEP is always more accurate than BP
and is often faster
• TreeEP is much more efficient than GBP
and more accurate on some problems
• TreeEP converges more often than BP and
GBP
Outline
• Background on expectation propagation (EP)
• Extending EP on Bayesian networks for
dynamic systems
– Poisson tracking
– Signal detection for wireless communications
• Tree-structured EP on loopy graphs
• Conclusions and future work
Conclusions
• Extend EP on graphical models:
– Instead of minimizing KL divergence, use other
sensible criteria to generate messages.
Effectively turn any deterministic filtering
method into a smoothing method.
– Use quadrature to approximate messages.
– Local propagation to save the computation and
memory in tree structured EP.
Conclusions
Error
State-of-art
Techniques
Extended EP
Computational Time
• Extended EP algorithms outperform state-of-art
inference methods on graphical models in the tradeoff between accuracy and efficiency
Future Work
• More extensions of EP:
– How to choose a sensible approximation family (e.g.
which tree structure)
– More flexible approximation: mixture of EP?
– Error bound?
– Bayesian conditional random fields
– EP for optimization (generalize max-product)
• More real-world applications, e.g., classification
of gene expression data.
Classifying Colon Cancer Data by Predictive
Automatic Relevance Determination
• The task: distinguish normal
and cancer samples
• The dataset: 22 normal and
40 cancer samples with 2000
features per sample.
• The dataset was randomly
split 100 times into 50
training and 12 testing
samples.
• SVM results from Li et al.
2002
End
Contact information:
yuanqi@media.mit.edu
Download