Extending Expectation Propagation for Graphical Models Yuan (Alan) Qi Joint work with Tom Minka Motivation • Graphical models are widely used in realworld applications, such as wireless communications and bioinformatics. • Inference techniques on graphical models often sacrifice efficiency for accuracy or sacrifice accuracy for efficiency. • Need a method that better balances the trade-off between accuracy and efficiency. Error Motivation Current Techniques What we want Computational Time Outline • Background on expectation propagation (EP) • Extending EP on Bayesian networks for dynamic systems – Poisson tracking – Signal detection for wireless communications • Tree-structured EP on loopy graphs • Conclusions and future work Outline • Background on expectation propagation (EP) • Extending EP on Bayesian networks for dynamic systems – Poisson tracking – Signal detection for wireless communications • Tree-structured EP on loopy graphs • Conclusions and future work Graphical Models Directed ( Bayesian networks) x1 x2 x1 x2 y1 y2 y1 y2 p(x, y ) 1 a (x, y ) Z a p(x, y) p(xi | x pa(i ) ) p(y j | x pa( j ) ) i Undirected ( Markov networks) j Inference on Graphical Models • Bayesian inference techniques: – Belief propagation (BP): Kalman filtering /smoothing, forward-backward algorithm – Monte Carlo: Particle filter/smoothers, MCMC • Loopy BP: typically efficient, but not accurate on general loopy graphs • Monte Carlo: accurate, but often not efficient Expectation Propagation in a Nutshell • Approximate a probability distribution by simpler parametric terms: p ( x) f a ( x) a ~ q ( x) f a ( x) a For directed graphs: f a (x) p( xia | x ja ) For undirected graphs: f a (x) (x a ) ~ • Each approximation term f a (x) lives in an exponential family (e.g. Gaussian) EP in a Nutshell ~ • The approximate term f a (x) minimizes the following KL divergence by moment matching: ~ arg min D ( f a ( x ) q ( x ) || f a ( x ) q \ a ( x )) ~ \a fa ( x ) Where the leave-one-out approximation is ~ q ( x) q ( x ) f b ( x) ~ f a ( x) ba \a Limitations of Plain EP • Can be difficult or expensive to analytically compute the needed moments in order to minimize the desired KL divergence. • Can be expensive to compute and maintain a valid approximation distribution q(x), which is coherent under marginalization. – Tree-structured q(x): q( x ,x ) q( x ) i xj j i Three Extensions ~ 1. Instead of choosing the approximate term f a (x) to minimize the following KL divergence: ~ \a arg min D( f a (x)q (x) || f a (x)q \ a (x)) ~ fa (x) use other criteria. 2. Use numerical approximation to compute moments: Quadrature or Monte Carlo. 3. Allow the tree-structured q(x) to be non-coherent during the iterations. It only needs to be coherent in the end. Efficiency vs. Accuracy Error Loopy BP (Factorized EP) Extended EP ? Monte Carlo Computational Time Outline • Background on expectation propagation (EP) • Extending EP on Bayesian networks for dynamic systems – Poisson tracking – Signal detection for wireless communications • Tree-structured EP on loopy graphs • Conclusions and future work Object Tracking Guess the position of an object given noisy observations y2 x2 x1 x3 y1 Object y4 y3 x4 Bayesian Network e.g. x1 x2 xT y1 y2 yT xt xt 1 νt (random walk) yt xt noise want distribution of x’s given y’s Approximation p(x, y ) p( x1 ) p( y1 | x1 ) p( xt | xt 1 ) p( yt | xt ) t 1 q(x) p( x1 )o~1 ( x1 ) ~ pt 1t ( xt ) ~ pt t 1 ( xt 1 )o~t ( xt ) t 1 Factorized and Gaussian in x Message Interpretation ~ ~ ~ q( xt ) pt 1t ( xt )o ( xt ) pt 1t ( xt ) = (forward msg)(observation msg)(backward msg) Forward Message xt Backward Message Observation Message yt Extensions of EP • Instead of matching moments, use any method for approximate filtering. – Examples: statistical linearization, unscented Kalman filter (UKF), mixture of Kalman filters Turn any deterministic filtering method into a smoothing method! All methods can be interpreted as finding linear/Gaussian approximations to original terms. • Use quadrature or Monte Carlo for term approximations Example: Poisson Tracking • yt is an integer valued Poisson variate with mean exp( xt ) Poisson Tracking Model p( x1 ) ~ N (0,100) p( xt | xt 1 ) ~ N ( xt 1 ,0.01) p ( yt | xt ) exp( yt xt e xt ) / yt ! Extended EP vs. Monte Carlo: Accuracy Mean Variance Accuracy/Efficiency Tradeoff Bayesian network for Wireless Signal Detection s1 s2 sT x1 x2 xT y1 y2 yT si: Transmitted signals xi: Channel coefficients for digital wireless communications yi: Received noisy observations Extended-EP Joint Signal Detection and Channel Estimation • Turn mixture of Kalman filters into a smoothing method • Smoothing over the last observations • Observations before (t ) act as prior for the current estimation Computational Complexity • Expectation propagation O(nLd2) • Stochastic mixture of Kalman filters O(LMd2) • Rao-blackwellised particle smoothers O(LMNd2) n: Number of EP iterations (Typically, 4 or 5) d: Dimension of the parameter vector L: Smooth window length M: Number of samples in filtering (Often larger than 500) N: Number of samples in smoothing (Larger than 50) EP is about 5,000 times faster than Rao-blackwellised particle smoothers. Experimental Results (Chen, Wang, Liu 2000) Signal-Noise-Ratio Signal-Noise-Ratio EP outperforms particle smoothers in efficiency with comparable accuracy. Bayesian Networks for Adaptive Decoding e1 e2 eT x1 x2 xT y1 y2 yT The information bits et are coded by a convolutional error-correcting encoder. EP Outperforms Viterbi Decoding Signal-Noise-Ratio Outline • Background on expectation propagation (EP) • Extending EP on Bayesian networks for dynamic systems – Poisson tracking – Signal detection for wireless communications • Tree-structured EP on loopy graphs • Conclusions and future work Inference on Loopy Graphs X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 Problem: estimate marginal distributions of the variables indexed by the nodes in a loopy graph, e.g., p(xi), i = 1, . . . , 16. 4-node Loopy Graph x1 x4 x3 x2 Joint distribution is product of pairwise potentials for all edges: p ( x) f a ( x) a Want to approximate p (x ) by a simpler distribution BP vs. TreeEP x1 x4 x2 BP x3 x1 x1 x4 x2 TreeEP x4 x3 x2 x3 Junction Tree Representation p(x) Junction tree p(x) q(x) q(x) Junction tree Two Kinds of Edges • On-tree edges, e.g., (x1,x4): exactly incorporated into the junction tree x1 x4 x2 x3 • Off-tree edges, e.g., (x1,x2): approximated by projecting them onto the tree structure KL Minimization KL minimization moment matching • • Match single and pairwise marginals of x1 x1 and x4 x2 x3 x4 x2 x3 • Reduces to exact inference on single loops – Use cutset conditioning Matching Marginals on Graph x1 x2 (1) Incorporate edge (x3 x4) x1 x3 x1 x4 x3 x4 x3 x5 x5 x7 x3 x6 x1 x2 x1 x3 (2) Incorporate edge (x6 x7) x3 x5 x5 x7 x3 x6 x6 x7 x1 x4 Drawbacks of Global Propagation • Update all the cliques even when only incorporating one off-tree edge – Computationally expensive • Store each off-tree data message as a whole tree – Require large memory size Solution: Local Propagation • Allow q(x) be non-coherent during the iterations. It only needs to be coherent in the end. • Exploit the junction tree representation: only locally propagate information within the minimal loop (subtree) that is directly connected to the off-tree edge. – Reduce computational complexity – Save memory x1 x2 x1 x3 x1 x2 x1 x4 x3 x4 x3 x5 x1 x3 x3 x5 x5 x7 x3 x6 (1) Incorporate edge(x3 x4) x3 x6 x1 x2 x3 x5 x3 x6 x6 x7 x1 x4 x5 x7 (2) Propagate evidence On this simple graph, x1 x3 local propagation runs x1 x4 roughly 2 times faster and uses 2 times less memory to store x5 x7 messages than plain EP (3) Incorporate edge (x6 x7) New Interpretation of TreeEP • Marry EP with Junction algorithm • Can perform efficiently over hypertrees and hypernodes 4-node Graph TreeEP = the proposed method GBP = generalized belief propagation on triangles TreeVB = variational tree BP = loopy belief propagation = Factorized EP MF = mean-field Fully-connected graphs Results are averaged over 10 graphs with randomly generated potentials • TreeEP performs the same or better than all other methods in both accuracy and efficiency! 8x8 grids, 10 trials Method FLOPS Error Exact 30,000 0 TreeEP 300,000 0.149 BP/double-loop 15,500,000 0.358 GBP 0.003 17,500,000 TreeEP versus BP and GBP • TreeEP is always more accurate than BP and is often faster • TreeEP is much more efficient than GBP and more accurate on some problems • TreeEP converges more often than BP and GBP Outline • Background on expectation propagation (EP) • Extending EP on Bayesian networks for dynamic systems – Poisson tracking – Signal detection for wireless communications • Tree-structured EP on loopy graphs • Conclusions and future work Conclusions • Extend EP on graphical models: – Instead of minimizing KL divergence, use other sensible criteria to generate messages. Effectively turn any deterministic filtering method into a smoothing method. – Use quadrature to approximate messages. – Local propagation to save the computation and memory in tree structured EP. Conclusions Error State-of-art Techniques Extended EP Computational Time • Extended EP algorithms outperform state-of-art inference methods on graphical models in the tradeoff between accuracy and efficiency Future Work • More extensions of EP: – How to choose a sensible approximation family (e.g. which tree structure) – More flexible approximation: mixture of EP? – Error bound? – Bayesian conditional random fields – EP for optimization (generalize max-product) • More real-world applications, e.g., classification of gene expression data. Classifying Colon Cancer Data by Predictive Automatic Relevance Determination • The task: distinguish normal and cancer samples • The dataset: 22 normal and 40 cancer samples with 2000 features per sample. • The dataset was randomly split 100 times into 50 training and 12 testing samples. • SVM results from Li et al. 2002 End Contact information: yuanqi@media.mit.edu