Inference in Gaussian and Hybrid Bayesian Networks ICS 275B Gaussian Distribution ( x )2 1 P( x) exp 2 2 Represente d as N( , ) or as a triple (p, , ) 1 where p 2 0.4 gaussian(x,0,1) gaussian(x,1,1) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 P( x) -1 0 1 ( x )2 1 exp 2 2 N(, ) 2 3 0.4 gaussian(x,0,1) gaussian(x,0,2) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 P( x) 0 1 2 ( x )2 1 exp 2 2 N(, ) 3 Multivariate Gaussian Definition: Let X1,…,Xn. Be a set of random variables. A multivariate Gaussian distribution over X1,…,Xn is a parameterized by an n-dimensional mean vector and an n x n positive definitive covariance matrix . It defines a joint density via: P( X ) 1 (2 ) n / 2 | |1/ 2 1 T 1 ( x ) ( x ) 2 Multivariate Gaussian P( X ) 1 (2 ) n/2 || 1/ 2 1 T 1 2 ( x ) ( x ) Linear Gaussian Distribution Definition: Let Y be a continuous node with continuous parents X1,…,Xk. We say that Y has a linear Gaussian model if it can be described using parameters 0, …,k and 2 such that: P(y| x1,…,xk)=N (μy + 1x1 +…,kxk ; ) =N([μy,1,…,k] , ) A ~ N (a , a ) A B B ~ N ([ wa b ], b ) A B A B 0 1Y1 kYk 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -10 -5 X 0 5 10-10 -5 0 5 Y 10 Linear Gaussian Network Definition Linear Gaussian Bayesian network is a Bayesian network all of whose variables are continuous and where all of the CPTs are linear Gaussians. Linear Gaussian BN Multivariate Gaussian =>Linear Gaussian BN has a compact representation Inference in Continuous Networks P ( A) N ( a , a ) P(B) N ([ b , wa ], b ) P ( B ) P ( B | A) * P ( A) A P ( B | A) * P ( A) N (c, C ) 1 C K 1 M 1 where K -1 1 / a wa / b M 2 wa /σ b / σb 1 1 b c CK a CM ( / σ ) * w b b a 1 1 / b wa / b A B Marginalization N ( c , C ) A c is a vector containing two quantities 'a 'b 'aa 'ab C is a matrix of four quantities 'ba 'bb N ( 'a , 'bb ) is the required answer. Problems: When we Multiply two arbitrary Gaussians! P ( A) N ( a , a ) P(B) N ([ b , wa ], b ) P ( B ) P ( B | A) * P ( A) A P ( B | A) * P ( A) N (c, C ) C K 1 M 1 where K -1 1 / 1 a wa / b M 2 w a /σ b / σb 1 1 b c CK a CM ( / σ ) * w b b a 1 1 / b wa / b Inverse of K and M is always well defined. However, this inverse is not! Theoretical explanation: Why this is the case ? P ( A | B ) * P ( B | X ) P ( A, B | X ) Inverse of a matrix 1 of size n x n exists 1 1 0 1 when the matrix is C K 1 M 1 - 1 2 - 1 of rank n. 0 - 1 1 If all sigmas and wb / a 0 1 - 1 1 / a w’s are assumed to be 1. where K -1 wb / a wb2 /σ a 0 - 1 - 1 0 (K-1+M-1) has rank 0 0 0 0 2 and so is not invertible. 0 M 1 0 0 0 1/ b wx / b 0 0 0 0 wx / b 0 1 - 1 wx2 /σ b 0 - 1 1 0 0 0 Density vs conditional However, Theorem: If the product of the gaussians represents a multi-variate gaussian density, then the inverse always exists. For example, For P(A|B)*P(B)=P(A,B) = N(c,C) then inverse of C always exists. P(A,B) is a multi-variate gaussian (density). But P(A|B)*P(B|X)=P(A,B|X) = N(c,C) then inverse of C may not exist. P(A,B|X) is a conditional gaussian. Inference: A general algorithm Computing marginal of a given variable, say Z. P( y | X 1 ,..., X k ) N ( y , w1 ,..., wk , y ) ( g , h, k ) Let w [ w1 ,..., wk ] g y2 2 y 2 1 2 log( 2 y ) 2 y w h 2 y 1 1 ww' - w K 2 y - w' 1 Step 1: Convert all conditional gaussians to canonical form Inference: A general algorithm Computing marginal of a given variable, say Z. P ( A | B ) ( g , h, k ) wb / a 1 / a K 2 wb /σ a wb / a P ( B ) ( g ' , h' , k ' ) K ' [1 / σ b ] Step 2: Extend all g’s,h’s and k’s to the same domain by adding 0’s. Extending K' and K to the same domain, K remains the same 0 K' is changed to 1 / σ b 0 0 Inference: A general algorithm Computing marginal of a given variable, say Z. Step 3: Add all g’s, all h’s and all k’s. Step 4: Let the variables involved in the computation be: P(X1,X2,…,Xk,Z)= N(μ,∑) Inference: A general algorithm Computing marginal of a given variable, say Z. P(Z ) N ( z , z ) 1k 1 Z 11 .............................. ........... Zk ZZ Z 1 z ZZ 1 .... k z Step 5: Extract the marginal Inference: Computing marginal of a given variable For a continuous Gaussian Bayesian Network, inference is polynomial O(N3). Complexity of matrix inversion So algorithms like belief propagation are not generally used when all variables are Gaussian. Can we do better than N^3? Use Bucket elimination. Bucket elimination Algorithm elim-bel (Dechter 1996) Marginalization operator b bucket B: bucket C: P(b|a) P(d|b,a) P(e|b,c) P(c|a) h B (a, d, c, e) hC (a, d, e) bucket D: bucket E: bucket A: Multiplication operator e=0 P(a) B C D h D (a, e) E h (a) P(a|e=0) W*=4 ”induced width” (max clique size) E A Multiplication Operator Convert all functions to canonical form if necessary. Extend all functions to the same variables (g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2) Again our problem! h(a,d,c,e) does not represent a density and so cannot be computed in our usual form N(μ,σ) Marginalization operator b bucket B: bucket C: P(b|a) P(d|b,a) P(e|b,c) P(c|a) h B (a, d, c, e) hC (a, d, e) bucket D: bucket E: bucket A: Multiplication operator P(e) P(a) B C D h D (a, e) E h (a) P(a) W*=4 ”induced width” (max clique size) E A Solution: Marginalize in canonical form Although intermediate functions computed in bucket elimination are conditional, we can marginalize in canonical form, so we can eliminate the problem of non-existence of inverse completely. Algorithm In each bucket, convert all functions in canonical form if necessary, multiply them and marginalize out the variable in the bucket as shown in the previous slide. Theorem: P(A) is a density and is correct. Complexity: Time and space: O((w+1)^3) where w is the width of the ordering used. Continuous Node, Discrete Parents Definition: Let X be a continuous node, and let U={U1,U2,…,Un} be its discrete parents and Y={Y1,Y2,…,Yk} be its continuous parents. We say that X has a conditional linear Gaussian (CLG) CPT if, for every value uD(U), we have a a set of (k+1) coefficients au,0, au,1, …, au,k+1 and a variance u2 such that: k p( X | u, y ) N (au , 0 au ,i yi , u2 ) i 1 CLG Network Definition: A Bayesian network is called a CLG network if every discrete node has only discrete parents, and every continuous node has a CLG CPT. Inference in CLGs Can we use the same algorithm? Yes, but the algorithm is unbounded if we are not careful. Reason: Marginalizing out discrete variables from any arbitrary function in CLGs is not bounded. If we marginalize out y and k from f(x,y,i,k) , the result is a mixture of 4 gaussians instead of 2. X and y are continuous variables I and k are discrete binary variables. Solution: Approximate the mixture of Gaussians by a single gaussian Multiplication and Marginalization Multiplication Convert all functions to canonical form if necessary. Extend all functions to the same variables (g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2) Strong marginal when marginalizing continuous variables Weak marginal when marginalizing discrete variables Problem while using this marginalization in bucket elimination Requires computing ∑ and μ which is not possible due to non-existence of inverse. Solution: Use an ordering such that you never have to marginalize out discrete variables from a function that has both discrete and continuous gaussian variables. Special case: Compute marginal at a discrete node Homework: Derive a bucket elimination algorithm for computing marginal of a continuous variable. Special Case: A marginal on a discrete variable in a CLG is to be computed. B,C and D are continuous variables and A and E is discrete Marginalization operator b Multiplication operator bucket B: P(b|a,e) P(d|b,a) P(d|b,c) bucket C: P(c|a) h B (a, d, c, e) hC (a, d, e) bucket D: bucket E: P(e) bucket A: P(a) h D (a, e) E h (a) P(a) W*=4 ”induced width” (max clique size) Complexity of the special case Discrete-width (wd): Maximum number of discrete variables in a clique Continuous-width (wc): Maximum number of continuous variables in a clique Time: O(exp(wd)+wc^3) Space: O(exp(wd)+wc^3) Algorithm for the general case:Computing Belief at a continuous node of a CLG Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques (Same as assigning functions to buckets) Select a Strong Root Perform message passing Creating a Special-tree decomposition Moralize the Bayesian Network. Select an ordering such that all continuous variables are ordered before discrete variables (Increases induced width). Elimination order w x y z Strong elimination order: • First eliminate continuous variables • Eliminate discrete variable when no available continuous variables W and X are discrete variables and Y and Z are continuous. Moralized graph has this edge Elimination order (1) dim: 2 dim: 2 w x y dim: 2 z 1 Elimination order (2) dim: 2 dim: 2 w x y 2 z 1 Elimination order (3) 3 dim: 2 w x y 2 z 1 Elimination order (4) 3 4 w 3 x w x 3 y w 2 y 2 3 z 4 Cliques 2 w y 1 2 y 2 z Cliques 1 1 separator Bucket tree or Junction tree (1) w w x y Cliques 2: root w y separator y z Cliques 1 Algorithm for the general case:Computing Belief at a continuous node of a CLG Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques (Same as assigning functions to buckets) Select a Strong Root Perform message passing Assigning Functions to cliques Select a function and place it in an arbitrary clique that mentions all variables in the function. Algorithm for the general case:Computing Belief at a continuous node of a CLG Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques (Same as assigning functions to buckets) Select a Strong Root Perform message passing Strong Root We define a strong root as any node R in the bucket-tree which satisfies the following property: for any pair (V,W) which are neighbors on the tree with W closer to R than V, we have V \ W or V W is the set of continuous variables is the set of discrete variables Example Strong root Strong Root Algorithm for the general case:Computing Belief at a continuous node of a CLG Create a special tree-decomposition Assign functions to appropriate cliques (Same as assigning functions to buckets) Select a Strong Root Perform message passing Message passing at a typical node x2 x1 hxna (sep( xn, a)) hx1a ( sep ( x1, a)) oNode “a” contains functions assigned to it according to the tree-decomposition scheme denoted by pj(a) a b hab (sep(a, b)) h a sep ( a ,b ) ib i a (sep(i, a)) pj (a) j Message Passing Two pass algorithm: Bucket-tree propagation Distribute Collect root root Figure from P. Green Lets look at the messages Collect Evidence Strong Root ∫C ∫D ∫L ∫Min∫D ∫Mout Distribute Evidence Strong Root ∫E∑W,B ∑F ∫E∑W,B ∫E∑B ∑W Lauritzens theorem When you perform message passing such that collect evidence contains only strong marginals and distribute evidence may contain weak marginals, the junction-tree algorithm in exact in the sense that: The first (mean) and second moments (variance) computed are true moments Complexity Polynomial in #of continuous variables in a clique (n3) Exponential in #of discrete variables in a clique Possible options for approximation Ignore the strong root assumption and use approximation like MBTE, IJGP, Sampling Respect the strong root assumption and use approximation like MBTE, IJGP, Sampling Inaccuracies only due to discrete variables if done in one pass of MBTE. Initialization (1) dim: 2 w= 0 0.5 w= 1 0.5 dim: 2 w x x=0 0.4 x=1 0.6 X=0 y dim: 2 z 0.2 10 0 1.0 2 0 , ) N ( y; N ( y; 1.0 , 0 2 ) 1.2 0 10 dim: 2 W=0 0.5 0.9 0.3 9 0 y, ) N ( z; 0 . 5 0 . 7 0 . 5 0 9 X=1 W=1 0.2 0.3 0.2 2 0 y, ) N ( z; 0.5 0.7 0.5 0 3 Initialization (2) Cliques 1 wyz w=0 g=log(0.5),h=[],K=[] w=1 g=log(0.5),h=[],K=[] Cliques 2 (root) wy wxy 0.1111 0 x=1 g=log(0.6),h=[],K=[] X=1 g = -4.1245 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] g = -3.0310 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] W=1 g = -4.0629 h = [0.0889 -0.0111 -0.0556 0.0556] 0.1444 -0.0089 -0.1 0.0778 K = -0.0089 0.0378 -0.0333 -0.0556 -0.0333 -0.0556 g=log(0.4),h=[],K=[] X=0 W=0 -0.1 0.0778 x=0 0 0.1111 g = -2.7854 h = [0.0867 -0.0633 -0.1000 -0.1667] 0.2083 -0.1467 0.15 -0.2333 K = -0.1467 0.1033 -0.1 0.1667 0.15 -0.2333 -0.1 0.1667 0.5 0 0 0.3333 Initialization (3) Cliques 1 wyz Cliques 2 (root) wxy wy empty wx=00 wx=10 g = -5.1308 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] g = -5.1308 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] wx=01 g = -3.5418 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] W=0 g = -4.7560 h = 0.0889 -0.0111 K = 0.1444 -0.0089 -0.0089 -0.1 0.0778 0.0378 -0.0333 -0.0556 wx=11 g = -3.5418 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] W=1 -0.0556 0.0556 -0.1 -0.0333 0.1111 0 0.0778 -0.0556 0 0.1111 g = -3.4786 h = 0.0867 -0.0633 K = 0.2083 -0.1467 -0.1467 0.15 -0.2333 0.1033 -0.1 0.1667 -0.1 0.15 -0.1 0.5 0 -0.1667 -0.2333 0.1667 0 0.3333 Message Passing Cliques 1 wyz Cliques 2 (root) wy empty Collect evidence * (wy ) (wyz ) wy * (wy ) * (wxy ) (wxy ) (wy ) wxy Distribute evidence ** (wy ) * (wxy )wy ** (wy ) ** (wyz ) (wyz ) * (wy ) Collect evidence (1) Cliques 1 Cliques 2 (root) wyz y1 y , y2 wy empty h1 K11 h , K h2 K 21 y T2 ]T d y1 (y 2 ; gˆ , hˆ , Kˆ ) T [ y 1 wxy K12 K 22 y2 y3 1 1 1 g g ( p log( 2 ) log | K11 | h1T K11 h1 ) y2 2 1 (y1,y2)(y2) hˆ h 2 K 21K11 h1 1 Kˆ K K K K y y 22 21 11 12 1 2 Collect evidence (2) Cliques 1 Cliques 2 (root) wyz wxy wy empty W=0 g = -4.7560 h = 0.0889 -0.0111 K = 0.1444 -0.0089 -0.0089 -0.1 0.0778 0.0378 -0.0333 -0.0556 W=1 -0.0556 0.0556 -0.1 -0.0333 0.1111 0 0.0778 -0.0556 0 0.1111 g = -3.4786 h = 0.0867 -0.0633 K = 0.2083 -0.1467 -0.1467 0.15 -0.2333 0.1033 -0.1 0.1667 -0.1 0.15 -0.1 0.5 0 marginalization W=0 g = -0.6931 h = [0.1388 0]’ *1.0e-16 K = [0.2776 -0.0694;0.0347 0]*1.0e-16 W=1 g = -0.6931 h = [0 0]’ K = [0 0 0 0] -0.1667 -0.2333 0.1667 0 0.3333 Collect evidence (3) Cliques 1 Cliques 2 (root) wyz wxy wy empty W=0 W=1 g = -0.6931 h = [0.1388 0]’ *1.0e-16 K = [0.2776 -0.0694;0.0347 0]*1.0e-16 g = -0.6931 h = [0 0]’ K = [0 0 0 0] multiplication wx=00 wx=10 g = -5.1308 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] g = -5.1308 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] wx=00 wx=10 g = -5.8329 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] g = -5.8329 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] wx=01 g = -3.5418 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] wx=01 g = -4.2350 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] wx=11 g = -3.5418 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] wx=11 g = -4.2350 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] Distribute evidence (1) Cliques 1 Cliques 2 (root) wyz wxy wy W=0 g = -4.7560 h = 0.0889 -0.0111 K = 0.1444 -0.0089 -0.0089 -0.1 0.0778 0.0378 -0.0333 -0.0556 W=1 -0.0556 0.0556 -0.1 -0.0333 0.1111 0 0.0778 -0.0556 0 0.1111 g = -3.4786 h = 0.0867 -0.0633 K = 0.2083 -0.1467 -0.1467 0.15 -0.2333 0.1033 -0.1 0.1667 -0.1 0.15 -0.1 0.5 0 division W=0 g = -0.6931 h = [0.1388 0]’ *1.0e-16 K = [0.2776 -0.0694;0.0347 0]*1.0e-16 W=1 g = -0.6931 h = [0 0]’ K = [0 0 0 0] -0.1667 -0.2333 0.1667 0 0.3333 Distribute evidence (2) Cliques 1 Cliques 2 (root) wyz wy wxy W=0 g = -4.0629 h = 0.0889 -0.0111 K = 0.1444 -0.0089 -0.0089 -0.1 0.0778 0.0378 -0.0333 -0.0556 W=1 -0.0556 0.0556 -0.1 -0.0333 0.1111 0 0.0778 -0.0556 0 0.1111 g = -2.7854 h = 0.0867 -0.0633 K = 0.2083 -0.1467 -0.1467 0.15 -0.2333 0.1033 -0.1 0.1667 -0.1 0.15 -0.1 0.5 0 -0.1667 -0.2333 0.1667 0 0.3333 Distribute evidence (3) Cliques 1 Cliques 2 (root) wyz wxy wy wx=00 wx=10 g = -5.8329 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] g = -5.8329 h = [-0.02 0.12]’ K = [0.1 0; 0 0.1] wx=01 g = -4.2350 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] Marginalize over x w=0 w=1 logp = -0.6931 mu = [0.52 -0.12]’ 5.5456 -0.6336 Sigma =-0.6336 6.3616 logp = -0.6931 mu = [0.52 -0.12]’ 5.5456 -0.6336 Sigma =-0.6336 6.3616 wx=11 g = -4.2350 h = [0.5 -0.5]’ K = [0.5 0.5;0.5 0.5] Distribute evidence (4) Cliques 1 Cliques 2 (root) wyz wxy wy W=0 g = -4.0629 h = 0.0889 -0.0111 K = 0.1444 -0.0089 -0.0089 -0.1 0.0778 0.0378 -0.0333 -0.0556 W=1 -0.0556 0.0556 -0.1 -0.0333 0.1111 0 0.0778 -0.0556 0 0.1111 g = -2.7854 h = 0.0867 -0.0633 K = 0.2083 -0.1467 -0.1467 0.15 -0.2333 0.1033 -0.1 0.1667 -0.1 0.15 -0.1 0.5 0 -0.1667 -0.2333 0.1667 0 0.3333 multiplication w=0 w=1 logp = -0.6931 mu = [0.52 -0.12]’ 5.5456 -0.6336 Sigma =-0.6336 6.3616 logp = -0.6931 mu = [0.52 -0.12]’ 5.5456 -0.6336 Sigma =-0.6336 6.3616 w=0 g = -4.3316 h = [0.0927 -0.0096]’ 0.1824 0.0182 K = 0.0182 0.159 w=1 g = -0.6931 h = [0.0927 -0.0096]’ 0.1824 0.0182 K = 0.0182 0.159 Canonical form Distribute evidence (5) Cliques 1 Cliques 2 (root) wyz wy wxy W=0 g = -8.3935 h = 0.1816 -0.0207 -0.0556 0.3268 0.0093 -0.1 K = 0.0093 0.1968 -0.0333 -0.1 0.0778 -0.0333 -0.0556 0.1111 0 W=1 0.0556 0.0778 -0.0556 0 0.1111 g = -7.1170 -0.073 h = 0.1793 0.3907 -0.1285 K= -0.1285 0.15 -0.2333 0.2623 -0.1 0.1667 -0.1 0.15 -0.1 0.5 0 -0.1667 -0.2333 0.1667 0 0.3333 After Message Passing Cliques 1 p(wyz) Cliques 2 (root) p(wy) p(wxy) Local marginal distributions