Inference in Continuous and Hybrid Networks (slides)

advertisement
Inference in Gaussian and
Hybrid Bayesian Networks
ICS 275B
Gaussian Distribution
  ( x   )2 
1

P( x) 
exp 
2
2 


Represente d as N( ,  ) or as a triple (p,  ,  )
1
where p 
2 
0.4
gaussian(x,0,1)
gaussian(x,1,1)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
-2
P( x) 
-1
0
1
  ( x   )2 
1

exp 
2
2 


N(, )
2
3
0.4
gaussian(x,0,1)
gaussian(x,0,2)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
-2
-1
P( x) 
0
1
2
  ( x   )2 
1

exp 
2
2 


N(, )
3
Multivariate Gaussian
Definition:
Let X1,…,Xn. Be a set of random variables. A
multivariate Gaussian distribution over X1,…,Xn
is a parameterized by an n-dimensional mean
vector  and an n x n positive definitive
covariance matrix . It defines a joint density via:
P( X ) 
1
(2 ) n / 2 |  |1/ 2
 1

T 1

(
x


)

(
x


)
 2

Multivariate Gaussian
P( X ) 
1
(2 )
n/2
||
1/ 2
 1

T 1
 2 ( x   )  ( x   )
Linear Gaussian Distribution
Definition:
Let Y be a continuous node with continuous
parents X1,…,Xk. We say that Y has a linear
Gaussian model if it can be described using
parameters 0, …,k and 2 such that:
P(y| x1,…,xk)=N (μy + 1x1 +…,kxk ;  )
=N([μy,1,…,k] ,  )
A ~ N (a , a )
A
B
B ~ N ([ wa b ],  b )
A
B
A
B
  0  1Y1    kYk
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-10
-5
X
0
5
10-10
-5
0
5
Y
10
Linear Gaussian Network
Definition
Linear Gaussian Bayesian network is a
Bayesian network all of whose variables are
continuous and where all of the CPTs are linear
Gaussians.
Linear Gaussian BN  Multivariate Gaussian
=>Linear Gaussian BN has a compact
representation
Inference in Continuous Networks
P ( A)  N (  a ,  a ) P(B)  N ([ b , wa ],  b )
P ( B )   P ( B | A) * P ( A)
A
P ( B | A) * P ( A)  N (c, C )
1
C   K  1  M  1 


where K -1  1 /  
a
 wa /  b 
M

2
wa /σ b 
 / σb

1
1  b
c  CK  a  CM 


(

/
σ
)
*
w
b
b
a

1
1 /  b

 wa /  b
A
B
Marginalization

N
(
c
,
C
)
A
c is a vector containing two quantities  'a  'b 
 'aa  'ab 
C is a matrix of four quantities 

 'ba  'bb 
N (  'a ,  'bb ) is the required answer.
Problems: When we Multiply two arbitrary
Gaussians!
P ( A)  N (  a ,  a ) P(B)  N ([ b , wa ],  b )
P ( B )   P ( B | A) * P ( A)
A
P ( B | A) * P ( A)  N (c, C )
C   K  1  M  1 


where K -1  1 /  
1
a
 wa /  b 
M

2
w a /σ b 
 / σb

1
1  b
c  CK  a  CM 


(

/
σ
)
*
w
b
b
a

1
1 /  b

 wa /  b
Inverse of K and M
is always well
defined.
However, this
inverse is not!
Theoretical explanation: Why this is the
case ?



P ( A | B ) * P ( B | X )  P ( A, B | X )
Inverse of a matrix
1
of size n x n exists
1
1
0


1
when the matrix is
C   K  1  M  1   - 1 2 - 1


of rank n.
0 - 1 1 
If all sigmas and
 wb /  a 0  1 - 1
1 /  a
w’s are assumed


to be 1.
where K -1   wb /  a
wb2 /σ a
0   - 1 - 1
0
(K-1+M-1) has rank
0
0 0 0

2 and so is not
invertible.
0

M 1  0
0

0
1/  b
 wx /  b
 0 0 0 
0

 wx /  b   0 1 - 1 
wx2 /σ b  0 - 1 1 
0
0 
0
Density vs conditional

However,

Theorem: If the product of the gaussians
represents a multi-variate gaussian density, then
the inverse always exists.


For example, For P(A|B)*P(B)=P(A,B) = N(c,C) then
inverse of C always exists. P(A,B) is a multi-variate
gaussian (density).
But P(A|B)*P(B|X)=P(A,B|X) = N(c,C) then inverse of C
may not exist. P(A,B|X) is a conditional gaussian.
Inference: A general algorithm Computing
marginal of a given variable, say Z.
P( y | X 1 ,..., X k )  N (  y , w1 ,..., wk ,  y )  ( g , h, k )
Let w  [ w1 ,..., wk ]
g
 y2
2 y
2
1
2
 log( 2 y )
2
 y  w
h 2  
 y 1 
1  ww' - w 
K 2

 y - w' 1 
Step 1:
Convert all conditional
gaussians to
canonical form
Inference: A general algorithm Computing
marginal of a given variable, say Z.
P ( A | B )  ( g , h, k )
 wb /  a 
1 /  a
K 

2
wb /σ a 
 wb /  a
P ( B )  ( g ' , h' , k ' )
K '  [1 / σ b ]

Step 2:
 Extend all g’s,h’s and
k’s to the same
domain by adding 0’s.
Extending K' and K to the same domain, K remains the same
0
K' is changed to 
1 / σ b
0
0 
Inference: A general algorithm Computing
marginal of a given variable, say Z.


Step 3: Add all g’s, all h’s and all k’s.
Step 4: Let the variables involved in the
computation be: P(X1,X2,…,Xk,Z)= N(μ,∑)
Inference: A general algorithm Computing
marginal of a given variable, say Z.
P(Z )  N ( z ,  z )
 1k  1 Z
11
..............................

...........

 Zk  ZZ
 Z 1
 z   ZZ
 1 
.... 
 
 k 
 
 z 






Step 5:
Extract the marginal
Inference: Computing marginal of a given
variable

For a continuous Gaussian Bayesian
Network, inference is polynomial O(N3).



Complexity of matrix inversion
So algorithms like belief propagation are not
generally used when all variables are
Gaussian.
Can we do better than N^3?

Use Bucket elimination.
Bucket elimination
Algorithm elim-bel (Dechter 1996)

Marginalization
operator
b
bucket B:
bucket C:
P(b|a) P(d|b,a) P(e|b,c)
P(c|a) h B (a, d, c, e)
hC (a, d, e)
bucket D:
bucket E:
bucket A:
Multiplication operator
e=0
P(a)
B
C
D
h D (a, e)
E
h (a)
P(a|e=0)
W*=4
”induced width”
(max clique size)
E
A
Multiplication Operator



Convert all functions to canonical form if
necessary.
Extend all functions to the same variables
(g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2)
Again our problem!
h(a,d,c,e) does not represent a density and so
cannot be computed in our usual form N(μ,σ)

Marginalization
operator
b
bucket B:
bucket C:
P(b|a) P(d|b,a) P(e|b,c)
P(c|a) h B (a, d, c, e)
hC (a, d, e)
bucket D:
bucket E:
bucket A:
Multiplication operator
P(e)
P(a)
B
C
D
h D (a, e)
E
h (a)
P(a)
W*=4
”induced width”
(max clique size)
E
A
Solution: Marginalize in canonical form

Although intermediate functions computed in bucket
elimination are conditional, we can marginalize in
canonical form, so we can eliminate the problem of
non-existence of inverse completely.
Algorithm



In each bucket, convert all functions in
canonical form if necessary, multiply them
and marginalize out the variable in the bucket
as shown in the previous slide.
Theorem: P(A) is a density and is correct.
Complexity: Time and space: O((w+1)^3)
where w is the width of the ordering used.
Continuous Node, Discrete Parents
Definition:
Let X be a continuous node, and let
U={U1,U2,…,Un} be its discrete parents and
Y={Y1,Y2,…,Yk} be its continuous parents.
We say that X has a conditional linear
Gaussian (CLG) CPT if, for every value
uD(U), we have a a set of (k+1) coefficients
au,0, au,1, …, au,k+1 and a variance u2 such
that:
k
p( X | u, y )  N (au , 0   au ,i yi ,  u2 )
i 1
CLG Network
Definition:
A Bayesian network is called a CLG network if
every discrete node has only discrete parents,
and every continuous node has a CLG CPT.
Inference in CLGs

Can we use the same algorithm?


Yes, but the algorithm is unbounded if we are not
careful.
Reason:

Marginalizing out discrete variables from any
arbitrary function in CLGs is not bounded.

If we marginalize out y and k from f(x,y,i,k) , the result is
a mixture of 4 gaussians instead of 2.


X and y are continuous variables
I and k are discrete binary variables.
Solution: Approximate the mixture of
Gaussians by a single gaussian
Multiplication and Marginalization
Multiplication



Convert all functions to
canonical form if
necessary.
Extend all functions to
the same variables
(g1,h1,k1)*(g2,h2,k2)
=(g1+g2,h1+h2,k1+k2)
Strong marginal when marginalizing
continuous variables
Weak marginal when marginalizing
discrete variables
Problem while using this marginalization
in bucket elimination




Requires computing ∑ and μ which is not possible
due to non-existence of inverse.
Solution: Use an ordering such that you never have
to marginalize out discrete variables from a function
that has both discrete and continuous gaussian
variables.
Special case: Compute marginal at a discrete node
Homework: Derive a bucket elimination algorithm
for computing marginal of a continuous variable.
Special Case: A marginal on a discrete
variable in a CLG is to be computed.
B,C and D are continuous variables and A and E is discrete

Marginalization
operator
b
Multiplication operator
bucket B:
P(b|a,e) P(d|b,a) P(d|b,c)
bucket C:
P(c|a) h B (a, d, c, e)
hC (a, d, e)
bucket D:
bucket E:
P(e)
bucket A:
P(a)
h D (a, e)
E
h (a)
P(a)
W*=4
”induced width”
(max clique size)
Complexity of the special case




Discrete-width (wd): Maximum number of
discrete variables in a clique
Continuous-width (wc): Maximum number of
continuous variables in a clique
Time: O(exp(wd)+wc^3)
Space: O(exp(wd)+wc^3)
Algorithm for the general case:Computing
Belief at a continuous node of a CLG





Convert all functions to canonical form.
Create a special tree-decomposition
Assign functions to appropriate cliques
(Same as assigning functions to buckets)
Select a Strong Root
Perform message passing
Creating a Special-tree decomposition


Moralize the Bayesian Network.
Select an ordering such that all continuous
variables are ordered before discrete
variables (Increases induced width).
Elimination order
w
x
y
z
Strong elimination order:
• First eliminate continuous variables
• Eliminate discrete variable when no
available continuous variables
W and X are discrete
variables and Y and Z are
continuous.
Moralized graph has this edge
Elimination order (1)
dim: 2
dim: 2
w
x
y
dim: 2
z
1
Elimination order (2)
dim: 2
dim: 2
w
x
y
2
z
1
Elimination order (3)
3
dim: 2
w
x
y
2
z
1
Elimination order (4)
3
4
w
3
x
w
x
3
y
w
2
y
2
3
z
4
Cliques 2
w
y
1
2
y
2
z
Cliques 1
1
separator
Bucket tree or Junction tree (1)
w
w
x
y
Cliques 2: root
w
y
separator
y
z
Cliques 1
Algorithm for the general case:Computing
Belief at a continuous node of a CLG





Convert all functions to canonical form.
Create a special tree-decomposition
Assign functions to appropriate cliques
(Same as assigning functions to buckets)
Select a Strong Root
Perform message passing
Assigning Functions to cliques

Select a function and place it in an arbitrary
clique that mentions all variables in the
function.
Algorithm for the general case:Computing
Belief at a continuous node of a CLG





Convert all functions to canonical form.
Create a special tree-decomposition
Assign functions to appropriate cliques
(Same as assigning functions to buckets)
Select a Strong Root
Perform message passing
Strong Root

We define a strong root as any node R in the
bucket-tree which satisfies the following property: for
any pair (V,W) which are neighbors on the tree with
W closer to R than V, we have
V \ W   or V  W  
 is the set of continuous variables
 is the set of discrete variables
Example Strong root
Strong Root
Algorithm for the general case:Computing
Belief at a continuous node of a CLG




Create a special tree-decomposition
Assign functions to appropriate cliques
(Same as assigning functions to buckets)
Select a Strong Root
Perform message passing
Message passing at a typical node
x2
x1
hxna (sep( xn, a))
hx1a ( sep ( x1, a))
oNode “a” contains functions
assigned to it according to
the tree-decomposition
scheme denoted by pj(a)
a
b
hab (sep(a, b)) 
 h
a  sep ( a ,b ) ib
i a
(sep(i, a)) pj (a)
j
Message Passing
Two pass algorithm: Bucket-tree propagation
Distribute
Collect
root
root
Figure from P. Green
Lets look at the messages
Collect Evidence
Strong Root
∫C
∫D
∫L
∫Min∫D
∫Mout
Distribute Evidence
Strong Root
∫E∑W,B
∑F
∫E∑W,B
∫E∑B
∑W
Lauritzens theorem

When you perform message passing such
that collect evidence contains only strong
marginals and distribute evidence may
contain weak marginals, the junction-tree
algorithm in exact in the sense that:

The first (mean) and second moments (variance)
computed are true moments
Complexity



Polynomial in #of continuous variables in a clique
(n3)
Exponential in #of discrete variables in a clique
Possible options for approximation


Ignore the strong root assumption and use approximation
like MBTE, IJGP, Sampling
Respect the strong root assumption and use
approximation like MBTE, IJGP, Sampling

Inaccuracies only due to discrete variables if done in one
pass of MBTE.
Initialization (1)
dim: 2
w=
0
0.5
w=
1
0.5
dim: 2
w
x
x=0
0.4
x=1
0.6
X=0
y
dim: 2
z
  0.2  10 0 
 1.0   2 0 
, 
) N ( y; 
N ( y; 
  1.0 ,  0 2 )
 1.2   0 10 



dim: 2
W=0
  0.5   0.9 0.3   9 0 
  
 y, 
)
N ( z; 
0
.
5

0
.
7
0
.
5
0
9

 
 

X=1
W=1
  0.2    0.3 0.2   2 0 
  
 y, 
)
N ( z; 
  0.5   0.7  0.5   0 3 
Initialization (2)
Cliques 1
wyz
w=0
g=log(0.5),h=[],K=[]
w=1
g=log(0.5),h=[],K=[]
Cliques 2 (root)
wy
wxy
0.1111
0
x=1
g=log(0.6),h=[],K=[]
X=1
g = -4.1245
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -3.0310
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
W=1
g = -4.0629
h = [0.0889 -0.0111 -0.0556 0.0556]
0.1444 -0.0089
-0.1
0.0778
K = -0.0089
0.0378 -0.0333 -0.0556
-0.0333
-0.0556
g=log(0.4),h=[],K=[]
X=0
W=0
-0.1
0.0778
x=0
0
0.1111
g = -2.7854
h = [0.0867 -0.0633 -0.1000 -0.1667]
0.2083 -0.1467
0.15 -0.2333
K = -0.1467
0.1033
-0.1
0.1667
0.15
-0.2333
-0.1
0.1667
0.5
0
0
0.3333
Initialization (3)
Cliques 1
wyz
Cliques 2 (root)
wxy
wy
empty
wx=00
wx=10
g = -5.1308
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -5.1308
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
wx=01
g = -3.5418
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
W=0
g = -4.7560
h = 0.0889 -0.0111
K = 0.1444 -0.0089
-0.0089
-0.1
0.0778
0.0378
-0.0333
-0.0556
wx=11
g = -3.5418
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
W=1
-0.0556
0.0556
-0.1
-0.0333
0.1111
0
0.0778
-0.0556
0
0.1111
g = -3.4786
h = 0.0867 -0.0633
K = 0.2083 -0.1467
-0.1467
0.15
-0.2333
0.1033
-0.1
0.1667
-0.1
0.15
-0.1
0.5
0
-0.1667
-0.2333
0.1667
0
0.3333
Message Passing
Cliques 1
wyz
Cliques 2 (root)
wy
empty
Collect evidence
 * (wy )   (wyz ) wy
 * (wy )
*
 (wxy )   (wxy )
 (wy )
wxy
Distribute evidence
 ** (wy )   * (wxy )wy
 ** (wy )
**
 (wyz )   (wyz ) *
 (wy )
Collect evidence (1)
Cliques 1
Cliques 2 (root)
wyz
 y1 
y   ,
 y2 
wy
empty
 h1 
 K11
h   , K  
 h2 
 K 21
y T2 ]T d y1   (y 2 ; gˆ , hˆ , Kˆ )
T

[
y
1

wxy
K12 

K 22 
y2 y3
1
1
1
g  g  ( p log( 2 )  log | K11 | h1T K11 h1 )
y2
2
1
(y1,y2)(y2)
hˆ  h 2  K 21K11 h1
1
Kˆ  K  K K K
y y
22
21
11
12
1 2
Collect evidence (2)
Cliques 1
Cliques 2 (root)
wyz
wxy
wy
empty
W=0
g = -4.7560
h = 0.0889 -0.0111
K = 0.1444 -0.0089
-0.0089
-0.1
0.0778
0.0378
-0.0333
-0.0556
W=1
-0.0556
0.0556
-0.1
-0.0333
0.1111
0
0.0778
-0.0556
0
0.1111
g = -3.4786
h = 0.0867 -0.0633
K = 0.2083 -0.1467
-0.1467
0.15
-0.2333
0.1033
-0.1
0.1667
-0.1
0.15
-0.1
0.5
0
marginalization
W=0
g = -0.6931
h = [0.1388 0]’ *1.0e-16
K = [0.2776 -0.0694;0.0347 0]*1.0e-16
W=1
g = -0.6931
h = [0 0]’
K = [0 0 0 0]
-0.1667
-0.2333
0.1667
0
0.3333
Collect evidence (3)
Cliques 1
Cliques 2 (root)
wyz
wxy
wy
empty
W=0
W=1
g = -0.6931
h = [0.1388 0]’ *1.0e-16
K = [0.2776 -0.0694;0.0347 0]*1.0e-16
g = -0.6931
h = [0 0]’
K = [0 0 0 0]
multiplication
wx=00
wx=10
g = -5.1308
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -5.1308
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
wx=00
wx=10
g = -5.8329
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -5.8329
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
wx=01
g = -3.5418
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
wx=01
g = -4.2350
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
wx=11
g = -3.5418
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
wx=11
g = -4.2350
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
Distribute evidence (1)
Cliques 1
Cliques 2 (root)
wyz
wxy
wy
W=0
g = -4.7560
h = 0.0889 -0.0111
K = 0.1444 -0.0089
-0.0089
-0.1
0.0778
0.0378
-0.0333
-0.0556
W=1
-0.0556
0.0556
-0.1
-0.0333
0.1111
0
0.0778
-0.0556
0
0.1111
g = -3.4786
h = 0.0867 -0.0633
K = 0.2083 -0.1467
-0.1467
0.15
-0.2333
0.1033
-0.1
0.1667
-0.1
0.15
-0.1
0.5
0
division
W=0
g = -0.6931
h = [0.1388 0]’ *1.0e-16
K = [0.2776 -0.0694;0.0347 0]*1.0e-16
W=1
g = -0.6931
h = [0 0]’
K = [0 0 0 0]
-0.1667
-0.2333
0.1667
0
0.3333
Distribute evidence (2)
Cliques 1
Cliques 2 (root)
wyz
wy
wxy
W=0
g = -4.0629
h = 0.0889 -0.0111
K = 0.1444 -0.0089
-0.0089
-0.1
0.0778
0.0378
-0.0333
-0.0556
W=1
-0.0556
0.0556
-0.1
-0.0333
0.1111
0
0.0778
-0.0556
0
0.1111
g = -2.7854
h = 0.0867 -0.0633
K = 0.2083 -0.1467
-0.1467
0.15
-0.2333
0.1033
-0.1
0.1667
-0.1
0.15
-0.1
0.5
0
-0.1667
-0.2333
0.1667
0
0.3333
Distribute evidence (3)
Cliques 1
Cliques 2 (root)
wyz
wxy
wy
wx=00
wx=10
g = -5.8329
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
g = -5.8329
h = [-0.02 0.12]’
K = [0.1 0; 0 0.1]
wx=01
g = -4.2350
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
Marginalize over x
w=0
w=1
logp = -0.6931
mu = [0.52 -0.12]’
5.5456 -0.6336
Sigma =-0.6336
6.3616
logp = -0.6931
mu = [0.52 -0.12]’
5.5456 -0.6336
Sigma =-0.6336
6.3616
wx=11
g = -4.2350
h = [0.5 -0.5]’
K = [0.5 0.5;0.5 0.5]
Distribute evidence (4)
Cliques 1
Cliques 2 (root)
wyz
wxy
wy
W=0
g = -4.0629
h = 0.0889 -0.0111
K = 0.1444 -0.0089
-0.0089
-0.1
0.0778
0.0378
-0.0333
-0.0556
W=1
-0.0556
0.0556
-0.1
-0.0333
0.1111
0
0.0778
-0.0556
0
0.1111
g = -2.7854
h = 0.0867 -0.0633
K = 0.2083 -0.1467
-0.1467
0.15
-0.2333
0.1033
-0.1
0.1667
-0.1
0.15
-0.1
0.5
0
-0.1667
-0.2333
0.1667
0
0.3333
multiplication
w=0
w=1
logp = -0.6931
mu = [0.52 -0.12]’
5.5456 -0.6336
Sigma =-0.6336
6.3616
logp = -0.6931
mu = [0.52 -0.12]’
5.5456 -0.6336
Sigma =-0.6336
6.3616
w=0
g = -4.3316
h = [0.0927 -0.0096]’
0.1824
0.0182
K = 0.0182
0.159
w=1
g = -0.6931
h = [0.0927 -0.0096]’
0.1824
0.0182
K = 0.0182
0.159
Canonical form
Distribute evidence (5)
Cliques 1
Cliques 2 (root)
wyz
wy
wxy
W=0
g = -8.3935
h = 0.1816 -0.0207 -0.0556
0.3268
0.0093
-0.1
K = 0.0093
0.1968 -0.0333
-0.1
0.0778
-0.0333
-0.0556
0.1111
0
W=1
0.0556
0.0778
-0.0556
0
0.1111
g = -7.1170
-0.073
h = 0.1793
0.3907 -0.1285
K=
-0.1285
0.15
-0.2333
0.2623
-0.1
0.1667
-0.1
0.15
-0.1
0.5
0
-0.1667
-0.2333
0.1667
0
0.3333
After Message Passing
Cliques 1
p(wyz)
Cliques 2 (root)
p(wy)
p(wxy)
Local marginal distributions
Download