Document

advertisement
Two Approximate Algorithms for
Belief Updating
Mini-Clustering - MC
Robert Mateescu, Rina Dechter, Kalev Kask. "Tree Approximation
for Belief Updating", AAAI-2002
Iterative Join-Graph Propagation - IJGP
Rina Dechter, Kalev Kask and Robert Mateescu. "Iterative JoinGraph Propagation”, UAI 2002
What is Mini-Clustering?

Mini-Clustering (MC) is an approximate algorithm for
belief updating in Bayesian networks

MC is an anytime version of join-tree clustering

MC applies message passing along a cluster tree


The complexity of MC is controlled by a user-adjustable
parameter, the i-bound
Empirical evaluation shows that MC is a very effective
algorithm, in many cases superior to other approximate
schemes (IBP, Gibbs Sampling)
Belief networks
A belief network is a quadruple
BN  X , D, G, P  where :
X  { X 1 ,..., X n } is a set of random variables
D  {D1 ,..., Dn } is the set of their domains
G is a DAG (directed acyclic graph) over X
P  { p1 ,..., pn }, pi  P( X i | pai ) are CPTs
(condition al probabilit y tables)

A
B
C
D
E
F
G
The belief updating problem is the task of computing the
posterior probability P(Y|e) of query nodes Y  X given
evidence e.
We focus on the basic case where Y is a single variable Xi
Tree decompositions
A tree decomposition for a belief network BN  X,D,G,P  is a
triple  T ,  , , where T  (V,E) is a tree and χ and ψ are labeling
functions, associatin g with each verte x v  V two sets, χ(v)  X and
ψ(v)  P satisfying :
1. For each function pi  P there is exactly one vertex such that
pi  ψ(v) and scope(pi )  χ(v)
2. For each varia ble X i  X the set {v  V|X i  χ(v)} forms a
connected subtree (running intersecti on property)
ABC
p(a), p(b|a), p(c|a,b)
BC
BCDF
p(d|b), p(f|c,d)
BF
BEF
p(e|b,f)
A
B
C
Belief network
EF
D
E
F
EFG
p(g|e,f)
Tree decomposition
G
Cluster Tree Elimination



Cluster Tree Elimination (CTE) is an exact algorithm
It works by passing messages along a tree
decomposition
Basic idea:


Each node sends only one message to each of its neighbors
Node u sends a message to its neighbor v only when u
received messages from all its other neighbors
Cluster Tree Elimination

Previous work on tree clustering:





Lauritzen, Spiegelhalter - ‘88 (probabilities)
Jensen, Lauritzen, Olesen - ‘90 (probabilities)
Shenoy, Shafer - ‘90, Shenoy - ‘97 (general)
Dechter, Pearl - ‘89 (constraints)
Gottlob, Leone, Scarello - ‘00 (constraints)
Belief Propagation
x1
h(u,v)
x2
xn
u
v
 ((uu) {h( x1 , u),)}h( x2 , u),...,
)} h( xn , u),
)}h(v, u)}
Compute the message :
h(u, v)  elim(u ,v )  f cluster(u ) {h ( v ,u )} f
Cluster Tree Elimination - example
1 ABC
h(1, 2) (b, c)   p(a)  p(b | a)  p(c | a, b)
A
a
BC
B
h( 2,1) (b, c)   p(d | b)  p( f | c, d )  h(3, 2) (b, f )
d, f
2 BCDF
C
D
h( 2,3) (b, f )   p(d | b)  p( f | c, d )  h(1, 2) (b, c)
E
c ,d
BF
h(3, 2) (b, f )   p(e | b, f )  h( 4,3) (e, f )
e
3 BEF
F
h(3, 4) (e, f )   p(e | b, f )  h( 2,3) (b, f )
b
G
EF
h( 4,3) (e, f )  p(G  ge | e, f )
4 EFG
Cluster Tree Elimination - the messages
1
ABC
p(a), p(b|a), p(c|a,b)
BC
2
BCDF
p(d|b), p(f|c,d)
h(1,2)(b,c)
sep(2,3)={B,F}
elim(2,3)={C,D}
3
BF
a
h( 2,3) (b, f )   p(d | b)  p( f | c, d )  h(1, 2) (b, c)
c ,d
BEF
p(e|b,f), h(2,3)(b,f)
EF
4
h(1, 2) (b, c)   p(a)  p(b | a)  p(c | a, b)
EFG
p(g|e,f)
Cluster Tree Elimination - properties

Correctness and completeness: Algorithm CTE is
correct, i.e. it computes the exact joint probability of
a single variable and the evidence.

Time complexity:
O ( deg  (n+N)  d w*+1 )

Space complexity:
O ( N  d sep)
where
deg = the maximum degree of a node
n = number of variables (= number of CPTs)
N = number of nodes in the tree decomposition
d = the maximum domain size of a variable
w* = the induced width
sep = the separator size
Mini-Clustering - motivation


Time and space complexity of Cluster Tree
Elimination depend on the induced width w* of the
problem
When the induced width w* is big, CTE algorithm
becomes infeasible
Mini-Clustering - the basic idea



Try to reduce the size of the cluster (the exponent);
partition each cluster into mini-clusters with less
variables
Accuracy parameter i = maximum number of
variables in a mini-cluster
The idea was explored for variable elimination (MiniBucket)
Mini-Clustering


Suppose cluster(u) is partitioned into p mini-clusters:
mc(1),…,mc(p), each containing at most i variables
TC computes the ‘exact’ message:
h(u ,v )  elim(u ,v ) k 1  f mc( k ) f
p

We want to process each fmc(k) f separately
Mini-Clustering
h(u ,v )  elim(u ,v ) k 1  f mc( k ) f
p


Approximate each fmc(k) f , k=2,…,p and take it
outside the summation
How to process the mini-clusters to obtain
approximations or bounds:



Process all mini-clusters by summation - this gives an upper
bound on the joint probability
A tighter upper bound: process one mini-cluster by
summation and the others by maximization
Can also use mean operator (average) - this gives an
approximation of the joint probability
Idea of Mini-Clustering
Split a cluster into mini-clusters =>bound complexity
h g
X
X
Exponentia l complexity decrease : O(en )  O(er )  O(enr )
Mini-Clustering - example
1 ABC
1
p(a)  p(b | a)  p(c | a, b)
H (1, 2 ) h(1, 2 ) (b, c) : 
a
BC
h(12 ,1) (b) :

p ( d | b)  h(13, 2 ) (b, f )
d, f
2
( 2 ,1)
H ( 2,1) h
2 BCDF
H ( 2 , 3)
BF
(c) : max p ( f | c, d )
d, f
h(12,3) (b) :

p (d | b)  h(11, 2 ) (b, c)
c ,d
2
( 2 , 3)
h
( f ) : max p ( f | c, d )
c ,d
1
1
H ( 3, 2 ) h( 3, 2 ) (b, f ) :  p(e | b, f )  h( 4,3) (e, f )
3 BEF
e
H ( 3, 4 ) h(13, 4) (e, f ) :  p(e | b, f )  h(12,3) (b)  h(22,3) ( f )
EF
b
H ( 4 , 3) h(14,3) (e, f ) : p(G  g e | e, f )
4 EFG
Mini-Clustering - the messages, i=3
1
ABC
p(a), p(b|a), p(c|a,b)
BC
h(11, 2) (b, c)   p(a)  p(b | a)  p(c | a, b)
a
2
BCD
p(d|b), h(1,2)(b,c)
CDF
p(f|c,d)
sep(2,3)={B,F}
elim(2,3)={C,D}
3
BF
BEF
p(e|b,f),
1
h (2,3)(b), h2(2,3)(f)
EF
4
EFG
p(g|e,f)
h(12,3) (b) 
2
( 2 , 3)
h

c ,d
p ( d | b)  h(11, 2 ) (b, c)
( f )  max p( f | c, d )
c ,d
Cluster Tree Elimination vs. Mini-Clustering
1
ABC
h(1, 2) (b, c)
1
BCDF
h( 2,1) (b, c)
h( 2,3) (b, f )
2
BEF
EF
4
EFG
BCDF
H ( 2,1)
BF
h(3, 2) (b, f )
3
h(3, 4) (e, f )
h( 4,3) (e, f )
BEF
EF
4
EFG
h(11, 2) (b, c)
h(12,1) (b)
H ( 2 , 3)
BF
3
H (1, 2)
BC
BC
2
ABC
h(22,1) (c)
h(12,3) (b)
h(22,3) ( f )
H ( 3, 2 )
h(13, 2) (b, f )
H ( 3, 4 )
h(13, 4) (e, f )
H ( 4 , 3)
h(14,3) (e, f )
Mini-Clustering


Correctness and completeness: Algorithm MC(i)
computes a bound (or an approximation) on the joint
probability P(Xi,e) of each variable and each of its
values.
Time & space complexity: O(n  hw*  d i)
where hw* = maxu | {f | f  (u)  } |
Normalization

Algorithms for the belief updating problem compute,
in general, the joint probability:
P( X i , e),

X i  query node, e  evidence
Computing the conditional probability:
P( X i | e),


X i  query node, e  evidence
is easy to do if exact algorithms can be applied
becomes an important issue for approximate algorithms
Normalization

MC can compute an (upper) bound P( X i , e) on the joint
P(Xi,e)


Deriving a bound on the conditional P(Xi|e) is not easy when
the exact P(e) is not available
If a lower bound P(e) would be available, we could use:
P( X i , e) / P(e)
as an upper bound on the posterior

In our experiments we normalized the results and regarded
them as approximations of the posterior P(Xi|e)
Experimental results
We tested MC with max and mean operators

Algorithms:





Exact
IBP
Gibbs sampling (GS)
MC with normalization
(approximate)

Measures:





Networks (all variables are binary):





Coding networks
CPCS 54, 360, 422
Grid networks (MxM)
Random noisy-OR networks
Random networks
Normalized Hamming Distance (NHD)
BER (Bit Error Rate)
Absolute error
Relative error
Time
Random networks - Absolute error
Random networks, N=50, P=2, k=2, evid=0, w*=10, 50 instances
Random networks, N=50, P=2, k=2, evid=10, w*=10, 50 instances
0.16
0.16
0.12
0.10
0.08
0.06
0.10
0.08
0.06
0.04
0.04
0.02
0.02
0.00
0.00
0
2
4
6
i-bound
evidence=0
8
MC
Gibbs Sampling
IBP
0.12
Absolute error
Absolute error
0.14
MC
Gibbs Sampling
IBP
0.14
10
0
2
4
6
i-bound
evidence=10
8
10
Coding networks - Bit Error Rate
Coding networks, N=100, P=4, sigma=.22, w*=12, 50 instances
Coding networks, N=100, P=4, sigma=.51, w*=12, 50 instances
0.007
0.18
MC
IBP
0.006
MC
IBP
0.16
0.005
Bit Error Rate
Bit Error Rate
0.14
0.004
0.003
0.002
0.12
0.10
0.001
0.08
0.000
0.06
0
2
4
6
i-bound
sigma=0.22
8
10
12
0
2
4
6
8
i-bound
sigma=.51
10
12
Noisy-OR networks - Absolute error
Noisy-OR networks, N=50, P=3, evid=10, w*=16, 25 instances
Noisy-OR networks, N=50, P=3, evid=20, w*=16, 25 instances
1e+0
1e+0
MC
IBP
Gibbs Sampling
MC
IBP
Gibbs Sampling
1e-1
Absolute error
Absolute error
1e-1
1e-2
1e-3
1e-2
1e-3
1e-4
1e-4
1e-5
1e-5
0
2
4
6
8
10
i-bound
evidence=10
12
14
16
0
2
4
6
8
10
i-bound
evidence=20
12
14
16
CPCS422 - Absolute error
CPCS 422, evid=0, w*=23, 1 instance
CPCS 422, evid=10, w*=23, 1 instance
0.05
0.05
MC
IBP
MC
IBP
0.04
Absolute error
Absolute error
0.04
0.03
0.02
0.01
0.03
0.02
0.01
0.00
0.00
2
4
6
8
10
i-bound
evidence=0
12
14
16
18
2
4
6
8
10
12
i-bound
evidence=10
14
16
18
Grid 15x15 - 0 evidence
Grid 15x15, evid=0, w*=22, 10 instances
Grid 15x15, evid=0, w*=22, 10 instances
0.14
0.05
0.12
MC
IBP
MC
IBP
0.04
Absolute error
NHD
0.10
0.08
0.06
0.03
0.02
0.04
0.01
0.02
0.00
0.00
0
2
4
6
8
10
12
14
16
18
0
2
4
6
8
10
12
14
i-bound
i-bound
Grid 15x15, evid=0, w*=22, 10 instances
Grid 15x15, evid=0, w*=22, 10 instances
0.12
16
18
16
18
12
MC
IBP
10
0.10
MC
IBP
8
Time (seconds)
Relative error
0.08
0.06
0.04
6
4
2
0.02
0
0.00
0
2
4
6
8
10
i-bound
12
14
16
18
0
2
4
6
8
10
i-bound
12
14
Grid 15x15 - 10 evidence
Grid 15x15, evid=10, w*=22, 10 instances
Grid 15x15, evid=10, w*=22, 10 instances
0.06
0.14
0.12
MC
IBP
0.05
MC
IBP
Absolute error
NHD
0.10
0.08
0.06
0.04
0.03
0.02
0.04
0.01
0.02
0.00
0.00
0
2
4
6
8
10
12
14
16
18
0
2
4
6
8
10
12
14
i-bound
i-bound
Grid 15x15, evid=10, w*=22, 10 instances
Grid 15x15, evid=10, w*=22, 10 instances
16
18
16
18
12
0.12
MC
IBP
0.10
MC
IBP
10
8
Time (seconds)
Relative error
0.08
0.06
0.04
6
4
2
0.02
0
0.00
0
2
4
6
8
10
i-bound
12
14
16
18
0
2
4
6
8
10
i-bound
12
14
Grid 15x15 - 20 evidence
Grid 15x15, evid=20, w*=22, 10 instances
Grid 15x15, evid=20, w*=22, 10 instances
1
1
MC
IBP
Gibbs Sampling
0.1
Absolute error
NHD
MC
IBP
Gibbs Sampling
0.01
0.1
0.01
0.001
0.001
0
2
4
6
8
10
12
14
16
18
0
2
4
6
8
10
12
14
i-bound
i-bound
Grid 15x15, evid=20, w*=22, 10 instances
Grid 15x15, evid=20, w*=22, 10 instances
16
18
16
18
10
1
8
Time (seconds)
Relative error
0.1
MC
IBP
Gibbs Sampling
0.01
MC
IBP
Gibbs Sampling
6
4
2
0
0.001
0
2
4
6
8
10
i-bound
12
14
16
18
0
2
4
6
8
10
i-bound
12
14
Conclusion


MC extends the partition based approximation from
mini-buckets to general tree decompositions for the
problem of belief updating
Empirical evaluation demonstrates its effectiveness
and superiority (for certain types of problems, with
respect to the measures considered) relative to other
existing algorithms
Download