Probability theory 2008-11-14 Introduction to Bayesian networks

advertisement
2008-11-14
Probability theory
Introduction to Bayesian networks
Biological
network =
Probability
theory
probability
distribution
Bayesian networks
Semantics
Learning
Example
Dynamic Bayesian networks
Gaussian networks
Markov networks
11/14/2008
Measurements
Bayesian
networks
1
Jose M. Peña @ KI
3
Bayesian networks: Semantics
0
1
1
1
0
1
1
1
1
0
0
1
0
1
0
1
Jose M. Peña @ KI
2
p(X,Y) Y=0
Y=1
X=0
0.3
0.3
X=0
0.6
X=1
0.2
0.2
X=1
0.4
Y=0
Y=1
p(X|Y) Y=0
0.5
0.5
11/14/2008
Y
Z
0
Y=1
X=0
0.3/0.5
0.3/0.5
X=1
0.2/0.5
0.2/0.5
Jose M. Peña @ KI
4
Bayesian networks: Semantics
p(X,Y,Z) = p(X,Y)p(Z|X,Y) = p(X)p(Y|X)p(Z|X,Y).
X
GeneC
0
Probability theory
p(X=x, Y=y) >= 0 for all x and y.
Conditional independence
Σx,y p(X=x, Y=y) = 1.
p(X|Y,Z) = p(X|Z), or
p(X,Y|Z) = p(X|Z)p(Y|Z)
Marginal:
p(X=x) = Σy p(X=x, Y=y).
Conditional:
p(X=x | Y=y) = p(X=x, Y=y) / p(Y=y).
Independence:
p(X=x | Y=y) = p(X=x) for all x and y, or
p(X=x, Y=y) = p(X=x)p(Y=y) for all x and y.
11/14/2008
Deterministic relations ?
Probabilistic relations ?
Noise ?
11/14/2008
Probability theory
GeneA GeneB
”All models are
wrong but some
are useful”
Jose M. Peña @ KI
Biological network over genes, proteins, … =
probability distribution over genes, proteins, …
A Bayesian network (BN) over X1,…,Xn consists of
a directed acyclic graph (DAG) over X1,…,Xn and
probability distributions p(Xi|Parents(Xi)) for all i,
and defines a probability distribution over X1,…,Xn as
p(X1,…,Xn) = ΠXi p(Xi|Parents(Xi)).
Drop the edge from X to Y, then
p(X,Y,Z) = p(X)p(Y)p(Z|X,Y).
In general: p(X1,…,Xn) = ΠXi p(Xi|Parents(Xi)).
11/14/2008
Jose M. Peña @ KI
5
11/14/2008
Jose M. Peña @ KI
6
1
2008-11-14
Bayesian networks: Semantics
Bayesian networks: Semantics
Which independencies are represented in a
DAG ?
X
Y
X
Z
X
Y
Y
p(X)p(Y|X)p(Z|X,Y)=p(X,Y,Z)=p(X)p(Y)p(Z|X,Y), so
p(Y|X)=p(Y), so
X is independent of Y !!!
No numeric calculation required !!!
Jose M. Peña @ KI
7
Bayesian networks: Semantics
X
Y
Z
Y
Y
11/14/2008
Z
9
U
Y
X
V
W
Y
Z
11/14/2008
U ┴ V | ZW ?
Jose M. Peña @ KI
10
Bayesian networks: Semantics
X is independent of Y given Z when for every
undirected path between X and Y there exists
a node Z in the path such that
Z doesn’t have two parents in the path and Z is in Z,
or
Z has two parents in the path and neither Z nor any
of its descendants is in Z.
Jose M. Peña @ KI
W
V
U ┴V ?
Which independencies are represented in a
DAG ? Use the d-separation criterion.
11/14/2008
8
Z
Z
Z
Jose M. Peña @ KI
U
p(X,Z|Y)
= p(X)p(Y|X)p(Z|Y)/p(Y)
= p(Y)p(X|Y)p(Z|Y)/p(Y)
= p(X|Y)p(Z|Y).
Bayesian networks: Semantics
Y
Bayesian networks: Semantics
p(X,Z) = Σy p(X)p(Y|X)p(Z|Y)
≠ p(X)p(Z).
Jose M. Peña @ KI
X
Z
11/14/2008
X
X
Z
X
X
Y
Z
Y
11/14/2008
X
Z
11
U
V
X
Y
Z
XU ┴ YV ?
X ┴ Y|Z ?
X ┴ Y|UV ?
U ┴ Z|X ?
U ┴ Y|ZV ?
U
V
X
Y
Z
Two DAGs represent the same independencies
iff they have the same adjacencies and
immoralities.
11/14/2008
Jose M. Peña @ KI
12
2
2008-11-14
Bayesian networks: Semantics
11/14/2008
Jose M. Peña @ KI
Bayesian networks: Semantics
13
Bayesian networks: Learning
MB(X)=Parents(X) U Children(X) U Spouses(X).
11/14/2008
simplifies specifying probability distributions, and
enables us to reason quantitative and qualitatively.
Is it learnable from data ?
Learning a BN B=(Bs,BP) consists in
The Markov boundary of X, MB(X), is the minimal
set such that X ┴ Rest | MB(X).
learning the best DAG Bs, and
learning the best probability distributions BP for Bs.
Jose M. Peña @ KI
The so-obtained DAG
15
Bayesian networks: Learning
11/14/2008
Start from the empty graph.
Repeat
Repeat
If there exists Xk in {X1,…,Xi-1}\Parents(Xi) such that
Xi ┴ Xk | Parents(Xi) then add it to Parents(Xi).
Jose M. Peña @ KI
16
Hill-climbing approach
Parents(Xi) = ∅.
Jose M. Peña @ KI
Bayesian networks: Learning
Take any ordering X1,…,Xn.
For each i
only represents true independencies, and
no subgraph of it has this property.
11/14/2008
The K2 algorithm
14
Take any ordering X1,…,Xn.
For each i, if Pi is the smallest subset of
X1,…,Xi-1 such that Xi ┴ {X1,…,Xi-1}\Pi | Pi
then Parents(Xi) = Pi.
Access to a database D = {C1,…,Cm}.
11/14/2008
Jose M. Peña @ KI
Bayesian networks: Learning
So far, we have a framework that
Perform the edge addition/removal that
improves the score the most*.
* Check for cycles + store scores to avoid recomputing.
17
11/14/2008
Jose M. Peña @ KI
18
3
2008-11-14
Bayesian networks: Learning
D = {C1,…,Cm} is a database of m cases.
p(Bs|D) = p(Bs,D) / p(D)
α p(Bs,D) = p(Bs)p(D|Bs)
Bayesian networks: Learning
The K2 score
Consistent,
decomposable,
non-equivalent
11/14/2008
Jose M. Peña @ KI
19
Bayesian networks: Learning
The BDeu score
11/14/2008
Jose M. Peña @ KI
20
Bayesian networks: Learning
Consistent,
decomposable,
equivalent
Consistent,
decomposable,
equivalent
The BIC/MDL score
BIC(Bs,D) = maxBP p(D|Bs,BP) – 0.5.log m.Σi qi(ri – 1)
where maxBP p(D|Bs,BP) is reached when
p(Xi=k|Parents(Xi)=j) = Nijk/Nij.
Use uninformative assignment
Note that BDeu = K2 if
E[p(Xi=k|Parents(Xi)=j)] = (N’ijk+Nijk)/(N’ij+Nij).
11/14/2008
Jose M. Peña @ KI
Best parameter values for Bs !!!
21
Bayesian networks: Learning
11/14/2008
Jose M. Peña @ KI
22
Bayesian networks: Example
True
model
Visited
model
11/14/2008
Jose M. Peña @ KI
23
11/14/2008
Jose M. Peña @ KI
24
4
2008-11-14
Bayesian networks: Example
11/14/2008
Jose M. Peña @ KI
Bayesian networks: Example
25
Dynamic Bayesian networks
11/14/2008
Jose M. Peña @ KI
27
Jose M. Peña @ KI
26
11/14/2008
Jose M. Peña @ KI
28
Markov networks: Learning
A.k.a Gaussian graphical models,
covariance selection models,
Markov random fields, …
Based on undirected graphs.
p(X1,…,Xn) = (ΠCi q(ci)) / Z.
u-separation criterion:
X is independent of Y given Z
if all the paths between X and Y
are blocked by Z.
11/14/2008
Jose M. Peña @ KI
Gaussian networks
Markov networks: Semantics
11/14/2008
Start from the empty graph.
Repeat
29
If Xi ┴ Xk|Rest then remove the edge between Xi
and Xk.
Assuming X=(X1,…,Xn)~N(µ,Σ).
µ=E[X] and Σ=cov(X,X)=E[(X-µ)(X-µ)’].
Xi ┴ Xk | Rest if and only if Σ-1[i,k]=0.
11/14/2008
Jose M. Peña @ KI
30
5
2008-11-14
Why Bayesian networks ?
More on Bayesian networks
Solidly founded on probability theory.
Can cope with noise and probabilistic relations.
Graphical interface.
Learnable from data and prior knowledge.
Offer flexible reasoning.
Accept both causal and acausal interpretation.
Model both linear and non-linear interactions.
Too many samples are required for accuracy.
Scalable to thousands of genes ?
Gaussian networks limited to linear interactions.
Learning Markov networks is not so well studied.
11/14/2008
Jose M. Peña @ KI
31
Continuous random variables, e.g.
Gaussian networks.
Leaning via independence tests, e.g. PC
algorithm.
Learning in the space of equivalent BNs,
e.g. GES, KES, PC algorithm.
Model averaging, e.g. MCMC.
Learning from interventional data.
Causal Bayesian networks.
11/14/2008
Jose M. Peña @ KI
32
Causal Bayesian networks
Causal Bayesian networks
Biological
network =
probability
distribution
Causal
Bayesian networks
Interventions
Learning from observations
Learning from observations and interventions
Example 1
Example 2
11/14/2008
Measurements
33
11/14/2008
the probability distribution factorizes accordingly,
d-separation applies,
it allows probabilistic inference, and
it is learnable from data.
Causal BNs enables us to predict the effect of
interventions, e.g. the effect of a drug. This is
not possible with acausal BNs.
11/14/2008
Jose M. Peña @ KI
34
Interventions
A BN can tell us how the distribution of X
changes when observing Y, i.e. p(X|Y=y).
In addition to this, a causal BN can tell us
how the distribution of X changes when
intervening on Y, i.e. p(X|do(Y=y)).
Check
”All models are
wrong but some
are useful”
Jose M. Peña @ KI
Bayesian
networks
Interventions
Causal BN = BN + causal interpretation, i.e.
Parents(X) are the direct causes of X.
A causal BN is a BN and, thus,
p(X2,…,Xn|do(X1=x1)) = Πi≠1 p(Xi|Parents(Xi)),
or as
p(Cancer=yes|WhiteTeeth=no)
p(Cancer=yes|do(WhiteTeeth=no))
Jose M. Peña @ KI
A causal BN enables us to predict the effect
of an intervention on X1 as
35
delete the edges from Parents(X1) to X1 and,
then, ”observe” X1=x1.
This is not possible with BNs.
11/14/2008
Jose M. Peña @ KI
36
6
2008-11-14
Interventions
Learning from observations
Which nodes get affected by an intervention on X ?
Use d-separation.
Answer: Only the descendants of X and, thus,
p(Y|do(X=x)) = p(Y) if Y is not one of them.
IC algorithm (≈PC algorithm)
Start from the complete graph.
For each pair of nodes Xi and Xk
1.
2.
If Xi ┴ Xk|Sik then remove the edge between Xi and Xk.
p(X2,…,Xn|do(Parents(X1)=NewParents(X1)) =
p(X1|NewParents(X1)) Πi≠1 p(Xi|Parents(Xi)),
rewire the causal BN accordingly.
For each induced subgraph Xi-Y-Xk
3.
Predicting the effect of an intervention that rewires the
causal BN
If Y is not in Sik then Xi
4.
Try small sets first
Y
There exist rules for
fullfilling this step
Xk.
Orient as many lines as possible.
Output is not
a BN but a class
of equivalent BNs !!!
or as
11/14/2008
Jose M. Peña @ KI
37
11/14/2008
Learning from observations
and interventions
Jose M. Peña @ KI
Alternative
Learn a BN and,
then, keep only
compelled edges
38
Learning from observations
and interventions
Choose between the equivalent BNs X Y
and X Y on the basis of
As before
but now Nijk is the number of cases in D in
which Xi is observed in state k and its
parents are in state j.
As opposed to intervened or
manipulated or perturbed !!!
May be due to intervention
p(Y|do(X=x)) = p(Y|X=x) ≠ p(Y) and
p(X|do(Y=y)) = p(X) ≠ p(X|Y=y), so X Y.
11/14/2008
Jose M. Peña @ KI
39
11/14/2008
Learning from observations
and interventions
X
Z
X
Y
Z
X
Jose M. Peña @ KI
40
Example 1
Even with interventions ambiguity may remain.
Two causal BNs are causally equivalent wrt do(X=x) if
they are acausally equivalent before and after do(X=x).
Y
E[p(Xi=k|Parents(Xi)=j)] = (N’ijk+Nijk)/(N’ij+Nij).
Similar for BIC/MDL.
Y
Sachs,
K. et al.: Causal Protein-Signaling
Networks Derived from Multiparameter Single-Cell
Data. Science 308 (2005) 523-529.
9 perturbations * 600 samples/perturbation = 5400
samples !!! And each from a single cell !!! Ideal !!!
Due to acyclity ?
Dynamic BNs ?
Z
E.g., the 2nd. and 3rd. are causally equivalent.
In other words, two causal BNs are causally equivalent
wrt do(X=x) iff they are acausally equivalent and X has
the same parents in both BNs.
11/14/2008
Jose M. Peña @ KI
41
11/14/2008
Jose M. Peña @ KI
42
7
2008-11-14
Example 1
Example 1
Importance of
interventions
Importance of
large datasets
Importance of
single cells
Due to acyclity ?
Dynamic BNs ?
May fit well the
data, but doesn’t
reveal causality.
Importance of
interventions
11/14/2008
Jose M. Peña @ KI
43
Example 2
11/14/2008
44
Bibliography
Pournara, I. and Wernisch, L.: Reconstruction of Gene
Networks Using Bayesian Learning and Manipulation
Experiments. Bioinformatics 20 (2004) 2934-2942.
Articles
Divide the
equivalence class
in many small
subclasses
45
Pearl (1988, 2000), Castillo et al. (1997), Neapolitan (2003),
Lauritzen (1999), Jensen (1996, 2000), …
Articles, e.g. UAI.
Software
Jose M. Peña @ KI
Pe’er, D.: Bayesian Network Analysis of Signaling Networks: A
Primer. Science STKE 281 (2005).
1996 lecture by Judea Pearl (www.cs.ucla.edu/~judea).
http://genomics.princeton.edu/~florian/docs/network-bib.pdf
Books
11/14/2008
Jose M. Peña @ KI
11/14/2008
www.hugin.dk
http://www.cs.ubc.ca/~murphyk/Software/bnsoft.html
Jose M. Peña @ KI
46
8
Download