2008-11-14 Probability theory Introduction to Bayesian networks Biological network = Probability theory probability distribution Bayesian networks Semantics Learning Example Dynamic Bayesian networks Gaussian networks Markov networks 11/14/2008 Measurements Bayesian networks 1 Jose M. Peña @ KI 3 Bayesian networks: Semantics 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 Jose M. Peña @ KI 2 p(X,Y) Y=0 Y=1 X=0 0.3 0.3 X=0 0.6 X=1 0.2 0.2 X=1 0.4 Y=0 Y=1 p(X|Y) Y=0 0.5 0.5 11/14/2008 Y Z 0 Y=1 X=0 0.3/0.5 0.3/0.5 X=1 0.2/0.5 0.2/0.5 Jose M. Peña @ KI 4 Bayesian networks: Semantics p(X,Y,Z) = p(X,Y)p(Z|X,Y) = p(X)p(Y|X)p(Z|X,Y). X GeneC 0 Probability theory p(X=x, Y=y) >= 0 for all x and y. Conditional independence Σx,y p(X=x, Y=y) = 1. p(X|Y,Z) = p(X|Z), or p(X,Y|Z) = p(X|Z)p(Y|Z) Marginal: p(X=x) = Σy p(X=x, Y=y). Conditional: p(X=x | Y=y) = p(X=x, Y=y) / p(Y=y). Independence: p(X=x | Y=y) = p(X=x) for all x and y, or p(X=x, Y=y) = p(X=x)p(Y=y) for all x and y. 11/14/2008 Deterministic relations ? Probabilistic relations ? Noise ? 11/14/2008 Probability theory GeneA GeneB ”All models are wrong but some are useful” Jose M. Peña @ KI Biological network over genes, proteins, … = probability distribution over genes, proteins, … A Bayesian network (BN) over X1,…,Xn consists of a directed acyclic graph (DAG) over X1,…,Xn and probability distributions p(Xi|Parents(Xi)) for all i, and defines a probability distribution over X1,…,Xn as p(X1,…,Xn) = ΠXi p(Xi|Parents(Xi)). Drop the edge from X to Y, then p(X,Y,Z) = p(X)p(Y)p(Z|X,Y). In general: p(X1,…,Xn) = ΠXi p(Xi|Parents(Xi)). 11/14/2008 Jose M. Peña @ KI 5 11/14/2008 Jose M. Peña @ KI 6 1 2008-11-14 Bayesian networks: Semantics Bayesian networks: Semantics Which independencies are represented in a DAG ? X Y X Z X Y Y p(X)p(Y|X)p(Z|X,Y)=p(X,Y,Z)=p(X)p(Y)p(Z|X,Y), so p(Y|X)=p(Y), so X is independent of Y !!! No numeric calculation required !!! Jose M. Peña @ KI 7 Bayesian networks: Semantics X Y Z Y Y 11/14/2008 Z 9 U Y X V W Y Z 11/14/2008 U ┴ V | ZW ? Jose M. Peña @ KI 10 Bayesian networks: Semantics X is independent of Y given Z when for every undirected path between X and Y there exists a node Z in the path such that Z doesn’t have two parents in the path and Z is in Z, or Z has two parents in the path and neither Z nor any of its descendants is in Z. Jose M. Peña @ KI W V U ┴V ? Which independencies are represented in a DAG ? Use the d-separation criterion. 11/14/2008 8 Z Z Z Jose M. Peña @ KI U p(X,Z|Y) = p(X)p(Y|X)p(Z|Y)/p(Y) = p(Y)p(X|Y)p(Z|Y)/p(Y) = p(X|Y)p(Z|Y). Bayesian networks: Semantics Y Bayesian networks: Semantics p(X,Z) = Σy p(X)p(Y|X)p(Z|Y) ≠ p(X)p(Z). Jose M. Peña @ KI X Z 11/14/2008 X X Z X X Y Z Y 11/14/2008 X Z 11 U V X Y Z XU ┴ YV ? X ┴ Y|Z ? X ┴ Y|UV ? U ┴ Z|X ? U ┴ Y|ZV ? U V X Y Z Two DAGs represent the same independencies iff they have the same adjacencies and immoralities. 11/14/2008 Jose M. Peña @ KI 12 2 2008-11-14 Bayesian networks: Semantics 11/14/2008 Jose M. Peña @ KI Bayesian networks: Semantics 13 Bayesian networks: Learning MB(X)=Parents(X) U Children(X) U Spouses(X). 11/14/2008 simplifies specifying probability distributions, and enables us to reason quantitative and qualitatively. Is it learnable from data ? Learning a BN B=(Bs,BP) consists in The Markov boundary of X, MB(X), is the minimal set such that X ┴ Rest | MB(X). learning the best DAG Bs, and learning the best probability distributions BP for Bs. Jose M. Peña @ KI The so-obtained DAG 15 Bayesian networks: Learning 11/14/2008 Start from the empty graph. Repeat Repeat If there exists Xk in {X1,…,Xi-1}\Parents(Xi) such that Xi ┴ Xk | Parents(Xi) then add it to Parents(Xi). Jose M. Peña @ KI 16 Hill-climbing approach Parents(Xi) = ∅. Jose M. Peña @ KI Bayesian networks: Learning Take any ordering X1,…,Xn. For each i only represents true independencies, and no subgraph of it has this property. 11/14/2008 The K2 algorithm 14 Take any ordering X1,…,Xn. For each i, if Pi is the smallest subset of X1,…,Xi-1 such that Xi ┴ {X1,…,Xi-1}\Pi | Pi then Parents(Xi) = Pi. Access to a database D = {C1,…,Cm}. 11/14/2008 Jose M. Peña @ KI Bayesian networks: Learning So far, we have a framework that Perform the edge addition/removal that improves the score the most*. * Check for cycles + store scores to avoid recomputing. 17 11/14/2008 Jose M. Peña @ KI 18 3 2008-11-14 Bayesian networks: Learning D = {C1,…,Cm} is a database of m cases. p(Bs|D) = p(Bs,D) / p(D) α p(Bs,D) = p(Bs)p(D|Bs) Bayesian networks: Learning The K2 score Consistent, decomposable, non-equivalent 11/14/2008 Jose M. Peña @ KI 19 Bayesian networks: Learning The BDeu score 11/14/2008 Jose M. Peña @ KI 20 Bayesian networks: Learning Consistent, decomposable, equivalent Consistent, decomposable, equivalent The BIC/MDL score BIC(Bs,D) = maxBP p(D|Bs,BP) – 0.5.log m.Σi qi(ri – 1) where maxBP p(D|Bs,BP) is reached when p(Xi=k|Parents(Xi)=j) = Nijk/Nij. Use uninformative assignment Note that BDeu = K2 if E[p(Xi=k|Parents(Xi)=j)] = (N’ijk+Nijk)/(N’ij+Nij). 11/14/2008 Jose M. Peña @ KI Best parameter values for Bs !!! 21 Bayesian networks: Learning 11/14/2008 Jose M. Peña @ KI 22 Bayesian networks: Example True model Visited model 11/14/2008 Jose M. Peña @ KI 23 11/14/2008 Jose M. Peña @ KI 24 4 2008-11-14 Bayesian networks: Example 11/14/2008 Jose M. Peña @ KI Bayesian networks: Example 25 Dynamic Bayesian networks 11/14/2008 Jose M. Peña @ KI 27 Jose M. Peña @ KI 26 11/14/2008 Jose M. Peña @ KI 28 Markov networks: Learning A.k.a Gaussian graphical models, covariance selection models, Markov random fields, … Based on undirected graphs. p(X1,…,Xn) = (ΠCi q(ci)) / Z. u-separation criterion: X is independent of Y given Z if all the paths between X and Y are blocked by Z. 11/14/2008 Jose M. Peña @ KI Gaussian networks Markov networks: Semantics 11/14/2008 Start from the empty graph. Repeat 29 If Xi ┴ Xk|Rest then remove the edge between Xi and Xk. Assuming X=(X1,…,Xn)~N(µ,Σ). µ=E[X] and Σ=cov(X,X)=E[(X-µ)(X-µ)’]. Xi ┴ Xk | Rest if and only if Σ-1[i,k]=0. 11/14/2008 Jose M. Peña @ KI 30 5 2008-11-14 Why Bayesian networks ? More on Bayesian networks Solidly founded on probability theory. Can cope with noise and probabilistic relations. Graphical interface. Learnable from data and prior knowledge. Offer flexible reasoning. Accept both causal and acausal interpretation. Model both linear and non-linear interactions. Too many samples are required for accuracy. Scalable to thousands of genes ? Gaussian networks limited to linear interactions. Learning Markov networks is not so well studied. 11/14/2008 Jose M. Peña @ KI 31 Continuous random variables, e.g. Gaussian networks. Leaning via independence tests, e.g. PC algorithm. Learning in the space of equivalent BNs, e.g. GES, KES, PC algorithm. Model averaging, e.g. MCMC. Learning from interventional data. Causal Bayesian networks. 11/14/2008 Jose M. Peña @ KI 32 Causal Bayesian networks Causal Bayesian networks Biological network = probability distribution Causal Bayesian networks Interventions Learning from observations Learning from observations and interventions Example 1 Example 2 11/14/2008 Measurements 33 11/14/2008 the probability distribution factorizes accordingly, d-separation applies, it allows probabilistic inference, and it is learnable from data. Causal BNs enables us to predict the effect of interventions, e.g. the effect of a drug. This is not possible with acausal BNs. 11/14/2008 Jose M. Peña @ KI 34 Interventions A BN can tell us how the distribution of X changes when observing Y, i.e. p(X|Y=y). In addition to this, a causal BN can tell us how the distribution of X changes when intervening on Y, i.e. p(X|do(Y=y)). Check ”All models are wrong but some are useful” Jose M. Peña @ KI Bayesian networks Interventions Causal BN = BN + causal interpretation, i.e. Parents(X) are the direct causes of X. A causal BN is a BN and, thus, p(X2,…,Xn|do(X1=x1)) = Πi≠1 p(Xi|Parents(Xi)), or as p(Cancer=yes|WhiteTeeth=no) p(Cancer=yes|do(WhiteTeeth=no)) Jose M. Peña @ KI A causal BN enables us to predict the effect of an intervention on X1 as 35 delete the edges from Parents(X1) to X1 and, then, ”observe” X1=x1. This is not possible with BNs. 11/14/2008 Jose M. Peña @ KI 36 6 2008-11-14 Interventions Learning from observations Which nodes get affected by an intervention on X ? Use d-separation. Answer: Only the descendants of X and, thus, p(Y|do(X=x)) = p(Y) if Y is not one of them. IC algorithm (≈PC algorithm) Start from the complete graph. For each pair of nodes Xi and Xk 1. 2. If Xi ┴ Xk|Sik then remove the edge between Xi and Xk. p(X2,…,Xn|do(Parents(X1)=NewParents(X1)) = p(X1|NewParents(X1)) Πi≠1 p(Xi|Parents(Xi)), rewire the causal BN accordingly. For each induced subgraph Xi-Y-Xk 3. Predicting the effect of an intervention that rewires the causal BN If Y is not in Sik then Xi 4. Try small sets first Y There exist rules for fullfilling this step Xk. Orient as many lines as possible. Output is not a BN but a class of equivalent BNs !!! or as 11/14/2008 Jose M. Peña @ KI 37 11/14/2008 Learning from observations and interventions Jose M. Peña @ KI Alternative Learn a BN and, then, keep only compelled edges 38 Learning from observations and interventions Choose between the equivalent BNs X Y and X Y on the basis of As before but now Nijk is the number of cases in D in which Xi is observed in state k and its parents are in state j. As opposed to intervened or manipulated or perturbed !!! May be due to intervention p(Y|do(X=x)) = p(Y|X=x) ≠ p(Y) and p(X|do(Y=y)) = p(X) ≠ p(X|Y=y), so X Y. 11/14/2008 Jose M. Peña @ KI 39 11/14/2008 Learning from observations and interventions X Z X Y Z X Jose M. Peña @ KI 40 Example 1 Even with interventions ambiguity may remain. Two causal BNs are causally equivalent wrt do(X=x) if they are acausally equivalent before and after do(X=x). Y E[p(Xi=k|Parents(Xi)=j)] = (N’ijk+Nijk)/(N’ij+Nij). Similar for BIC/MDL. Y Sachs, K. et al.: Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 308 (2005) 523-529. 9 perturbations * 600 samples/perturbation = 5400 samples !!! And each from a single cell !!! Ideal !!! Due to acyclity ? Dynamic BNs ? Z E.g., the 2nd. and 3rd. are causally equivalent. In other words, two causal BNs are causally equivalent wrt do(X=x) iff they are acausally equivalent and X has the same parents in both BNs. 11/14/2008 Jose M. Peña @ KI 41 11/14/2008 Jose M. Peña @ KI 42 7 2008-11-14 Example 1 Example 1 Importance of interventions Importance of large datasets Importance of single cells Due to acyclity ? Dynamic BNs ? May fit well the data, but doesn’t reveal causality. Importance of interventions 11/14/2008 Jose M. Peña @ KI 43 Example 2 11/14/2008 44 Bibliography Pournara, I. and Wernisch, L.: Reconstruction of Gene Networks Using Bayesian Learning and Manipulation Experiments. Bioinformatics 20 (2004) 2934-2942. Articles Divide the equivalence class in many small subclasses 45 Pearl (1988, 2000), Castillo et al. (1997), Neapolitan (2003), Lauritzen (1999), Jensen (1996, 2000), … Articles, e.g. UAI. Software Jose M. Peña @ KI Pe’er, D.: Bayesian Network Analysis of Signaling Networks: A Primer. Science STKE 281 (2005). 1996 lecture by Judea Pearl (www.cs.ucla.edu/~judea). http://genomics.princeton.edu/~florian/docs/network-bib.pdf Books 11/14/2008 Jose M. Peña @ KI 11/14/2008 www.hugin.dk http://www.cs.ubc.ca/~murphyk/Software/bnsoft.html Jose M. Peña @ KI 46 8