Learning genetic networks from gene expression data Dirk Husmeier Biomathematics & Statistics Scotland (BioSS) JCMB, The King’s Buildings, Edinburgh EH9 3JZ United Kingdom http://www.bioss.ac.uk/∼dirk Paradigm Shift in Molecular Biology Paradigm Shift in Molecular Biology Pre-Genomic • Reductionist (DNA or RNA or protein) • Generally qualitative, non-numeric • Hypothesis driven Paradigm Shift in Molecular Biology Pre-Genomic • Reductionist (DNA or RNA or protein) • Generally qualitative, non-numeric • Hypothesis driven Post-Genomic • Holistic, systems approach: DNA and RNA and protein • Quantitative, highly numeric • Data driven Paradigm Shift in Molecular Biology Pre-Genomic • Reductionist (DNA or RNA or protein) • Generally qualitative, non-numeric • Hypothesis driven Post-Genomic • Holistic, systems approach: DNA and RNA and protein • Quantitative, highly numeric • Data driven =⇒ Need for machine learning and statistics Inferring genetic networks from microarray gene expression data From http://www.csu.edu.au/faculty/health/biomed/subjects/molbol/images . . Transcription Translation . DNA . mRNA Protein . + q + + + hysteretic oscillator ligand binding F B A eq e E - f b a ab + f2 g - G switch c2 c + - H h - C cascades - d2 d D - g2 external ligand J j . . Transcription Translation . DNA mRNA Protein Microarrays . . From http://www.nhgri.nih.gov/DIR/Microarray/NEJM Supplement/ Reverse engineering Learn the network structure from postgenomic data. Problem: Noise, sparse data Bayesian networks Probabilistic framework for robust inference of interactions in the presence of noise Nir Friedman et al. (2000) Journal of Computational Biology 7: 601-620 Outline of the talk • Recapitulation: Bayesian networks • Reverse engineering: Learning networks from data • Estimating the accuracy of inference Outline of the talk • Recapitulation: Bayesian networks • Reverse engineering: Learning networks from data • Estimating the accuracy of inference Revision: Bayes’ Rule G: A certain gene is over-expressed C: A patient is suffering from cancer Ω G G^C P (G, C) P (G|C) = P (C) C P (G, C) P (C|G) = P (G) Revision: Bayes’ Rule G: A certain gene is over-expressed C: A patient is suffering from cancer Ω G G^C P (G, C) P (G|C) = P (C) P (G, C) = P (G|C)P (C) C P (G, C) P (C|G) = P (G) P (G, C) = P (C|G)P (G) Revision: Bayes’ Rule G: A certain gene is over-expressed C: A patient is suffering from cancer Ω G G^C P (G|C) = P (G, C) P (C) P (G, C) = P (G|C)P (C) C P (C|G) = P (G, C) P (G) P (G, C) = P (C|G)P (G) P (G|C)P (C) P (C|G) = P (G) A C B D E Nodes A C B D E Edges A C B D E Edges = directed A C B D E No directed cycles ! A C B D E P (A, B, C, D, E) = Y i P (nodei|parentsi) A C B D E P (A, B, C, D, E) = P (A) A C B D E P (A, B, C, D, E) = P (A)P (B|A) A C B D E P (A, B, C, D, E) = P (A)P (B|A)P (C|A) A C B D E P (A, B, C, D, E) = P (A)P (B|A)P (C|A)P (D|B, C) A C B D E P (A, B, C, D, E) = P (A)P (B|A)P (C|A)P (D|B, C)P (E|D) A B C A B A C B C When are A and B conditionally independent given C? P (A, B|C) = P (A|C)P (B|C) A B C A B C A B C When are A and B marginally independent? P (A, B) = P (A)P (B) A B C A B C P (A, B, C) = P (A|C)P (B|C)P (C) A B C P (A, B, C) = P (A|C)P (B|C)P (C) P (A, B, C) P (A, B|C) = = P (A|C)P (B|C) P (C) A B C P (A, B, C) = P (A|C)P (B|C)P (C) P (A, B, C) P (A, B|C) = = P (A|C)P (B|C) P (C) But: P (A, B) 6= P (A)P (B) Babies Storks Babies Storks Environment A B C A B C P (A, B, C) = P (B|C)P (C|A)P (A) A B C P (A, B, C) = P (B|C)P (C|A)P (A) P (A, B|C) = P (A,B,C) P (C) = P (C|A)P (A) P (B|C) P (C) A B C P (A, B, C) = P (B|C)P (C|A)P (A) P (A, B|C) = P (A,B,C) P (C) = P (C|A)P (A) P (B|C) P (C) P (A, B|C) = P (B|C)P (A|C) A B C P (A, B, C) = P (B|C)P (C|A)P (A) P (A, B|C) = P (A,B,C) P (C) (A) = P (B|C) P (C|A)P P (C) P (A, B|C) = P (B|C)P (A|C) But: P (A, B) 6= P (A)P (B) Cloudy Rain Grass wet Cloudy Rain Grass wet A B C A B C P (A, B, C) = P (C|A, B)P (A)P (B) A B C P (A, B, C) = P (C|A, B)P (A)P (B) X P (A, B) = P (A, B, C) = P (A)P (B) C A B C P (A, B, C) = P (C|A, B)P (A)P (B) X P (A, B) = P (A, B, C) = P (A)P (B) C But: P (A, B|C) 6= P (A|C)P (B|C) Battery Petrol Engine Battery Petrol Engine A B A C A B A B C C C A B C A B B A B C A B C A B A B ? A B A B No Unobserved node Observed node Blocked path Open path • Two variables A, B are d-separated if every path from A to B is blocked. • A⊥B iff A and B are d-separated. A C B D E P (A, B, C, D, E) = Y P (nodei|parentsi) i Bayesian network = DAG + distribution family + parameters Example: multinomial CPD P(true) P(false) 0.5 0.5 Cloudy Cloudy true false P(true) P(false) 0.3 0.7 0.8 0.2 Cloudy Rain Sprinkler true false Wet grass Sprinkler Rain P(true) true true true false 1.0 0.8 0.0 0.2 false true 1.0 0.0 false false 0.0 1.0 P(false) P(true) P(false) 0.8 0.2 0.0 1.0 Linear Gaussian CPD P (A|P1, . . . , Pn) = N w0 + n X i=1 wiPi, σ 2 ! Learning network structure from data Find the best network structure M : M ∗ = argmax{P (M |D)} Find the best model M , that is, the best network P (M |D) ∝ P (D|M )P (M ) R P (D|M ) = P (D|θ, M )P (θ|M )dθ When is the integral analytically tractable? Find the best model M , that is, the best network P (M |D) ∝ P (D|M )P (M ) R P (D|M ) = P (D|θ, M )P (θ|M )dθ When is the integral analytically tractable? • Complete observation: No missing values. • P (D|θ, M ) and P (θ|M ) must satisfy certain regularity conditions. • Examples: Multinomial with a Dirichlet prior, linear Gaussian with a normal-gamma prior. A C B D E P (A, B, C, D, E) = Y P (nodei|parentsi) i Bayesian network = DAG + distribution family + parameters Biological example Yeast cell cycle Clustering Spellman et al. 1998 Molecular Biology of the Cell 9 (12) :3273-97 G e n e s . Experiments G e n e s Experiments From Spellman et al., http://cellcycle-www.stanford.edu/ Advantage of clustering Fast, computationally cheap Shortcoming of clustering It is NOT reverse engineering SLT2 clusters with low-osmolarity response genes Biological example Yeast cell cycle Nir Friedman et al. (2000) Journal of Computational Biology 7: 601-620 Low osmolarity response genes ... SLT2 Rlm1p MAP kinase Transcription factors Swi4/6 SLT2 Low osmolarity response genes Low osmolarity response genes ... SLT2 Rlm1p MAP kinase Transcription factors Swi4/6 SLT2 Low osmolarity response genes Low osmolarity response genes ... SLT2 Rlm1p MAP kinase Transcription factors Swi4/6 SLT2 Low osmolarity response genes Low osmolarity response genes ... SLT2 Rlm1p MAP kinase Transcription factors Swi4/6 SLT2 Low osmolarity response genes Low osmolarity response genes ... SLT2 Rlm1p MAP kinase Transcription factors Swi4/6 SLT2 Low osmolarity response genes Can we learn causal relationships from conditional dependencies ? Can we learn causal relationships from conditional dependencies ? A B C Bayesian network: A node is independent of its nondescendants, given its parents. Causal network: A node is independent of its earlier causes, given its immediate causes. Causal Network Bayesian Network Causal Network Bayesian Network Causal Network Bayesian Network Problem 1 Hidden variables Bayesian networks versus causal networks True causal graph A B C Node A unknown A B C Problem 1 Hidden variables Problem 2 Equivalence classes and PDAGs A B C A B C A B C A B C A B C A B A B C C A B C P(A,B,C) = P(B|C) P(C|A) P(A) P(A|C) P(B|C) P(C) P(A|C) P(C|B) P(B) P(C|A,B) P(A) P(B) A B C A B A B C C A B C P(A,B,C) = P(B|C) P(C|A) P(A) P(A|C) P(C) P(A|C) P(B|C) P(C) P(A|C) P(C|B) P(B) P(B|C) P(C) P(C|A,B) P(A) P(B) A B A C B A B A C C B C P(A,B,C) = P(B|C) P(C|A) P(A) P(A|C) P(B|C) P(C) P(A|C) P(C|B) P(B) P(A|C) P(C) A B C P(C|A,B) P(A) P(B) P(B|C) P(C) A B C A B C A B C A B A C B B C C A A A B C B C • Two DAGs are equivalent iff they have the same skeleton (= the underlying undirected graph) and the same v-structure. • v-structure: Converging directed edges into the same node without an edge between the parents. An equivalence class of DAGs can be represented by a PDAG (partially directed acyclic graph). A C B D E We can only learn PDAGs from the data! • Observation: A passive measurement of the domain of interest. • Intervention: Setting the values of some variables using forces outside the causal model, e.g., gene knockout or over-expression • Observation: A passive measurement of the domain of interest. • Intervention: Setting the values of some variables using forces outside the causal model, e.g., gene knockout or over-expression • Interventions can destroy the symmetry within an equivalent class. A B A B A B A B A B A B Interventional data A and B are correlated A B inhibition of A A A B down-regulation of B B A B no effect on B n n i =1 i =1 −{i} P ( D | M ) = ∏ P ( X i = Di | pa[ X i ] = D pa[ X i ] ) → ∏ P ( X i = Di−{i} | pa[ X i ] = D pa [ Xi ] ) Learning with interventions No intervention: Q P (D|M ) = i P (Xi|P a(Xi)) Two models M, M0 with the same score are structure equivalent. Int: Set of interventions −→ Modified score: Y P (D|M ) = P (Xi|P a(Xi)) i,Xi∈Int / This score is no longer structure equivalent. A B C D E . . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . A B C D E A . B A B A B A B A B C C C C C D D D D D E E E E E . Active learning Based on preliminary inference: Predict the intervention that maximizes the information content of the expected response. . . Experiment . Modelling . . . Experiment Measurements . Modelling . . . Experiment Measurements . Modelling Inference . . . Experiment Intervention . Modelling Inference . . . Experiment Intervention . Modelling Inference . Dynamic Bayesian Networks Symmetry breaking by the direction of time: Cause precedes its effect Modelling recurrent structures and feedback loops t=1 t=2 t=3 t=4 Outline of the talk • Recapitulation: Bayesian networks • Reverse engineering: Learning networks from data • Estimating the accuracy of inference Learning the network from data Find the best network structure M : M ∗ = argmax{P (M |D)} Find the best parameters θ ∗ θ∗ = argmax{P (θ|D, M ∗)} Find the best model M , that is, the best network P (M |D) ∝ P (D|M )P (M ) Find the best model M , that is, the best network P (M |D) ∝ P (D|M )P (M ) R P (D|M ) = P (D|θ, M )P (θ|M )dθ When is the integral analytically tractable? Find the best model M , that is, the best network P (M |D) ∝ P (D|M )P (M ) R P (D|M ) = P (D|θ, M )P (θ|M )dθ When is the integral analytically tractable? • Complete observation: No missing values. • P (D|θ, M ) and P (θ|M ) must satisfy certain regularity conditions. • Examples: Multinomial with a Dirichlet prior, linear Gaussian with a normal-gamma prior. BDe: Multinomial with a Dirichlet prior Heckerman, Geiger, Chickering (1995) Learning Bayesian Networks: The Combination of Knowledge and Statistical Data Machine learning 20, 245-274 BGe: Linear Gaussian with a normal-gamma prior Geiger and Heckerman (1994) Learning Gaussian networks Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence Morgan Kaufmann publisher, San Francisco, 235-243 Naive approach • Compute P (M |D) for all possible network structures M . • Select network structure M ∗ that maximizes P (M |D) Naive approach • Compute P (M |D) for all possible network structures M . • Select network structure M ∗ that maximizes P (M |D) Problem 1: Number of different network structures increases super-exponentially with the number of nodes. N of nodes 2 4 6 8 10 N of structures 3 543 3.7 × 106 7.8 × 1011 4.2 × 1018 −→ Optimization problem intractable for large N of nodes Naive approach • Compute P (M |D) for all possible network structures M . • Select network structure M ∗ that maximizes P (M |D) Problem 2: Data are sparse → Intrinsic uncertainty of inference P(M|D) P(M|D) M M* Large data set D: Best network structure M* well defined M M* Small data set D: Intrinsic uncertainty about M* Objective: Sample from the posterior distribution P (D|Mk )P (Mk ) P (Mk |D) = P i P (D|Mi)P (Mi) P Direct approach intractable due to i P (D|Mi)P (Mi) Markov chain Monte Carlo (MCMC): • Proposal move: Given network Mold, propose a new network Mnew with probability Q(Mnew |Mold). • Acceptance/Rejection: Accept this new network with n o Q(Mold|Mnew ) new |D) probability min 1, PP(M × (M |D) Q(Mnew |M ) old old Markov chain Monte Carlo (MCMC) Reject Propose Propose Accept Propose ( P (Mnew ) Q(Mold |Mnew ) new ) Acceptance probability: min 1, PP(D|M × × (D|Mold ) P (Mold ) Q(Mnew |Mold ) ) Marginal likelihood P (D|Mnew ) P (D|Mold) × P (Mnew ) P (Mold) × Q(Mold|Mnew ) Q(Mnew |Mold) Find the best model M , that is, the best network P (M |D) ∝ P (D|M )P (M ) R P (D|M ) = P (D|θ, M )P (θ|M )dθ When is the integral analytically tractable? • Complete observation: No missing values. • P (D|θ, M ) and P (θ|M ) must satisfy certain regularity conditions. • Examples: Multinomial with a Dirichlet prior, linear Gaussian with a normal-gamma prior. Example: multinomial CPD P(true) P(false) 0.5 0.5 Cloudy Cloudy true false P(true) P(false) 0.3 0.7 0.8 0.2 Cloudy Rain Sprinkler true false Wet grass Sprinkler Rain P(true) true true true false 1.0 0.8 0.0 0.2 false true 1.0 0.0 false false 0.0 1.0 P(false) P(true) P(false) 0.8 0.2 0.0 1.0 Multinomial CPD Disadvantage • Requires discretization −→ Loss of information. • Number of free parameters exponential in the number of parents. Advantage • Can model nonlinear interactions. Linear Gaussian CPD ! n X P (A|P1, . . . , Pn) = N w0 + wiPi, σ 2 i=1 Advantage • Number of free parameters linear in the number of parents. • No discretization −→ No loss of information. Disadvantage • Can only model linear interactions. A B A B C C C= w1 A + w2 B C= w1 A + w2 B B A A B C C C= w1 A + w2 B C= w1 A + w2 B Alternatives to the BDe and BGe scores: Nonlinear and continuous relations, possibly including hidden variables • Imoto et al. (2003): heteroscedastic regression, Laplace approximation • Beal (2003): BNets with hidden variables, VBEM • Pournara (2005): nonlinear regression with Gaussian processes, RJMCMC Prior probability P (D|Mnew ) P (D|Mold) × P (Mnew ) P (Mold) × Q(Mold|Mnew ) Q(Mnew |Mold) . . Fan-out unrestricted Fan-in restricted not permissible . . Proposal probability P (D|Mnew ) P (D|Mold) × P (Mnew ) P (Mold) × Q(Mold|Mnew ) Q(Mnew |Mold) MCMC moves Delete edge Reverse edge Create edge Proposal probability = ? Proposal probability = ? Proposal probability = 1/5 Proposal probability = 1/6 Neighbourhood Neighbourhood Delete Reverse Add Delete Add Delete Reverse Add Reverse Add Add Add Problem: Statistical significance of the networks • Complex models: Transcript levels of hundreds of genes. • Sparse data: Typically a few dozen samples. P(M|D) P(M|D) M M* Large data set D: Best network structure M* well defined M M* Small data set D: Intrinsic uncertainty about M* • Posterior probability P (M |D) diffuse: Global network inference is meaningless. Solution: Focus on features and subnetworks Feature: Indicator variable for a property of interest, e.g.: Are X and Y close neighbours in the network? 1 if M satisfies the feature f (M ) = 0 otherwise Solution: Focus on features and subnetworks Feature: Indicator variable for a property of interest, e.g.: Are X and Y close neighbours in the network? 1 if M satisfies the feature f (M ) = 0 otherwise X Posterior probability of features: P (f |D) = f (M )P (M |D) M T 1X Approximate this sum with MCMC: P (f |D) = f (Mi) T i=1 where {Mi} is a sample from the posterior obtained with MCMC. Model network, data set size: N = 50 6 4 14 9 16 5 3 10 13 8 15 2 1 11 12 17 7 18 32 31 34 19 39 35 30 21 29 33 22 28 20 36 40 38 37 26 23 24 27 25 Model network, data set size: N = 50 6 4 14 9 16 5 3 10 13 8 15 2 1 11 12 17 7 18 32 31 34 19 39 35 30 21 29 33 22 28 20 36 40 38 37 26 23 24 27 25 Predicted connectivity spectrum 10 9 8 Connectivity 7 6 5 4 3 2 1 0 0 10 20 30 Node 40 50 Outline of the talk • Recapitulation: Bayesian networks • Reverse engineering: Learning networks from data • Estimating the accuracy of inference