Reconstructing regulatory networks and signalling pathways from post-genomic data Dirk Husmeier Biomathematics & Statistics Scotland Edinburgh, UK October 2008 unknown Postgenomics high-throughput data data data machine learning statistical methods Overview • • • • • Relevance networks Partial correlation analysis Graphical Gaussian models Differential equation models Bayesian networks Overview • • • • • Relevance networks Partial correlation analysis Graphical Gaussian models Differential equation models Bayesian networks Relevance networks (Butte and Kohane, 2000) 1. Choose a measure of association A(.,.) 2. Define a threshold value tA 3. For all pairs of domain variables (X,Y) compute their association A(X,Y) 4. Connect those variables (X,Y) by an undirected edge whose association A(X,Y) exceeds the predefined threshold value tA Association scores Threshold Bootstrapping or randomization test strong correlation σ12 1 2 ‘direct interaction’ 1 2 X ‘common regulator’ 1 2 X 1 2 X 1 2 ‘indirect interaction’ Shortcomings Pairwise associations do not take the context of the system into consideration direct common indirect interaction regulator interaction co-regulation Overview • • • • • Relevance networks Partial correlation analysis Graphical Gaussian models Differential equation models Bayesian networks Distinguish between direct and indirect interactions direct common interaction indirect regulator interaction co-regulation A and B have a low partial correlation Identify candidate causal genes within the eQTL confidence interval around a marker by (partial) gene expression correlation analysis Correlation Æ Partial correlation Network reconstruction, part 1 • For each gene included in the gene list of an eQTL confidence interval Æ compute correlation coefficient with the gene expression profile of the gene affected by the eQTL. • Test for significant departure from zero via a ttest with Bonferroni correction (threshold pvalue: 0.05/n, n: number of genes in the eQTL confidence interval) • If significant: Identify the gene with the most significant correlation coefficient Æ Gene 1. Network reconstruction, part 2 • Compute first-order partial correlation coefficients between the other genes and the gene affected by the eQTL, conditional on Gene 1. • Test for significant departure from zero via a t-test with Bonferroni correction (threshold p-value: 0.05/(n-1), n: number of genes in the eQTL confidence interval). • If significant: Identify the gene with the most significant partial correlation coefficient Æ Gene 2. Network reconstruction, part 3 • Compute second-order partial correlation coefficients between the other genes and the gene affected by the eQTL, conditional on Genes 1 & 2. • Test for significant departure from zero via a t-test with Bonferroni correction (threshold p-value: 0.05/(n-2), n: number of genes in the eQTL confidence interval). • If significant: Identify the gene with the most significant partial correlation coefficient Æ Gene 3. • And so on … Shortcomings • Iterative, heuristic piecemeal approach • No conditioning on the whole system, but on a set of pre-selected genes Overview • • • • • Relevance networks Partial correlation analysis Graphical Gaussian models Differential equation models Bayesian networks Graphical Gaussian Models −1 π ij = − 1 ⋅ (Σ ) ij (Σ −1 ) ii ⋅ (Σ −1 ) jj Inverse of the covariance matrix strong partial Partial correlation, i.e. correlation correlation π12 conditional on all other domain variables Corr(X1,X2|X3,…,Xn) 1 2 Direct interaction 1 2 Graphical Gaussian Models −1 π ij = − 1 ⋅ (Σ ) ij (Σ −1 ) ii ⋅ (Σ −1 ) jj Inverse of the covariance matrix strong partial Partial correlation, i.e. correlation correlation π12 conditional on all other domain variables Corr(X1,X2|X3,…,Xn) 1 2 Direct interaction 1 2 Problem: #observations < #variables Î Covariance matrix is singular Shrinkage estimation and the lemma of Ledoit-Wolf Shrinkage estimation and the lemma of Ledoit-Wolf Shrinkage estimation and the lemma of Ledoit-Wolf Shrinkage estimation and the lemma of Ledoit-Wolf Shrinkage estimation and the lemma of Ledoit-Wolf Summary of the GGM algorithm, part 1 • Partial correlations, as opposed to standard correlations, capture the influence of the whole system. Mathematically, this is the correlation between two nodes conditional on the rest of the system. • The partial correlations can be computed from the inverse of the covariance matrix. • The true covariance matrix is usually unknown Æ approximated by the empirical covariance matrix, estimated from the data. • Empirical covariance matrix Æ over-fitting problem, can be illconditioned or rank-deficient (singular) Æ inversion impossible. • Regularization: add the identity matrix, weighted by some constant, to the empirical covariance matrix Æ matrix nonsingular. Possible problem: bias. Summary of the GGM algorithm, part 2 • Set the trade-off parameter so as to minimize the expected difference between the (unknown) true covariance matrix and the estimated matrix. • Statistical decision theory: closed-from expression for the optimal trade-off parameter. • Catch: this expression depends on expectation values with respect to other data sets generated from the same processes (parallel statistical universes). Cannot be computed in practice. • Heuristics: replace expectation values by the actually observed values or use bootstrapping. Correlation graph, 150 leading edges from Arabidopsis thaliana Partial correlation graph (CIG), 150 leading edges from Arabidopsis thaliana • How do we choose the threshold? • Can we control the false discover rate (FDR)? Small-sample setting • Covariance matrix: singular or illconditioned Æ standard partial correlation estimates cannot be used. • Empirical finding: sampling distributions have the same functional form as the theoretical result. • Number of degrees of freedom κ no longer a function of N and G, but needs to be estimated from the data. Mixture distribution Parameter estimation Control of the FDR Collaboration with plant biologists at SCRI Genomewide microarray experiments for 9 mutant strains of Pba measured in triplicets Scale-free networks GGMs with varying thresholds values Number of nodes versus degree of connectivity on a log-log plot In a Æ Steps 1-4: Standard CIG algorithm Shortcoming Pairwise interactions conditional on the whole systems, but: no proper scoring of the whole network direct interaction common indirect regulator interaction co-regulation P(A,B)=P(A)·P(B) But: P(A,B|C)≠P(A|C)·P(B|C) Example Overview • • • • • Relevance networks Partial correlation analysis Graphical Gaussian models Differential equation models Bayesian networks Description of the system with ODEs Alternative pathways Alternative pathways Model selection under the assumption that all the kinetic rate parameters are known • Given a time series of gene expression profiles x(1) … x(t) … x(T) • For all time points t=1 … T: Numerically integrate the ODEs to obtain the predictor z(t) for x(t) from x(t-1). • Find the optimal network structure that minimizes the deviation Σ [z(t)-x(t)]² Model selection for unknown parameters: naïve approach • Given a time series of gene expression profiles x(1) … x(t) … x(T) • For all time points t=1 … T: Numerically integrate the ODEs to obtain the predictor z(t) for x(t) from x(t-1). • Find the optimal network structure and the parameters that minimize the deviation Σ [z(t)-x(t)]² Model selection for unknown parameters: naïve approach • Given a time series of gene expression profiles x(1) … x(t) … x(T) • For all time points t=1 … T: Numerically integrate the ODEs to obtain the predictor z(t) for x(t) from x(t-1). • Find the optimal network structure and the parameters that minimize the deviation Σ [z(t)-x(t)]² Æ Overfitting: this approach will always select the most complex model Bayesian model selection Select the model with the highest posterior probability: This requires an integration of the whole parameter space: Illustration: Marginal likelihood and Ockham factor Numerical integration by sampling from the prior Model: Parameters: Problem: Extremely poor convergence in high dimensions Prior distribution Likelihood function Numerical integration by sampling from the posterior Model: Parameters: Problem: Poor convergence in high dimensions and instability Prior distribution Sampling from the peaks Likelihood function Posterior distribution Main contributions to the integral from the valleys Importance sampling Arbitrary (possibly unnormalized) distribution sampled from Illustration of annealed importance sampling Posterior distribution Prior distribution Taken from the MSc thesis by Ben Calderhead, Annealed importance sampling Gradually transform the prior (0) into the posterior (1) Prior Posterior How to sample from ? Generate samples from a Markov chain with a transition kernel T that leaves invariant. A sufficient condition for this to hold is the condition of detailed balance: Annealed importance sampling Marginal likelihoods for the alternative pathways Shortcomings of the method • Each step of the AIS procedure requires a numerical solution of the ODEs. • Computationally extremely expensive • Applicable only for selecting a pathway out of a small set of candidate pathways • Not yet applicable for learning novel networks structures “from scratch”. Overview • • • • • Relevance networks Partial correlation analysis Graphical Gaussian models Differential equation models Bayesian networks Bayesian networks •Marriage between graph theory and probability theory. NODES A B C EDGES D E F •Directed acyclic graph (DAG) representing conditional independence relations. •It is possible to score a network in light of the data: P(D|M), D:data, M: network structure. •We can infer how well a particular network explains the observed data. P( A, B, C , D, E , F ) = P ( A) ⋅ P( B | A) ⋅ P(C | A) ⋅ P( D | B, C ) ⋅ P ( E | D) ⋅ P ( F | C , D) Model Parameters Multinomial CPD Linear Gaussian CPD Bayes net ODE model Learning Bayesian networks P(M|D) = P(D|M) P(M) / Z M: Network structure. D: Data Madigan & York (1995), Guidici & Castello (2003) Easy to compute for Bayesian nets Computationally very expensive for ODE models Ongoing research: Improve mixture and convergence of the MCMC simulations Causal structure of pathways Bayes net ODE model Bayesian networks versus causal networks Bayesian networks represent conditional (in)dependence relations - not necessarily causal interactions. Bayesian networks versus causal networks • Hidden variables • Equivalence classes True causal graph A B C Node A unknown A B C Equivalent network structures Equivalent classes Break symmetry of equivalence classes Interventional data A and B are correlated A B inhibition of A A A B down-regulation of B B A B no effect on B n n i =1 i =1 −{i} P ( D | M ) = ∏ P ( X i = Di | pa[ X i ] = D pa[ X i ] ) → ∏ P ( X i = Di−{i} | pa[ X i ] = D pa [ Xi ] ) How can we model feedback loops? Not a valid DAG Unfolding in time Æ Dynamic Bayesian network Valid DAG • Homogeneous Markov chain • We need time series data Performance comparison Bayesian networks versus Graphical Gaussian models Score based versus constraint based inference Evaluation on the Raf signalling pathway Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005 Evaluation: Raf signalling pathway • Cellular signalling network of 11 phosphorylated proteins and phospholipids in human immune systems cell • Deregulation Æ carcinogenesis • Extensively studied in the literature Æ gold standard network Flow cytometry data • Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins • 5400 cells have been measured under 9 different cellular conditions (cues) • Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments Spellman et al (1998) Cell cycle 73 samples Tu et al (2005) Metabolic cycle 36 samples time time Genes Genes Microarray example In addition to experimental data: Comparison with simulated data Raf pathway Comparison with simulated data 2 Steady-state approximation “True” network Reconstructed network “True” network Ranking of edges “True” network Thresholded network Performance evaluation: ROC curves •We use the Area Under the Receiver Operating Characteristic Curve (AUC). AUC=0.5 AUC=1 0.5<AUC<1 Alternative performance evaluation: True positive (TP) scores BN GGM RN 5 FP counts Directed graph evaluation - DGE Undirected graph evaluation - UGE Synthetic data, observations Synthetic data, interventions Cytometry data, observations Cytometry data, interventions Complications with real data • Putative negative feedback loops: Can we trust our gold-standard network? • Interventions might not be “ideal” owing to negative feedback loops. Disputed structure of the gold-standard network Regulation of Raf-1 by Direct Feedback Phosphorylation. Molecular Cell, Vol. 17, 2005 Dougherty et al Stabilisation through negative feedback loops Inhibition Disputed structure of the gold-standard network Regulation of Raf-1 by Direct Feedback Phosphorylation. Molecular Cell, Vol. 17, 2005 Dougherty et al Stabilisation through negative feedback loops Inhibition Conclusions 1 • BNs and GGMs outperform RNs, most notably on Gaussian data. • No significant difference between BNs and GGMs on observational data. • For interventional data, BNs clearly outperform GGMs and RNs, especially when taking the edge direction (DGE score) rather than just the skeleton (UGE score) into account. Conclusions 2 Performance on synthetic data better than on real data. • Real data: more complex • Real interventions might not be ideal • Possible errors in the gold-standard network + + + + … Deviation between the network G and the prior knowledge B: Graph: є {0,1} “Energy” Prior knowledge: є [0,1] Prior distribution over networks Hyperparameter Multiple sources of prior knowledge Sample networks and hyperparameters from the posterior distribution Bayesian networks with two sources of prior Source 1 β1 Data Source 2 BNs + MCMC Recovered Networks and trade off parameters β2 Bayesian networks with two sources of prior Source 1 β1 Data Source 2 BNs + MCMC Recovered Networks and trade off parameters β2 Sample networks and hyperparameters from the posterior distribution with MCMC Proposal probabilities Metropolis-Hastings scheme Prior distribution Energy of a network Rewriting the energy Approximation of the partition function Partition function of an ideal gas Evaluation on the Raf regulatory network From Sachs et al Science 2005 Flow cytometry data • Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins • 5400 cells have been measured under 9 different cellular conditions (cues) • Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments Prior knowledge from KEGG Prior distribution Prior knowledge from KEGG 0.87 0 0.71 1 1 0 0.25 0 0 0.5 0 0.5 0.5 0.5 0 0 Raf network 0 Data and prior knowledge + KEGG + Random Evaluation • Can the method automatically evaluate how useful the different sources of prior knowledge are? • Do we get an improvement in the regulatory network reconstruction? • Is this improvement optimal? Sampled values of the hyperparameters Bayesian networks with two sources of prior knowledge Random β1 Data BNs + MCMC Recovered Networks and trade off parameters KEGG β2 Bayesian networks with two sources of prior knowledge Random β1 Data BNs + MCMC Recovered Networks and trade off parameters KEGG β2 Evaluation • Can the method automatically evaluate how useful the different sources of prior knowledge are? • Do we get an improvement in the regulatory network reconstruction? • Is this improvement optimal? Flow cytometry data and KEGG Evaluation • Can the method automatically evaluate how useful the different sources of prior knowledge are? • Do we get an improvement in the regulatory network reconstruction? • Is this improvement optimal? Learning the trade-off hyperparameter Mean and standard deviation of the sampled trade off parameter • Repeat MCMC simulations for large set of fixed hyperparameters β • Obtain AUC scores for each value of β • Compare with the proposed scheme in which β is automatically inferred. Monolithic Individual datadatadata data data data Propose a compromise between the two Compromise between the two previous ways of combining the data M* β1 M1 D1 β2 M2 D2 ... βI MI DI BGe or BDe Ideal gas approximation MCMC Result 1: Successful detection of corrupted data sets M* β1 β2 M1 D1 M2 ... βI MI DI 0 Result 2: Improved network reconstruction accuracy Summary • • • • • Relevance networks Partial correlation analysis Graphical Gaussian models Differential equation models Bayesian networks Thank you! Any questions?