Beyond Co-expression: Gene Network Inference Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik Beyond Co-expression • Clustering approaches rely on co-expression of genes under different conditions • Assumes co-expression is caused by co-regulation • We would like to do better than that: – Causal inference – What is regulating what? Gene Network Inference Overview • Modeling Issues: – – – – • • • • Level of biochemical detail Boolean or continuous? Deterministic or stochastic? Spatial or non-spatial? Data Requirements Linear Models Nonlinear models Conclusions Level of Biochemical Detail • Detailed models require lots of data! • Highly detailed biochemical models are only feasible for very small systems which are extensively studied • Example: Arkin et al. (1998), Genetics 149(4):1633-48 lysis-lysogeny switch in Lambda: 5 genes, 67 parameters based on 50 years of research, stochastic simulation required supercomputer Example: Lysis-Lysogeny Arkin et al. (1998), Genetics 149(4):1633-48 Level of Biochemical Detail • In-depth biochemical simulation of e.g. a whole cell is infeasible (so far) • Less detailed network models are useful when data is scarce and/or network structure is unknown • Once network structure has been determined, we can refine the model Boolean or Continuous? • Boolean Networks (Kauffman (1993), The Origins of Order) assumes ON/OFF gene states. A 0 1 C 0 C = A AND B B • Allows analysis at the network-level • Provides useful insights in network dynamics • Algorithms for network inference from binary data Boolean or Continuous? • Boolean abstraction is poor fit to real data • Cannot model important concepts: – amplification of a signal – subtraction and addition of signals – compensating for smoothly varying environmental parameter (e.g. temperature, nutrients) – varying dynamical behavior (e.g. cell cycle period) • Feedback control: negative feedback is used to stabilize expression causes oscillation in Boolean model Deterministic or Stochastic? • Use of concentrations assumes individual molecules can be ignored • Known examples (in prokaryotes) where stochastic fluctuations play an essential role (e.g. lysislysogeny in lambda) • Requires stochastic simulation (Arkin et al. (1998), Genetics 149(4):1633-48), or modeling molecule counts (e.g. Petri nets, Goss and Peccoud (1998), PNAS 95(12):6750-5) • Significantly increases model complexity Deterministic or Stochastic? • Eukaryotes: larger cell volume, typically longer halflives. Few known stochastic effects. • Yeast: 80% of the transcriptome is expressed at 0.1-2 mRNA copies/cell Holstege, et al.(1998), Cell 95:717-728. • Human: 95% of transcriptome is expressed at <5 copies/cell Velculescu et al.(1997), Cell 88:243-251 Spatial or Non-Spatial • Spatiality introduces additional complexity: – – – – intercellular interactions spatial differentiation cell compartments cell types • Spatial patterns also provide more data e.g. stripe formation in Drosophila: Mjolsness et al. (1991), J. Theor. Biol. 152: 429-454. • Few (no?) large-scale spatial gene expression data sets available so far. Overview • Modeling Issues: – – – – • • • • Level of biochemical detail Boolean or continuous? Deterministic or stochastic? Spatial or non-spatial? Data Requirements Linear Models Nonlinear models Conclusions Overview • Modeling Issues • Data Requirements: – – – – Lower bounds from information theory Effect of limited connectivity Comparison with clustering Variety of data points needed • Linear Models • Nonlinear models • Conclusions Lower Bounds from Information Theory • How many bits of information are needed just to specify the connection pattern of a network? • N2 possible connections between N nodes N2 bits needed to specify which connections are present or absent O(N) bits of information per “data point” O(N) data points needed • Effect of Limited Connectivity • Assume only K inputs per gene (on average) NK connections out of N2 possible: N2 possible connection patterns NK • Number of bits needed to fully specify the connection pattern: N2 NK log N K log 2 NK O(Klog(N/K)) data points needed Comparison with clustering • Use pairwise correlation comparisons as a stand-in for clustering • As number of genes increases, number of false positives will increase as well need to use more stringent correlation test • If we want to use the same correlation cutoff value r, we need to increase the number of data points as N increases O(log(N)) data points needed Summary Fully connected Connectivity K Clustering N Klog(N/K) log(N) (thousands) (hundreds?) (tens) • Additional constraints reduce data requirements: – limited connectivity – choice of regulatory functions • Network inference is feasible, but does require much more data than clustering Variety of Data Points Needed • To unravel regulation of a gene, need to sample many different combinations of its regulatory inputs (different environmental conditions and perturbations) • Time series data yields dynamics, but all data points are related • Steady-state data yields attractors, can sample state space more efficiently • Both types of data will be needed, and multiple data sets of each Overview • Modeling Issues • Data Requirements • Linear Models: – – – – Formulation Underdetermined problem! Solution 1: reduce N Solution 2 • Nonlinear models • Conclusions Linear Models • Basic model: weighted sum of inputs yi (t t ) w ji y j (t ) bi or dyi w ji y j bi dt j j g1 w12 • Simple network representation: w55 g2 • Only first-order approximation w23 g5 g4 g3 • Parameters of the model: weight matrix containing NxN interaction weights • “Fitting” the model: find the parameters wji, bi such that model best fits available data Underdetermined problem! • Assumes fully connected network: need at least as many data points (arrays, conditions) as variables (genes)! • Underdetermined (underconstrained, ill-posed) model: we have many more parameters than data values to fit • No single solution, rather infinite number of parameter settings that will all fit the data equally well Solution 1: reduce N • Rather than trying to model all genes, we can reduce the dimensionality of the problem: • Network of clusters: construct a linear model based on the cluster centroids – rat CNS data (4 clusters): Wahde and Hertz (2000), Biosystems 55, 1-3:129-136. – yeast cell cycle (15-18 clusters): Mjolsness et al.(2000), Advances in Neural Information Processing Systems 12; van Someren et al.(2000) ISMB2000, 355-366. • Network of Principle Components: linear model between “characteristic modes” of the data Holter et al.(2001), PNAS 98(4):1693-1698. Solution 2: • Take advantage of additional information: – – – – replicates accuracy of measurements smoothness of time series … • Most likely, the network will still be poorly constrained. Need a method to identify and extract those parts of the model that are well-determined and robust What’s next? • Regulatory motifs: once we have identified the corresponding DNA binding proteins (transcription factors), we can start building the gene network from there • Integration with other data: – – – – – – – transcription factors functional annotation known interactions in the literature protein-protein interactions protein expression levels genetic data ... Linking Regulatory Motifs to Expression Data Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik Introduction • Gene expression is regulated by Transcription Factors (TFs), that bind to specific regulatory motifs in the promoter region of the gene. Synonyms: regulatory element, regulatory sequence, promoter elements, promoter motifs, (TF) binding site, operator (in prokaryotes), … TF DNA regulatory motif gene • Question: Do genes with similar expression patterns share regulatory motifs? 1: Systematic Determination of Genetic Network Architecture 0.2 1 2 -0.8 -1.3 -1.8 Time -point 3 Normalized Expression Time-point 3 Normalized Expression 1.2 0.7 -0.3 Tavazoie et al., Nature Genetics 22, 281 – 285 (1999) Normalized Expression Time-point 1 1.5 1 0.5 0 -0.5 1 2 3 -1 -1.5 Time -point 1.5 1 0.5 0 -0.5 1 2 -1 -1.5 -2 Time -point 3 Search for Motifs in Promoter Regions 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. Best Motif Found by AlignACE 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** 3 2.5 2 MCB N=182 1.5 1 0.5 0 -0.5 -1 Number of ORFs s.d. from mean Replication & DNA synthesis (2) -1.5 100 80 60 40 20 0 Distance from ATG (b.p.) 0 00 -1 00 -2 00 -3 00 -4 00 -5 00 -6 00 -7 00 -8 -9 00 35 30 25 20 15 10 5 0 00 0 23 30 11 40 CLUSTER Number of sites DNA synthesis and replication (82) Cell cycle control and mitosis (312) Recombination and DNA repair (84) Nuclear organization (720) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 -1 MIPS Functional category (# ORFs) ORFs within category Systematic Determination of Genetic Network Architecture • Tavazoie et al., Nature Genetics 22, 281–285 (1999) • Most motifs found are highly selective for the cluster they were found in. • Can find many known binding sites for transcription factors. • Also finds many novel regulatory motifs, associated with specific functional categories. 1) cluster 2) identify regulatory motifs in clustered genes 3) identify TF’s that bind to those motifs Gene regulation network 2: Regulatory Element Detection Using Correlation with Expression Bussemaker et al., Nature Genetics 27, 167 – 174 (2001) • What is the contribution of each regulatory motif (or the TF that binds to that motif) to the expression level of the genes containing the motif? • Given a set of known or putative regulatory motifs, identify all genes that contain the motif in their promoter region. • For a single expression experiment (e.g. single point in a time series), is the presence of the motif correlated with the expression level of the genes? • Perform multiple regression of (log) expression level on the presence/absence of the motifs. • Plot contribution of motif throughout time series. Contribution of Motifs to Expression Levels : the presence of motif 1 is correlated with the expression levels of the genes in which it appears : motif 2 is not correlated with expression levels of the genes in which it appears : motif 3 is negatively correlated with expression levels of the genes in which it appears Linear Combination of Motif Contributions • Find the most highly correlated motif. • Determine its contribution Fi to expression level by linear regression. • Subtract its contribution from the expression levels. • Find the next highest correlated motif. • Repeat until no more significantly correlated motifs. Ag C F1 N1g F2 N 2 g F3 N 3 g ... • Repeat this entire analysis for each time point of a time series weights Fi for the individual motifs will change throughout he time course. Time Courses of Regulatory Signals Time (minutes) Time (minutes) Ag (t ) C N1g F1 (t ) N 2 g F2 (t ) N 3 g F3 (t ) ... • We can think of the time-varying contributions Fi of each motif as the Regulatory Signals of the transcription factors that bind to these motifs Regulatory Element Detection Using Correlation with Expression • Bussemaker et al., Nature Genetics 27, 167–174 (2001) • Can be used with known regulatory motifs, sets of putative motifs, and even exhaustively on the set of all motifs up to a certain length (n=7). • Known motifs generally have high statistical significance. • Allows us to infer regulatory inputs of (possibly unknown) transcription factors. • Accounts for only 30% of total signal present in genomewide expression patterns. • Purely linear model: no synergistic effects between TF’s, cooperative binding, etc. 3: Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements Pilpel et al., Nature Genetics 29, 153–159 (2001) • Most transcription factors are thought to work in concert with other TF’s. Synergistic effects • Clustering: – a motif may occur in more than one cluster, because it may give rise to different expression patterns depending on its interaction partners. – several motifs may occur in the same cluster. • Correlation with expression pattern: – by itself, a motif may not show a clear expression pattern. – contributions of multiple motifs may not be simply additive. Synergy between Mcm1 and SFF in Cell Cycle Data Set Mcm1 and SFF were not detected in Tavazoie et al Yet TFs that bind these motifs are known to interact in control of G2-genes Expression level SFF but not Mcm1 EC=0.05 2 Mcm1 but not SFF 0 0 -2 -2 5 EC=0.05 2 10 Time 15 5 10 Time 15 SFF and Mcm1 Bussemaker et al found that these motifs are antagonistic. Expression level (Nature. 2000 406:90-4.) EC=0.23 2 0 -2 5 10 Time 15 Expression Coherence and Synergy • Expression Coherence (EC) score indicates how tightly clustered the expression profiles of a set of genes are. • For every combination of N=2,3 motifs: 1) Calculate the expression coherence score of the genes that have the N motifs 2) Calculate the expression coherence score of genes that have every possible subset of N-1 motifs 3 )Test (statistically) the hypothesis that the score of the orfs with N motifs is significantly higher than that of orfs that have any sub set of N-1 motifs Correlation -1 The “Combinogram” -0.5 0 0.5 G2 G1 1 MCB MSE URS1 SCB MCM1' SFF' MCB MSE URS1 SCB MCM1' SFF' Highly synergistic interaction between MCB and SFF Previously unknown Subsequently predicted via chromatin immunoprecipitation (ChIP) Expression Coherence 0.2 0.4 0.6 0.8 (cell cycle data) Ho et al. Nature. 2002 Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements • Pilpel et al., Nature Genetics 29, 153–159 (2001) • Found several known and novel interactions between regulatory elements active in cell cycle, sporulation and stress response. • Doesn’t assume a specific (e.g. linear) model of TF interactions. • Combined with TF expression patterns, may allow us to infer a model of interaction. Protein Networks Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik Yeast 2-Hybrid Assays Transcription Factor (e.g. Gal4) BD Binding site MATa MATa AD + Prot1 BD AD Prot2 BD AD Reporter gene Reconstituted active TF “bait” fusion: “prey” fusion: BD Prot1 AD Prot2 BD BD AD Prot1 AD Binding site Prot2 BD AD Reporter gene Fields and Song, Nature 340:245-246 (1989) Large-Scale 2-Hybrid Data Sets • Uetz et al, Nature 403:623-627 (2000) – 6000 x 192 protein pairs screened using protein array – nearly all 6000 x 6000 pairs, using pooled prey libraries – total of 957 putative interactions between 1004 proteins • Ito et al, PNAS 98:4569-4574 (2001) – nearly all 6000 x 6000 pairs, using bait and prey pools – total of 4549 putative interactions between 3278 proteins – core set of 841 interactions between 797 proteins • Surprisingly little overlap between the data sets, possibly indicating a large number of missed interactions (false negative). Intersections between Protein Interaction Data Sets MIPS 1546 MIPS 1546 Ito full 4475 1415 1436 49 54 28 709 Uetz 947 156 61 21 4242 28 109 756 Uetz 947 Ito core 806 648 Causes of False Positives • • • • • • • • • • Bait acts as activator Bait interacts with endogenous activator Prey binds to DNA Prey interacts with endogenous transcription factors Bait interacts with Activation Domain Prey interacts with DNA Binding Domain “Sticky” proteins (nonspecific binding) Changes in plasmid copy number AD Various other artifacts Prot2 BD Prot1 AD ... BD Binding site Reporter gene Yeast Protein-Protein Interaction Map Each node is a protein Each line is an interaction 5560 putative interactions 3725 different proteins ~ 3 interactions / protein Uetz, Schwikowski, Fields and co-workers; Ito and co-workers Membrane Proteins Transcription Factors - membrane protein - DNA-binding protein - all other yeast proteins - physical interaction between two proteins Problem: How to Rank Possible Pathways? Ste2/3 Possible Paths from Ste3: 1045 different paths to 143 transcription factors Ste12 Rank Predicted Paths by Degree of Expression Correlation from Microarray Expreriments • Known pathways often show correlated expression • Known interacting proteins often show correlated expression Average Pairwise Correlation Coefficient Among Pathway Members STE3AKR1STE5STE4FAR1CDC24SOH1 STE3AKR1IQG1CDC42BEM4RHO1SKN7 STE3AKR1STE4FAR1FUS3DIG2STE12 STE3AKR1GCS1YGL198WSAS10NET1 0.190 0.059 0.281 -0.106 * Microarray Data Downloaded from Rosetta Inpharmatics Classical View of MAPK Pathways adapted from C.Roberts, et al., Science, 287, 873 (2000) The Protein Network View • Highly interconnected, not just a linear pathway! • Some proteins are missing from the protein interaction data sets (Cdc42, Ste20). • Includes several additional proteins (especially Akr1, Kss1). Conclusions • Protein interaction data and expression data are both noisy. Combining them increases the accuracy. • Can estimate protein interaction error rates by looking at consistency between data sets probabilistic interaction model (work in progress). • Pathways are far more interconnected than often portrayed. • Can integrate various other forms of data: – co-localization of proteins – homology with known interacting proteins – “Rosetta Stone” method Acknowledgments Roland Somogyi NCGR: Stefanie Fuhrman Xiling Wen Jason Stewart Pedro Mendes Harvard: Tzachi Pilpel Martin Steffen Allegra Petti John Aach George Church UNM: Stephanie Forrest Andreas Wagner David Peabody Barak Pearlmutter The Santa Fe Institute