Understanding Gene Regulation: From Networks to Mechanisms Daphne Koller Stanford University Gene Regulatory Networks Controlled by diverse mechanisms Modified by endogenous and exogenous perturbations http://en.wikipedia.org/wiki/Gene_regulatory_network Goals Infer regulatory network and mechanisms that control gene expression Identify effect of perturbations on network Understand effect of gene regulation on phenotype Outline Regulatory networks for gene expression Individual genetic variation and gene regulation Cell differentiation and gene regulation Expression changes underlying phenotype Regulatory Network I mRNA level of regulator can indicate its activity level Target expression is predicted by expression of its regulators Use expression of regulatory genes as regulators Transcription factors, signal transduction proteins, mRNA processing factors, … ECM18 UTH1 ASG7 MEC3 GPA1 MFA1 TEC1 HAP1 PHO3 SGS1 SEC59 PHO5 RIM15 PHO84 PHM6 PHO2 PHO4 SAS5 SPL2 GIT1 Segal et al., Nature Genetics 2003; Lee et al., PNAS 2006 VTC3 Regulatory Network II Co-regulated genes have similar regulation program Exploit modularity and predict expression of entire module Allows uncovering complex regulatory programs module ECM18 UTH1 ASG7 MEC3 GPA1 MFA1 TEC1 “Regulatory Program” ? Targets HAP1 PHO3 SGS1 SEC59 PHO5 RIM15 PHO84 PHM6 PHO2 PHO4 SAS5 SPL2 GIT1 Segal et al., Nature Genetics 2003; Lee et al., PNAS 2006 VTC3 Module Networks* Activator Gene1 Repressor Gene2 Gene3 Genes in module repressor expression activator false true repressor false true regulation program activator expression repressed module genes target gene expression induced Learning quickly runs out of statistical power Poor regulator selection lower in the tree Many correct regulators not selected Arbitrary choice among correlated regulators Combinatorial search Multiple local optima * Segal et al., Nature Genetics 2003 Regulation as Linear Regression minimizew (Σwixi - ETargets)2 x1 w1 w2 … xN -3 x GPA1 0.5 x MFA1 + wN ETargets = parameters x2 Module ETargets= w1 x1+…+wN xN+ε But we often have hundreds or thousands of regulators … and linear regression gives them all nonzero weight! Problem: This objective learns too many regulators Lasso* (L1) Regression minimizew (w1x1 + … wNxN - ETargets)2+ C |wi| x1 w1 parameters w2 xN L2 L1 wN ETargets Provably selects “right” features when many features are irrelevant Convex optimization problem … Induces sparsity in the solution w (many wi‘s set to zero) x2 Unique global optimum Efficient optimization But, arbitrary choice among correlated regulators * Tibshirani, 1996 Elastic Net* Regression minimizew (w1x1 + … wNxN - ETargets)2+ C |wi| + D wi2 x1 w1 x2 w2 … xN L2 L1 wN ETargets Induces sparsity But avoids arbitrary choices among relevant features Convex optimization problem Unique global optimum Efficient optimization algorithms Lee et al., PLOS Genetics 2009 * Zhou & Hastie, 2005 Learning Regulatory Network -3 x GPA1 Cluster genes into modules 0.5 x MFA1 Learn a regulatory program for each module + = Module ECM18 UTH1 ASG7 MEC3 GPA1 This is a Bayesian network MFA1 TEC1 HAP1 • But multiple genes share samePHO3 program PHO5 PHO84 • Dependency RIM15 model is linear regression PHM6 SGS1 SEC59 PHO2 PHO4 SAS5 SPL2 GIT1 VTC3 Lee et al., PLoS Genet 2009 Outline Regulatory networks for gene expression Individual genetic variation and gene regulation Effect of genotype on expression Regulatory potential Cell differentiation and gene regulation Expression changes underlying phenotype Genotype phenotype Different sequences Perturbations to regulatory network …ACTCGGTTGGCCTAAATTCGGCCCGG… …ACCCGGTAGGCCTTAATTCGGCCCGG… : …ACTCGGTAGGCCTATATTCGGCCGGG… ?? Different phenotypes Genotype Regulation Different sequences Perturbations to regulatory network …ACTCGGTTGGCCTAAATTCGGCCCGG… …ACCCGGTAGGCCTTAATTCGGCCCGG… : ?? …ACTCGGTAGGCCTATATTCGGCCGGG… Goals: Infer regulatory network that controls gene expression Identify mechanisms by which genetic variation affects gene expression eQTL Data BY [Brem et al. (2002) Science] RM 112 progeny Genotype data × Markers 0101100100…011 1011110100…001 0010110000…010 : 0000010100…101 0010000000…100 112 individuals 6000 genes 3000 markers 112 individuals : Expression data Traditional Approach: Single Marker Expression quantitative trait loci (eQTL) mapping Marker Gene 1 2gene, 3 4find 5 the … marker that is most predictive of its For each expression level [Yvert G et al. (2003) Nat Gen]. mRNA individuals markers 0101100100100…011 1011110111100…001 0010110001000…010 : Marker j 0000010110100…101 individuals genes Gene i induced 1110000110000…100 Genotype data repressed Expression data LirNet Regulatory network E-regulators: Activity (expression) of regulatory genes G-regulators: Genotype of genes Measured as values of chromosomal markers marker M1011 M321 M1 M120 ECM18 UTH1 M321 M22 ASG7 MEC3 GPA1 MFA1 TEC1 HAP1 PHO3 SGS1 SEC59 PHO5 RIM15 PHO84 PHM6 PHO2 PHO4 SAS5 SPL2 GIT1 Lee et al., PNAS 2006; Lee et al., PLoS Genetics 2009 VTC3 The Telomere Module 40/42 genes in telomeres Enriched for telomere maintenance (p < 10-11) & helicase activity (p < 10-18) Includes Rif2 – Chr XII: 1056097 • • • • control telomere length establishes telomeric silencing 6 coding & 8 promoter SNPs Binds to Rap1p C-terminus Enrichment for Rap1p targets (29/42; p < 10-15) Lee et al., PNAS 2006 Some Chromatin Modules • Locus containing Sir1 • 4 coding SNPs Chr XI: 643655 4/5 consecutive genes Known Sir1 targets • Locus containing uncharacterized Sir1 homologue • 87(!) coding SNPs Chr X: 22213 5/7 consecutive genes Lee et al., PNAS 2006 Mechanism I Chromatin as Mechanism module #genes Runs 728 16 732 7 742 24 743 6 744 5 755 14 757 3 770 35 783 23 790 122 848 15 855 42 862 20 869 48 876 51 890 114 897 15 904 56 917 31 941 110 968 218 972 7 991 4 993 5 1018 88 1025 217 1030 15 1054 6 1061 46 1078 90 1085 42 1120 8 1126 26 1133 19 1145 7 1155 75 Dom DEG Enrich Chrom Reg-E Reg-G Telo Telo Telo Telo Telo 23 modules (out of 165) with “chromosomal features” 16 have “chromatin regulators” (p < 10-7) “Evolutionary strategy” to make coordinated changes in gene expression Chromatin modification explains significant part ofof variation by modifying small number hubs in Telo TY,LTR Telo gene expression between strains Telo Telo Telo Telo Telo Lee et al., PNAS 2006 The Puf3 Module weight BY RM 112 segregants 147/153 genes (P ~ 10-130) are pulldown targets of mRNA binding protein Puf3 HAP4 TOP2 KEM1 GCN1 GCN20 DHH1PUF P-body components family: Sequence specific mRNA binding proteins (3’ UTR) Regulate degradation of mRNA and/or repress translation +1 0 -1 PUF3 expression genotype Lee et al., PLOS Genetics 2009 P-Bodies mRNAs stored in P-bodies are translationally repressed Translation mRNAs can exit P-bodies and resume translation GFP of Dhh1** RNA degradation RNA stored in P-bodies can be degraded (via decapping) Dhh1 regulates mRNA decapping in p-bodies But what regulates sequence-specific localization of P-body Puf3 mRNAs to P-bodies? ? protein * Sheth and Parker (2003) Science 300:805 ** Beliakova-Bethell , et al. (2006) RNA 12:94 cell Microscopy experiment Fluorescent microscopy: Puf3 localizes to P-body Supports hypothesis of Puf3 involvement in regulating mRNA degradation by P-bodies Puf3 protein ? P-body Dhh1 Dhh1 Puf3 cell Joint work with David Drubin and Pam Silver Lee et al., PLOS Genetics 2009 What Regulates the P-bodies? A marker that covers a large region in Chr 14. Region contains ~30 genes and ~318 SNPs. Experiments for all 30 genes not feasible! RM BY ChrXIV:449639 DHH1 BLM3 GCN20 KEM1 GCN1 Lee et al., PLOS Genetics 2009 Outline Regulatory networks for gene expression Individual genetic variation and gene regulation Effect of genotype on expression Regulatory potential Cell differentiation and gene regulation Expression changes underlying phenotype Motivation Not all SNPs are equally likely to be causal. ChrXIV: 449,639-502,316 Gene TACGTAGGAACCTGTACCA … GGAAAATATCAAATCCAACGACGTTAGCCAATGCGATCGAATGGGAACGTA SNP 1: Conserved residue in a gene involved in RNA degradation SNP 2: In nonconserved intergenic region “Regulatory features” F SNPs 1. Gene region? 2. Protein coding region? 3. Nonsynonymous? 4. Create a stop codon? 5. Strong conservation? : Idea: Prioritize SNPs that have “good” regulatory features But how do we weight different features? Lee et al., PLOS Genetics 2009 Bayesian L1-Regularization M log P( y m 1 m | x m , w m ) C | wm,k | m,k Prior variance wi Module m higher prior variance weight can more easily deviate from 0 regulator more likely to be selected Regulator k xmk wmk ym wi ~ P(w) = Laplacian(0,C) ~ P(ym|x;w) = N (k wmkxmk,ε2) Lee et al., PLOS Genetics 2009 Metaprior Model (Hierarchical Bayes) YES Module m Regulatory features Inside a gene? Protein coding region? Strong conservation? TF binds to module genes : Regulator k fmk Cmk xmk wmk Regulatory prior ß = g(ßTfmk) ~ Laplacian (0,Cmk) Module 1NO : x1 w11 ~ P(ym|x;w) = N (k wmkxmk,ε2) Lee et al., PLOS Genetics 2009 regulators … x2 w12 NO NO : xN w1N Emodule 1 “Regulatory potential” = ß1 x Inside a gene? + ß2 x Protein coding region? + ß3 x Conserved? … : Module M x1 ym YES NO Potential : wN1 x2 wN2 Emodule M … xN wMN Metaprior Method BY(lab) MVLT L ELVQ A VSDASKQLWDI RM(wild) MVLT D ELVQG VSDASKQLWDI 0 Non-synonymous Conservation AA small large 0.1 0.2 0.3 1 2 3 4 5 0 0.1 0.2 0.3 Regulatory weights ß 1 0 = 3. Compute regulatory potential of SNPs Empirical hierarchical Bayes 6 1 1 1 : = Cell cycle 1 0 0 : 0.3 0.9 Regulatory features F Use point estimate of model parameters Learn priors from data to maximize joint posterior 2. Learn ß Maximize P(E,ß,W|X) 1. Learn regulatory Regulatory potentials programs Regulatory programs Module i+2 Module i+1 Module x1 xi2 …… xN x1 x2 … xN x1 x2 xN Ei Ei Ei Lee et al., PLOS Genetics 2009 x1 … x2 Ei Maximize P(E,ß,W|X) xN Transfer Learning What do regulatory potentials do? They do not change selection of “strong” regulators – those where prediction of targets is clear They only help disambiguate between weak ones Strong regulators help teach us what to look for in other regulators Transfer of knowledge between different prediction tasks Learned regulatory weights Yeast regulatory weights Regulatory features 0 Non-synonymous coding Stop codon Synonymous coding 3' UTR 500 bp upstream 5' UTR 500 bp downstream Conservation score Cis-regulation Change of average mass (Da) Change of isoelectric point (pI) Change of pK1 Change of pK2 Change of hydro-phobicity Change of pKa Change of polarity Change of pH Change of van der Waals Transcription regulator activity Telomere organization and Protein folding Glucose metabolic process RNA modification Hexose metabolic process Cell cycle checkpoint Proteolysis same GO process same GO function ChIP-chip binding 0.1 Lee et al., PLOS Genetics 2009 0.2 0.3 0.4 Human regulatory weights 0.5 0 Non-synonymous coding Synonymous coding Location Intron Locus region Splice site UTR region Conservation score Cis-regulation Change of average mass (Da) Change of isoelectric point Change of pK1 Change of pKa AA property Change of pH changeChange of van der Waals volume Change of polarity cell communication signal transduction transcription Gene function Pairwise feature 0.1 0.2 0.3 0.4 0.5 Statistical Evaluation 500 1000 1500 2000 2500 PGV: Percent genetic variation explained by the predicted regulatory program for each gene # of genes with PGV > X % 1650 genes (Lirnet) 1450 genes (Lirnet without regulatory prior) 850 genes (Geronemo*) 250 genes (Brem & Kruglyak) 100 90 80 70 60 50 40 30 20 10 Lee et al., PLOS Genetics 2009 PGV (%) 0 * Lee et al. PNAS 2006 Biological evaluation I How many predicted interactions have support in other data? Deletion/ over-expression microarrays [Hughes et al. 2000; Chua et al. 2006] ChIP-chip binding experiments [Harbison et al. 2004] Transcription factor binding sites [Maclsaac et al. 2006] mRNA binding pull-down experiments [Gerber et al. 2004] Literature-curated signaling interactions Lirnet without regulatory features 90 % Supported interactions 80 Geronemo Flat Lirnet Lirnet 70 60 50 40 Reg 30 TF 20 10 0 %interactions Module %modules Lee et al., PLOS Genetics 2009 %interactions %modules Via cascade % supported regulatory predictions Biological Evaluation II 100 90 80 Lirnet Zhu et al (Nature Genet 2008) 70 Random 60 50 40 30 20 10 0 0 15 30 45 60 75 significance of support Significance (%) Lee et al., PLOS Genetics 2009 90 What Regulates the P-Bodies? The regulatory potential over all 318 SNPs in the region Regulatory potential ChrXIV:415,000-495,000 MKT1 0.7 High-scoring regulatory features Conservation 0.230 0.6 Cis-regulation 0.208 Same GO proc 0.204 0.5 Non-syn 0.158 Translation regulation 0.4 Mass (Da) pK1 Polar pI 0.066 0.060 0.057 0.050 0.046 ChrXIV:449639 Lee et al., PLOS Genetics 2009 Saccharomyces Genome Database (SGD) Mkt1 Mkt1 binds to mRNAs at 3’ UTR BY has SNP at conserved residue in nuclease domain XPG-N XPG-I * 35 MKT1 RM Pbp1 Interaction mkt1D in RM 30 All others P-body Module Puf3 Module BY 25 P-body module 20 15 Puf3 module 10 5 0 Lee et al., PLOS Genetics 2009 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 8 validated regulators in 7 regions 14 validated regulators in 11 regions Predicting Causal Regulators Finding causal regulators for 13 “chromosomal hotspots” Region Zhu et al [Nat Genet 08] Lirnet (top 3 are considered) SEC18 RDH54 SPT7 1 None AMN1 CNS1 TOS1 2 TBS1, TOS1, ARA1, CSH1, SUP45, CNS1, AMN1 3 None TRS20 ABD1 PRP5 4 LEU2, ILV6, NFS1, CIT2, MATALPHA1 LEU2 PGS1 ILV6 5 MATALPHA1 MATALPHA1 MATALPHA2 RBK1 6 URA3 URA3 NPP2 PAC2 7 GPA1 STP2 GPA1 NEM1 8 HAP1 HAP1 NEJ1 GSY2 9 YRF1-4, YRF1-5, YLR464W SIR3 HMG2 ECM7 10 None ARG81 TAF13 CAC2 11 SAL1, TOP2 MKT1 TOP2 MSK1 12 PHM7 PHM7 ATG19 BRX1 13 None ADE2 ORT1 CAT5 Lee et al., PLOS Genetics 2009 Learning Regulatory Priors Learns regulatory potentials that are specific to organism and even data set Can use any set of regulatory features: Sequence features Functional features for relevant gene Features of regulator/target(s) pairs Applicable to any organism, including ones where functional data may not be readily available Lee et al., PLOS Genetics 2009 Conclusion Framework for modeling gene regulation Use machine learning to identify regulatory program Hierarchical Bayesian techniques to capture regularities in effects of perturbations on network Uncovers diverse regulatory mechanisms Chromatin remodeling mRNA degradation Acknowledgements Su-In Lee University of Washington Aimee Dudley Dana Pe’er Institute for System Biology Columbia University Nevan Krogan, UCSF Pam Silver, David Drubin, Harvard Medical School National Science Foundation