Manolis Kellis: Research synopsis • Why biology in a computer science group? • Big biological questions: 1. Interpreting the human genome. Brief overview 2. Revealing the logic of gene regulation. 1 slide each 3. Principles of evolutionary change. • Underlying computational techniques: – Comparative genomics: evolutionary signatures – Regulatory genomics: motifs, networks, models – Epigenomics: chromatin states, dynamics, disease vignette – Phylogenomics: evolution at the genome scale • Defining characteristics of research program: – Genome-wide rules, exploit nature of problems, interdisciplinary collaborations, biology impact (1) Comparative genomics: evolutionary signatures Protein-coding signatures • 1000s new coding exons • Translational readthrough • Overlapping constraints Non-coding RNA signatures • Novel structural families • Targeting, editing, stability • Structures in coding exons microRNA signatures: • Novel/expanded miR families • miR/miR* arm cooperation • Sense/anti-sense switches Regulatory motif signatures • Systematic motif discovery • Regulatory motif instances • TF/miRNA target networks • Single binding-site resolution (2) Regulatory genomics: circuits, predictive models ENCODE/modENCODE • 4-year effort, dozens of experimental labs • Integrative analysis • Systematic genome annotation • Flagship NIH project Predictive models of gene regulation • Infer networks • Predict function • Predict regulators • Predict gene expression • Initial annotation of the non-coding genome, from 20% to 70% • Systems biology for an animal genome for the first time possible • Students and postdocs are co-first authors, leadership roles New phylogenomic pipeline Bayesian formulation Generative model (3) Phylogenomics: Bayesian gene-tree reconstruction Two components of gene evolution 1. Family rate Fj ~gamma (α,β) 2. Species-specific rates Si ~normal(μi,σi) Selective pressures on gene function Population dynamics of the species Length I, Topology T, Reconciliation R Alignment data D, species-level parameters θ Sequence likelihood Branch length prior Topology prior HKY model (traditional) Learned Fj,Si distributions Birth-Death process Vignette: Epigenomics Jason Ernst, Pouya Kheradpour Ernst and Kellis, Nature Biotech, 2010 Ernst, Kheradpour et al, Nature, 2011 (in press) Epigenomics and ‘chromatin state’ signatures DNA Promoter states Transcribed states Histone tails Active Intergenic Repressed Chromatin ‘marks’ • Learn de novo combinations of chromatin marks • Reveal functional elements • Use for genome annotation • Use for studying dynamics across many cell types ChromHMM: learning ‘hidden’ chromatin states Transcription Start Site Enhancer Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State K4me1 K27ac 1 200bp intervals K4me3 K4me3 Transcribed Region K4me1 K36me3 K36me3 4 6 6 DNA K36me3 K36me3 K4me1 2 3 6 6 High Probability Chromatin Marks in State 0.8 0.8 0.7 1: 2: 3: K4me1 K27ac 0.9 0.8 K4me3 K4me1 0.9 K4me3 Each state: vector of emissions, vector of transitions 4: K4me1 5: 6: 0.9 K36me3 6 5 5 5 All probabilities are learned de novo from chromatin data alone (Baum-Welch aka. EM) 7 Chromatin states dynamics across nine cell types • State definitions are cell-type invariant – Same combinations consistently found • State locations are cell-type specific – Can study pair-wise or multi-way changes Multi-cell activity profiles and their correlations Gene expression Chromatin States Active TF motif enrichment TF regulator expression Dip-aligned motif biases HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 ON OFF Active enhancer Repressed Motif enrichment Motif depletion TF On TF Off Motif aligned Flat profile Chromatin state & gene expression link enhancers and target genes TF motif enrichment & TF expression reveal activators / repressors Coordinated activity reveals enhancer links Enhancer activity Gene activity Predicted regulators Activity signatures for each TF • Enhancer networks: Regulator enhancer target gene • Ex1: Oct4 predicted activator of embryonic stem (ES) cells • Ex2: Ets activator of GM/HUVEC (but not either one alone) Revisiting disease- xx associated variants • Disease-associated SNPs enriched for enhancers in relevant cell types • E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator Contributions Science We aim to further our understanding of Nature Nature the human genome by computational Nature integration of large-scale functional and comparative genomics datasets. Nature Biotech • We use comparative genomics of Nature Nature multiple related species to recognize PLoS Genetics evolutionary signatures of proteincoding genes, RNA structures, MBE microRNAs, regulatory motifs, and Genome Research Nature individual regulatory elements. • We use combinations of epigenetic modifications to define chromatin states Genome Research Nature associated with distinct functions, Genome Research PLoS Comp. Bio. including promoter, enhancer, transcribed, and repressed regions, each with distinct functional properties. Genes & Development • We develop phylogenomic methods to Genome Research study differences between species and to Nature uncover evolutionary mechanisms for the PNAS emergence of new gene functions BMC Evo. Bio. ACM TKDD Our methods have led to numerous new insights on diverse regulatory mechanisms, uncovered evolutionary principles, and Genome Research RECOMB provide mechanistic insights for previously J. Comp. Bio. uncharacterized disease-associated SNPs PNAS Nature Nature Nature Nature Nature Gen Genes&Dev Nature In review Nature Nature Biotech Nature Nature Nature Nature WBpress PNAS Nature G.R. BioChem Nature GenomRes Nature G.R. Science RECOMB RECOMB