ISMB/ECCB07 Vienna, Austria Special Session: Michael Zhang (Chair) Computational epigenetics and chromatin regulation I’d like to start by quoting (Callinan & Feinberg, 2006): “One of the most exciting frontiers in both epigenetics and genome sciences is the new field of epigenomics or the study of epigenetic modification at a level much larger than a single gene. Epigenetics is the study of heritable changes other than those in the DNA sequence and encompasses two major modifications of DNA or chromatin: DNA methylation, the covalent modification of cytosine, and post-translational modification of histones including methylation, acetylation, phosphorylation and sumoylation. Functionally, epigenetic marks act to regulate gene expression, silence the activity of transposable elements and stabilize adjustments of gene dosage, as seen in X inactivation and genomic imprinting. Curiously, the term epigenetics has evolved from its original definition by Waddington, meaning essentially developmental biology, yet epigenetics in the current meaning of the word may in fact be critical to understanding developmental biology. This is because the DNA sequence is invariant across tissues with the exception of rearranged genes such as immunoglobulin family members, yet the epigenome shows tissue-specific variation.” In this Special Session, I’ve sampled a few genome-scale studies in different aspects of this emerging field where computational methods have played an important role in helping to understand the epigenome. Mon. July 23, 2007 2:40 p.m. – 3:10 p.m. Insulator (CTCF) binding site motif and its distribution in the human genome (M. Q. Zhang) 3:10 p.m – 3:40 p.m. A genomic code for nucleosome positioning (E. Segal) 3:40 p.m. – 4:10 p.m. Coffee Break 4:10 p.m – 4:40 p.m. Mapping the structure of human chromatin (W.S. Noble) 4:40 p.m – 5:10 p.m. Allelic expression and genomic imprinting (A. Hartemink) 5:10 p.m. – 5:40 p.m Realizing the medical potential of epigenomics by tailored algorithms and software (T. Lengauer/C. Bock) Abstracts: Insulator (CTCF) binding site motif and its distribution in the human genome Michael Q. Zhang, Cold Spring Harbor Laboratory, USA and Tsinghua University, China CTCF, or CCCTC - binding factor, is a ubiquitous 11 Zn-finger transcription factor with highly versatile functions and plays important roles in both normal development and cancer progression. CTCF is highly conserved, can serve as transcriptional activator (e.g., APBβ), repressor (e.g., c-myc), or chromatin insulator (e.g., β-globin) and may have widespread function in imprinting, with binding sites now identified within H19/Igf2, Rasgrf1, Meg3, and the BWS locus. CTCF interactions with sin3 and YB-1 are shown to modulate CTCF function as a transcriptional repressor. Cooperation of CTCF with nucleophosmin, Kaiso and helicase protein CHD8 have been linked to the control of insulator function of CTCF and epigenetic regulation. More recently, CTCF-YY1-Tsix complex has been reported to function as a key component of the X-chromosome binary switch. Since CTCF recognizes long and diverse nucleotide sequences, whether or not there exist a well-defined binding site consensus motif has been controversial. Recent genome-wide ChIP-chip mapping of CTCF binding sites (Kim et al 2007) has allowed us to characterize its motif and distribution that are consistent to the major role of chromatin insulation. Together with our analysis of large-scale DNA methylation data (Rollins et al 2006, Das et al 2006 and Fan et al, submitted) and Polycomb silencing data (Lee et al 2006, Cuddapah et al, submitted), CTCF seems also to play important roles in boundary regions of un-methylated CpG islands and PcG binding domains. References 1. Kim TH, Abdullaev Z, Smith AD, Ching KA, Loukinow D, Green RD, Zhang MQ, Lobanenkov V, Ren B (2007) Analysis of the vertebrate insulator protein CTCF binding sites in the human genome. Cell, Accepted. 2. Rollins RA, Haghighi FG, Edwards JR, Das R, Zhang MQ, Ju J, Bester TH (2006) Large-scale structure of genomic methylation patterns. Genome Res. 16:157-63. 3. Das R, Dimitrova N, Xuan Z, Rollins RA, Haghighi F, Edwards JR, Ju J, Bestor TH, Zhang MQ (2006) Computational prediction of methylation status in human genomic sequences. Proc Natl Acad Sci U S A. 103:10713-6. 4. Lee TI, Jenner RG, Boyer LA, Guenther MG, Levine SS, Kumar RM, Chevalier B, Johnstone SE, Cole MF, Isono K, Koseki H, Fuchikami T, Abe K, Murray HL, Zucker JP, Yuan B, Bell GW, Herbolsheimer E, Hannett NM, Sun K, Odom DT, Otte AP, Volkert TL, Bartel DP, Melton DA, Gifford DK, Jaenisch R, Young RA. (2006) Control of developmental regulators by Polycomb in human embryonic stem cells. Cell. 125:301-13. A genomic code for nucleosome positioning E. Segal Weizmann Institute, IL Eukaryotic genomes are packaged into nucleosome particles that occlude the DNA from interacting with most DNA binding proteins. Nucleosomes are remarkable from a physical perspective because in each nucleosome one persistence length of DNA is wrapped in nearly two complete superhelical turns around a protein core. As a consequence of this extreme DNA bending, nucleosomes have higher affinity for particular DNA sequences that are best-able to sharply bend as required by the nucleosome. We have discovered that genomes care where their nucleosomes are located on average, and that genomes manifest this care by encoding an additional layer of genetic information, superimposed on top of other kinds of regulatory and coding information that were previously recognized. By constructing a statistical profile from nucleosome-bound sequences that we isolated in vivo, we developed a partial ability to read this nucleosome positioning code and predict the in vivo locations of nucleosomes. Our results suggest that genomes utilize the nucleosome positioning code to facilitate specific chromosome functions including transcription factor binding and transcription initiation; and they imply, further, that genomes are encoding even higher levels of chromosome architecture. Mapping the structure of human chromatin William Stafford Noble, Robert Thurman, John Stamatoyannopoulos Department of Genome Sciences, University of Washington, Seattle, WA, USA DNA in the nucleus is packaged in a complex molecular structure known as chromatin. At the finest scale, the DNA strand is wound around eight-protein histone complexes called nucleosomes. The positioning of nucleosomes along the DNA, as well as superhelical and other higher level chromatin structures, directly affects the accessibility of the DNA to DNA- binding proteins and hence modulates the expression of nearby genes. Low-throughput methods for assaying local chromatin structure have been available for decades. Recently, however, a variety of moderate- to high-throughput techniques have been developed that allow us to understand chromatin on a larger scale and across a variety of cellular conditions. In this talk, I will compare and contrast some of these methods, including methods using southern blots, PCR, microarrays and capillary sequencing machines. I will describe some of our observations derived from these assays, and I will discuss the computational tasks that naturally arise from these new types of data. For example, our data provides evidence for large-scale chromatin domains, and we show that these domains are correlated with other high-throughput data sets generated by the ENCODE consortium. Computational Methods for Genome-wide Prediction of Imprinted Genes A. Hartemink Duke University, USA Imprinted genes are epigenetically modified genes that are expressed monoallelically according to their parent of origin. They are involved in embryonic development, and imprinting dysregulation is linked to cancer, obesity, diabetes, and behavioral disorders such as autism and bipolar disease. Experimental evidence suggests that genomic imprinting evolved ~180 million years ago in a common ancestor to viviparous mammals after divergence from the egg-laying monotremes. We adopt a machine learning approach for both identifying imprinted gene candidates and predicting their parental expression preference. We collect a series of DNA sequence features within and flanking each locus, such as statistics on repetitive elements, transcription factor binding sites, and CpG islands. Based on these features, we subsequently train a classifier employing a two- tier prediction strategy. Each gene in the mouse genome is first predicted to be either imprinted or nonimprinted, and then the parental allele preferentially expressed is predicted for all candidate imprinted genes. Of 23,788 annotated autosomal mouse genes, our model identifies 600 (2.5%) to be potentially imprinted, 64% of which are predicted to exhibit maternal expression. These predictions allow us to identify putative candidate genes for complex conditions where parent-oforigin effects are involved, including Alzheimer disease, autism, bipolar disorder, diabetes, obesity, and schizophrenia. We observe that the number, type, and relative orientation of repeated elements flanking a gene are particularly important in predicting whether a gene is imprinted. Realizing the medical potential of epigenomics by tailored algorithms and software Christoph Bock1, Jörn Walter2, Thomas Lengauer1 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany and 2Universität des Saarlandes, FR 8.3 Biowissenschaften, Genetik/Epigenetik, Saarbrücken, Germany The demand for computational support and bioinformatic tools in the field of medical epigenetics is rapidly increasing, due to complex experimental methods, increasingly genome-wide analysis and the pressure to quickly translate scientific results into clinical practice. Our goal is to develop bioinformatic methodology for addressing these issues, and to implement a set of web services which make powerful algorithms available to typical bench scientists. This talk surveys three of our contributions to the field of computational epigenetics. 1. We developed software tools that support curation and low-level analysis of DNA methylation data derived by bisulfite sequencing. This technology is commonly regarded as the gold standard for assaying DNA methylation, but suffers from the need for careful control of data quality and tedious data preparation. Our BiQ Analyzer software automates these tasks and provides an easy step-by-step analysis workflow to facilitate reproducible data analysis (Bock et al. (2005). Bioinformatics 21, 4067-8). BiQ Analyzer has recently been selected by ABI to be part of the Applied Biosystems Software Community Program. 2. We developed a method for predicting epigenetic modifications from the DNA. Initially applied to DNA methylation, we could show that the methylation state of CpG islands in human lymphocytes is highly correlated with DNA sequence, repeats and predicted DNA structure (Bock et al. (2006). PLoS Genet 2(3): e26). We subsequently extended the method to the prediction of other epigenetic modifications such as histone modifications and DNA accessibility, and we applied it to refine the annotation of CpG islands for the human genome (Bock et al. submitted). Given the generality of the method and a wide range of potential applications in analyzing large-scale epigenome datasets and prioritization of candidate regions, we implemented it as a publicly available web service called EpiGRAPH, which is currently in beta testing. 3. We developed a workflow and a set of statistical methods that optimize candidates for epigenetic biomarkers with respect to their applicability in clinical settings. This approach accounts for the fact that some widely used methods for pre-clinical research (such as methylation-specific PCR and bisulfite sequencing of multiple clones) are less appropriate for clinical settings, due to lack of robustness, complicated handling and time-consuming analysis. Using a combination of experimental and statistical analysis, we were able to determine optimal pyrosequencing targets for assessing aberrant promoter methylation of the MGMT gene, which is an important biomarker for resistance against chemotherapy (Mikeska et al. submitted).