Simultaneous Computational Discovery of DNA Regulatory Motifs and Transcription Factor Binding Constraints at High Spatial Resolution by Yuchun Guo M.S. Computer Science Northeastern University, 2000 SUBMITTED TO THE COMPUTATIONAL AND SYSTEMS BIOLOGY PROGRAM IN PARTIAL FULLLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTATIONAL AND SYSTEMS BIOLOGY AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY SEPTEMBER 2012 2012 Massachusetts Institute of Technology All rights reserved Signature of Author………………………………………………………………………………..……….……. Yuchun Guo Computational and Systems Biology Program August 31, 2012 Certified by…………………………………………………………………………………………………… David K. Gifford Professor of Computer Science and Engineering Thesis Supervisor Accepted by……………………………………………………………………………….............………….. Chris Burge Professor of Biology and Biological Engineering Computational and Systems Biology Ph.D. Program Director 2 Simultaneous Computational Discovery of DNA Regulatory Motifs and Transcription Factor Binding Constraints at High Spatial Resolution by Yuchun Guo Submitted to the Computational and Systems Biology Program on August 31, 2012 in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computational and Systems Biology Abstract I present three novel computational methods to address the challenge of identifying protein-DNA interactions at high spatial resolution from noisy ChIP-Seq data. I first present the genome positioning system (GPS) algorithm which predicts protein-DNA interaction events from ChIP-Seq data using a single-base resolution generative probabilistic model. Using synthetic and actual ChIP-Seq data, I show that GPS improves the effective spatial resolution and accuracy in resolving proximal binding events when comparing with existing methods. Second, I present the k-mer set motif (KSM) representation and the k-mer motif alignment and clustering (KMAC) method which discovers DNA-binding motifs from ChIP-Seq derived sequences. I demonstrate that the KSM model is more predictive than the widely used position weight matrix model, and that KMAC outperforms other existing motif discovery programs in recovering known motifs from a large collection of human ChIP-Seq experiments. Finally, I present an integrative method, genome wide event finding and motif discovery (GEM), which models ChIP data with explanatory motifs and binding events at high spatial resolution. The GEM model links binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence. I show that GEM further improve upon previous methods for processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and discovery of proximal binding events. GEM enables a systematic analysis of in vivo transcription factor binding to discover hundreds of spatial binding constraints between factors in human and mouse cells, including known factor pairs and novel pairs such as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4α/FOXA1. I also discovered a complex spatial binding relationship involved 6 key regulatory factors in mouse embryonic stem (ES) cell that is likely to be functional in ES cell gene regulation. Such computational discoveries propose testable models for regulatory factor interactions that will help elucidate genome function and the implementation of combinatorial control. Thesis Supervisor: David K. Gifford Title: Professor of Computer Science and Engineering 3 4 To My Teachers and My Family 5 6 Acknowledgments Getting a Ph.D. degree is a journey. Without the inspiration, guidance and supports from the people around me, it is almost impossible. I would like to sincerely thank all the teachings and inspirations I received that led me to my Ph.D. study at MIT and kept motivating and supporting me throughout this journey. For my thesis research at MIT, I first would like to thank my thesis advisor, Prof. David Gifford. David introduced me to the field of computational genomics and developmental biology, guided me through every step in my research. David encouraged me to explore research ideas that are really interesting and meaningful to me, helped me to get help from other lab members to get my project started quickly, and suggested new directions when I made progress. His knowledge in both computer science and biology, and his extraordinary ability to effectively communicate and collaborate with scientist in other fields have set a great example for me. Next I want to thank my thesis committee members, Prof. Tommi Jaakkola and Prof. Ernest Fraenkel. Tommi and Ernest not only provided their expert suggestions to improve my research, but also taught me how to do science in practice. The members of the Gifford Lab have provided me a wonderful and relaxed learning environment. Shaun Mahony, a research scientist in the Gifford Lab, has been a great colleague, mentor, and friend. Shaun helped me to develop a lot of analysis in my project, helped to improve my writing, and was always available to chat when I got stuck. Chris Reeder, a fellow graduate student, and I shared the ups and downs of the graduate student experience. The conversations we had, ranging from machine learning, biology, and life in general, helped me to keep the research work in perspective. I would also like to thank all the other Gifford lab members, including Jeanne Darling, Georg Gerber, Robin Dowell, Alan Qi, Alex Rolfe, Tim Danford, Charlie O’Donnell, Matt Edwards, Tahin Syed and Tatsu Hashimoto, who made my life at MIT much easier and richer. I would like to thank the opportunity to study in the Computational and Systems Biology (CSB) Program at MIT. The program directors, Prof. Chris Burge and Prof. Bruce Tidor, are great teachers. They always made themselves available when I needed their helps the most. The CSB administrator Bonnie Whang, Darlene Ray, and other CSB students, gave me another home at MIT. I would also like to thank many MIT 7 students, including Pouya Kheradpour, Georgios Papachristoudis and Bob Altshuler for their helps in my research. I also wish to acknowledge and thank my friends at MIT and in the Boston area. The encouragements and helps from them helped me to overcome difficulties in the past few years. Finally, I am grateful to my family, my parents, my wife, and my two sons. Their love and support are always with me. I feel like we share the MIT degree. 8 Table of contents Chapter 1 : Introduction .............................................................................................. 17 1.1 Gene expression and transcription regulation .................................................... 17 1.2 Transcription factor binding and combinatorial regulation .................................. 19 1.3 Next-generation sequencing technologies ......................................................... 21 1.4 ChIP-Seq and computational challenges ........................................................... 23 1.5 Thesis road map ................................................................................................ 25 Chapter 2 : Genome Positioning Systems (GPS) ...................................................... 31 2.1 Introduction ........................................................................................................ 31 2.2 GPS algorithm ................................................................................................... 33 2.2.1 GPS algorithm overview ................................................................................ 33 2.2.2 Empirical spatial distribution of reads ............................................................. 35 2.2.3 GPS mixture model ........................................................................................ 35 2.2.4 EM algorithm.................................................................................................. 36 2.2.5 Setting the sparseness parameter α .............................................................. 39 2.2.6 Statistical significance of predicted events ..................................................... 39 2.2.7 Artifact filtering ............................................................................................... 41 2.2.8 Software implementation ............................................................................... 41 2.3 Results .............................................................................................................. 42 2.3.1 GPS automatically adapts the empirical read distribution ............................... 42 2.3.2 GPS predictions have higher spatial resolution .............................................. 43 2.3.3 GPS discovers more joint events ................................................................... 44 2.4 Discussion ......................................................................................................... 46 2.5 Methods ............................................................................................................. 48 2.5.1 Datasets used ................................................................................................ 48 2.5.2 ChIP-Seq analysis methods ........................................................................... 49 2.5.3 Method comparison on spatial resolution ....................................................... 49 2.5.4 Evaluating performance in deconvolving joint events using synthetic data ..... 50 2.5.5 Evaluating performance in deconvolving joint binding events using GABP ChIP-Seq data ........................................................................................................... 51 9 Chapter 3 : K-mer set motif representation and discovery ...................................... 55 3.1 Introduction ........................................................................................................ 55 3.1.1 DNA motif representations ............................................................................. 55 3.1.2 DNA motif discovery methods ........................................................................ 56 3.1.3 About this chapter .......................................................................................... 58 3.2 K-mer set motif (KSM) model ............................................................................. 59 3.2.1 The KSM representation ................................................................................ 59 3.2.2 Scoring K-mer set motif in a DNA sequence .................................................. 61 3.3 K-mer motif alignment and clustering (KMAC) ................................................... 63 3.3.1 Discovery of the set of enriched k-mers ......................................................... 64 3.3.2 Clustering the enriched k-mers into k-mer set motifs...................................... 64 3.4 Results ..............................................................................................................67 3.4.1 The PWM model does not capture k-mer differences between CTCF binding in mouse and human cells............................................................................................. 67 3.4.2 K-mer set motif model is more predictive for in vivo binding than PWM model 67 3.4.3 KMAC outperforms other motif discovery methods in discovering known DNAbinding motifs ............................................................................................................ 69 3.4.4 KMAC outperforms other ChIP-Seq oriented motif discovery methods .......... 71 3.5 Discussion ......................................................................................................... 71 3.6 Methods ............................................................................................................. 74 3.6.1 Datasets ........................................................................................................ 74 3.6.2 Motif-finding performance comparison ........................................................... 74 3.6.3 ROC comparison of motif representation performance in predicting in vivo binding 75 Chapter 4 : Genome-wide event finding and motif discovery (GEM)....................... 79 4.1 Introduction ........................................................................................................ 79 4.2 GEM algorithm................................................................................................... 80 4.2.1 Predicting protein DNA-binding events with a sparse prior ............................. 80 4.2.2 Discover the k-mer set motifs at binding event locations ................................ 80 4.2.3 Positional prior generation ............................................................................. 81 4.2.4 Binding event prediction with a positional prior............................................... 81 4.2.5 Motif discovery using improved event locations ............................................. 84 10 4.2.6 4.3 GEM software ................................................................................................ 84 Results ..............................................................................................................84 4.3.1 GEM improves the spatial resolution of binding event prediction.................... 84 4.3.2 GEM is better at resolving closely spaced binding events .............................. 86 4.3.3 GEM improves the spatial resolution of ChIP-exo binding event prediction .... 87 4.4 Discussion ......................................................................................................... 89 4.5 Methods ............................................................................................................. 91 4.5.1 Datasets ........................................................................................................ 91 4.5.2 Evaluating spatial resolution of ChIP-Seq event calls..................................... 91 4.5.3 Evaluating performance in deconvolving proximal binding events using GABP ChIP-Seq data ........................................................................................................... 91 4.5.4 Analysis of ChIP-exo data .............................................................................. 92 Chapter 5 : Transcription factor spatial binding constraints ................................... 95 5.1 Introduction ........................................................................................................ 95 5.2 Spatial binding constraints discovery ................................................................. 95 5.3 Results ..............................................................................................................96 5.3.1 GEM reveals known Sox2-Oct4 distance-constrained transcription factor binding distances....................................................................................................... 96 5.3.2 Enhancer grammar elements deduced from transcription factor binding sites predicted by GEM...................................................................................................... 98 5.3.3 5.4 Spatially constrained human factor binding in ENCODE data ...................... 104 Discussion ....................................................................................................... 115 Chapter 6 : Conclusions ........................................................................................... 119 6.1 Summary and contributions ............................................................................. 119 6.1.1 Genome Positioning Systems (GPS) ........................................................... 119 6.1.2 K-mer set motif representation and discovery .............................................. 120 6.1.3 Genome-wide event finding and motif discovery (GEM) ............................... 121 6.1.4 Transcription factor spatial binding constraints............................................. 122 6.2 Directions for future work ................................................................................. 123 6.2.1 Weighting factor of motif prior in the GEM algorithm .................................... 123 6.2.2 K-mer based comparison of in vivo versus in vitro binding for similar TFs in a family 123 11 6.2.3 6.3 Discovery of binding constraints................................................................... 124 Conclusions ..................................................................................................... 124 References ................................................................................................................. 126 12 Figures Figure 2-1 Random mixture of ChIP-Seq reads from joint events ................................. 31 Figure 2-2 Spatial distribution of ChIP-Seq reads ......................................................... 34 Figure 2-3 GPS probabilistically models ChIP-Seq read spatial distributions ................ 34 Figure 2-4 GPS automatically adapts the empirical read distribution ............................ 42 Figure 2-5 GPS improves the effective spatial resolution ............................................. 44 Figure 2-6 GPS has better spatial resolution than other shape-aware methods ........... 45 Figure 2-7 GPS improves accuracy in resolving joint binding events ............................ 46 Figure 3-1 Oct4 KSM and PWM motif representation ................................................... 60 Figure 3-2 Search k-mer set motif in a DNA sequence ................................................. 62 Figure 3-3 Schematic of k-mer set motif finding............................................................ 65 Figure 3-4 The PWM model does not capture k-mer differences .................................. 67 Figure 3-5 The KSM model is more predictive than the PWM model ............................ 69 Figure 3-6 KMAC motif discovery outperforms other methods when detecting motifs in ChIP-Seq data. ...................................................................................................... 70 Figure 3-7 KMAC outperforms other ChIP-Seq oriented motif discovery methods ........ 72 Figure 4-1 GEM improves spatial accuracy in binding event prediction ........................ 86 Figure 4-2 GEM is better at resolving closely spaced binding events. .......................... 87 Figure 4-3 GEM improves the spatial resolution of ChIP-exo data event prediction. ..... 88 Figure 5-1 GEM reveals transcription factor spatial binding constraints. ....................... 98 Figure 5-2 Spatial binding constraints detected from mouse ES cells. .......................... 99 Figure 5-3 Spatial relationship between Klf4 and other 15 factors in mouse ES cells . 100 Figure 5-4 Enhancer grammar elements deduced from mouse ES cell transcription factor binding sites predicted by GEM. ................................................................ 102 Figure 5-5 A Klf4-Sox2 distance-constrained region interacts with Tcfcp2l1 transcriptional start site. ....................................................................................... 103 Figure 5-6 Klf4-Sox2 distance-constrained regions are bound by p300 and marked by H3K27ac ............................................................................................................. 104 13 Figure 5-7 Spatial binding constraints detected from human K562 cells. .................... 106 Figure 5-8 Spatial binding constraints detected from human GM12878 cells. ............. 108 Figure 5-9 Spatial binding constraints detected from human HepG2 cells .................. 110 Figure 5-10 Spatial binding constraints detected from human HeLa-S3 cells. ............ 112 Figure 5-11 Spatial binding constraints detected from human H1-hESC cells. ........... 113 Figure 5-12 Examples of transcription factor spatial binding constraints detected from GEM analysis in ENCODE ChIP-Seq data. ......................................................... 115 14 Chapter 1: Introduction Chapter 1 Introduction 15 Chapter 1: Introduction Chapter 1: Introduction This thesis is about developing and applying new computational algorithms to discover precise transcription factor binding locations, corresponding in vivo DNA regulatory motifs, and transcription factor binding spatial constraints from highthroughput biological experimental datasets. My research is within the broader research area of computational biology, with a focus on understanding the regulatory mechanisms of transcription, a fundamental biological process. To explain what transcription regulation is, why it is an important research subject, and how my thesis work fits in the frontier of this research area, I will start with a brief overview of the biological background and the related research problems. 1.1 Gene expression and transcription regulation The human body is full of wonders. In our brain which weighs only about 3 pounds, approximately 100 billion neurons interconnected with each other by electrical or chemical signals give rise to our five senses, consciousness, memory, emotion, creativity, and so on. An army of immune cells constantly defends our bodies from countless foreign invaders (for example, virus and bacteria) and cancer cells. Over 2 billion heart cells beat in a highly coordinated manner for roughly 3 billion times throughout our life, supplying oxygen and other nutrients to our bodies (Sherwood, 1997). The list goes on. Perhaps more amazingly, all ~200 major types of cells in our bodies, with diverse and complex functions, originate from a single fertilized egg cell, starting with the same copy of genetic instructions encoded in DNA. Although nearly all the cells in our bodies contain the same full set of genes, only some of the genes are active, or expressed, and used to make proteins or functional RNAs in a particular cell type (Lodish, 2004). Gene expression, the process by which information from the gene is used to synthesize a protein or another functional gene product, is used by all life forms to make the macromolecular machinery which carry out life’s functions. The first main step of gene expression is transcription, in which the DNA sequence information of a gene is copied from the DNA template to a single-stranded RNA by the enzyme RNA polymerase. In eukaryotic cells, the initial RNA copy is processed into a messenger RNA (mRNA). In the second main step of gene expression called translation, a complex 17 Chapter 1: Introduction molecular machine, the ribosome, assembles proteins using the precise sequence information in the mRNA, which is originally coded in the gene. With the widespread influence of gene expression on the basic cellular processes such as cell growth, maintenance, development and differentiation, proper regulation is essential to ensure gene expression happens at the right time and right place. A classic example is lactose metabolism in E. coli: the enzyme that metabolizes lactose is expressed at high levels only when lactose is available in the environment, but when glucose (a better food source) is also available, the enzyme is not expressed even when lactose is present (Jacob and Monod, 1961). Regulation of gene expression can happen at various steps during the process, including initiation, elongation, and termination of transcription, splicing, mRNA transport, mRNA decay, and translation. However, the regulation of transcription initiation –the first step- is the most important mechanism for determining which genes are expressed and how much of the encoded mRNAs and, consequently, proteins are produced (Lodish, 2004). Transcription initiation from a gene promoter is controlled by sequence-specific DNA-binding regulatory proteins called transcription factors (TFs, also called activators or repressors in bacteria). Eukaryotic TFs typically contain one or more DNA-binding domains that recognize specific DNA sequences and a transcription regulation domain that interacts with other transcriptional regulatory proteins and regulate the activity of transcription (Mitchell and Tjian, 1989; Ptashne and Gann, 1997). During transcription initiation, RNA polymerase (together with the general transcription factors, also called the transcriptional machinery) binds to the promoter region of a gene and starts the process of transcription. However, at many promoters, in the absence of regulatory proteins, RNA polymerase binds only weakly and produces a low level of constitutive expression. With a regulatory activator, which typically binds specific sites at or near the promoter, the polymerase is recruited to the promoter and produces a high level of transcription. The transcriptional activators, usually with the help of co-activators (Taatjes et al., 2004), can interact with one or more of many different components of the transcriptional machinery to recruit polymerase. Alternatively they can interact with chromatin modifiers that open inaccessible DNA to allow binding of transcriptional machinery to a promoter. On the other hand, a regulatory repressor may interfere with or inhibit the transcriptional machinery or activators, or recruit repressive chromatin modifiers to suppress transcription. Thus gene promoters typically contain specific short sequences elements that can be recognized by the specific TFs. In higher eukaryotes, 18 Chapter 1: Introduction TFs may bind regions called enhancers located tens of thousands base pairs either upstream or downstream from the promoter. Some TFs may also regulate transcriptional elongation (Rahl et al., 2010). Therefore, a gene can be regulated by multiple TFs that work together in large numbers and various combinations. This allows the integration of multiple signal transduction pathways, particularly in multicellular organisms (Watson, 2004). The regulation of transcription by transcription factors is critical for numerous biological phenomena, including development, signal transduction, immune response, sensory perception, etc. (Vaquerizas et al., 2009). For example, introducing only four transcription factors, Oct4, Sox2, c-Myc, and Klf4 can change the cell fate of mouse embryonic or adult fibroblasts into induced pluripotent state cells (Takahashi and Yamanaka, 2006). Dysfunction in transcription regulation may lead to various diseases, such as developmental syndromes (Boyadjiev and Jabs, 2000) and cancers (Furney et al., 2006). For example, deregulated expression of transcription factor c-Myc is found to cause unregulated expression of many cell proliferation genes and result in certain cancers (Dang, 2012). In a manually curated census of human TFs, 164 TFs were identified to be directly responsible for 277 diseases or syndromes (Vaquerizas et al., 2009). Mutations in transcriptional regulatory elements have also been found associated with numerous human diseases (Maston et al., 2006). In summary, transcription factors are key players in regulating gene expression and in influencing broad a spectrum of biological process. However, most human TFs are uncharacterized (Vaquerizas et al., 2009). It is important to understand how TFs (possibly interacting with other TFs) bind to the regulatory DNA sequences and regulate the expression of their target genes. 1.2 Transcription factor binding and combinatorial regulation Combinatorial binding of TFs plays a key role in the specificity of transcriptional regulation and is thought to contribute to the complexity and diversity of eukaryotes (Watson, 2004). An increase in both the ratio and absolute number of transcription factors in a genome seems to correlate with organismal complexity (Levine and Tjian, 2003). The complexity of the regulatory sequences follows the same trend. From bacteria to yeast, to multicellular organisms such as fruit fly and human, the regulatory sequences typically contain increasing numbers of binding sites and are further away from the gene promoter. 19 Chapter 1: Introduction Theoretical analysis showed that bacterial TFs can recognize a specific DNA site in the genomic background, but the same is not true for eukaryotic TFs because the eukaryotic binding sites are shorter and their genomes are much larger (Wunderlich and Mirny, 2009). In addition, TFs in large families share similar DNA-binding domains and recognize very similar consensus sequences. One example is the so-called Hox paradox: Homeobox (Hox) family factors recognize similar sequences containing a TAAT core in vitro, yet display functional diversities in vivo (Hueber and Lohmann, 2008). Given the generic binding sites and generic DNA-binding domains of the TFs, the formation of complex nucleoprotein structures involving a combinatorial TF partner code and their DNA sites increases the effective length of the target DNA sequences and thus increases the specificity of gene regulation (Georges et al., 2010). Thus in multicellular organisms, enhancers usually contain clusters of sequencespecific TF binding sites (Maston et al., 2006). Specific genomic regions that are extensively targeted by multiple TFs have been reported in fruit fly (Moorman et al., 2006; The modENCODE Consortium et al., 2010) and in mouse (Chen et al., 2008). An intriguing question is how these TFs work together to regulate specific gene expression patterns. The notion of grammar (Levine, 2010; Swanson et al., 2011) has been referred to the phenomenon that spacing and arrangement of binding sites matter for the activity of the enhancer, just like the order of words in a sentence can affect its meaning. Such regulatory grammars have been observed in certain enhancers. An open question is the pervasiveness of grammatical features (Levine, 2010). However, most of the current binding data show overlapping binding regions but do not have enough spatial resolution to reveal the detailed grammars that govern the interactions among the TFs and between the TFs and the DNA. The nature of the combinatorial control with respect to the arrangement (position and orientation) of the binding sites has been described in two competing models: the “enhanceosome” and “billboard” models (Arnosti and Kulkarni, 2005). The enhanceosome model proposes that the binding sites within the enhancer are precisely positioned, allowing for highly cooperative assembly of TFs. One well-studied example is the Interferon-β enhanceosome, where binding sites for ATF/c-Jun, IRF-3/IRF-7, and NR-κB are tightly clustered on a sequence only 55 base pairs long. Specific type and number of TF binding sites and their correct positioning on the surface of the DNA double helix are required for enhancer function (Thanos and Maniatis, 1995). In contrast, the billboard models suggests that the arrangement of the binding sites may not be very 20 Chapter 1: Introduction strict because TFs binding on sub-elements of the enhancer can be interpreted by transcriptional machinery separately (Kulkarni and Arnosti, 2003). These two models has been observed in only a few detailed studies. In reality, enhancers may function somewhere between these two extreme models. In addition, the TF binding site arrangement may be only part of the regulatory code. Sometimes protein-protein interactions may modify the binding preferences of the TFs. Recent study of in vitro binding of Hox-cofactor complexes showed that cofactor binding evoked differences in DNA binding among different Hox proteins and this may contribute to the in vivo binding specificities of Hox proteins (Slattery et al., 2011). Furthermore, there may or may not be protein-protein interactions between the TFs in a multiprotein-DNA complex. A composite structure model of Interferon-β enhanceosome showed the absence of major protein-protein interfaces between the TFs, suggesting the cooperative occupancy of the enhancer comes from both binding-induced DNA conformational changes and specific interactions with co-activators (Panne et al., 2007). Thus, more cases of detailed analysis of in vivo binding sites within the enhancers will be needed to unravel the grammars of combinatorial regulation. Computational prediction of in vivo TF binding sites suffers from high false positive rates (Wasserman and Sandelin, 2004). Although the situation is improving with the new approaches that model combinatorial binding to improve predictive specificity, the improvements are limited by the availability of sufficient known sites to train the model (Wasserman and Sandelin, 2004). Therefore, a complete survey of genome wide TF binding and further studies of the binding motifs and spatial constraints of TFs in vivo will be helpful to elucidate the nature of combinatorial control. 1.3 Next-generation sequencing technologies New technological advances in experimental methods, especially in sequencing technologies (Mardis, 2008; Metzker, 2010) have brought excitements in genomics research. The next-generation sequencing (NGS) technologies made various innovations in areas such as template preparation, sequencing and imaging, and genome alignment and assembly methods (Metzker, 2010). The major advance offered by NGS is the ability to generate very large amount of sequencing data that cost much less than the automated Sanger sequencing method, and this enables various innovative approaches in basic, applied and clinical research (Metzker, 2010). NGS has been applied in the research field of functional genomics. By sequencing 21 Chapter 1: Introduction the ends of the DNA/RNA molecules in the sample and mapping them to the genome, one can count the mapped reads and analyze their distribution throughout the genome. Such sequence census methods enables researchers to assay the regulatory input and output of the genome routinely and comprehensively, and vastly increases our ability to understand how the genome specifies all the different cell types and their states of behavior (Wold and Myers, 2008). For example, RNA-Seq is replacing microarrays for gene expression profiling. RNA-Seq reveals unexpected complexity in eukaryotic transcriptomes and provides a far more precise measurement of levels of transcripts and their isoforms than other methods (Wang et al., 2009). Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq) enables genome-wide profiling of protein-DNA interactions at a much higher resolution and coverage than previous methods (Park, 2009). ChIP-Seq studies of TF binding find that most TFs bind to thousands of places in the genome, often outside of the proximal promoter regions, and that combinatorial binding and recruitment of co-activators are important for high level of transcription activity (Farnham, 2009). ChIP-Seq profiling of multiple histone marks have been used for genome annotation and detection regulatory sequences and non-coding RNAs (Guttman et al., 2009; Ernst et al., 2011; Shen et al., 2012). DNase-seq and FAIRE-seq have been applied to map nearly a million open chromatin regions that cover 9% of the human genome and to discover clusters of open regulatory elements that are suggested to control gene activity required for the maintenance of cell-type identity (Song et al., 2011). NGS-based technologies have also been applied to variant discovery by resequencing targeted regions of interest or whole genomes, de novo assemblies of bacterial and lower eukaryotic genomes, and species classification and or gene discovery by metagenomics studies, etc. (Metzker, 2010). The application of emerging new technologies and the large consortium efforts across multiple institutions such as ENCODE (Birney et al., 2007), modENCODE (Celniker et al., 2009), and the Roadmap Epigenomics Mapping Consortium (Bernstein et al., 2010) are starting to generate unprecedented amounts of data. Integrative analysis through detailed computational modeling on these comprehensive datasets will greatly leverage the potential of these resources and facilitate the translation of data into biological knowledge. Such combined experimental and computational efforts promise to unravel the molecular mechanisms of gene regulation and improve human health. 22 Chapter 1: Introduction 1.4 ChIP-Seq and computational challenges As one of the early applications of NGS, chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq) has become an indispensable tool for genomewide profiling of protein-DNA interactions (Barski et al., 2007; Johnson et al., 2007; Mikkelsen et al., 2007; Robertson et al., 2007; Park, 2009). Compared to its predecessor, Chromatin immunoprecipitation followed by microarray hybridization (ChIPchip) (Ren et al., 2000; Iyer et al., 2001), ChIP-Seq has higher resolution, fewer artifacts, greater coverage and a larger dynamic range (Park, 2009) and therefore provides substantially improved mapping of physical interactions between proteins and DNA in the living cell. ChIP-Seq has been applied to genome-wide profiling of TF binding sites and histone modifications and has generated valuable knowledge on global gene regulation (Farnham, 2009). It is considered the most successful high-throughput experimental technique for discovery of TF binding sites (Ladunga, 2010). ChIP-Seq is based on Chromatin immunoprecipitation (ChIP)(Solomon et al., 1988) to enrich the DNA fragments that are associated with a specific protein. The DNAbinding protein is crosslinked to the DNA by formaldehyde and the DNA is sheared by sonication into small fragments. An antibody specific to the protein of interest is used to selectively immunoprecipitate the protein-bound DNA fragments. Finally, the pulled down protein-DNA links are reversed and the recovered DNA is assayed by NGS to determine the sequences bound by that protein. The output is a list of reads that are sequenced from the 5’ end of the ChIP DNA fragments. The specificity of the antibody is critical in the experimental design but it may be difficult to find a native antibody with sufficient quality for the protein of interest. The situation is improving thanks to the wider adoption of ChIP-Seq and the large consortium efforts such as ENCODE (Birney et al., 2007), modENCODE (Celniker et al., 2009), and Roadmap Epigenomics Mapping Consortium (Bernstein et al., 2010). As an alternative to the requirement for TF specific antibodies, ChIP-Seq with epitope-tagged human or mouse proteins has also been developed (Cao et al., 2011; Mazzoni et al., 2011). Sequencing depth is another important experimental design issue. Early ChIP-Seq datasets typically contained several million reads. A recent evaluation suggested that the regularly adopted depth of 15-20 million reads in human experiments is insufficient (Chen et al., 2012). Typically, a control experiment is used to correct biases in DNA shearing, amplification and sequencing (Park, 2009). There are three types of commonly used control: genomic DNA, chromatin input DNA and DNA from a 23 Chapter 1: Introduction nonspecific IP. Chromatin input DNA has been shown to outperform genomic DNA in predicting binding events with more enriched binding motifs (Chen et al., 2012). Short read alignment software such as Bowtie (Langmead et al., 2009), Eland (the default aligner for the Illumina platform) or MAQ (Li et al., 2008) are then used to align the sequenced reads to the genome. A few base mismatches are typically permitted during read alignment to allow for sequencing errors. Depending on the length of the reads and the genome, ~10-20% of the reads cannot be uniquely mapped to the genome (Rozowsky et al., 2009). The non-uniquely mapped reads are typically discarded for downstream analysis (Park, 2009), although a recent study showed that they are important for studying TF binding in highly repetitive regions of genomes (Chung et al., 2011). The next step, the most critical step, is to infer actual binding events by identifying statistically enriched regions in the ChIP data as compared to the control data. Many computational methods (usually called “peak callers”) have been developed to detect binding events. They are reviewed in (Park, 2009; Pepke et al., 2009) and compared in (Laajala et al., 2009; Wilbanks and Facciotti, 2010; Rye et al., 2011). Some of the methods are discussed in Chapter 2 and Chapter 4. For TF binding, a critical issue in computational analysis for ChIP-Seq is the spatial resolution of binding event predictions, which is defined as the difference between the predicted location of a binding event and the midpoint of its actual location. Spatial resolution is important for downstream analysis such as motif discovery and annotation of the binding sites, particularly for mapping the binding constraints among multiple TFs in the same cellular condition. Although the reads are mapped at single-base resolution, random variation in the ChIP DNA fragmentation process obscures the actual location of interaction events. In addition, ChIP-Seq reads caused by different closely spaced events (joint events or proximal events) will spatially mix with one another along the genome, presenting a challenge for precisely estimating the multiplicity and exact positions of proximal binding events of the same TFs. The typical spatial resolution of ChIP-Seq binding event detection is within 40-50 base pairs and varies with the dataset and the methods used (Wold and Myers, 2008; Wilbanks and Facciotti, 2010). To fully capitalize on the benefits of ChIP-Seq, the spatial resolution of event detection needs to be greatly improved. The most common follow-up analysis of binding event detection, motif discovery, has also not been optimized for ChIP-Seq data. Motif discovery is one of the most 24 Chapter 1: Introduction widely studied problems in computational biology and many methods have been developed. They are reviewed in (MacIsaac and Fraenkel, 2006; Das and Dai, 2007; Zambelli et al., 2012) and compared in (Hu et al., 2005; Tompa et al., 2005). Some of the methods are discussed in Chapter 3. Traditional motif discovery programs such as MEME (Bailey and Elkan, 1994) and Weeder (Pavesi et al., 2001) are not suitable for large number of ChIP-Seq bound sequences due to computational inefficiency. These methods are thus limited to process only ~500-1000 top ranking sequences, ignoring weak binding sites. New methods have been developed to take advantage of some features of ChIP-Seq data, such as higher spatial resolution, more quantitative binding strength and higher redundancy of motif instances in the sequences (see more discussion in 3.1.2). But as shown in Chapter 3, the performance of these methods is not improved as expected. A related issue is that there is currently no generallyaccepted gold standard for motif representation (Hughes, 2011). Thus it is beneficial to explore potentially more suitable motif representations and better approaches to discovering motifs for ChIP-Seq data. Other down-stream analyses of ChIP-Seq binding predictions includes studying the relationships among binding calls from multiple transcription factors in the same cellular condition, and elaborating the relationship between binding calls and gene structure, gene target assignment, gene expression, condition specific binding, etc. (Park, 2009). For more accurate down-stream analysis, it is important to have binding calls with higher spatial resolution and more accurate binding strength quantification. The interaction between a TF and its target site on the DNA is the basic unit to understand the complex network of global gene regulation. Innovations in the computational analysis of ChIP-Seq data promise to reveal aspects of transcription factor binding at a new level of resolution, enables further mechanistic study of the combinatorial control and the gene regulatory network. 1.5 Thesis road map In this thesis I present novel computational algorithms for ChIP-Seq binding event prediction, DNA regulatory motif discovery, and transcription factor binding constraint discovery, and the resulting biological findings. Chapter 2: Genome Positioning Systems (GPS) In Chapter 2, I present the Genome Positioning Systems (GPS), a principled modelbased computational method to predict ChIP-Seq binding events with high spatial 25 Chapter 1: Introduction resolution. I first introduce the challenge presented by the random fragmentation of ChIP DNA and the mixing of closely spaced events for precisely estimating the multiplicity and exact positions of proximal binding events (Section 2.1). Next I describe the GPS algorithm. GPS models the spatial distribution of reads and deconvolves proximal binding events using a probabilistic mixture model with a sparse prior (Section 2.2). I compare these results with the widely used published methods, and I find GPS improves the spatial resolution of binding event predictions and resolves more proximal binding events (Section 2.3). Finally, I discuss the significance of improved spatial resolution and discovery of proximal binding events, compare GPS with recently published similar approaches (Section 2.4), and describe analysis methods (Section 2.5). Chapter 3: K-mer set motif representation and discovery In Chapter 3, I present a novel k-mer set motif representation and a new motif discovery method, k-mer motif alignment and clustering (KMAC), to learn motifs that are most enriched in ChIP-Seq bound sequences versus control sequences. I give a brief introduction on widely used motif representations and motif discovery methods and discuss their limitations, particularly the challenge of incorporating informative features in ChIP-Seq derived data (Section 3.1). Then I describe the k-mer set motif representation (Section 3.2) and the KMAC motif discovery method. KMAC discovers motif by using a combined enumerative and alignment-based approach and weighting the motif sites with binding event strength and position information (Section 3.3). When KMAC is used to recover known motifs using a large number of diverse ChIP-Seq datasets I show that it is more informative and predictive than the position weight matrix (PWM) model, and that it also outperforms other motif discovery methods, including ChIP-Seq oriented methods (Section 3.4). Finally I discuss the significance of using k-mer set motif representation and KMAC motif discovery method in the context of ChIP-Seq analysis (Section 3.5), and describe analysis methods (Section 3.6). Chapter 4: Genome-wide event finding and motif discovery (GEM) In Chapter 4, I present genome-wide event finding and motif discovery (GEM), an integrative model to resolve the location of protein-DNA interactions and discover explanatory DNA sequence motifs. I first introduce the value of integrating motif finding and event discovery (Section 4.1). Then I describe the GEM algorithm. GEM extends the GPS model to incorporate motif information as a position-specific prior to bias binding event prediction (Section 4.2). Next I show the results that GEM locates binding 26 Chapter 1: Introduction events with exceptional spatial resolution on their corresponding motif positions, and further improves proximal event deconvolution. GEM can also be directly applied to ChIP-exo data and improves upon existing methods (Section 4.3). Finally I discuss the significance of GEM for improving signal-to-noise ratio in motif discovery, the flexibility to incorporate other positional information, and the application to ChIP-exo data (Section 4.4), and describe analysis methods (Section 4.5). Chapter 5: Transcription factor spatial binding constraints In Chapter 5, I present the discovery of TF binding constraints using GEM predictions. First I introduce the value of discovering in vivo TF binding constraints and the limitation of motif-based approaches (Section 5.1). Next I describe the method to discover statistically significant TF binding constraints using GEM binding predictions of a large number of TFs in the same cellular condition (Section 5.2). Then I present the biological findings from mouse ES cells and 5 human cell types. GEM found 37 examples of TF binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4, Sox2 and other key regulatory factors. In human ENCODE data, GEM found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1 (Section 5.3). Finally I discuss the value of using GEM to discover TF binding constraints (Section 5.4). Chapter 6: Conclusions In Chapter 6, I summarize the work presented here and outline the main contributions of this thesis. 27 Chapter 2: Genome Positioning Systems (GPS) Chapter 2 Genome Positioning Systems (GPS) The material presented in this chapter was adapted from the following publication: Yuchun Guo, Georgios Papachristoudis, Robert C. Altshuler, Georg K. Gerber, Tommi S. Jaakkola, David K. Gifford, and Shaun Mahony (2010). Discovering homotypic binding events at high spatial resolution, Bioinformatics 26(24): 3028-3034. Collaborations: Y.G., S.M. and D.K.G. conceived the project. Y.G., S.M., G.K.G., G.P., T.S.J. and D.K.G. designed the computational model and implemented the algorithm. Y.G., S.M., G.P., and R.C.A. analyzed the data. Y.G., S.M. and D.K.G. wrote the manuscript. 29 Chapter 2: Genome Positioning Systems (GPS) 30 Chapter 2: Genome Positioning Systems (GPS) Chapter 2: Genome Positioning Systems (GPS) 2.1 Introduction The precise physical description of where transcription factors, histones, RNA polymerase II, and other proteins interact with the genome provides an invaluable mechanistic foundation for understanding gene regulation. ChIP-Seq (Chromatin immunoprecipitation followed by high-throughput sequencing) has become an indispensable tool for genome-wide profiling of protein-DNA interactions (Barski et al., 2007; Johnson et al., 2007; Mikkelsen et al., 2007; Robertson et al., 2007; Park, 2009). Computational methods are necessary to predict the location of protein-DNA interaction events from ChIP-Seq data because random variation in the ChIP DNA fragmentation process obscures the actual location of interaction events (Figure 2-1). In the ChIP-Seq protocol, reads are sequenced from the 5’ end of the ChIP DNA fragments that are sonicated randomly in solution. Thus while ChIP-Seq DNA sequence reads are mapped to precise bases in the genome, these reads do not manifestly indicate the location of the protein-DNA interaction events that caused them. In addition, ChIP-Seq reads caused by different closely spaced events (joint events) spatially mix with one another along the genome, presenting a challenge for precisely estimating the multiplicity and exact positions of proximal protein-DNA interaction events (Figure 2-1). Figure 2-1 Random mixture of ChIP-Seq reads from joint events Protein-DNA interaction events at closely spaced positions 1 and 2 on the genome result in mixture of reads (tags) in the ChIP-Seq protocol. Green and orange ovals represent protein bind at different positions. Solid lines and rectangle bars represent the ChIP DNA fragments and the reads at the end of the fragments, respectively. 31 Chapter 2: Genome Positioning Systems (GPS) The difference between the computationally predicted location of a protein-DNA binding event and the midpoint of its actual location is defined as spatial resolution. An ideal computational method for analyzing ChIP-Seq data would accurately localize protein-DNA interaction events (high spatial resolution), would include no false events (high specificity), would include all true events (high sensitivity), and would be able to resolve closely spaced DNA-protein interactions (joint event discovery). Joint event discovery is important because it can capture cooperative biological regulatory mechanisms in proximal genomic locations (Pepke et al., 2009). Homotypic clusters of transcription factor binding sites (TFBS) have been extensively studied in Drosophila (Lifanov et al., 2003). Such regulatory mechanisms may be common in mammalian genomes as 40-60% of certain ChIP-Seq defined protein-DNA interaction regions contain more than one motif within 200bp (Jothi, et al., 2008; Valouev, et al., 2008). Furthermore, homotypic clusters of TFBS occupy nearly 2% of the human genome and may act as key components of almost half of the human promoters and enhancers (Gotea, et al., 2010). Thus, homotypic event discovery is necessary to fully reveal the transcription factor regulatory interactions present in ChIP-Seq data. Existing ChIP-Seq computational methods (Park, 2009; Pepke, et al., 2009) do not simultaneously consider multiple events as the cause for observed reads in the context of a probabilistic model at mammalian genome scale. To detect binding events, PeakSeq extends the length of mapped reads to create peaks (Rozowsky, et al., 2009), MACS shifts the mapped position of reads a fixed distance towards their 3’ ends (Zhang, et al., 2008), FindPeaks aggregates overlapping reads (Fejes, et al., 2008), SISSRs identifies positive to negative strand transition points at read accumulations (Jothi, et al., 2008), cisGenome scans for the center of modes of the 5’ and 3’ peaks (Ji, et al., 2008), and QuEST (Valouev, et al., 2008) and spp (Kharchenko, et al., 2008) use kernel density estimation methods. All of these methods use statistical detection criteria such as overlapping read counts or read distribution strand symmetry to estimate the location of a protein-DNA interaction event. ChIP-Seq event calling method evaluations showed that although they identified binding sites with a highly significant overlap with the corresponding sequence motif (Laajala et al., 2009), and exhibited similar sensitivity and specificity (Wilbanks and Facciotti, 2010), there are pronounced differences in the spatial resolution of all these methods. One important piece of information that is not fully exploited by the early ChIP-Seq 32 Chapter 2: Genome Positioning Systems (GPS) methods is the spatial distribution of reads (also called as “read position densities”, “tag density profile” or “peak shape”). A recent evaluation of five peak-finder methods demonstrated the room for improvement by showing that over 80% of false binding calls can be visually identified using the shape of read profiles without additional information from background data or replicates (Rye et al., 2011). It further called for development of methods utilizing the shape information. A recent method named CSDeconv deconvolves proximal binding events using a computed spatial read distribution (Lun, et al., 2009), although it is at present computationally impractical on entire mammalian genomes. In this chapter, I present the Genome Positioning System (GPS), a high-resolution computational method for genome-wide ChIP-Seq analysis that can accurately detect protein-DNA interaction events and deconvolve closely spaced events by modeling the spatial distribution of ChIP-Seq reads at single base resolution. GPS detects more joint events in synthetic and actual ChIP-Seq data and has superior spatial resolution when compared with other methods. 2.2 GPS algorithm 2.2.1 GPS algorithm overview GPS has three phases: spatial distribution discovery, event discovery, and the determination of event significance. In its first phase, GPS summarizes the observed genomic spatial distribution of mapped reads from protein-DNA interaction events in the input ChIP-Seq data. The farther a mapped read is located from an event, the less likely it is to be caused by the event (Figure 2-2). We assume in GPS that for a given ChIP-Seq experiment, every interaction event will produce the same characteristic distribution of reads. While this assumption will not always be true, we have found that it produces good results in practice. In its second phase, GPS employs a probabilistic mixture model to assign an event probability to every base in the genome. Each potential event’s contribution to generating the observed reads is modeled (Figure 2-3A). A sparse prior on event probabilities provides a complexity penalty that biases events to have their probability mass at a single base position. Event probabilities are selected to maximize the penalized likelihood of observed reads using an Expectation-Maximization (EM) 33 Chapter 2: Genome Positioning Systems (GPS) Figure 2-2 Spatial distribution of ChIP-Seq reads The observed spatial read density (blue: “+” strand, red: “-” strand) from ~4,000 CTCF events aligned with respect to the CTCF motif position at each event Figure 2-3 GPS probabilistically models ChIP-Seq read spatial distributions A) GPS models ChIP-Seq reads as being generated by a mixture of binding events at every genomic base, with each event producing the characteristic spatial read density. B) A sparse prior on mixture components causes GPS to assign events to as few bases as possible to explain the observed reads (green and orange reads). Positions 1 and 2 represent the estimated binding positions of the protein of interest. In GPS, a given read can be explained by more than one event (yellow reads). 34 Chapter 2: Genome Positioning Systems (GPS) algorithm that segments the genome into efficiently solvable subproblems. GPS uses the number of reads assigned to a base by the mixture model as a measure of the relative strength of a predicted event at that base. In its third and final phase, GPS determines significant events by comparing the number of reads at the predicted events to the corresponding normalized number of reads in the control channel. We compute the statistical significance using the binomial distribution (Rozowsky, et al., 2009) and correct for multiple hypothesis testing by applying a Benjamini-Hochberg correction (Benjamini and Hochberg, 1995). 2.2.2 Empirical spatial distribution of reads GPS iteratively estimates the empirical spatial distribution of reads directly from ChIP-Seq data. Given a set of events, we count all the reads at each position relative to the corresponding event positions. Only the base positions within 250bp of the event are counted because typical ChIP-Seq protocols performs a size selection in the range of ~150-300bp (Park, 2009) and we have empirically found that the probability of generating reads at positions further than 250bp is not significant. The initial set of events for estimating the empirical spatial distribution can be defined by using known motifs or by finding the center of the forward and reverse read profiles (Zhang, et al., 2008). Alternatively, GPS can use a generic empirical spatial distribution for ChIP-Seq data to make the initial event prediction and then re-estimate the empirical spatial distribution and use it for more accurate prediction (Figure 2-4). This process can be repeated until convergence. 2.2.3 GPS mixture model GPS is based on a generative mixture model that describes the likelihood of an observed set of ChIP-Seq reads from a set of protein-DNA interaction events. Each event (mixing component) contributes a distribution of reads surrounding its genomic position to the mixture of reads. We assume that reads are independent conditioned on the locations of their underlying causal events. GPS performs event discovery by finding the set of protein-DNA interaction events that maximizes the penalized likelihood of the observed ChIP-Seq reads. We consider N ChIP-Seq reads that have been mapped to genome locations R = {r 1 , …, r N } and M possible protein-DNA interaction events at genome locations B = {b 1 , …, b M } (Figure 2-3A). We represent the latent assignments of reads to the location of events that 35 Chapter 2: Genome Positioning Systems (GPS) caused them as Z = {z1 , …, z N }, where zn = j when j is the index of the event located at position b j that caused read n. The conditional probability of read r n being generated from event j is p(rn | z n = j ) = emp ( sn (rn − b j )) where emp(d) is the empirical spatial distribution that models the probability of a read occurring d bases away from its corresponding event position (Figure 2-2). Strand sense is handled by s n = 1 or s n = -1 if read r n is mapped to the forward strand or reverse strand, respectively. We assume that all the events in one ChIP-Seq experiment have the same empirical spatial distribution. The probability of observing a read is a convex combination of possible binding events M p(rn | π ) = ∑ π j p (rn | j ) j =1 where M is the number of possible events, π denotes the parameter vector of mixing probabilities (i.e. the probabilities of the possible events), and π j is the probability of event j, with ∑ j =1π j = 1 . M The overall likelihood of the observed set of reads is then, N M p ( R | π ) = ∏ ∑ π j p (rn | j ) n =1 j =1 Our assumption is that binding events are relatively sparse throughout the genome. To model this assumption, we place a negative Dirichlet prior distribution (Figueiredo and Jain, 2002; Bicego et al., 2007) p(π) on π: 1 M p (π ) ∝ ∏ j =1 (π j )α ,α > 0 where α is a tuning parameter to adjust the degree of sparseness. If for event j, the value of π j becomes zero (see component elimination below), the model is restructured to eliminate the event. 2.2.4 EM algorithm We solve for the MAP (maximum a posteriori) solution for π using the Expectation- 36 Chapter 2: Genome Positioning Systems (GPS) Maximization (EM) algorithm (Dempster, et al., 1977). The complete-data log penalized likelihood is N M M ln p( R, Z , π ) = ∑ ∑ 1( z n = j )(ln π j + ln p (rn | j ) ) − α ∑ ln π j n =1 j =1 j =1 where 1( z n = j ) is the indicator function. We initialize the mixing probabilities π with uniform probabilities, π j = 1/M, where j=1, …, M. At the E step, we use the current parameter estimates π to evaluate the expectation of Z given R, γ ( zn = j) = π j p(rn | j ) M ∑π j '=1 j' p (rn | j ' ) We can interpret γ ( z n = j ) as the fraction of read n that is assigned to event j. This is referred to as a "soft assignment'' because read n can be assigned partially to multiple events. At the M step, on iteration i we find parameter πˆ (i ) to maximize the expected complete-data log penalized likelihood, M N M (i ) ˆ π j = arg max ∑ ∑ γ ( z n = j )(ln π j + ln p(rn | j ) ) − α ∑ ln π j πj n =1 j =1 j =1 under the constraint ∑ j =1π j = 1 . M Use a Lagrange multiplier λ (Bishop, 2006) to incorporate the constraint N M n =1 j =1 M j =1 ∑ j =1π j = 1 , M πˆ j (i ) = arg max ∑ ∑ γ ( z n = j )(ln π j + ln p(rn | j ) ) − α ∑ ln π j + λ (∑ j =1π j − 1) πj M To maximize the right hand side term, set its derivative with respect to π j to 0, γ ( zn = j) α − +λ =0 πˆ j πˆ j n =1 N ∑ N πˆ j λ = α − ∑ γ ( z n = j ) (2-1) n =1 37 Chapter 2: Genome Positioning Systems (GPS) Sum both sides of the equation over j to solve for λ, ∑ M j =1 N πˆ j λ = ∑ j =1 (α − ∑ γ ( z n = j )) M n =1 N λ = ∑ j =1 (α − ∑ γ ( z n = j )) M (2-2) n =1 Substitute (2-2) back to (2-1), we find πˆ j (i ) = N j −α ∑ M j '=1 (N j' − α ) , N j = ∑n=1 γ ( z n = j ) N where N j is the expected number of reads assigned to event j. As we iteratively estimate πˆ , we use a component elimination method (Figueiredo and Jain, 2002). If N j ≤ α, we set π j = 0 to eliminate event j. Our final estimate of πˆ (i ) is πˆ j (i ) = max(0, N j − α ) ∑ j '=1 max(0, N j ' − α ) M The sparseness parameter α can be interpreted as the minimum number of reads that an event needs to survive the EM iterations. Intuitively, the effect of the sparseness prior is to penalize each event with α read count and promote the competition among the remaining events. The EM algorithm is deemed to have converged when the change in likelihood falls below a specified threshold. Our implementation of component elimination includes two special cases. To avoid premature elimination of components during EM iterations, we start with α = 0 for a number of iterations to allow nascent components to gain support from the data. We then set α to our desired value. This is because when the number of components M is large, no component may have enough initial support to prevent π from being immediately forced to zero. Furthermore, in a single iteration we do not eliminate all the components that meet the criteria N j ≤ α. Instead, we only eliminate the components with the lowest value of N j at each iteration. This allows the data points that supported the eliminated components to be re-distributed immediately to support the other components. At the convergence of the EM algorithm, the GPS mixture model produces a list of non-zero-probability events π j ≠ 0, and the "soft" read assignments to these events γ ( z n = j ) . We do not use the mixing probabilities π in subsequent analysis 38 Chapter 2: Genome Positioning Systems (GPS) because we segment the genome into regions for analysis, and π values are dependent on the region analyzed. We define event strength as the expected number of reads associated with the event. Thus the event strength of event j is calculated as N j = ∑n=1γ ( z n = j ) N 2.2.5 Setting the sparseness parameter α The value of the sparseness parameter α will influence the sensitivity and specificity of event detection. It should scale with the read count of the events in the region that GPS is analyzing. From our experience in analyzing mouse CTCF and human GABP datasets, the α value is set empirically as follows to achieve better spatial accuracy: α = max( C max A , alpha min ) where C max is the maximum read count in a 500bp (i.e. roughly the length of non-zero density region of the read distribution) sliding window across the region that GPS is evaluating, alpha min is the minimum number of read count for a valid binding event. A is a constant factor that can be specified at command line. We set the value of alpha min using a Poisson test. The parameter of the Poisson distribution is set as the mean read count in the 500bp sliding windows across the whole genome. alpha min is then set as the value that gives a p-value of 1e-4 and that is not less than 6. We tested setting different A values (A=1,2,3,4 or 5) or using fixed α values (α=10 or 20) when analyzing the GABP data. The results show that GPS with the settings A=2,3,4 or 5 call more joint events (~8-10%) and give marginally better spatial resolution of binding calls (~0.6bp) than with other settings. Thus we set A=3 in our analyses. 2.2.6 Statistical significance of predicted events To evaluate the statistical significance of predicted events when we have a control dataset, we compare the number of reads of the ChIP event to the number of reads in the corresponding region in the control sample. For non-overlapping events, we count the number of control reads in the range of the empirical spatial distribution (+/- 250bp centered on the IP event). For joint events, we need to assign control reads to the corresponding events. We run the EM algorithm without the sparse prior (no component elimination, equivalent to α = 0) on the control 39 Chapter 2: Genome Positioning Systems (GPS) data, initializing the events j at the same positions as predicted IP events. The M step of EM algorithm is modified as πˆ j (i ) = Nj ∑ M j '=1 = N j' Nj N where N j = ∑n=1 γ ( z n = j ) . N To account for differences between IP and control dataset sizes, we multiply the control reads by a scaling factor. We divide long non-specific-binding regions (defined by excluding the "enriched regions") into short segments (length 10 kb) and perform leastsquare linear regression using all the read count pairs of IP and control segments that have at least one read. The slope of the regression is then the scaling factor, F IP/C , between the read counts from the IP and control (Kharchenko, et al., 2008; Rozowsky, et al., 2009). Using a statistical testing method proposed by Rozowsky, et. al. (Rozowsky, et al., 2009), we calculate the P-value from the cumulative distribution function for the binomial distribution using the corresponding IP and scaled control read counts, k n F(k, n, P) = ∑ P l (1 − P) ( n−l ) l =0 l where k is the scaled control read count, n is the ceiling of the sum of IP and scaled control reads, P = 0.5, which is the probability under the null hypothesis that reads should occur with equal likelihood from the IP as from the scaled control data. To correct for multiple hypothesis testing, we apply a Benjamini-Hochberg correction to adjust the P-value (Benjamini and Hochberg, 1995). All the predicted events that are tested for significance are ranked by P-value from most significant to least significant. For each event, the Q-value is given by Q − value = P − value × count rank where count is the total number of events tested. Significant events are then selected using a Q-value threshold. If control data is not available, we apply a statistical test proposed by Zhang, et. al. (Zhang, et al., 2008) that uses a dynamic Poisson distribution to account for local biases. The dynamic parameter of a local Poisson model for the candidate event is defined as λ local = max(λ BG, λ 5kb, λ 10kb ) 40 Chapter 2: Genome Positioning Systems (GPS) where the λ BG, λ 5kb, λ 10kb are λ estimated from corresponding chromosome (background), 5kb or 10kb window centered at the event location, to capture the background variability at both global and local scales. The P-value is calculated to be the upper tail of the Poisson distribution, P - value = 1 − N event −1 ∑ Pois(n; λ local n =0 ) where N event is the read count of the candidate event. To correct for multiple hypothesis testing, we apply a Benjamini-Hochberg correction as above. 2.2.7 Artifact filtering GPS filters the predicted events by computing the Kullback–Leibler divergence (Kullback and Leibler, 1951) from the empirical read distribution to the spatial read distribution of each predicted event, DKL (emp event ) = ∑ emp(i ) log i emp(i ) event (i ) where event() is the spatial distribution of non-zero read count of the event computed from the EM algorithm, and emp() is the empirical read distribution with the corresponding positions of the non-zero reads, i is the index of the non-zero read positions. Events with a Kullback–Leibler divergence value higher than a user defined threshold are discarded. 2.2.8 Software implementation We have implemented GPS in Java, and our software is available for download from our website (http://cgs.csail.mit.edu/gps). For computational efficiency, GPS independently processes separable genomic regions. We identify separable regions with a conservative method that spatially segments the genome at read gaps that are larger than the width of empirical spatial distribution (500bp) and further excludes regions that contain fewer than 6 reads. The segmented protein binding regions are typically a few thousand base pairs long. To further reduce memory requirements and run time, GPS estimates events in two stages for each region. In the first stage, initial events are spaced at 5bp intervals to make a rough estimate of event locations. In the second stage, events are spaced at 41 Chapter 2: Genome Positioning Systems (GPS) 1bp near locations predicted in the first stage. For the CTCF ChIP-Seq experiment in this study (~4.2 million IP reads and ~7.9 million control reads), GPS requires 750MBytes of main memory, and runs for 21 minutes on an AMD 64bit 2.3GHz computer. 2.3 Results 2.3.1 GPS automatically adapts the empirical read distribution Figure 2-4 GPS automatically adapts the empirical read distribution A generic read distribution (from CTCF data, blue) was initially used to predict GABP binding events. GPS then used the predicted positions to iteratively re-estimate the read distribution specific for GABP. The GPS learned distribution (red) is highly similar to the GABP read distribution defined by the GABP motifs (green). We first verified that GPS is able to automatically adapt to the empirical read distribution of the analyzed ChIP-Seq data. This is important because the GPS mixture model is initialized with a pre-determined read distribution and the actual read distribution generating the observed data can be very different depending on the binding factors or the experimental protocols. A more accurately determined empirical read distribution will lead to more accurate prediction of binding events by GPS. We tested the adaptation of read distribution during GPS analysis on a human GABP dataset (Valouev, et al., 2008) with an initial read distribution from a mouse CTCF dataset (Chen, et al., 2008). GPS automatically adapts the empirical read distribution to the GABP data by learning a read distribution anchoring on the GABP event positions 42 Chapter 2: Genome Positioning Systems (GPS) predicted using the CTCF distribution. The GPS learned GABP distribution is different from the initial CTCF distribution, but highly similar to the GABP read distribution defined by the GABP motif match positions (Figure 2-4). Therefore, even with a generic initial read distribution, GPS is able to adapt to the read distribution of the analyzed data and subsequently improve the prediction accuracy. 2.3.2 GPS predictions have higher spatial resolution We next analyzed the spatial resolution of GPS on ChIP-Seq data profiling the insulator binding factor CTCF (CCCTC-binding factor) (Chen, et al., 2008), as the high information-content CTCF motif allows us to reliably measure spatial resolution based on event distance to the CTCF motif. We used GFP ChIP-Seq data (Chen, et al., 2008) in the third phase of GPS to control for non-specific binding. We found that the spatial resolution of GPS on the CTCF data is superior to the spatial resolution produced by eight published ChIP-Seq analysis methods (Figure 2-5): MACS (Zhang, et al., 2008), SISSRs (Jothi, et al., 2008), cisGenome (Ji, et al., 2008), QuEST (Valouev, et al., 2008), FindPeaks (Fejes, et al., 2008), spp-wtd, spp-mtc (Kharchenko, et al., 2008) and PeakRanger (Feng et al., 2011). Because different methods predict different sets of binding events, we limit our comparison to a matched set of events. From the 22,222 top ranking predictions by each method, 4,322 events are predicted by all nine methods and correspond to the same high-scoring CTCF binding motif. Of these matched events, 84.5% of the predictions by GPS are within 20bp of the CTCF binding motif, while between 63.2% and 73.4% of predictions by other methods are within 20bp (Figure 2-5A). GPS has an average spatial resolution of 11.27±10.21bp, compared to 14.55±12.50bp for SISSRs, 16.14±12.30bp for MACS, 16.14±13.90bp for cisGenome, 17.54±13.44bp for QuEST, 14.97 ±11.09 for FindPeaks, 16.52 ±12.82 for spp-wtd, 16.22 ±14.57 for spp-mtc and 15.03 ±12.03 for PeakRanger. Although the matched set allows direct comparison on the same binding events, it is possible to introduce some bias such as focusing only on the subset of more significant binding events. We thus evaluated spatial resolution while increasing the number of all the top ranking events identified by each method and the performances of the methods are similar to those of the matched set (Figure 2-5B). The above analysis was repeated with 50bp window size and the results are similar to those with 100bp window size. SISSRs, MACS and two spp methods were shown to have better spatial resolution than seven other methods in a recent performance evaluation (Wilbanks and Facciotti, 43 Chapter 2: Genome Positioning Systems (GPS) 2010), and thus our analysis of CTCF data shows that GPS may have superior spatial resolution to these seven methods. Figure 2-5 GPS improves the effective spatial resolution A) Fraction of predicted CTCF binding events with a motif within the given distance with event discovery by GPS, SISSRs, MACS, cisGenome, QuEST, FindPeaks, spp-wtd, spp-mtc and PeakRanger. Events shown were predicted by all eight methods and had a CTCF motif within 100bp. B) The spatial resolution of CTCF event calls is shown averaged over increasing numbers of the strongest ranked events identified by different methods. GPS was further compared to two recently published methods, PICS (Zhang et al., 2011) and SeqSite (Wang and Zhang, 2011), on the spatial resolution of predicted event positions. These two methods also model the ChIP-Seq read distribution of the binding event and achieve better spatial resolution but were not included in the previous comparison because they were published after GPS (Guo et al., 2010). A widelyevaluated human FoxA1 binding dataset (Zhang et al., 2008) was used for the evaluation because PICS requires a read mappability profile, which is available only in the human genome. Of 698 matched events that are predicted by all three methods and correspond to the same high-scoring FoxA1 binding motif, GPS has an average spatial resolution of 16.35±15.70bp, compared to 17.86±16.16bp for PICS, and 20.15±14.54bp for SeqSite (Figure 2-6). Therefore, GPS achieves better spatial resolution of binding event predictions than these two new “shape-aware” methods. 2.3.3 GPS discovers more joint events Using synthetic data we found that GPS is able to detect more joint events than other methods. We generated synthetic joint events and single events by placing ChIPSeq binding events from actual CTCF data at pre-defined intervals. GPS detects 99.7% 44 Chapter 2: Genome Positioning Systems (GPS) Figure 2-6 GPS has better spatial resolution than other shape-aware methods Fraction of predicted FoxA1 binding events with a motif within the given distance with event discovery by GPS, PICS and SeqSite. Events shown were predicted by all three methods and had a FoxA1 motif within 100bp. of joint events that are 200bp apart, while SISSRs only detects 54.5% of joint events that are 200bp apart, MACS and QuEST detect none of the joint events that are 200bp apart and detect joint events only when they are more than 280bp apart (Figure 2-7 A). Although SISSRs appears to be more sensitive when the joint evens are less than 100bp apart, it makes more false joint event calls with the synthetic single events than all other methods (data not shown). GPS is also able to predict more joint events than the other methods we tested on actual ChIP-Seq data. For example, GPS uniquely detects two CTCF events in mouse ES cells over proximal CTCF motifs that are 99bp apart on chromosome 8 (Figure 2-7 B). However, the CTCF dataset does not contain a sufficient number of joint events to effectively evaluate the methods on a whole genome scale. We selected a human Growth Associated Binding Protein (GABP) ChIP-Seq dataset for our evaluation because GABP ChIP-Seq data were previously reported to contain joint events (Lun, et al., 2009; Valouev, et al., 2008). We identified 581 candidate sites of joint events that all had at least one event detected by all five methods and where each site contains two or more GABP motifs separated by less than 500bp. GPS identified joint events in 122 candidate sites, while SISSRs and QuEST detected joint events at fewer than 83 of the candidate sites, and MACS and cisGenome only identified 3 and 5 of the candidate sites as containing joint events respectively (Figure 2-7 C). 45 Chapter 2: Genome Positioning Systems (GPS) Figure 2-7 GPS improves accuracy in resolving joint binding events A) Fraction of binary events recovered vs. the distance between the generated synthetic events for GPS, SISSRs, MACS and QuEST. B) Example of a predicted binary CTCF event that contains coordinately located CTCF motifs. C) Number of GABP events discovered by GPS, SISSRs, MACS, cisGenome, and QuEST in regions that contain clustered GABP motifs within 500bp. 2.4 Discussion GPS is a novel computational method that predicts the most likely positions of binding events at single-base resolution. It uses a probabilistic mixture model based on the characteristic spatial distribution of reads. Instead of aggregating ChIP-Seq reads to ChIP-chip like analog “peaks”, GPS models the individual reads and thus retains the rich digital information gained from high-throughput sequencing. Our analysis with synthetic and actual ChIP-Seq data demonstrates the value of our approach in resolving closely spaced joint events and improving event spatial resolution. The “peak shape” information is useful for distinguishing false events from true events but it is not fully exploited by the early peak finders and is considered challenging to model (Rye et al., 2011). To the best of our knowledge, GPS is the first method to explicitly model real “peak shape” information, with an adaptable empirical read spatial distribution learned directly from the ChIP-Seq data. GPS also provides a principled method to use “peak shape” for filtering artifacts and thus improves specificity of prediction. Before GPS, the “peak shape” information was only used in a simplified manner to infer average fragment length (Zhang, et al., 2008). After the publication of GPS (Guo et al., 2010), two subsequently published works employed a similar approach to modeling the spatial distribution of reads: PICS uses DNA fragment length information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model (Zhang et al., 2011); SeqSite models the read distribution as a truncated gammadistribution and locates TFBSs with a least-squares model fitting strategy on smoothed data (Wang and Zhang, 2011). Such efforts showed improved performance in resolving 46 Chapter 2: Genome Positioning Systems (GPS) closely spaced joint events and improved spatial resolution of the prediction (Wang and Zhang, 2011; Zhang et al., 2011), underscoring the value of high resolution modeling of ChIP-Seq data. Although such parametric distributions provide a simple description of the read distribution and allow parameter estimation to account for the variation among the events, the parametric distributions do not adequately fit the real data. In particular, the asymmetry of the read distribution of real data (Figure 2-4) cannot be accurately modeled by the symmetric t-distribution and the non-zero probability region around binding site (Figure 2-4) cannot be captured by the gamma-distribution with zero density at the binding position. In GPS, the automatically learned empirical read distribution provides a more accurate description of the reads, which is consistent with the comparison result that GPS has better spatial resolution than PICS and SeqSite. In addition, the empirical read distribution does not cause a computation performance disadvantage for GPS because it is represented as a look-up table, allowing probabilities be computed efficiently for the EM algorithm iterations. GPS provides improved spatial resolution when compared with contemporary methods. Recent evaluation showed that DNA-binding motif discovery is more successful from the fixed-size regions flanking the point position of predicted binding sites (i.e. the summits of the peaks) than from the long regions returned by some of the methods evaluated (Rye et al., 2011). Thus the improved spatial resolution of GPS may enable researchers to search for DNA-binding motif in narrower windows of sequence, effectively increasing the signal to noise ratio to improve motif discovery. The high spatial resolution of GPS can be used to produce a position-specific prior (Narlikar et al., 2006b; Qi et al., 2006; Bailey et al., 2010) that can be used by motif discovery methods to limit the motif search to tight genomic regions around events, or to exclude event locations for co-factor motif discovery. GPS’s ability to resolve homotypic events from ChIP-Seq data will facilitate the genome-wide study of cooperative binding on gene expression under specific biological conditions. This is achieved by explicitly modeling the mixing of closely spaced homotypic events using a mixture model framework. Accurately deconvolving homotypic events allows more accurate quantification of each event in terms of binding strength as well as binding location. Such accurate quantification will facilitate further study on the cooperativity of homotypic events. Homotypic binding sites have been suggested to act as key components of invertebrate and mammalian promoters and enhancers (Gotea, et al., 2010; Lifanov, et al., 2003). In addition, modeling based approaches have 47 Chapter 2: Genome Positioning Systems (GPS) demonstrated that identifying homotypic binding is important for the faithful reproduction of biological behaviors (Segal, et al., 2008). Furthermore, we expect that alternative empirical read distributions can be used for different kinds of events, such as histone location, as the GPS framework is inherently adaptable to other empirical read distributions. 2.5 Methods 2.5.1 Datasets used Dataset 1: CTCF binding To evaluate the performance of GPS, we analyzed a ChIP-Seq dataset of insulator binding factor CTCF in mouse ES cells, with a control using antibody against GFP to control for non-specific binding (Chen, et al., 2008). We chose CTCF data for our evaluation because the strong CTCF consensus motif allows us to reliably measure spatial resolution. The ChIP-Seq data comprised 4.2 million CTCF reads and 7.9 million GFP reads that uniquely map to the mm8 mouse genome. Dataset 2: GABP binding To evaluate the performance of joint event discovery, we analyzed a ChIP-Seq dataset of GABP in human Jurkat cells, with a control using input DNA (Valouev, et al., 2008). GABP binding has been reported to contain multiple binding motifs in a short region (Lun, et al., 2009; Valouev, et al., 2008). The data was downloaded from QuEST website (http://mendel.stanford.edu/SidowLab/downloads/quest/). It comprised 7.9 million GABP reads and 17.4 million input DNA reads that uniquely map to the hg18 human genome. Dataset 3: FoxA1 binding The FoxA1 dataset (Zhang et al., 2008) is used to evaluate the performance of 3 shape-based ChIP-Seq methods, GPS, PICS and SeqSite. The FoxA1 ChIP-Seq data were downloaded from (http://liulab.dfci.harvard.edu/MACS/Sample.html). It comprised 3.9 million FoxA1 reads and 5.2 million input DNA reads that uniquely map to the hg18 human genome. Motif The CTCF motif is reported in (Guo et al., 2010). GABP motif was retrieved from TRANSFAC database (M00341) (Matys, et al., 2003). FoxA1 motif was retrieved from Jasper database (MA0148.1) (Sandelin et al., 2004). 48 Chapter 2: Genome Positioning Systems (GPS) 2.5.2 ChIP-Seq analysis methods We compared the performance of GPS against ten published ChIP-Seq analysis methods: MACS (version 1.3.7.1)(Zhang, et al., 2008), SISSRs (version 1.4)(Jothi, et al., 2008), cisGenome (version 1.2)(Ji, et al., 2008), QuEST (version 2.4)(Valouev, et al., 2008), FindPeaks (version 4)(Fejes, et al., 2008), spp-wtd and spp-mtc (version 1.8) (Kharchenko, et al., 2008), PeakRanger (version 1.12)(Feng et al., 2011), PICS (version 1.11)(Zhang et al., 2011) and SeqSite (version 1.1.2)(Wang and Zhang, 2011). All the methods were run using default parameters except as described in the following. For MACS, we used the summit location as the predicted binding site position. The binding events are sorted by p-value. For SISSRs, “-t” option was used to obtain binding site predictions. For cisGenome, we analyzed the data with the option of boundary refinement, and used the center of the predicted region as the binding site position. In our tests, these options gave the best result in spatial resolution. For FindPeaks, we ran with options “-dist_type 1 -duplicatefilter “ to filter artifact reads. We used the max_coord position as the predicted binding site. The binding events are sorted by height. For spp-wtd and spp-mtc, the binding events are sorted by FDR and then by score. 2.5.3 Method comparison on spatial resolution We evaluated the effective spatial resolution of GPS against other methods. We define effective spatial resolution as the absolute value of the distance between genome coordinates of predicted binding events and the middle of corresponding high-scoring binding motif hit. The sign of the offset was adjusted according to the strand on which the motif hit occurred. Because the center of the motif hit may not represent a true center of binding event, the offsets to the motif were centered by subtracting the mean offsets (Kharchenko, et al., 2008). Because different methods predict different sets of binding events, we compare spatial resolution on the “matched” set of predictions that correspond to the same high-scoring binding motif. Only those events within 100bp of a motif match are included in the calculation. For CTCF, from the top 34019 predictions of each method, we select the 7,653 events that were predicted by all eight methods. We also evaluate spatial resolution while increasing the number of top ranking events identified by each method. Note that this analysis does not have a “matched” set of predictions. We simply average the spatial resolution of the top ranking events that 49 Chapter 2: Genome Positioning Systems (GPS) have a motif at a distance less than 100bp. The results are similar to those of the “matched” set. 2.5.4 Evaluating performance in deconvolving joint events using synthetic data In order to test the performance of joint event detection we constructed realistic synthetic datasets using CTCF binding data. These synthetic datasets allow us to more accurately evaluate the performance of different methods, as we know the true location of the constituent parts of joint binding events. To construct the datasets, we first collect the set of CTCF events that have the following properties: i) they are predicted by five evaluated methods (GPS, SISSRs, MACS, cisGenome, QuEST), ii) none of the five methods predicts more than two events in the region, iii) they contain a match to the CTCF motif, iv) the average distance from the motif match to the event prediction across all five methods is less than 10bp, v) the enrichment of CTCF ChIP-Seq reads under the event is significantly greater than the level of GFP reads with a P-value of less than 0.001 (as calculated by a binomial test). A total of 3,233 CTCF binding events meet these criteria. Synthetic ChIP-Seq data were constructed by randomly choosing one of the real CTCF events and translating the coordinates of its reads in the surrounding 1Kbp onto a fake genome. This is repeated to simulate 20,000 synthetic single events, each placed 100Kbp apart on the fake genome. We similarly create 1,000 joint (binary) binding events by randomly choosing two real CTCF events and placing them on the fake genome a fixed distance apart. Note that this method of constructing synthetic joint events assumes that the ChIP-Seq reads generated by closely neighboring events will be an independent mixture of the reads generated by each component event. A synthetic control channel is simulated by taking GFP reads in the regions around CTCF events and translating their coordinates in the same way as the matched IP reads. Further control reads are randomly spread across the fake genome until the read counts in the synthetic IP and control channels match. Datasets are constructed for the following distances between joint binding events: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300, 350, 400, 450, 500, 550, 600, 650, 700, and 750. The datasets can be downloaded from the GPS website. Note that we generated the synthetic data with the number of single events and joint events on the same order as real data. However, the read counts and fake genome size 50 Chapter 2: Genome Positioning Systems (GPS) are different from the real experiment, and this may throw off some methods that are tuned to make certain assumptions about the distribution of the data. GPS, SISSRs, MACS and QuEST were run with default settings on the synthetic datasets. cisGenome, FindPeaks and spp are not evaluated because we cannot script them to run on the command line for the repeated tests on multiple synthetic datasets. Note that MACS should not be unduly affected by joint binding events when estimating the correct (CTCF) binding distribution, as a large set of single events exist in the synthetic experiment. An algorithm is said to have correctly recovered joint binding events when it makes two event predictions in the relevant area and these predictions are each within 100bp of the position at which the event was simulated. The false calls of a method are also counted when synthetic single events are called as joint events. 2.5.5 Evaluating performance in deconvolving joint binding events using GABP ChIP-Seq data To evaluate the genome-wide performance of joint event discovery in real ChIP-Seq data, we analyzed a human GABP ChIP-Seq dataset, which was reported previously to contain multiple binding motifs in a short region (Lun, et al., 2009; Valouev, et al., 2008). For the GABP dataset, we compared GPS against 4 other methods (SISSRs, MACS, cisGenome and Quest) genome wide. FindPeaks only reported 615 events (991 events with the –subpeaks option), much fewer than the other methods. Therefore it is not included in the subsequent joint event discovery analysis. We did not run spp on GABP data because the data format we downloaded from QuEST website can not be used for spp, which reads BOWTIE or ELAND format. The number of events predicted by all five methods are: GPS (17,179), SISSRs (16,567), MACS (14,527), cisGenome (21,101), QuEST (6,442). The same number of top 6,442 events from each of the methods were used in our comparison. We define a set of candidate sites that all have at least one event detected by all five methods, and that contain two or more GABP motifs separated by less than 500bp. We discovered 581 such sites. Thus nearly 9% of the GABP bound regions potentially contain joint binding events. For each of these sites, we count the number of events discovered by different methods. 51 Chapter 2: Genome Positioning Systems (GPS) 52 Chapter 3: K-mer set motif representation and discovery Chapter 3 K-mer set motif representation and discovery Part of the material presented in this chapter was adapted from the following publication: Yuchun Guo, Shaun Mahony, David K Gifford (2012). High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints, PLOS Computational Biology, 8(8): e1002638. Collaborations: Y.G., S.M. and D.K.G. conceived the project. Y.G. designed the computational model and implemented the algorithm. Y.G. and S.M. analyzed the data. Y.G., S.M. and D.K.G. wrote the manuscript. 53 Chapter 3: K-mer set motif representation and discovery 54 Chapter 3: K-mer set motif representation and discovery Chapter 3: K-mer set motif representation and discovery 3.1 Introduction DNA sequence motifs are short patterns that are presumed to have a biological function (D’haeseleer, 2006b). The identification of motifs corresponding to the sequence-specific binding sites of transcription factors (TFs) has become one of the most widely studied problems in computational biology, both for its biological significance and for its bioinformatics challenge (Zambelli et al., 2012). Computational identification of transcription factor binding sites has been proved to be of great importance in deciphering complex transcriptional regulatory networks (Spellman et al., 1998; Lee et al., 2002; Kim and Park, 2011). The binding motifs of TFs describe the sequence preferences of the factors and indicate where they likely to bind and which genes they likely regulate. However, motif discovery has been a challenging problem in computational biology. The DNA sequences bound by TFs can be as short as 6-8 bases and be quite variable. Finding such short and imperfect copies of an unknown pattern in a set of noisy sequences of hundreds or thousands base pairs has been likened to searching for the proverbial needle in a haystack (D’haeseleer, 2006a). 3.1.1 DNA motif representations A motif model can be viewed as a classifier in the context of supervised learning. It is learned from the training sequences that contain bound TF sites. The motif model is then used to scan new DNA sequences (testing sequences) and classify them according to a score cutoff into two classes: motif instances and background sequences. There are many ways for a motif model to represent the binding specificity of a transcription factor. One simple and widely used motif representation is the consensus sequence. It is a short word with ambiguity codes, most commonly IUPAC codes (Nomenclature Committee of the International Union of Biochemistry, 1986), to represent binding degeneracy (binding positions that admit more than one base). Motif discovery with consensus sequences (or words, k-mers) has the advantage of being rigorous and exhaustive but is less effective for long words and several mismatches (Ladunga, 2010). The most widely used motif model is the position weight matrix (PWM). A PWM is a matrix where each matrix element represents the probability of each base at each motif 55 Chapter 3: K-mer set motif representation and discovery sequence offset. The PWM score for an observed sequence (motif instance) is the sum of corresponding base-specific probabilities for each base in that sequence (Stormo, 2000). A PWM can be also depicted graphically as a motif logo (Schneider and Stephens, 1990). An early analysis showed that the PWM representation is more sensitive and more precise than the consensus representation (Stormo et al., 1982). Typically PWM and consensus sequence motif representations are employed when the number of observed binding sites is limited. Although simple and compact, such highly compressed motif representations fail to capture subtle but meaningful differences that may be caused by dependencies in neighboring bases. These models assume that each sequence position independently contributes to the binding energy and ignore some of the complexity of protein-DNA interaction. Dependencies between nucleotides at different positions in protein binding sites have been observed (Man and Stormo, 2001; Bulyk et al., 2002; Berger et al., 2006; Maerkl and Quake, 2007). However, it is accepted that although the position independence assumption does not fit the data perfectly, it provides a good approximation of protein-DNA interactions (Benos et al., 2002). Although more complex models have been proposed that take into account the positional dependencies, they require more data to properly estimate the model’s parameters and may overfit the data if data are limited (MacIsaac and Fraenkel, 2006; Zambelli et al., 2012). In vitro protein binding microarray (PBM) results suggest that the complete binding specificity of a transcription factor can only be represented by a full list of weighted k-mers (Berger et al., 2006). Currently, there is no generally-accepted standard for motif representation (Hughes, 2011). 3.1.2 DNA motif discovery methods Early motifs were identified with laborious low throughput methods such as footprinting methods (Galas and Schmitz, 1978). Microarrays (Spellman et al., 1998) provided new data sources to discover motifs from up to a few thousand base pair long promoter regions selected from dozens to hundreds co-expressed genes. With the rapid advancement in ChIP-chip (Ren et al., 2000; Iyer et al., 2001) and ChIP-Seq (Barski et al., 2007; Johnson et al., 2007; Robertson et al., 2007) technologies, an unprecedented amount of high quality sequences that are bound by transcription factors in vivo are available. ChIP-Seq offers higher spatial resolution, higher coverage of the genome and less noisy binding regions than ChIP-chip for improved motif discovery (Park, 2009; Zambelli et al., 2012). 56 Chapter 3: K-mer set motif representation and discovery As the amount of bound sequences has been increasing rapidly, computational methods are needed to discover the important features in the sequences. Over 200 tools have been developed for the computational identification of DNA motifs (Ladunga, 2010). The algorithmic approaches for motif discovery are linked to the motif models that they used and can be broadly grouped into two categories: enumerative methods and alignment-based methods (MacIsaac and Fraenkel, 2006). Enumerative methods typically exhaustively count words up to some maximum size in the sequence set, and are thus best suited to consensus sequence motif models (MacIsaac and Fraenkel, 2006). For example, Weeder implemented a suffix tree to deterministically search for short words with few mismatches (Pavesi et al., 2001). It had been shown to be the most sensitive and selective tool in a comprehensive assessment of 14 tools (Tompa et al., 2005). However, it can be slow when searching 10bp or longer words with up to 3 mismatches in ChIP dataset (Zambelli et al., 2012). Alignment-based methods often construct a probabilistic model of the sequence data and optimize the parameters of the model to find motifs. Two well-known examples are MEME (Bailey and Elkan, 1994), which uses the EM algorithm to optimize a mixture model of binding sites and background sequences, and AlignACE (Hughes et al., 2000), which uses the Gibbs sampling technique. A comprehensive assessment of motif discovery programs (Tompa et al., 2005) showed that their performance are good on yeast data but significantly worse when applied to more complex sequence data in flies and human; each program typically covers a small subset of the known binding sites, with relatively little overlap between the methods. Therefore, the results from multiple motif discovery tools are typically combined to improve performance (D’haeseleer, 2006a; MacIsaac and Fraenkel, 2006). Traditional motif discovery methods treat all the bases in all the sequences equally. However, additional information can be used for more accurate motif discovery. For example, phylogenetic conservation information (reviewed in MacIsaac and Fraenkel, 2006), or position-specific prior information based on ChIP-chip predicted binding locations (Qi et al., 2006) or the structural class of the protein (Narlikar et al., 2006b), have been integrated into motif discovery methods to achieved better performance. The advent of ChIP-Seq technology enables a better in vivo motif discovery, compared with data obtained from ChIP-chip or promoters of co-expressed genes. ChIP-Seq provides much higher spatial resolution (typically within 40-50bp) in localizing the binding events (Wold and Myers, 2008; Wilbanks and Facciotti, 2010), in contrast to 57 Chapter 3: K-mer set motif representation and discovery a few hundreds base pairs for ChIP-chip or up to a few thousand base pairs of promoter regions (Zambelli et al., 2012). New computational methods such as GPS (Chapter 2) further improve the spatial resolution. In addition, ChIP-Seq is not limited by the coverage of array designs, is more sensitive in discovering both strong and weak binding events, and produces fewer artifacts (Park, 2009). The higher sensitivity of ChIP-Seq offers a much higher redundancy of motif instances in the sequences as well as the opportunity to learn weak binding sites, which have been shown to activate enhancer at higher TF concentration (Papatsenko and Levine, 2005) and contribute to gene expression pattern in modeling based approaches (Segal, et al., 2008). Also, ChIP-Seq enrichment has been reported to be an indicator of binding affinity of the TF for the bound region (Jothi et al., 2008). It is still a computational challenge how to best integrate all the informative features offered by ChIP-Seq into motif discovery. Traditional motif discovery program such as MEME (Bailey and Elkan, 1994) and Weeder (Pavesi et al., 2001) are not suitable for large number of ChIP-Seq bound sequences due to their computational inefficiency (Jothi et al., 2008; Zambelli et al., 2012). These methods are can process only the top ~500-1000 ranking sequences, and thus ignore weak binding sites. Newer methods have been developed to exploit the improved spatial resolution of ChIP-Seq by using a positional prior or by narrowing the sequence length. HMS (Hu et al., 2010) and ChIPMunk (Kulakovskiy et al., 2010) use read coverage profiles as a positional prior for a greedy search for motifs. POSMO (Ma et al., 2012) uses positional information to score and rank k-mers and subsequently cluster k-mers into PWMs. DREME (Bailey, 2011) searches for IUPAC words up to 8 base-pairs wide in short sequences from ChIPSeq binding events in a discriminative setting. However, these methods are not yet optimized for ChIP-Seq motif discovery. 3.1.3 About this chapter In this chapter, to fully exploit the advantages of ChIP-Seq data, I present a novel kmer set motif (KSM) model and a k-mer motif alignment and clustering (KMAC) method that learns such motif models. I show that KMAC analysis of a large set of human ChIPSeq experiments recovers more known factor motifs than other contemporary methods. In addition, the KSM model is more accurate than the PWM model when predicting which new sequences will be bound in vivo. 58 Chapter 3: K-mer set motif representation and discovery 3.2 K-mer set motif (KSM) model In this section, I propose a novel k-mer set motif (KSM) model to describe the binding preference of a protein in the discriminative context of bound and unbound sequences, and describe a motif scoring method based on evaluating the statistical significance of candidate KSM instances in DNA sequences. The choice of the KSM representation is motivated by the limitations of the positional independence assumption of the PWM model and the large number of training examples found in ChIP-Seq datasets that enable KSM learning. Similar to the in vitro protein binding microarray (PBM) derived k-mers (Berger et al., 2006), we hypothesize that the KSM model derived from ChIP-Seq binding sequences should result in more accurate bound sequence classification. 3.2.1 The KSM representation A KSM motif model is represented as a single set of aligned enriched k-mers (overlapping ungapped words of length k), each k-mer with an offset, a positive count, and a negative count that are derived from the training sequences. A k-mer and its reverse compliment k-mer are treated as one k-mer. For example, the motif of ES cell regulator Oct4 contains a set of aligned 8-mers, which are ranked by the statistical significance of enrichment in the ChIP-Seq bound (positive) sequences relative to a set of unbound (negative) sequences (Figure 3-1A). The top 8-mers such as ATGCAAAT, TATGCAAA or TGCAAATG covers the core of the motif PWM (Figure 3-1B) and some flanking bases, or a variation of the motif core (e.g. ATGCTAAT). As the k-mer rank lowers, the k-mers contains more flanking bases or mismatches, becoming more and more divergent from the top ranking k-mers. Each 8-mer is characterized by its offset relative to the expected binding position in the training sequences, and the number of positive or negative training sequences that contain the 8-mer (positive or negative hit count). The “offset” of a k-mer is defined as the offset of the first base of the k-mer relative to the expected binding position, which is estimated from the motif discovery process described in section 3.3.2. For the Oct4 example, the expected binding position is the position with base C. 59 Chapter 3: K-mer set motif representation and discovery Figure 3-1 Oct4 KSM and PWM motif representation A) The k-mer set motif (KSM) representation of the Oct4 motif. The top ranking k-mers are shown to be aligned with each other, each with a consistent offset relative to the expected binding position (yellow). “Pos Hit” indicates the number of positive training sequences containing the k-mer. “Neg Hit” indicates the number of negative training sequences containing the k-mer. B) The PWM logo representation of the motif. The relative heights of the bases at each position represent the relative frequencies of the bases at that position. The total height of a position signifies the information content at that position. The overlapping k-mers are included to capture the effect of flanking bases because the flanking sequences overlapping the motif cores may reflect the interaction with cofactors and contain information about the in vivo binding specificity of the protein of interest (Maerkl and Quake, 2007; Slattery et al., 2011). The k-mers are aligned and associated with consistent offsets such that when scanning motif instances on a sequence using the KSM model, the multiple k-mer matches can be grouped based on the expected binding positions derived from their matched positions and their offsets. In summary, a KSM motif representation is the set of all enriched and consistently aligned overlapping ungapped k-mers. Under this simple k-mer set representation, a sequence is said to contain an instance of the motif if it contains any of the component kmers. The binding position can then be computed using the offset of the matched kmers. The length of each k-mer (the parameter k) will influence the accuracy of the motif model. If k is too small, the k-mer collection may not be rich enough to capture the binding specificity. If k is too large, the number of observed instances of each k-mer in the data is too few to generate useful statistics. The k-mer set motif discovery method is 60 Chapter 3: K-mer set motif representation and discovery designed to select the optimal k value based on the enrichment of the motif in the in vivo bound sequences, as described in section 3.3. 3.2.2 Scoring K-mer set motif in a DNA sequence An important use of a KSM motif model is to scan a query DNA sequence and assign a score to the sequence to indicate the significance of its potential matches to the KSM. To assign a score to a DNA sequence, we simultaneously search for the occurrences of all the k-mers in the KSM using the Aho-Corasick algorithm (Aho and Corasick, 1975). The Aho-Corasick algorithm is an efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. It constructs a finite state pattern matching machine from the keywords and then uses the pattern matching machine to process the text string in a single pass. The complexity of the algorithm is linear in the length of the searched text plus the number of output matches if the construction of the pattern matching machine is pre-processed (Aho and Corasick, 1975). The k-mer matches in the query sequences are grouped based on their respective expected binding locations. The expected binding event location of a k-mer is computed by the matched position of the k-mer and the KSM offset of the k-mer. We define the “kmer group” as the subset of component k-mers in the KSM model that occur in the query sequence and that are mapped to the same expected binding position on the sequence. Thus each k-mer group is a KSM motif instance in the query DNA sequence. We illustrate this use a simple example as follows. We want to search the motif represented by a KSM model (Figure 3-2A) in a DNA sequence (Figure 3-2B). Four component k-mers from the KSM model match the query sequence and can be grouped into two groups based on the expected binding positions. To explain how a k-mer group is scored, we first need to introduce the definition of a motif hit count. We define the “hit count” of a motif as the number of sequences containing the motif in the training sequence set. This is similar to the ZOOPS (Zero Or One Per Sequence) mode of MEME (Bailey and Elkan, 1994) in that we count the number of the sequences, but not the number of motif occurrences. This definition is useful to avoid overcounting motifs in simple repeats, such as “AAAAAAAAAAAA…”. 61 Chapter 3: K-mer set motif representation and discovery Figure 3-2 Search k-mer set motif in a DNA sequence A) An example of a simple KSM model. B) The component k-mer matches are grouped to two k-mer groups by the expected binding positions (yellow). With the above definitions, it is straightforward to define hit count for the PWM motif. However, the hit count for a k-mer group cannot be obtained by simply summing the hit count of all the matching component k-mers because the component k-mers are overlapping and a simple summation will overcount the number of sequences. Thus the “k-mer group hit count” is defined as the number of all the positive training sequences that contain at least one of the matched k-mers in the k-mer group. To compute the kmer group hit counts without the actual training sequences, we record with each k-mer the IDs of the positive/negative training sequences that contain the k-mer during the motif learning phase (Figure 3-2A). The union of all the IDs from all the matching k-mers can thus identify the training sequences that contain at least one of the component kmer of the k-mer group. In the above example (Figure 3-2B), k-mer group 1 match 8 positive training sequences (ID=1~7,9) and 1 negative training sequence (ID=1). Thus, it has 8 positive hit counts and 1 negative hit count. In our discriminative motif discovery setting, given the positive/negative motif hit count and the total number of positive and negative training sequences, the statistical significance of a motif instance can be evaluated in terms of its relative enrichment in positive and negative training sequences by computing a hypergeometric p-value (HGP) (Barash et al., 2001): 62 Chapter 3: K-mer set motif representation and discovery HGP = min( N + , n ) ∑ l = n+ N + N − N + l n − l N n where N is the total number of positive and negative training sequences, N + is the total number of positive training sequences, n is the number of positive and negative training sequences containing the motif (positive and negative hit count), and n + is the number of positive training sequences containing the motif (positive hit count). Note that the motif instance can be a PWM match, consensus sequence match, or a k-mer group. The “KSM score” of a k-mer group is then defined as the negative logarithm (base 10) of the HGP of the k-mer group. Typical KSM score ranges from 0 to a few thousand. 3.3 K-mer motif alignment and clustering (KMAC) The goal of the K-mer Motif Alignment and Clustering (KMAC) method is to discover the set of k-mers that are enriched in the DNA sequences bound by the protein of interest, and to cluster the k-mers into one or more k-mer set motifs that describe distinct binding preferences. The k-mer sets may correspond to the primary binding motif, variations of the primary binding motif, or secondary motifs that correspond to co-factor binding. The input data are a set of ChIP-Seq bound sequences (positive sequences) and optionally a set of unbound sequences (negative sequences). The negative sequences will be randomly generated by using a di-nucleotide shuffling method if they are not provided. Each positive sequence also carries a weight, which can be the strength of the corresponding binding event, or the read count of the event. The weights of the positive sequences are set to 1 if they are not provided. The center base of each sequence is assumed to be the ChIP-Seq predicted binding site position. Although this method is designed to analyze ChIP-Seq derived DNA sequences, it can be applied to other DNA sequences as long as the positional weight and the sequence weight can be set appropriately (see section 3.3.2). In this study, values of k from 5 to 13 are used on each dataset, and the final k value is chosen as the one that gives the most significantly enriched primary PWM as described below. Note that the width of the PWMs estimated from this method is not explicitly specified by the value of k. Different k values may converge to the same or 63 Chapter 3: K-mer set motif representation and discovery different width of PWMs. 3.3.1 Discovery of the set of enriched k-mers First, a set of enriched k-mers is discovered by comparing k-mer hit counts between positive sequences and negative control sequences. The number of positive and negative sequences that contain instances of each possible k-mer (hit count) are counted, treating each k-mer and its reverse complement as the same sequence. A hypergeometric p-value (HGP) is computed to evaluate the significance of each k-mer in terms of its relative enrichment in positive and negative sequences. A k-mer is considered enriched if its HGP is less than 1e-3 and it has at least 3-fold enrichment in terms of positive/negative hit count. Combinations of HGP thresholds (1e-2, 1e-3 and 1e-4) and fold values (2 and 3) were tested and the selected thresholds gave the best results in finding the correct motifs. 3.3.2 Clustering the enriched k-mers into k-mer set motifs KMAC next clusters the enriched k-mers into k-mer set motifs (KSMs) that describe similar DNA binding preferences. A genomic sequence is said to match a KSM if the genomic sequence contains any of its component k-mers. KMAC clusters enriched kmers into KSMs by the following steps (Figure 3-3): Step1: A k-mer set is initialized with the most enriched k-mer (seed k-mer) and any other enriched k-mers that differ by a single base from the most enriched k-mer. Positive set sequences that match the k-mer set are selected and aligned on the k-mer match positions. Step2: Extract k-mer set. Any enriched k-mer that appears in a 2k+1 bp window around a KSM match is tested for addition to the k-mer set. An enriched k-mer must have the same alignment offset to window sequences in at least f fraction of its occurrences to be added to the k-mer set, 0< f ≤1. In this study, f =0.5 (see below for the choice of this value). Step3: Select and align the positive sequences matching the KSM. Step4: Construct a PWM from the aligned sequences. Step5: Select positive sequences containing a PWM hit and align the sequences using the PWM, continue with step 2. Thus, the k-mer set is further expanded by iteratively repeating step 2-5. We assume that a k-mer in the true motif should align consistently with the other k- 64 Chapter 3: K-mer set motif representation and discovery mers in the motif. Thus, when extracting the k-mer set in step 2, each selected k-mer should have a consistent alignment offset in at least f fraction of the k-mer occurrences. The value of f affects the stringency of including a k-mer into the k-mer set and the optimal f value may vary in different datasets. We tested different f values (f = 0.3, 0.5 and 0.7) when analyzing the large set of ENCODE data and the performances for motif discovery are similar. Therefore we chose f = 0.5 in this study. Figure 3-3 Schematic of k-mer set motif finding Step 1: Initialize k-mer set and aligned matched sequences. Step 2: Extract k-mer set from aligned sequences. Step 3: Align matched sequences with k-mer set. Step 4: Construct PWM from aligned sequences. Step 5: Align PWM-matched sequences with the PWM. Step 2-5 are repeated until the hypergeometric p-value of the PWM stops improving. The rationale of alternating KSM and PWM motif representation is to combine the precision of the KSM model and the generalizability of the PWM model. While the k-mer set motif is more precise because it requires consistent alignment to add a k-mer and requires exact matching of existing k-mers to find a motif instance, it is not easy to generalize to new k-mers, especially when initially the component k-mers are limited to 1 mismatch to the seed k-mer. On the other hand, the PWM motif allows generalization to un-seen sites, but it may lead to false positives. Therefore, a combination of the two may result in a more rich and accurate representation of the motif. At step 4, the enrichment of the PWM is evaluated by computing a hypergeometric p-value (HGP) from the PWM hit count in the positive sequences compared to the 65 Chapter 3: K-mer set motif representation and discovery negative sequences. The PWM hit threshold is set to be 60% of the maximum PWM score, which has been shown to approximately equal to the cutoffs determined by crossvalidation method (MacIsaac and Fraenkel, 2006). Iteration stops when the HGP of the PWM stops improving. The PWM and the KSM representations of the motif are then recorded. We compute the expected binding position in the alignment by averaging over the relative offsets of predicted binding event positions in all aligned sequences. For each k-mer in the k-mer set, the offset of the k-mer relative to the expected binding position is recorded. When constructing the PWM, we incorporate two sources of information from the binding events to weight the sequences in order to bias motif discovery towards the binding event positions, and to bias motif discovery towards patterns that occur in more confident binding events. The first weighting factor is a positional weight that represents the probability of having a motif at a certain position given its distance from the binding event. The distance weighting function we use was fit to characterized ChIP-Seq data, and is the logistic distribution with mean 0 and variance 13. The second weighting factor is the binding strength (read count) of the binding event. A PWM is constructed with weighted positive sequences centered on the k-mer set match and a zero order Markov model learned from negative sequences. Flanking positions are progressively trimmed to find the PWM with the most significant HGP. After finding the primary KSM, KMAC searches for other KSMs. To accomplish this, the previous seed k-mer is removed from the enriched k-mer pool and PWM motif occurrences are masked in the sequences. The process of building new KSMs is repeated until no more significantly enriched PWMs can be constructed. Rarely, a secondary motif PWM can become more significantly enriched than the primary motif. If this happens, the motif finding process is restarted using the seed k-mer of this secondary motif. In summary, KMAC selects the enriched k-mers from the positive sequences compared with the negative sequences and produces a list of both KSMs and PWMs corresponding to the primary binding motif and secondary binding motifs. 66 Chapter 3: K-mer set motif representation and discovery 3.4 Results 3.4.1 The PWM model does not capture k-mer differences between CTCF binding in mouse and human cells The KSM model consists of a comprehensive list of k-mers with their respective hit count information. However, this rich and quantitative information about TF binding specificity is lost after being compressed into the PWM model. For example, a 12-mer CCAGAAGAGGGC occurs in 3,962 of the 40,000 top ranking CTCF bound sequences in mouse ES cells (Figure 3-4A), but occurs only in 81~83 of the 40,000 top ranking CTCF bound sequences in 3 types of human (H1-hES, GM12878 and K562) cells. Similar cases include CCACAAGAGGGC, CCAGAAGAGGGT and CCAGAAGAGGGG. Contrasting to the dramatic differences in k-mers, the PWM models for these 4 cell types are highly similar (Figure 3-4B) and could not explain the differences displayed by the kmer results. Figure 3-4 The PWM model does not capture k-mer differences A) The hit counts of top 10 non-overlapping k-mers shows dramatic differences of CTCF binding in mouse ES cells as compared to in human H1-hES, GM12878 and K562 cells. The hit counts are computed from 40,000 61bp sequences from CTCF binding sites in the respective cells. B) The highly similar PWM motif logos of mouse ES cells, human H1-hES, GM12878 and K562 cells. The motif logos are generated by STAMP (Mahony et al., 2007) from the motifs discovered by KMAC using the 40,000 top ranking binding sites in the respective datasets. 3.4.2 K-mer set motif model is more predictive for in vivo binding than PWM model The classification performance of the k-mer set motif (KSM) model and the PWM model are further compared in terms of their sensitivity and specificity. KMAC is applied to sequences derived from a mouse ES cell CTCF ChIP-Seq data to learn both KSM 67 Chapter 3: K-mer set motif representation and discovery and PWM representations for the mouse CTCF motif. Both representations of the same mouse motif were then used to predict motif instances in the same set of mouse CTCF bound sequences (n=39071) and in the human CTCF bound sequences (n=61767), with corresponding negative sequences generated by dinucleotide shuffling. This is analogous to testing on the training sample and testing on new testing samples. The performances are characterized using truncated (false positive rate <= 0.1) receiver operating characteristic (ROC) curves because in practice only the predictions with such low false positive rates are meaningful. We also tested 2 alternative scoring methods: for each sequence, we only used one single k-mer which has the most significant pvalue among the matched k-mers (top match k-mer) to score the sequence. The p-value or the hit count of this single top match k-mer was used as the score for that test sequence. The results show that KSM with k-mer group score gives the best performance (Figure 3-5 A and B). The area under curve (AUC) for the truncated ROC (maximum score is 0.1) is 0.081 for KSM and 0.074 for PWM when predicting the same mouse training sequences, and 0.058 for KSM and 0.052 for PWM when predicting the new human testing sequences. It is remarkable that even taking the top match k-mer in the sequence outperforms the PWM model. In addition, the k-mer group score gives better performance than scores from single top match k-mer, underscoring the value of including the flanking sequences into the motif model. A similar analysis with c-Myc also showed that the KSM model is more predictive than the PWM model (Figure 3-5 C and D). The area under curve (AUC) for the truncated ROC is 0.035 for KSM and 0.031 for PWM when predicting the same mouse training sequences, and 0.023 for KSM and 0.021 for PWM when predicting the new human testing sequences. We also tested the PWM motif from public database (Jasper MA0147.1) derived from the same mouse cMyc ChIP-Seq data, the results were similar (data not shown). 68 Chapter 3: K-mer set motif representation and discovery Figure 3-5 The KSM model is more predictive than the PWM model A) Truncated ROC of 4 different representations/scoring methods for the mouse CTCF motif to predict binding in mouse CTCF ChIP-Seq bound sequences. B) Truncated ROC of 4 different representations/scoring methods for the mouse CTCF motif to predict binding in human CTCF ChIP-Seq bound sequences. C) Truncated ROC of 4 different representations/scoring methods for the mouse c-Myc motif to predict binding in mouse c-Myc ChIP-Seq bound sequences. D) Truncated ROC of 4 different representations/scoring methods for the mouse c-Myc motif to predict binding in human c-Myc ChIP-Seq bound sequences. For all cases, the negative sets of sequences are generated by dinucleotide shuffling using the positive sequences. 3.4.3 KMAC outperforms other motif discovery methods in discovering known DNA-binding motifs We tested KMAC’s ability to discover biologically relevant DNA-binding motifs in data from the ENCODE project (Birney et al., 2007). We chose this large collection of experiments because we expected they would be representative of the typical range of ChIP-Seq data noise and sequencing depth. Noise can be caused by low antibody affinity and deviations from ideal experimental procedure. We used a set of 214 ChIPSeq experiments and associated controls comprising 63 distinct transcription factors that were profiled in one or more cell lines by the ENCODE project and for which validated 69 Chapter 3: K-mer set motif representation and discovery DNA-binding motifs exist in public databases. We applied KMAC to 61bp sequences extracted from the 5000 most highly ChIPenriched GPS peaks calls from these 214 ChIP-Seq data, and the most significant KMAC-discovered motifs from each analysis were compared to corresponding known binding preferences of the same TFs using STAMP (Mahony et al., 2007). A motif alignment with E-value less than 1e-5 was considered a match. For comparison, we also used four popular traditional motif discovery tools covering a range of computational techniques, including MEME (Bailey and Elkan, 1994), Weeder (Pavesi et al., 2001), MDScan (Liu et al., 2002), and AlignACE (Hughes et al., 2000), and four ChIP-Seq oriented tools, DREME (Bailey, 2011), POSMO (Ma et al., 2012), HMS (Hu et al., 2010) and ChIPMunk (Kulakovskiy et al., 2010) on the same data. A set of 100bp sequences extracted from the 500 most highly ChIP-enriched GPS peaks calls are examined by the motif-finders MEME, Weeder, MDScan, AlignACE, DREME, or POSMO. For HMS and ChIPMunk, a set of 100bp sequences and corresponding read coverage profiles are extracted from the 500 most highly ChIP-enriched GPS peaks calls. Figure 3-6 KMAC motif discovery outperforms other methods when detecting motifs in ChIP-Seq data. The motif detection performance of KMAC is compared to the motif detection performance of various motif-finders on 214 ENCODE ChIP-Seq experiments. 70 Chapter 3: K-mer set motif representation and discovery We found KMAC outperforms all of the compared motif discovery approaches, even when allowing each method to make multiple motif predictions (Figure 3-6). We note that KMAC sometimes (8 out of 214) failed to find the known motif in datasets where one of the other algorithms succeeded. 3.4.4 KMAC outperforms other ChIP-Seq oriented motif discovery methods For the ChIP-Seq oriented tools, KMAC, DREME, POSMO, HMS and ChIPMunk, we also tested the methods in two set of conditions: 1) sequence sets with the length and number recommended by the authors of these methods, 2) same set of 500x100bp sequences. We found that for the other 4 methods, the “500x100bp” condition provided superior results than using more sequences as designed for ChIP-Seq data (Figure 3-7). This suggests that the additional sequences from the relatively weak binding events degrade the motif discovery performance of these methods, which defeats the purpose of capturing weak binding sites available from ChIP-Seq data. In contrast, KMAC with 5000 sequences performs better than the “500x100bp” condition, and all the other methods in both conditions (Figure 3-7). The comparison of ChIP-Seq oriented tools suggested that KMAC is more suitable for learning the whole spectrum of binding preferences from a large number of both strong and weak binding sites. 3.5 Discussion In this study, we presented a novel k-mer set motif (KSM) representation to describe the binding sequence preferences of TFs and a motif discovery method to learn the KSM model from ChIP-Seq defined sequences. We have shown that the KSM model with the motif group scoring method is able to predict in vivo binding more accurately than the widely used PWM model, even when predicting binding across species. In addition, our KMAC motif discovery method outperforms several widely used programs, including several new ChIP-Seq oriented methods, in discovering known motif from a large set of 214 ENCODE ChIP-Seq data. The comprehensive list of k-mers in the KSM model gives a more complete and quantitative description of in vivo binding preferences. We have shown that the KSM model is more predictive than the PWM model. The performance gain is likely due to the more comprehensive representation of the motif and the extra information provided by the flanking k-mers learned from a large set of sequence data. In addition, the 71 Chapter 3: K-mer set motif representation and discovery Figure 3-7 KMAC outperforms other ChIP-Seq oriented motif discovery methods The motif detection performances of various motif-finder/condition combinations on 214 ENCODE ChIP-Seq experiments are compared. KMAC, POSMO, HMS, ChIPMunk and DREME motifs are learned from the recommended number of sequences by the methods (DREME: all sequences, other methods: top 5000 sequences). KMAC500, DREME500, POSMO500, HMS500 and ChIPMunk500 motifs are learned from the same condition (top 500 100bp sequences). advantageous performance of the KSM model suggests that the stringent exact k-mer match criterion does not limit our model to generalize to un-seen sequences while ruling out unrealistic binding sequences that may pass the PWM threshold. In this work, the kmers are limited to ungapped k-mers for the purpose of computation simplicity and efficiency. A more flexible model that incorporates gapped k-mers will represent the binding specificity more accurately, but will need more sophisticated learning methods and motif scanning methods. The KMAC motif discovery method exploits the advantages of ChIP-Seq when compared to ChIP-chip, such as better spatial resolution, more complete coverage, higher sensitivity and specificity and better quantification of binding events. It learns motifs enriched in bound sequences as compared to unbound sequences, utilizes both strong and weak binding events and biases motif discovery towards the binding event positions and more confident events. The value of integrating these factors has been 72 Chapter 3: K-mer set motif representation and discovery demonstrated by the improved performance of KMAC for discovering biologically relevant DNA-binding motifs in the large set of ENCODE ChIP-Seq data. Motif discovery methods typically compare their performance with existing method by analyzing synthetic sequences or sequences bound by a small set of TFs (Zambelli et al., 2012). The ENCODE data we compiled consists of 214 ChIP-Seq experiments and 63 TFs spanning different structural classes. These ChIP-Seq data covers a variety of cell types, antibodies and lab protocols, provides a diverse set of ChIP-Seq binding sequences for a comprehensive assessment of motif discovery performance. Traditional motif discovery methods such as MEME and Weeder are found to be not suitable for analyzing the whole sequence set due to computational inefficiencies (Jothi et al., 2008; Zambelli et al., 2012), and are usually applied to only a small subset of top ranking binding sequences. Several ChIP-Seq oriented methods such as DREME, POSMO, HMS and ChIPMunk are designed to exploit the high spatial resolution gained from ChIP-Seq data. However, their performance with the designed ChIP-Seq compatible condition (i.e. top 5000 or all sequences) was surprisingly worse than using the top 500 sequences. In contrast, KMAC motif discovery on the ENCODE data with 5000 top sequences outperforms these methods in both tested conditions, as well as outperforms traditional methods such as MEME, Weeder, AlignACE and MDScan. Thus, the combination of the KSM model and the KMAC motif discovery method offers promise in taking advantage of what ChIP-Seq technology has to offer for a better understanding of in vivo sequence specificity of protein-DNA interactions. The KSM model and the KMAC motif discovery method are designed to leverage the large amount of high spatial resolution binding sites produced by ChIP-Seq technology. As better sequencing technology and more ChIP-Seq datasets become available, the advantage of using a more comprehensive k-mer based motif representation such as the KSM model will be of greater significance. The integration of our in vivo binding model with an in vitro model learned from technologies such as protein binding microarray (PBM) (Berger et al., 2006), HT-SELEX (Zhao et al., 2009), and Bacterial one-hybrid (B1H) (Meng et al., 2005; Meng and Wolfe, 2006) will generate a more complete picture of binding preferences for further understanding of mechanisms of protein-DNA interactions. Such quantitative characterization of transcription factor binding will allow more systems biology approaches to modeling and predicting the behavior of the complex regulatory network in living cells. 73 Chapter 3: K-mer set motif representation and discovery 3.6 Methods 3.6.1 Datasets 214 ENCODE ChIP-Seq datasets that have an embargo date before Oct 28, 2011 and have known motifs in public databases were downloaded from the ENCODE project website (Birney et al., 2007). Mouse ES cell factor ChIP-Seq datasets (Chen et al., 2008) were downloaded from GEO. FastQ files of the ChIP-Seq data were then aligned with genome (human hg19 or mouse mm9) using Bowtie (Langmead et al., 2009) version 0.12.7 with options “-q --best --strata -m 1 -p 4 --chunkmbs 1024”. 3.6.2 Motif-finding performance comparison For the 214 ENCODE ChIP-Seq data, the GPS event-finder (Guo et al., 2010) were applied to call binding events. KMAC and 8 other motif finding methods, AlignACE v4.0 (Hughes et al., 2000), MDscan v2004 (Liu et al., 2002), MEME v4.7.0 (Bailey and Elkan, 1994), Weeder v1.4.2 (Pavesi et al., 2001), DREME v4.7.0 (Bailey, 2011), POSMO v1 (Ma et al., 2012), HMS v0.1 (Hu et al., 2010) and ChIPMunk v3 (Kulakovskiy et al., 2010), were applied to discover motifs independently. For KMAC, the positive set consists of 61bp sequences centered on the GPS predicted binding locations, and a negative set consists of 61bp sequences that are 300bp away in the reference genome from binding locations and that don’t overlap positive sequences. For AlignACE, MDscan, MEME and Weeder, 100bp sequences centered on the top 500 peaks were extracted from each dataset, as suggested by the MEME Suite’s documentation based on the typical resolution of ChIP-Seq peaks. For ChIP-Seq oriented methods, DREME, POSMO, HMS and ChIPMunk, two sets of sequences were tested: 1) a set of 100bp sequences centered on the top 500 GPS peaks; 2) a set of sequences with number and length recommended by the authors of these methods (DREME: all binding calls, 100bp; POSMO: top 5000 1000bp; HMS and ChIPMunk: top 5000 200bp sequences). For HMS and ChIPMunk, a set of read coverage profiles matching the sequences were also extracted. MEME was run with “-nmotifs 6” and Weeder was run with option “large”. POSMO was run with options “5000 11111111 sequence_file 1.6 2.5 20 200”. ChIPMunk was run with options “6 15 yes 1.0 p:read_coverage_profile 100 10 1 4 random 0.41”. HMS was run with options “-w motif_width -dna 4 -iteration 100 -chain 50 -seqprop 0.1 -strand 2 -base read_coverage_profile -dep 2”; motif_width was determined by width of motif 74 Chapter 3: K-mer set motif representation and discovery discovered by MEME for the same data. All other parameters were the defaults specified by the authors. We collected known binding preference motifs (PWMs) from the TRANSFAC (Matys et al., 2003), JASPAR (Sandelin et al., 2004), and Uniprobe (Berger et al., 2006) databases. We only included motifs of the factors of interest or motifs for the TF family but not motifs of factors in the same family because factors in the same family may have very different binding motifs. Discovered motifs (PWMs) were compared to known motifs using STAMP (Mahony et al., 2007). A motif with E-value less than 1e-5 was considered a match. For each program, we counted the number of datasets that had a motif matching at least one known motif of that transcription factor. In some cases, the correct motifs were not matched by the first motif that a method outputs, but by the second or later motifs. Therefore we compared the motif-finding performance using the top 1, top 2… or top 6 motifs. Little improvement was observed after the 6th motifs. 3.6.3 ROC comparison of motif representation performance in predicting in vivo binding The mouse motifs (in both KSM and PWM representations) were learned by applying KMAC on top 5000 GPS (Guo et al., 2010) binding event calls in CTCF and cMyc ChIP-Seq of mouse ES cells (Chen et al., 2008). The test mouse CTCF (n=39071) and c-Myc (n=7085) bound sequences are obtained by extracting 100 bp sequences centered on all the GPS binding event calls in the respective ChIP-Seq data (Chen et al., 2008). The test human CTCF and c-Myc bound sequences are obtained by extracting 100 bp sequences centered on all the GPS binding event calls in CTCF (n=61767) and c-Myc (n=18537) ChIP-Seq data by Crawford Lab in the ENCODE project, respectively (Birney et al., 2007). A matched set of negative test sequences for each test sequence set is generated by dinucleotide shuffling (Jiang et al., 2008) of the ChIP-Seq bound sequences. The KSM model (k-mer group score, scores using the p-value or the hit count of the single top match k-mer) and the PWM model are used to scan the testing sequences. A list of scores on positive and negative sequences for each testing case is processed using Matlab software (The MathWorks, Inc.) to compute the truncated (false positive rate <= 0.1) ROC curves and the AUC values. 75 Chapter 4: Genome-wide event finding and motif discovery (GEM) Chapter 4 Genome-wide event finding and motif discovery (GEM) Part of the material presented in this chapter was adapted from the following publication: Yuchun Guo, Shaun Mahony, David K Gifford (2012). High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints, PLOS Computational Biology, 8(8): e1002638. Collaborations: Y.G., S.M. and D.K.G. conceived the project. Y.G. designed the computational model, implemented the algorithm and analyzed the data. Y.G., S.M. and D.K.G. wrote the manuscript. 77 Chapter 4: Genome-wide event finding and motif discovery (GEM) Chapter 4: Genome-wide event finding and motif discovery (GEM) 4.1 Introduction As the method of choice for genome-wide profiling of in vivo protein-DNA interactions, ChIP-Seq offers much improved spatial resolution (also called spatial accuracy, positional resolution), which is considered perhaps the greatest improvement over ChIP-chip (Park, 2009). High spatial resolution in binding event calls can greatly facilitate computational motif discovery from the bound regions by narrowing the search window (Bailey, 2011) or by providing positional prior information for computational models (Narlikar et al., 2006a; Qi et al., 2006), thus increasing the motif signal-to-noise ratio. In addition, the spatial accuracy in locating the binding events influences the quality of regulatory site annotation relative to binding sites of other transcription factors (TFs), transcription start sites, exon/intron boundaries, the 3’ end of genes and other conserved noncoding features. However, although the reads are mapped to the genome at base pair resolution, random variation in the ChIP DNA fragmentation process obscures the actual location of interaction events (see discussion in Section 2.1). The majority of TFs are known to interact with DNA in a highly sequence-specific manner, with much higher affinity to the motif sites than neighboring sequences (Vaquerizas et al., 2009; Stormo and Zhao, 2010). DNA binding motifs have been used in computational identification of TF binding sites, but the prediction of in vivo binding from sequence may result in false positives and has been shown to be empirically unreliable (Wasserman and Sandelin, 2004; Farnham, 2009). Thus the ChIP-Seq read coverage and DNA motif information provide complimentary perspectives to enhance the accuracy for predicting sequence-specific protein-DNA interactions. By using genome sequence data, it should be possible to predict the TF binding locations at a much higher spatial resolution. Contemporary methods to resolve binding events in ChIP-Seq data identify statistically enriched regions of ChIP-Seq read density and the peak points of enrichment within those regions (Ji et al., 2008; Jothi et al., 2008; Valouev et al., 2008; Zhang et al., 2008; Guo et al., 2010; Feng et al., 2011). The resulting binding calls can be offset from the bound site by dozens of bases (Park, 2009). Recent studies have integrated peak detection and motif discovery by including motif occurrences to score 79 Chapter 4: Genome-wide event finding and motif discovery (GEM) the significance of predicted binding events (Boeva et al., 2010; Wu et al., 2010), or by using ChIP-Seq read coverage as a positional prior to improve motif discovery (Hu et al., 2010; Kulakovskiy et al., 2010). However, no study has yet used the motif position information to reciprocally improve the spatial accuracy of binding event prediction. To improve the spatial resolution of ChIP-Seq binding prediction, we have developed a new method called GEM, which simultaneously resolves the location of protein-DNA interactions and discovers explanatory DNA sequence motifs with an integrated model of ChIP-Seq reads and proximal DNA sequences. GEM reciprocally improves motif detection using binding event locations, and binding event predictions using discovered motifs. In this chapter, I describe the GEM algorithm and review the GEM derived results. GEM significantly improves upon previous methods for processing ChIP-Seq and ChIPexo data to yield unsurpassed spatial resolution and improves the discovery of closely spaced binding events for the same factor. Additional results of using GEM to discover TF spatial binding constraints are presented in Chapter 5. 4.2 GEM algorithm The GEM algorithm consists of five phases: 1. Predict protein-DNA binding event locations with a sparse prior 2. Discover the k-mer set motifs at binding event locations 3. Generate a positional prior for event discovery with the most enriched k-mer set motif 4. Predict improved protein-DNA binding event locations with a k-mer based positional prior 5. Repeat motif discovery (Steps 2) from the Phase 4 improved event locations. 4.2.1 Predicting protein DNA-binding events with a sparse prior Initial protein-DNA binding event locations are predicted by GPS (Guo et al., 2010), which employs a negative Dirichlet sparse prior. 4.2.2 Discover the k-mer set motifs at binding event locations Next we apply KMAC to discover the k-mer set motifs (KSMs) that are enriched in the bound sequences as compared to the unbound sequences. The positive set consists of 61bp sequences centered on the predicted binding locations from Phase 1, 80 Chapter 4: Genome-wide event finding and motif discovery (GEM) and a negative set consists of 61bp sequences that are 300bp away from binding locations and that don’t overlap positive sequences. 4.2.3 Positional prior generation Phase 3 of GEM uses the primary KSM to compute a positional prior that will be used for binding event discovery in Phase 4. As in GPS, the genome is segmented into independent separable regions (typically a few kb long) by dividing at read gaps that are larger than 500bp and further excluding regions that contain fewer than 6 reads. At each evaluated genome region, we search the k-mer group instances of the primary k-mer set motif, as described in Subsection 3.2.2. The position-specific prior for a sequence base is defined as the hit count of the k-mer group whose binding offsets match that base (i.e. the number of positive set sequences that contain one or more of the matched k-mers whose binding offsets match that base). Alternatively, the PWM of the TF of interest is used to scan the sequence and the PWM scores that pass a certain threshold is used to specify the position-specific prior information. The concept of using informative positional priors for motif discovery has been explored previously (Narlikar et al., 2006a; Qi et al., 2006). 4.2.4 Binding event prediction with a positional prior GEM employs a generative probabilistic model that describes the likelihood of a set of ChIP-Seq reads being generated from a set of protein-DNA interaction events originating at specific DNA sequences. The model generates protein-DNA interaction events that are biased to occur at explanatory DNA sequences by a motif based positional prior. Each event then independently generates reads following an empirical read spatial distribution that describes the probability of reads given the distance from the event (see Figure 2-2 for an example). Formally, in an evaluated region of length M, we consider N ChIP-Seq reads that have been mapped to genome locations R = {r 1 , …, r N } and M all possible protein-DNA interaction events at single base locations B = {b 1 , …, b M }. We represent the latent assignments of reads to events that caused them as Z = {z1 , …, z N }, where indicator function 1(zn = m) = 1 when read n is caused by the event m. The probability of a read n is based on a mixture of possible binding events: 81 Chapter 4: Genome-wide event finding and motif discovery (GEM) M p(rn | π ) = ∑ π m p(rn | m), m =1 ∑ M m =1 πm =1 where M is the number of possible events; π denotes the parameter vector of mixing probabilities, and π m is the probability of event m; p(r n | m) is the probability of read n being generated from event m and can be determined from the empirical spatial distribution of reads given the event. The overall likelihood of the observed set of reads is: N M p ( R | π ) = ∏∑ π m p (rn | m) n =1 m =1 We make two prior assumptions about the binding events: 1) binding events prefer to occur at the sequence specific DNA motif positions; 2) binding events are relatively sparse throughout the genome. To incorporate these assumptions, we place a negative Dirichlet prior (Figueiredo and Jain, 2002; Guo et al., 2010) p(π) on binding event probabilities π: M p (π ) ∝ ∏ (π m ) −α s +α m m =1 where α s is the uniform sparse prior parameter governing the degree of sparseness, α s >0; α m denotes the binding event specific prior parameter and its value is proportional to C m, the positional prior count underlying event m (as defined in Phase 3): αm = αsµ Cm max Cm ' m' where µ is a parameter to tune the effect of the motif based prior, 0≤µ<1. The rationale for scaling the motif-based positional prior with the genome-wide occurrences of motif is that if the motif matched position m has more occurrences at binding events genome wide, it is more likely to cause a binding event at that genome position. The parameter α s is set based on the read coverage in the whole dataset and in the local genomic region (see more details in Subsection 2.2.5). The parameter α m is scaled such that the values of all possible α m will be less than α s . Therefore the k-mer based prior will not force the model to predict a binding event at a motif position when the observed reads do not provide sufficient evidence of a protein-DNA interaction event. We tested different settings for µ values (µ = 0.5, 0.8 or 0.95) in analyzing the GABP data. Results show that GEM with µ = 0.8 or 0.95 give similar results but produce 82 Chapter 4: Genome-wide event finding and motif discovery (GEM) much better spatial resolution (~1bp) and call more joint events (~5%) than GEM with µ = 0.5. Thus we chose µ = 0.8. Since the k-mers underlying the possible binding event positions and their counts are known, the value of term -α s +α m remains constant when we estimate the parameters in the mixture model. Therefore, we can solve the mixture model using the ExpectationMaximization (EM) algorithm (Dempster et al., 1977). The complete-data log penalized likelihood is: M M ( ) ln p ( R, Z , π ) = ∑ ∑ 1( z n = m) ln π m + ln p (rn | m) + ∑ (−α S + α m ) ln π m n =1 m =1 j =1 N where 1(z n =m) is the indicator function. In the E Step we have: γ ( z n = m) = π j p (rn | m) M ∑π m '=1 j' p (rn | m' ) where γ(zn =m) can be interpreted as the fraction of read n that is assigned to event m. In the M step, on iteration i we find parameter πˆ (i ) to maximize the expected complete-data log penalized likelihood: N M M j =1 πˆ m (i ) = arg max ∑ ∑ γ ( zn = m)(ln π m + ln p(rn | m) ) + ∑ (−α S + α m ) ln π m n =1 m =1 πm ∑ j =1π j = 1 . M under the constraint By simplifying, we find the closed-form solution of the maximization as: πˆ m (i ) = ∑ max(0, N m − α S + α m ) M m '=1 , N m = ∑n =1 γ ( z n = m) N max(0, N m ' − α S + α m ) where N m is the effective number of reads assigned to event m, or the binding strength of event m. Intuitively, the effective read count of an event is decreased by a pseudocount α s for the sparseness penalty, and is increased by a pseudo-count α m for the motif match at position m. If for event m, the value of π m becomes zero, the model is restructured to eliminate it (Figueiredo and Jain, 2002). The EM algorithm is deemed to have converged when the change in likelihood falls below a small value, for example 1e-5. Since the value of term -α s +α m is negative, a binding event supported by enriched k- 83 Chapter 4: Genome-wide event finding and motif discovery (GEM) mers may still be eliminated if it is not sufficiently supported by read data. In addition, a binding event not supported by enriched k-mers may still survive if it is sufficiently supported by the read data. The predicted binding events are tested for significance as described previously (Guo et al., 2010). Briefly, if a control dataset is available, we compare the number of reads in the ChIP event to the number of reads in the corresponding region in the control sample using a Binomial test. If control data is not available, we apply a statistical test that uses a dynamic Poisson distribution to account for local biases. To correct for multiple hypothesis testing, a Benjamini-Hochberg correction (Benjamini and Hochberg, 1995) is applied. It is worth mentioning that we only use read counts of events to test for significance. The read spatial distribution of binding events is updated after each round of binding event prediction. 4.2.5 Motif discovery using improved event locations Phase 5 repeats Phase 2 motif discovery using the binding events predicted from Phase 4. As described in the results section (Figure 4-1), the spatial accuracy of binding events discovered from Phase 4 (GEM) is significantly improved from Phase 1 (GPS). Thus, these events will be more accurately centered on motifs and the performance of motif discovery is correspondingly improved. 4.2.6 GEM software GEM is a stand-alone Java software that takes alignment files of ChIP-Seq reads and a genome sequence as input and reports a list of predicted binding events and the explanatory binding motifs. It can be downloaded from our web site (http://cgs.csail.mit.edu/gem). For analysis of mammalian genome experiments, GEM requires about 5-15G memory. 4.3 Results 4.3.1 GEM improves the spatial resolution of binding event prediction We compared GEM’s spatial resolution to six well known ChIP-Seq analysis methods, including GPS (Guo et al., 2010), SISSRs (Jothi et al., 2008), MACS (Zhang et al., 2008), cisGenome (Ji et al., 2008), QuEST (Valouev et al., 2008) and PeakRanger 84 Chapter 4: Genome-wide event finding and motif discovery (GEM) (Feng et al., 2011). We used a human Growth Associated Binding Protein (GABP) ChIP-Seq dataset for our evaluation because GABP ChIP-Seq data were previously reported to contain homotypic events where the reads generated by multiple closely spaced binding events overlap (Valouev et al., 2008). Thus the GABP dataset offers the opportunity to test if integrating motif information and binding event prediction improves our ability to deconvolve closely spaced binding events with greater accuracy. We also evaluated the methods using ChIP-Seq data (Chen et al., 2008) from the insulator binding factor CTCF (CCCTC-binding factor), as it binds to a more informative motif than GABP. These two factors are representative of relatively easy (CTCF) and difficult (GABP) cases for ChIP-Seq data analysis. They are also used by other studies as benchmarks allowing for the direct evaluation of our results. GEM performance on other factors may vary. We found that GEM has the best spatial resolution among tested methods. Spatial resolution is defined as the average absolute value difference between the computationally predicted locations of binding events and the middle of the nearest motif match. From all observations, spatial resolution is corrected for a fixed offset by subtracting the mean difference before averaging the absolute value differences. To ensure a fair comparison, we used 428 shared GABP binding sites that are predicted by all seven tested methods and which contain an instance of the GABP motif within 100bp. GEM exactly locates the events at the motif position in 56.5% of these events (Figure 4-1A). For a dataset with a stronger consensus motif, ChIP-Seq data from CTCF, GEM exactly locates the events at the motif position in more than 90% of the shared events, significantly improving the spatial accuracy of predicted binding events over other methods (Figure 4-1B). Alternative evaluations with all the binding sites that have a motif at a distance less than 100bp were also performed for both GABP and CTCF data, and the results were similar to those above (data not shown). To show that GEM’s binding calls indeed more accurately locate the actual binding sites and do not simply map to the motif positions, we performed an independent analysis without using motif as the gold standard for evaluation. ChIP-exo is a new experimental method for generating binding data with higher spatial resolution than ChIP-Seq (Rhee and Pugh, 2011). We used human HeLa cell CTCF ChIP-exo binding sites (without using motif information) as an approximation of actual binding sites. These ChIP-exo sites were then used as a gold standard to evaluate the GEM and GPS binding calls of human HeLa cell CTCF ChIP-Seq data. We found that overall GEM 85 Chapter 4: Genome-wide event finding and motif discovery (GEM) binding calls were located closer to the ChIP-exo binding sites than the GPS binding call (Figure 4-1C), suggesting that GEM binding calls indeed more accurately locate the actual binding sites. Thus, GEM’s joint model of ChIP-Seq read coverage and genome sequence is able to more accurately predict the location of binding sites than other approaches, which do not use motif information in their binding event predictions. Figure 4-1 GEM improves spatial accuracy in binding event prediction A) Fraction of predicted GABP binding events with a motif within the given distance following event discovery by GEM, GPS, SISSRs, MACS, cisGenome, QuEST and PeakRanger. Events shown were predicted by all seven methods and had a GABP motif within 100bp. B) Fraction of predicted CTCF binding events with a motif within the given distance following event discovery by GEM, GPS, SISSRs, MACS, cisGenome, QuEST, FindPeaks, spp-wtd and spp-mtc. Events shown were predicted by all nine methods and had a CTCF motif within 100bp. C) Fraction of predicted CTCF binding events with ChIP-exo site within the given distance following event discovery by GEM and GPS. Events shown were predicted by both methods and had a CTCF ChIP-exo site within 50bp. 4.3.2 GEM is better at resolving closely spaced binding events GEM is also better at resolving closely spaced binding events (Gotea et al., 2010) in the GABP data than the other methods we tested. For example, GEM uniquely detects two GABP events over proximal GABP motifs that are 32bp apart on chromosome 2 86 Chapter 4: Genome-wide event finding and motif discovery (GEM) (Figure 4-2A). To evaluate binding deconvolution on a genome-wide scale, we identified 477 candidate clusters of closely spaced binding events. Each candidate cluster was detected as bound by all seven tested methods and contained two or more proximal GABP motifs separated by less than 500bp. GEM identified two or more closely spaced events in 144 of the candidate clusters, significantly more than GPS(108), SISSRs(77), QuEST(77), PeakRanger(36), MACS(4) and cisGenome(5) (Figure 4-2B). Figure 4-2 GEM is better at resolving closely spaced binding events. A) Example of a predicted binary GABP event that contains coordinately located GABP motifs. B) Numbers of GABP binding events discovered by GEM, GPS, SISSRs, MACS, cisGenome, QuEST and PeakRanger in 477 regions that contain clustered GABP motifs within 500bp. 4.3.3 GEM improves the spatial resolution of ChIP-exo binding event prediction We also tested GEM and GPS on the new experimental protocol ChIP-exo. ChIPexo aims to improve transcription factor binding spatial resolution by extensively digesting ChIP fragments down to the DNA that is protected by the bound protein complex (Rhee and Pugh, 2011). While ChIP-exo experiments provide high-resolution binding information, typical peak-finding methodologies may fail to achieve single-base resolution binding event predictions if they do not account for the properties of the ChIPexo experiment. An example is provided by the published CTCF ChIP-exo experiment (Rhee and Pugh, 2011), where ChIP-exo reads are bimodally distributed around binding sites on both strands because CTCF is cross-linked at two distinct sites of DNA. The published event predictions did not account for this characteristic distribution, and are thus often offset from CTCF binding motif instances. Since GPS and GEM automatically learn a model of sequence reads around binding events, they may be directly applied to 87 Chapter 4: Genome-wide event finding and motif discovery (GEM) ChIP-exo data without modification. To test GEM’s ability to automatically adapt to ChIP-exo data, we initialized GEM with a ChIP-Seq empirical read distribution. Results showed that GEM is able to automatically adapt to the read distribution produced by the ChIP-exo protocol. We compared GEM’s final computed read distribution to the expected empirical distribution of ChIP-exo and found that they were consistent (Figure 4-3B). A similar test was done on the yeast Reb1 ChIP-exo data and GEM also automatically adapted to the ChIP-exo distribution (Figure 4-3D). Figure 4-3 GEM improves the spatial resolution of ChIP-exo data event prediction. A) Fraction of predicted CTCF binding events with a motif within the given distance following event discovery by GEM, GPS, and the peak-pair midpoint method of Rhee, et al. GEM and GPS analysis of ChIP-exo data are initialized with a ChIP-Seq read distribution. B) GEM automatically adapts to the ChIP-exo read spatial distribution. C) Fraction of predicted Reb1 binding events with a motif within the given distance with event discovery by GEM, GPS, and the peak-pair midpoint method of Rhee, et al. GEM and GPS are initialized with a ChIP-Seq read distribution. D) GEM automatically adapts to the Reb1 ChIP-exo read spatial distribution. GEM improves upon the spatial resolution of binding event detection over other methods for ChIP-exo data analysis (Figure 4-3A). To investigate the performance of GEM on ChIP-exo data, we compared the binding event predictions of GEM and GPS 88 Chapter 4: Genome-wide event finding and motif discovery (GEM) on ChIP-exo CTCF binding and the “middle of peak-pair” method from the original ChIPexo study (Rhee and Pugh, 2011), as well as predictions of GEM and GPS on ChIP-Seq CTCF binding data from same cell type. To ensure a fair comparison, we used 4507 shared binding sites that are predicted by all tested methods and that contain a strong CTCF motif match within 100bp of the binding positions. The original ChIP-exo study (Rhee and Pugh, 2011) had 5.4% of the binding event calls centered on the motif match position, 40.3% of the calls within 10bp, and an average spatial resolution of 15.85±15.29bp. Applying GPS to the ChIP-exo data improved the spatial resolution, with 8.8% calls at 0bp positions, 59.7% of calls within 10bp, and an average spatial resolution of 10.38±11.26bp. Applying GEM to the ChIP-exo data located 76.5% calls exactly at the motif match positions, 89.7% of calls within 10bp, and an average spatial resolution of 3.35±9.71bp. In addition, GEM was also applied to yeast Reb1 ChIP-exo data and located 95.5% calls at the motif match positions. These results demonstrate that GEM can significantly improve the spatial accuracy of ChIP-exo binding event predictions. Interestingly, we found that GEM ChIP-Seq prediction has a marginally better performance in spatial resolution than GEM ChIP-exo prediction (Figure 4-3A). This suggests that GEM’s ability to integrate motif information for binding event prediction may compensate the relative lower resolution of ChIP-Seq read data and still produce high resolution binding calls. However, we cannot draw conclusions about the relative performance of the ChIP-Seq and ChIP-exo protocols, because only a single ChIP-exo dataset for a vertebrate transcription factor is publically available (CTCF), and the CTCF ChIP-Seq experiments we analyzed are not matched controls for the ChIP-exo experiment (i.e. they were performed by different groups under different laboratory conditions). We also note that GPS and GEM may not yet be fully optimized for ChIPexo analysis (see discussion). It is therefore difficult to separate the inherent resolution of ChIP-Seq and ChIP-exo experimental data from the relative performance of our methods on these data types. 4.4 Discussion GEM builds on the probabilistic mixture model framework of GPS to integrate motif information with read coverage for ChIP-Seq binding event prediction. The motif information is modeled as the position-specific prior to bias the binding events towards motif positions and thus improve the spatial resolution of event predictions. In doing so, 89 Chapter 4: Genome-wide event finding and motif discovery (GEM) GEM offers a more principled approach than simply snapping binding event predictions to the closest instance of the motif, and indeed, GEM does not require that all binding events are associated with strong motifs. GEM achieves exceptional spatial resolution and improves the deconvolution of closely spaced binding events, underscoring the value of integrating motif information as a positional prior into binding event prediction. An important issue for de novo motif discovery is the quality of the input sequences, as evident from the improvement from using promoters of co-expressed genes to using ChIP enriched regions. An algorithm has been developed to improve the signal-to-noise ratio by determining the threshold to partition the data into target and background sets (Eden et al., 2007). GEM’s significant improvement on spatial resolution further facilitates the motif discovery by locating more binding regions with motif instances, thus increasing the motif signal-to-noise ratio. The manner of integrating motif information as position-specific counts is generalizable. Either k-mer set motif hit counts or PWM scores can be integrated as positional prior into the model. In addition, other position-specific information, e.g. phastCons scores for cross-species sequence conservation (Siepel et al., 2005), may be integrated in a similar fashion. In the current implementation of GEM, KMAC is used to find the k-mer set motif. Although the performance of KMAC motif discovery and the predictive power of the k-mer set motif model has been shown to be superior (Chapter 3), GEM can be easily modified to use motif priors generated by other motif discovery methods. It is important to note that GEM’s performance is dependent on the correct identification of the motif prior information. In any case, it is a good practice to verify that the discovered motif is indeed biologically relevant. GEM can also be applied to ChIP-exo data without modification and further improves the spatial resolution of binding prediction. GEM’s successful automatic adaptation to the ChIP-exo read distribution suggests that it is flexible enough to be applied to new types of data. GPS/GEM uses a single read distribution model for all binding events. This may be a limitation because different types of binding events may be associated with different types of read distributions in ChIP-exo experiments (e.g. because of different patterns of exonuclease protection associated with recruitment of different cofactor complexes). Using a single read distribution to analyze ChIP-exo data may therefore hurt the relative spatial resolution performance of our methods on ChIPexo data. Thus it would be useful to account for such potential features of ChIP-exo datasets in the next generation of GPS/GEM models. 90 Chapter 4: Genome-wide event finding and motif discovery (GEM) 4.5 Methods 4.5.1 Datasets The mouse ES cell CTCF (Chen, et al., 2008) and human Jurkat cell GABP (Valouev, et al., 2008) ChIP-Seq binding datasets are described in subsection 2.5.1. ChIP-exo (Rhee and Pugh, 2011) data were provided by Ho Sung Rhee and B. Franklin Pugh. Human HeLa-S3 cell CTCF ChIP-Seq dataset was generated by Crawford Lab and was downloaded from the ENCODE project website. 4.5.2 Evaluating spatial resolution of ChIP-Seq event calls The genome-wide spatial resolution performance in ChIP-Seq event calls is evaluated as following. We define effective spatial resolution as the average absolute value of the distance between genome coordinates of predicted binding events and the middle of the corresponding high-scoring binding motif hit. Because the center of the motif hit may not represent the true center of a binding event, the offsets to the motif were centered by subtracting the mean offsets. We compare spatial resolution on the “matched” set of predictions that are called by all the methods and correspond to the same high-scoring binding motif. Only those events within 100bp of a motif match are included in the calculation. An alternative evaluation with all the events that have a motif at a distance less than 100bp is also performed. For Figure 4-1C, we used the GPS binding calls from a human HeLa cell CTCF ChIP-exo data (without using motif information) as an approximate version of actual binding sites. 4.5.3 Evaluating performance in deconvolving proximal binding events using GABP ChIP-Seq data The genome-wide performance of proximal event discovery in ChIP-Seq data is evaluated as follows. For GABP dataset, we compared GEM against other 6 methods (GPS, SISSRs, MACS, cisGenome, Quest and PeakRanger) genome wide. We define a set of candidate sites that all have at least one event detected by all seven methods, and that contain two or more GABP motifs separated by less than 500bp. We discovered 477 such sites. For each of these sites, we count the number of events discovered by different methods. The GABP motif was retrieved from the TRANSFAC 91 Chapter 4: Genome-wide event finding and motif discovery (GEM) database (M00341) (Matys et al., 2003). A motif score threshold of 9.9, which is 60% of the maximum PWM score, is used in this analysis. 4.5.4 Analysis of ChIP-exo data To test GEM’s ability to automatically adapt to ChIP-exo data, we initialized GEM with a generic ChIP-Seq empirical read distribution, and ran GEM with one extra run (phase 4 and 5) so that GEM could use more accurately positioned events to refine the read distribution and use it for final prediction. In practice, the user can directly initialize GEM with a ChIP-exo empirical read distribution (provided with GEM software) and apply GEM the same way as analyzing ChIP-Seq data. 92 Chapter 5: Transcription factor spatial binding constraints Chapter 5 Transcription factor spatial binding constraints Part of the material presented in this chapter was adapted from the following publication: Yuchun Guo, Shaun Mahony, David K Gifford (2012). High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints, PLOS Computational Biology, 8(8): e1002638. Collaborations: Y.G., S.M. and D.K.G. conceived the project. Y.G. designed the computational model and implemented the algorithm. Y.G., S.M. and D.K.G. analyzed the data. Y.G., S.M. and D.K.G. wrote the manuscript. 93 Chapter 5: Transcription factor spatial binding constraints Chapter 5: Transcription factor spatial binding constraints 5.1 Introduction Genomic sequences facilitate both cooperative and competitive regulatory factorfactor interactions that implement cellular transcriptional regulatory logic. The functional syntax of DNA motifs in regulatory elements is thus an essential component of cellular regulatory control. Appropriately spaced motifs can facilitate cooperative homo-dimeric or hetero-dimeric factor binding, while overlapping motifs can implement competitive binding by steric hindrance. Cooperative and competitive binding are an integral part of complex cellular regulatory logic functions (Wolberger, 1999; Ponticos et al., 2004). The notion of grammar (Levine, 2010; Swanson et al., 2011) has been referred to the phenomenon that spacing and arrangement of binding sites matter for the activity of the enhancer, just like the order of words in a sentence can affect its meaning. The binding of regulatory proteins to the genome cannot at present be predicted from primary DNA sequence alone as chromatin structure, co-factors, and other mechanisms make the prediction of in vivo binding from sequence empirically unreliable (Farnham, 2009). Thus it is not possible to use primary DNA sequence to determine the aspects of genome syntax that are employed in vivo. To discover novel pair-wise factor spatial binding constraints in vivo, we developed GEM to improve the spatial resolution of binding event predictions (Chapter 4). GEM’s unbiased computational approach has enabled us to discover novel binding constraints between transcription factors from sequenced ChIP experiments. These spatial constraints directly suggest biological regulatory mechanisms that will be useful in future studies. SpaMo studied motif spacing using ChIP-Seq events to infer transcription factor complexes but the predicted motif spacing does not necessarily indicate in vivo binding in the specific cellular conditions (Whitington et al., 2011). Here we review our GEM derived results, discuss these results in the context of current data production projects, and detail our methods. 5.2 Spatial binding constraints discovery To study the in vivo binding spatial relationship between a pair of transcription 95 Chapter 5: Transcription factor spatial binding constraints factors A and B in a certain cell type and condition, we apply GEM independently to ChIP-Seq data from A and B to predict the respective binding sites. To compute the distribution of spacing between A relative to B, we compute the offsets of A binding sites from B binding sites within a 201bp window. We choose this window size because we expect most of the observable binding constraints to be within this range. The sequence strand of the binding predictions is oriented using the B motif when a match to the motif is present, and B is placed in the middle of the window. The occurrences of A at each offset position are summed over all the B sites to produce the empirical spatial distribution. In this study, we evaluate three different methods to call binding sites: GEM binding calls, GPS binding calls, and GPS binding calls that are snapped to a motif within 50bp if one is present. Another motif distance for snapping binding calls, 100bp, was also tested and the result was very similar to the 50bp distance. To determine if a specific spacing is significant, we compute the p-value of the number of occurrences of factor A at that offset position using a Poisson test. The parameter of the Poisson distribution is set as the mean number of A site occurrences across all the positions in the [-400bp -200bp] and [200bp 400bp] windows, assuming there are no significant spatial binding constraints in these windows. The p-value is corrected for multiple hypotheses testing using Bonferroni correction by multiplying the p-value by the number of positions in the window and the total number of pair-wise tests across all cell types. The significance threshold for the corrected p-value is 1e-8. Because the strand orientation of bound sequences cannot be oriented consistently when comparing multiple factor pairs, we report the absolute distance between the most significant interacting factor pairs in the pairwise spatial constraint matrices. 5.3 Results 5.3.1 GEM reveals known Sox2-Oct4 distance-constrained transcription factor binding distances We examined if GEM could detect pairs of transcription factors that bind to the genome with characteristic pair-wise spacing, beginning with the well-known heterodimeric pair Sox2-Oct4 (Chew et al., 2005). In general, distance-constrained transcription factor binding cannot be predicted based solely on sequence motifs, as motif presence does not guarantee binding. Such spatial binding constraints may be caused by combinatorial binding, alternative binding, binding that is orchestrated by 96 Chapter 5: Transcription factor spatial binding constraints multimeric protein complexes, or the spread of constrained enhancer syntax. We were able to discover Sox2-Oct4 transcription factor spatial binding constraints by combining GEM binding calls from Sox2 and Oct4 ChIP-Seq data. We applied GEM independently to mouse ES cell Sox2 and Oct4 ChIP-Seq data (Chen, et al., 2008) to call the respective binding sites, and then computed the distance between Oct4 sites from Sox2 sites within a 201bp window. The sequence strand of the GEM binding predictions is oriented using the Sox2 motif when a match to the motif is present. As expected, GEM predicted Oct4 binding sites are predominantly (630 sites out of 2525 in the 201bp window) located at -6bp position relative to GEM predicted Sox2 sites (Figure 5-1A). However, this spacing cannot be observed from the binding calls of GPS or other event discovery methods alone because of their more limited spatial accuracy (Figure 5-1B). An alternative approach is to snap binding calls to the nearest instance of the transcription factor’s binding motif. We tested this approach using GPS binding calls as the starting points and found that the alternate approach captures fewer (277 sites out of 2753 in the 201bp window) instances of Oct4-Sox2 spatial binding constraints (Figure 5-1C), presumably because some of the bound motifs do not pass the motif scoring threshold or because some unbound motif instances are located closer to the binding calls than the true motif. We also tested using the PWM motifs as the motif prior for GEM. 476 instances of Oct4-Sox2 spatial binding constraints were discovered (Figure 5-1D), less than 630 sites from GEM with k-mer set motif (KSM) prior. This is consistent with the finding that the KSM representation is more predictive than the PWM representation (see Subsection 3.4.2). Inspection of the sequences of the 630 Oct4 and Sox2 co-bound regions shows that Oct4 and Sox2 motifs locates right next to each other (Figure 5-1E). 97 Chapter 5: Transcription factor spatial binding constraints Figure 5-1 GEM reveals transcription factor spatial binding constraints. A), B), C) and D) Genome wide spatial distribution of Oct4 binding sites in a 201bp window around Sox2 binding sites, obtained by using GEM binding calls, GPS binding calls, GPS binding calls snapping to the nearest motifs within 50bp or GEM (with PWM motif prior) binding calls respectively. Dashed lines represent the Sox2 binding sites at position 0. E) Color chart representation of 61bp sequences in 630 regions with 6bp Sox2/Oct4 binding constraint. Each row represents a 61bp bound sequence. Green, blue, yellow and red indicate A, C, G and T. The motif logos are generated by STAMP (Mahony et al., 2007) from the motifs discovered using all the binding sites in the respective datasets. 5.3.2 Enhancer grammar elements deduced from transcription factor binding sites predicted by GEM We next studied pair-wise binding relationships between 14 sequence-specific transcription factors (Oct4, Sox2, Nanog, Klf4, STAT3, Smad1, Zfx, c-Myc, n-Myc, Esrrb, Nr5a2, Tcfcp2l1, E2f1 and CTCF) and two transcriptional regulators (p300 and Suz12) in mouse ES cells by applying GEM to a large compendium of ChIP-Seq binding data (Chen et al., 2008; Heng et al., 2010). Binding prediction by GEM enables the detection of 37 pairs of statistically significant spatial binding constraints, involving Oct4, Sox2, Nanog, Klf4, Esrrb, Nr5a2, Tcfcp2I1, E2f1, c-Myc, n-Myc and Zfx (Figure 5-2). Interestingly, we found that Klf4, one of the ES cell reprogramming factors, exhibits strong distance-specific binding with many other factors, including Nanog, Sox2, Zfx, cMyc, n-Myc, E2f1, Esrrb, Nr5a2 and Tcfcp2l1 (Figure 5-3). 98 Chapter 5: Transcription factor spatial binding constraints A Nanog 0 1 7 2 1 24 58 57 65 Sox2 1 0 6 1 0 25 56 58 66 Oct4 7 6 0 27 c-Myc 0 0 n-Myc 0 0 Ctcf 0 STAT3 3 1 1 1 23 6 57 1e-300 1e-250 2 1e-200 0 Suz12 0 Zfx 0 P300 2 1 Smad1 1 0 6 5 3 1e-150 0 0 0 E2f1 3 Klf4 24 25 27 1 Esrrb 58 56 24 1 6 6 0 1 9 5 3 0 31 30 41 23 Nr5a2 57 58 0 Tcfcp2l1 65 66 1e-100 2 31 0 1 10 30 1 0 11 1e-50 N an og So x2 O ct c- 4 M y n- c M yc C ST tcf AT Su 3 z1 2 Zf P3 x Sm 0 0 ad 1 E2 f1 Kl f Es 4 rrb N Tc r5a fc 2 p2 l1 41 10 11 0 B Nanog 1 3 12 1 1 1 5 4 Sox2 5 1 52 1 1 6 26 4 8 Oct4 7 50 1 11 c-Myc 1 4 n-Myc 3 1 Ctcf 1 STAT3 2 2 5 2 3 1 60 3 1 1 1 P300 1 1 1 Tcfcp2l1 3 7 100 1 120 1 140 1 7 3 2 1 2 1 4 2 1 2 1 40 3 1 1 2 4 43 1 4 85 2 1 2 4 77 4 1 4 160 180 200 Zf P3 x Sm 0 0 ad 1 E2 f1 Kl f Es 4 rrb N Tc r5a fc 2 p2 l1 N an og So x2 O ct c- 4 M y n- c M yc C ST tcf AT Su 3 z1 2 4 1 1 Esrrb 3 25 10 Nr5a2 4 4 1 E2f1 6 20 80 Zfx Smad1 1 1 40 1 Suz12 Klf4 1 7 Figure 5-2 Spatial binding constraints detected from mouse ES cells. A) Matrix representation of pairwise spatial binding constraints between factor B (column) and factor A (row) detected from 16 ChIP-Seq dataset in mouse ES cells. The colors represent the significance levels (corrected p-value) of the strongest spacings. The numbers represent the distances between the factors in the strongest spacing. B) The colors and numbers represent the number of positions exhibiting significant spatial binding constraints within the 201bp window around the binding sites of factor B (column). 99 Chapter 5: Transcription factor spatial binding constraints 0 100 0 100 10 0 100 Esrrb 0 100 10 0 100 5 0 -100 0 0 0 -100 100 0 100 0 100 0 100 20 0 -100 100 100 0 -100 100 100 100 0 200 200 10 0 -100 Nr5a2 n-Myc 0 -100 100 20 Suz12 STAT3 20 0 -100 0 5 0 -100 0 -100 100 40 50 Smad1 10 0 100 100 50 0 -100 0 -100 100 10 P300 Oct4 20 0 -100 0 100 2 0 -100 0 -100 100 Sox2 0 200 Tcfcp2l1 Klf4 0 -100 4 x 10 4 20 200 Zfx 50 400 E2f1 Ctcf 40 Nanog c-Myc 100 0 100 50 0 -100 Offset from Klf4 binding site Figure 5-3 Spatial relationship between Klf4 and other 15 factors in mouse ES cells Spatial distribution of 16 mouse ES cell factor binding sites in a 201bp window around Klf4 binding sites. Vertical dash-dot lines represent the Klf4 binding sites at position 0; horizontal dashed lines represent the number of occurrences at a position corresponding to corrected p-value of 1e-8. The discovered pair-wise spatial binding constraints reveal complex relationships among the factors. For example, Klf4 exhibits constrained binding with Sox2 but much less significantly with Oct4 (Figure 5-3). However, we did observe strong distancespecific binding between Oct4-Sox2 (Figure 5-1). This raises the question of whether the detected Klf4-Sox2 and Oct4-Sox2 spatial binding constraints are on the same genomic regions. We therefore studied all Sox2 bound regions that are co-bound with Klf4. Out of a total of 5609 Sox2 bound regions with a Sox2 motif instance that can be oriented, 123 regions are co-bound by Klf4 at position +25bp (Figure 5-4A). However, only four regions show co-binding of Klf4 at position +25bp and Oct4 at position -6bp. More surprisingly, the distance-constrained Sox2/Klf4 regions are co-bound by 6 ES cell factors within a 70bp window, including Sox2 (at 0bp), Nanog (at 1bp), Klf4 (at 25bp), Esrrb (at 56, 59bp), Nr5a2 (at 55, 58, 61bp) and Tcfcp2I1 (at 66, 69bp). Inspecting the underlying sequences of these regions, we found that the binding motifs of these factors 100 Chapter 5: Transcription factor spatial binding constraints are embedded at the positions consistent with the binding positions (Figure 5-4B). In addition to the consistent spatial arrangement of motifs, these sequences (spanning from -70bp to 100bp) exhibit a high degree of similarity. A subset of the sequences is shifted 3 bases by some insertion/deletions, consistent with the 3bp shift of some of the factor binding positions. Several lines of evidence suggest that these Klf4-Sox2 distance-constrained regions may be functional in ES cell transcriptional regulation. First, 21 of these regions are shown to interact with total 36 other genomic regions by mouse ES cell RNA polymerase II ChIA-PET experiments (Reeder et al., unpublished results). Chromatin Interaction Analysis by Paired-End-Tag sequencing (ChIA-PET) is a recently developed technology for genome-wide investigation of chromatin interactions bound by specific protein factors (Fullwood et al., 2009). One of the Klf4-Sox2 distance-constrained regions located at the second intron of Tcfcp2l1 shows a strong (22 paired-end reads) long-range chromatin interaction with the Tcfcp2l1 transcriptional start site over a 20kb distance (Figure 5-5) (Reeder et al., unpublished results), and is bound by p300 (Creyghton et al., 2010) , a histone acetyltransferase and transcriptional coactivator that predicts tissuespecific enhancers (Visel et al., 2009), suggesting potential roles in regulating the transcription of Tcfcp21. Long-range chromatin interaction between some of these 21 regions and the transcriptional start sites of Nanog and other genes are also observed (data not shown). In addition, analyses with p300 and H3K27ac ChIP-Seq data from mouse ES cell (Creyghton et al., 2010) suggest that these Klf4-Sox2 distanceconstrained regions may be active enhancer regions (Figure 5-6). GPS binding calls show that almost all (119 out of 123) of these regions are bound by p300. Read coverage enrichment analysis shows that the large majority (111 out of 123) of these regions are also marked by H3K27ac, a histone modification associated with active enhancers (Creyghton et al., 2010). The enrichment of H3K27ac mark is computed using a Binomial test (p-value<1e-4) of read count in a 1001bp window using a matched control of histone H3 ChIP-Seq dataset. These results demonstrated that GEM analysis enables detection of coordinated binding of multiple factors that are may be functional in ES cell transcriptional regulation. 101 Chapter 5: Transcription factor spatial binding constraints Figure 5-4 Enhancer grammar elements deduced from mouse ES cell transcription factor binding sites predicted by GEM. A) The binding site distribution of Sox2, Klf4, Nanog, Oct4, Esrrb, Nr5a2 and Tcfcp21l in 123 regions that exhibit Sox2-Klf4 spatial binding constraints. The Sox2 sites are aligned at the 0bp positions, and Klf4 sites are at the 25bp positions. The rows are ordered by Esrrb offset positions. B) Color chart representation of 201bp sequences in the same regions as in A. Each row represents a 201bp bound sequence. Green, blue, yellow and red indicate A, C, G and T. The motif logos are generated by STAMP (Mahony et al., 2007) from the motifs discovered using all the binding sites in the respective datasets. 102 Chapter 5: Transcription factor spatial binding constraints Figure 5-5 A Klf4-Sox2 distance-constrained region interacts with Tcfcp2l1 transcriptional start site. The tracks are: chromosome coordinates of the region overlapping the Tcfcp2l1 gene; mouse ES cell RNA polymerase II ChIA-PET interactions (the arcs connect the two ends of the paired-end reads); clusters of the pol2 ChIA-PET interactions (the left and right ends indicate the mean positions of paired-end reads in the cluster, number indicates the number of paired-end reads in the cluster) ; p300 binding ChIP-Seq read profile; Refseq gene annotation (arrow indicates transcriptional start site); a Klf-Sox2 distance-constrained region (red rectangle). The ChIA-PET interaction clusters are generated by applying hierarchical clustering to the paired-end reads in the region, using the Chebyshev distance metrics with a 4kb distance cutoff. The clusters with less than 3 reads are not shown. Of the 123 regions where Sox2, Klf4, and other sites display constrained spacing, 109 (89%) are annotated instances of the RLTR9 ERVK family of long terminal repeat elements. It is interesting to note that while Bourque, et al. found an association between Oct4/Sox2 co-binding sites and other members of the ERVK repeat class (Bourque et al., 2008), we found a set of repetitive elements that encode the binding of Sox2 and other factors without Oct4 in ES cells. Kunarso, et al. suggested that transposable elements have rewired the core regulatory network of ES cells (Kunarso et al., 2010). Our analysis found that the repetitive sequences constrain the in vivo binding of a number of key transcription factors in ES cells. 103 Chapter 5: Transcription factor spatial binding constraints p300 - Sox2 site H3K27ac 1kb - Sox2 site 1kb Figure 5-6 Klf4-Sox2 distance-constrained regions are bound by p300 and marked by H3K27ac Read profiles and Heatmaps of p300 and K3K27ac histone mark read coverage in 123 Klf4-Sox2 distance-constrained regions. Top: read profile, bottom: Heatmap of read coverage. The regions are 2kbp over the Sox2 binding sites. The regions are in same order as in Figure 5-4. Color shading corresponds to the ChIP-Seq read count in the region. 5.3.3 Spatially constrained human factor binding in ENCODE data We computed statistically significant pair-wise spatially constrained binding events between 46 transcription factors characterized in 184 ENCODE ChIP-Seq data sets in five different cell lines. Each transcription factor ChIP was processed independently by GEM so that we could assess any differences in observed binding between cell lines and biological replicates. We found that 390 pairs of transcription factors have significant binding distance constraints within 100bp of each other (Figure 5-7 ~ Figure 5-11). The number of pairs found in each cell line differed as did the number of transcription factors assayed: K562 (152 pairs/37 TFs), GM12878 (148 pairs/29 TFs), HepG2 (107 pairs/29 TFs), HeLa-S3 104 Chapter 5: Transcription factor spatial binding constraints A K562 c-Myc:S-IFNa30 0 1 1 1 1 1 2 0 3 Max:S 1 0 2 3 3 2 2 1 4 c-Myc:S-IFNa6h 1 26 3 3 3 22 2 2 2 13 10 14 13 3 7 9 8 34 1e-300 2 4 15 3 2 0 1 1 0 1 1 2 24 27 10 4 4 4 34 15 11 12 13 59 13 13 7 4 5 4 1 3 1 0 0 0 0 2 2 5 5 5 16 21 16 0 26 11 16 12 11 5 5 5 6 5 c-Myc:S 1 3 0 0 0 0 0 2 2 16 30 30 5 5 5 36 12 14 6 15 25 16 12 11 5 5 5 6 11 4 c-Myc:S-IFNg6h 1 4 4 28 8 4 16 6 32 12 16 13 12 4 4 4 5 5 4 3 3 6 5 5 31 3 34 6 1 20 14 12 12 4 4 4 5 5 4 3 3 6 3 3 25 10 0 35 1 9 24 28 14 13 3 3 3 4 4 3 2 2 5 c-Myc:C 7 15 0 2 0 0 0 0 0 1 2 13 19 1 USF1:M 2 2 1 0 0 0 0 2 2 Max:M 0 1 1 2 2 1 2 0 3 11 8 2 Egr-1:M 3 4 2 2 2 2 2 3 0 8 4 3 7 20 13 8 3 8 0 2 1 6 6 6 25 2 37 19 31 14 42 9 4 2 0 1 4 2 26 0 30 1 18 2 3 1 1 CTCF:B CTCF:C 26 CTCF:ST 4 8 27 2 6 3 2 11 13 5 6 3 0 16 8 32 43 13 28 3 4 9 11 0 7 11 7 26 2 15 37 7 12 5 9 7 0 0 0 0 5 4 5 5 4 5 3 YY1:M-v1 3 2 4 5 5 4 5 3 13 6 2 11 0 0 0 4 YY1:M-v2 3 2 4 5 5 4 5 3 13 6 8 0 0 1 2 5 0 4 4 3 4 0 0 1 61 4 0 0 1 3 1 1 0 5 5 0 1 1 0 19 10 7 0 STAT1:S-IFNa30 STAT1:S-IFNg30 11 16 13 6 14 0 GABP:M 12 PU.1:M NRSF:M 0 8 SRF:M 6 11 10 8 STAT1:S-IFNg6h 5 5 12 15 6 3 1 14 34 35 2 34 15 37 33 0 9 10 3 12 4 2 0 11 6 6 26 15 1 24 13 1 4 3 11 7 10 7 8 2 1 7 17 5 16 17 4 13 7 12 3 33 3 28 30 13 30 19 5 0 12 10 5 5 6 6 21 6 5 11 5 4 4 33 11 33 5 20 6 9 5 4 21 6 17 0 11 22 10 16 5 5 5 6 6 1 15 8 1 22 0 3 19 0 17 14 12 56 55 13 2 34 6 5 6 0 15 28 15 16 16 4 0 GATA1:S 6 12 12 12 19 24 10 17 32 15 11 1 0 4 1 0 26 GATA2:M 16 16 16 16 14 28 20 23 33 22 5 4 0 3 4 11 13 72 20 12 35 16 12 GATA2:S 14 13 12 13 12 26 11 17 24 16 16 56 5 18 17 1 3 0 1 14 11 22 24 13 11 13 20 GATA2:W 12 12 11 12 11 11 24 16 9 19 55 20 11 1 0 4 1 0 17 17 24 22 16 16 11 16 4 4 9 15 21 5 5 11 4 4 3 37 3 FOS:W 5 5 4 4 3 37 7 13 5 FOSL1:M 5 5 4 4 3 37 7 6 6 4 5 4 12 0 c-Jun:S-IFNa6h 5 5 4 c-Jun:S-IFNg30 15 28 12 4 4 3 37 3 3 2 15 16 11 9 c-Jun:S 25 8 7 JunD:W 7 34 4 0 JunB:W 11 c-Jun:S-IFNg6h 4 24 3 2 15 7 6 5 6 0 5 11 1 26 11 14 17 0 0 0 1 1 0 1 1 2 8 12 21 20 21 0 0 0 1 1 0 1 1 2 5 11 4 21 12 14 25 0 0 0 1 1 0 1 1 2 0 16 9 22 15 1 1 1 0 0 1 2 2 1 1 1 1 0 0 1 0 0 3 16 20 19 16 0 0 0 1 1 0 1 1 2 11 16 19 18 1 1 1 2 0 1 0 0 3 12 12 16 1 1 1 2 0 1 0 0 3 2 2 2 1 3 2 3 3 0 31 5 5 5 3 12 6 12 4 6 6 6 22 21 5 6 5 10 5 7 6 30 6 11 6 3 14 17 6 16 15 22 0 37 :S -IF M yc M yc 1e-100 10 21 N a M 30 : S ax -IF :S N c- a6 M h cM c- yc: yc M C : S yc -IF : S N U g6h SF 1 M :M a Eg x:M rC 1 :M TC C F: B T C CF TC : C F: S Y T YY Y 1 1: :S YY M1: v1 ST M A E ST T1 L F v2 A :S- 1:M ST T1 : IFN A S-I a3 T1 FN 0 :S g -IF 3 0 N G g6 A h BP PU :M N .1 : R M ST SF A T1 SR :M :S F: -IF M G Na A 6 T h G A1 A :S T G A2 A :M T G A2 A :S TA 2 c- :W Ju FO n :S FO S: SL W cJu Ju 1: M n c- :S n D Ju -I :W n : FN S- a IF 6 h N cJu J g3 n : un 0 S- B IF :W N N g6 F- h E 2: S 4 4 17 3 13 5 c- c- 1e-150 6 37 1 11 13 11 3 22 16 0 10 1 7 28 10 33 33 34 0 10 17 0 STAT1:S-IFNa6h NF-E2:S 1e-250 1e-200 47 2 13 16 13 15 9 13 8 25 15 26 7 14 16 37 37 37 38 36 37 36 38 14 5 13 78 8 YY1:S 3 ELF1:M 22 4 105 1e-50 1e-8 Chapter 5: Transcription factor spatial binding constraints B K562 c-Myc:S-IFNa30 1 4 5 4 4 5 18 4 4 Max:S 4 1 4 5 5 4 27 4 2 c-Myc:S-IFNa6h 5 4 3 9 9 27 49 29 21 1 11 5 1 54 27 41 3 68 66 19 3 66 48 71 2 4 108 4 49 5 9 1 26 40 62 74 98 73 94 85 4 102 100 96 1 108 104 3 54 27 114 96 107 11 44 131 26 94 c-Myc:S 4 5 10 29 1 34 63 45 76 18 62 30 3 93 94 99 2 106 90 3 47 14 85 77 119 19 30 123 29 93 c-Myc:S-IFNg6h 5 4 30 39 38 3 67 75 86 80 102 100 1 100 106 114 c-Myc:C 4 1 USF1:M 17 30 45 65 64 69 2 67 99 2 26 4 2 6 5 1 2 1 1 12 50 54 141 133 139 103 111 167 12 94 150 32 8 109 78 114 124 8 106 4 75 17 80 3 133 28 1 148 146 1 98 83 66 1 12 17 103 63 127 40 92 133 2 50 72 12 51 38 109 54 8 61 4 74 65 115 78 7 83 26 96 6 136 32 145 107 1 2 95 93 101 6 45 2 62 52 107 68 1 1 66 49 2 2 1 7 1 1 1 12 4 60 45 117 25 51 127 4 25 80 10 7 13 97 63 99 28 136 74 148 4 107 1 98 100 106 3 1 8 2 126 49 125 8 3 22 97 74 85 91 131 1 28 74 29 1 33 85 90 149 134 127 56 80 141 22 96 128 55 9 6 YY1:S 1 67 14 103 4 69 14 Egr-1:M CTCF:ST 6 17 26 119 8 118 5 4 30 73 46 74 75 1 134 134 137 134 7 118 121 146 1 5 62 62 97 4 CTCF:B 19 2 Max:M CTCF:C 13 60 2 53 102 95 103 66 117 111 97 103 99 67 1 97 115 110 YY1:M-v2 4 1 26 99 93 107 62 125 80 87 99 93 50 97 1 136 97 5 93 12 5 1 35 24 110 21 44 136 6 5 73 38 137 14 41 150 2 22 62 2 2 37 103 100 115 102 145 121 74 104 101 STAT1:S-IFNg30 3 3 5 111 71 114 49 3 1 2 3 3 2 1 26 4 3 29 1 1 STAT1:S-IFNg6h 30 GABP:M 10 3 4 3 70 109 113 116 102 123 114 15 61 42 PU.1:M 15 6 55 5 NRSF:M 13 SRF:M 1 4 112 102 112 4 10 66 62 104 92 119 67 129 103 49 72 62 97 94 112 2 1 13 6 1 99 18 121 131 1 STAT1:S-IFNa30 16 11 52 1 9 3 84 2 35 108 3 4 2 1 80 1 4 1 1 64 107 47 2 106 39 1 2 3 48 1 1 2 1 7 49 100 3 98 74 128 114 119 40 80 129 6 62 99 9 65 1 35 3 4 54 5 86 1 1 STAT1:S-IFNa6h 1 8 52 25 41 3 1 133 1 7 10 2 1 93 101 146 119 119 81 97 143 11 105 133 51 44 4 47 3 3 88 24 110 60 78 120 1 GATA1:S 19 58 40 84 18 56 25 7 98 1 92 1 2 GATA2:M 6 33 14 91 8 55 6 4 69 10 95 2 1 13 2 102 18 25 157 8 83 122 26 GATA2:S 11 62 111 93 145 55 139 98 4 40 78 3 130 51 145 6 4 GATA2:W 5 48 91 76 129 32 128 51 5 34 44 2 112 24 122 2 2 29 1 111 27 61 139 21 105 127 41 c-Jun:S 12 11 78 111 116 127 111 139 126 36 66 52 85 108 133 5 115 35 117 47 82 108 141 112 1 57 59 101 49 93 104 73 2 FOS:W 12 15 58 20 101 38 7 23 16 42 2 79 2 13 73 21 62 1 32 49 4 28 54 6 FOSL1:M 36 32 71 50 108 86 36 46 38 78 2 99 8 15 13 113 61 59 34 1 46 5 29 46 10 2 106 138 150 49 129 136 140 48 108 156 156 142 101 49 44 1 22 68 104 43 4 JunD:W 65 34 101 130 119 145 127 166 138 106 114 107 c-Jun:S-IFNa6h 6 25 5 15 c-Jun:S-IFNg30 4 26 30 92 26 94 52 JunB:W 4 48 101 93 121 84 149 75 52 78 66 c-Jun:S-IFNg6h 5 2 57 13 28 8 1 7 7 7 2 4 20 58 78 63 5 9 1 12 6 140 8 22 145 77 105 164 45 134 151 89 5 52 48 5 2 4 21 1 45 22 43 4 65 6 105 41 85 137 108 92 30 28 68 45 1 73 56 2 4 101 7 132 4 81 129 156 128 105 52 47 104 14 74 1 50 4 15 2 48 1 28 92 48 75 6 2 2 8 42 43 57 49 1 4 2 1 5 4 2 4 3 160 180 200 M yc c- c- M yc :S -IF N a M 30 : S ax -IF :S N c- a6 M h cM c- yc: yc M C : S yc -IF : S N U g6h SF 1 M :M a Eg x:M rC 1 :M TC C F: B T C CF TC : C F: S Y T YY Y 1 1: :S YY M1: v1 ST M A E ST T1 : L F v2 A S-I 1:M ST T1 : FN A S-I a3 T1 F 0 :S Ng -IF 3 0 N G g6 A h B PU P:M N .1 : M R ST SF A T1 SR :M :S F: -IF M G Na A 6 T h G A1 A :S TA G 2 A :M T G A2 A :S TA 2 c- :W Ju FO n :S FO S: SL W cJu Ju 1: M c- n :S n D Ju -I :W n : FN S- a IF 6 h N cJu J g3 n : un 0 S- B IF :W N N g6 F- h E 2: S NF-E2:S 40 1 YY1:M-v1 3 ELF1:M 20 Figure 5-7 Spatial binding constraints detected from human K562 cells. Matrix representation of pairwise spatial binding constraints between factor B (column) and factor A (row) detected from 37 ChIP-Seq dataset in human K562 cells. A) The colors represent the significance levels (corrected p-value) of the strongest spacings. The numbers represent the distances between the factors in the strongest spacings. B) The colors and numbers represent the number of positions exhibiting significant spatial binding constraints within the 201bp window around the binding sites of factor B (column). 106 Chapter 5: Transcription factor spatial binding constraints A GM12878 c-My c:S 0 0 1 1 3 6 5 21 21 14 3 17 13 12 Max:S 0 0 1 1 0 USF1:M 1 1 0 0 1 c-My c:C 1 1 0 0 2 PAX5:M-C20 3 0 1 2 0 3 4 13 7 3 0 4 1 c-Jun:S 6 JunD:S 25 34 10 3 44 0 33 7 4 2 2 4 2 11 0 0 1 12 13 15 1 23 7 0 20 19 6 6 8 3 14 5 17 4 14 5 0 13 14 16 9 2 13 1 13 0 1 1 0 19 2 11 11 3 1 8 12 10 7 MEF2C:M 7 13 14 1 0 12 0 10 23 11 14 14 8 6 4 15 16 1 12 0 20 9 15 17 2 17 NRSF:M 4 Y Y 1:S 6 21 10 8 SRF:M-v 1 11 5 2 EBF1:M 21 14 12 NFKB:S 14 0 0 20 9 34 33 GABP:M 10 10 0 2 10 3 2 0 7 8 2 26 5 17 6 7 0 0 7 8 2 4 0 8 0 0 4 4 10 5 5 2 1 16 43 1 0 0 17 2 1 1 1 8 2 23 10 2 0 0 4 4 8 5 5 8 18 1 26 4 4 0 0 3 9 2 8 9 9 4 4 0 0 4 2 2 8 10 5 8 12 3 4 0 2 3 2 17 5 6 2 2 0 1 0 3 1 0 1 2 0 1 0 6 15 10 5 8 8 14 15 4 9 6 5 6 8 16 17 47 7 5 5 2 2 1 0 2 3 4 1 0 16 14 4 2 2 11 12 2 5 2 2 1 14 0 1 3 5 1 5 7 11 10 0 1 1 0 16 1 0 2 4 0 4 TCF12:M 10 32 4 10 5 1 16 1 2 16 2 36 3 2 0 2 1 3 8 43 3 4 14 4 1 1 0 0 1 6 ZEB1:M 19 2 0 CTCF:C 10 4 2 6 10 38 13 0 2 2 4 0 9 8 9 4 2 9 5 4 2 0 11 1 32 1 0 1 0 0 4 2 4 3 1 4 0 5 c- M yc : M S U ax: SF S 1 PA c -M : M X5 y c: :M C c- C2 Ju 0 n Ju :S M nD EF : S M 2A E :M PO F2C U :M 2 N F2: RS M F Y :M SR Y1 F :S SR : M F: -2x Eg M r-1 -v 1 :M EB -2x F1 EB :M F N FK N F :M B: KB S- : S TN EL Fa F ET 1:M S G 1:M AB P PU : M C .1: M TC C F: TC B T C F: S F1 T Z E 2:M B C 1: Eg TC M r-1 F: C :M -v 1 Egr-1:M-v 1 3 7 8 107 1e-100 46 2 CTCF:B 11 7 1e-150 9 14 15 25 CTCF:ST 6 1e-200 47 7 17 2 16 10 10 3 PU.1:M 26 4 0 35 14 18 11 18 15 ELF1:M 2 0 19 6 NFKB:S-TNFa 4 2 8 0 1e-250 12 5 0 Egr-1:M-2x ETS1:M 20 0 SRF:M-2x EBF:M 0 8 20 12 19 10 6 6 MEF2A:M POU2F2:M 5 1e-300 1e-50 1e-8 Chapter 5: Transcription factor spatial binding constraints B GM12878 c-My c:S 1 5 4 4 44 1 53 24 29 4 13 5 Max:S 5 3 4 4 10 1 USF1:M 4 4 3 3 4 c-My c:C 4 4 3 1 7 PAX5:M-C20 40 9 5 5 1 3 1 37 1 2 1 2 1 2 2 1 46 1 50 119 61 89 56 82 80 114 89 61 59 JunD:S 8 48 1 96 37 44 28 11 15 73 30 6 12 MEF2A:M 164120 96 1 98 161 115 MEF2C:M 73 66 40 99 1 95 70 166 91 45 159 97 1 NRSF:M 2 6 34 101 64 3 EBF1:M 23 3 EBF:M 35 4 NFKB:S 3 NFKB:S-TNFa ELF1:M 1 1 2 GABP:M 115 69 93 70 164 81 153 65 157 71 147104 79 153 98 149 66 60 94 30 112 23 84 3 91 1 81 120 86 85 9 72 76 26 PU.1:M 3 5 1 169 77 11 152 59 155 134 84 70 49 92 22 34 90 6 35 56 67 69 68 45 1 3 157 60 5 157 34 165 80 1 1 1 45 SRF:M-v 1 1 70 139 77 121 17 164 94 101 163 16 101 51 71 1 144 99 121 22 164 102106 161 29 116 61 141 144 1 59 4 73 99 60 1 119 121 7 1 1 4 138 151 2 1 47 79 2 1 54 60 77 110107 135 70 104 86 22 23 10 1 60 1 13 8 48 62 7 1 5 1 92 169 160133 51 74 9 7 70 2 23 1 6 15 30 1 3 CTCF:B 106 69 102 25 96 101 108 1 1 132 99 17 135 2 CTCF:ST 105 69 97 30 99 102 102 7 131 1 93 9 114 3 TCF12:M 35 6 163 58 11 158 79 151 2 ZEB1:M 82 6 62 CTCF:C 104 64 97 2 1 1 87 82 9 44 88 2 7 152 162 159146 76 133 66 1 154 100 95 1 26 110 68 7 14 34 34 108 108 58 6 51 61 2 63 1 104 19 6 25 1 42 4 18 132117 109 36 1 2 1 1 1 85 24 5 36 1 2 70 4 100 120 140 160 180 200 c- M yc : M S U ax: SF S 1 PA c -M : M X5 y c: :M C c- C2 Ju 0 n Ju :S M nD EF : S M 2A EF : M PO 2C U :M 2 N F2: RS M F Y :M SR Y1 F :S SR : M F: -2x Eg M r-1 -v 1 :M EB -2x F1 EB :M F N FK N F :M B: KB S- : S TN EL Fa F ET 1:M S G 1:M AB P PU : M C .1: M TC C F: TC B T C F: S F1 T Z E 2:M B C 1: Eg TC M r-1 F: C :M -v 1 Egr-1:M-v 1 1 60 1 1 1 Egr-1:M-2x 6 82 164 105101 153 62 99 33 156 155150 81 123 69 1 SRF:M-2x ETS1:M 57 71 96 32 40 158 72 63 160 5 64 87 150 157155 108 81 88 94 Y Y 1:S 20 101 2 166 160149 56 134 92 27 157 110101 168 84 101 89 1 51 3 163 68 166 2 c-Jun:S 1 POU2F2:M 52 13 1 Figure 5-8 Spatial binding constraints detected from human GM12878 cells. A) Matrix representation of pairwise spatial binding constraints between factor B (column) and factor A (row) detected from 29 ChIP-Seq dataset in human GM12878 cells. The colors represent the significance levels (corrected p-value) of the strongest spacings. The numbers represent the distances between the factors in the strongest spacings. B) The colors and numbers represent the number of positions exhibiting significant spatial binding constraints within the 201bp window around the binding sites of factor B (column). 108 Chapter 5: Transcription factor spatial binding constraints A HepG2 CTCF:C 0 4 1 1 21 0 23 24 7 41 40 5 4 2 32 18 36 4 CTCF:M 4 0 5 3 17 4 19 13 11 62 18 11 7 3 11 14 11 2 CTCF:B 1 5 0 2 7 1 30 25 6 24 23 23 12 12 6 19 6 3 CTCF:ST 1 3 2 0 8 1 22 17 8 26 25 3 1 19 26 13 35 15 0 0 1 20 10 12 BHLHE40:M USF1:M 22 0 c-My c:C ELF1:M 14 8 24 44 9 0 0 0 11 14 4 14 12 13 12 6 6 8 26 7 5 4 1 1 1 0 0 19 15 20 0 5 6 6 5 6 7 7 7 5 19 13 29 12 9 0 2 6 22 15 4 4 5 7 0 0 0 3 14 15 2 0 12 6 2 19 5 30 18 7 3 5 8 0 3 9 9 6 GABP:M 1 5 0 1 HSF1:S-f orskolin 4 18 12 7 1 2 2 1e-250 1e-200 6 0 NRSF:M 0 SREBP1:S 0 12 6 0 11 0 14 6 8 12 5 15 7 13 6 7 14 7 12 6 7 2 2 1 20 6 5 5 19 2 3 19 5 6 6 5 14 7 11 6 41 8 7 0 30 3 9 14 26 13 26 7 0 5 9 11 37 7 7 0 8 8 17 5 5 3 9 13 8 10 0 7 11 18 8 FOXA1:M-SC-101058 41 62 24 28 FOXA1:M-SC-6553 67 36 23 80 FOXA2:M 15 11 23 2 2 SRF:M 0 6 3 2 2 10 3 4 9 0 7 6 0 3 2 2 11 3 3 3 13 26 18 1 3 3 0 1 1 4 4 5 5 5 3 2 1 0 0 3 3 4 4 4 2 2 1 0 0 3 3 4 4 4 2 JunD:M 13 0 11 4 3 3 0 0 1 20 6 9 3 3 4 3 3 0 0 6 10 5 15 HNF4A:M 36 9 HNF4G:M 36 3 13 5 4 4 13 11 0 0 0 3 4 3 5 4 4 20 10 0 0 0 3 6 9 13 5 4 4 6 5 0 0 0 3 15 4 5 0 26 3 2 2 8 2 3 3 3 0 2 TC F: C TC C TC F: M 6 RXRA:M C HNF4A:S-f orskolin 2 2 C F: B T BH C F LH : ST E4 0 U :M SF 1: M cM yc :C EL F1 H SF G :M 1: AB SP: fo rs M ko N lin R SR SF : EB M P1 ER :S R S C A: S R F EB -fo : M F O PB rs XA :S- koli 1: f or n F O M -S sk o XA C- lin 1: 101 M -S 058 C6 F O 553 XA F O 2:M SL 2 J u :M nD H NF : H 4A NF M 4 :S -f o A:M rs k H olin NF 4G R :M XR A: M 5 FOSL2:M 5 2 5 ERRA:S-f orskolin CEBPB:S-f orskolin C 4 1e-300 109 1e-150 1e-100 1e-50 1e-8 Chapter 5: Transcription factor spatial binding constraints B HepG2 CTCF:C 12 148 162 125 17 105 128 36 85 88 82 138 81 79 54 72 8 8 CTCF:M 144 26 168 130 22 102 131 40 86 94 87 127 83 81 79 77 8 20 CTCF:B 166 168 14 105 124 24 85 91 78 139 82 70 39 68 2 6 21 77 63 58 123 72 66 30 50 1 1 2 5 9 1 87 93 110 70 50 56 1 CTCF:ST 125 130 150 149 1 9 97 111 1 4 3 10 4 3 48 65 c-Myc:C 112 103 106 97 3 46 1 120 14 BHLHE40:M USF1:M 21 17 10 ELF1:M 128 127 125 110 GABP:M 5 61 122 1 64 6 66 1 10 HSF1:S-forskolin 1 12 2 2 80 63 66 3 38 112 130 120 130 120 118 146 132 121 123 77 130 152 140 151 139 140 162 123 132 124 30 5 2 1 NRSF:M 67 49 3 1 54 71 60 77 1 1 1 52 33 5 7 5 8 7 3 70 77 87 67 65 42 1 29 37 9 2 20 40 60 1 80 1 SREBP1:S 1 SRF:M 39 36 22 18 12 79 26 CEBPB:S-forskolin 87 86 83 78 67 106 128 1 FOXA1:M-SC-101058 90 95 89 52 1 82 134 152 FOXA1:M-SC-6553 82 88 78 62 1 67 127 133 FOXA2:M 143 137 138 120 1 69 127 154 FOSL2:M 83 81 84 77 5 90 120 141 59 6 69 121 178 181 174 JunD:M 78 79 68 65 2 90 118 140 36 5 78 127 184 179 167 110 HNF4A:M 60 78 31 27 107 143 163 8 83 28 166 199 201 199 160 163 HNF4A:S-forskolin 75 44 70 131 129 10 54 37 154 191 185 182 139 140 125 49 120 132 4 64 16 149 194 194 192 137 136 129 108 43 2 ERRA:S-forskolin 39 1 1 8 2 RXRA:M 7 16 3 C TC 2 2 1 57 1 1 4 70 1 178 59 5 6 80 48 128 116 179 179 184 124 130 167 155 154 143 1 179 123 1 122 118 178 182 3 183 116 108 190 193 196 111 180 175 1 184 192 197 175 166 198 184 189 194 1 120 140 109 163 137 138 161 1 162 144 136 160 1 160 126 132 187 1 107 173 1 148 198 193 191 158 161 186 170 171 180 169 1 200 TC F: C 63 4 1 52 100 F C :M TC C F:B T BH CF: LH S T E 40 U :M SF 1 c - :M M yc : EL C F1 H SF GA :M BP 1: S -fo :M rs ko N l in R SF SR : E M BP 1: ER S R A SR : C F: E S FO B P -for M X B: s ko A1 S l :M -fo in FO -S rs k o C l X A1 -10 in 10 :M 58 -S C -6 FO 55 X 3 A2 FO :M S L2 : Ju M n H D N : F 4 HN M A: F4 S- A: M fo rs k H ol in N F4 G :M R XR A :M 79 HNF4G:M C 1 Figure 5-9 Spatial binding constraints detected from human HepG2 cells A) Matrix representation of pairwise spatial binding constraints between factor B (column) and factor A (row) detected from 29 ChIP-Seq dataset in human HepG2 cells. The colors represent the significance levels (corrected p-value) of the strongest spacings. The numbers represent the distances between the factors in the strongest spacings. B) The colors and numbers represent the number of positions exhibiting significant spatial binding constraints within the 201bp window around the binding sites of factor B (column). 110 Chapter 5: Transcription factor spatial binding constraints A HeLa-S3 CTCF:B 0 1 2 3 4 CTCF:C 1 0 1 2 28 CTCF:ST 2 1 0 1 1 27 1 0 0 4 13 3 1 0 0 4 0 4 4 0 4 0 0 4 0 AP-2alpha:S 3 AP-2gamma:S 2 GABP:M 2 4 STAT1:S-IFNg30 28 24 NRSF:M 1 4 c-Fos:S 1 0 8 13 2 7 1 1 8 1 11 4 9 16 0 2 3 3 6 8 3 3 3 20 13 5 5 5 15 12 11 9 10 0 16 12 12 8 1 4 0 c-Jun:S JunD:S 11 7 19 3 3 9 23 2 0 0 0 6 15 3 3 3 9 9 12 0 0 0 6 15 3 3 3 5 10 12 0 0 0 6 6 6 3 3 3 0 1 3 c-My c:C 1e-250 1e-200 1e-150 1e-100 1 0 24 20 12 0 23 5 5 6 1 0 1 Max:S 7 11 2 10 13 12 6 1 3 3 6 3 1 0 0 0 5 0 0 G AB AT P: 1: M SIF Ng 30 N RS F: M cFo s: S cJu n: S Ju nD :S cM yc :C cM yc :S ST m a: S ha :S AP -2 ga m F: ST AP -2 al p F: C TC C TC C C TC F: B Nrf 1:S 111 N 2 M c-My c:S rf1 :S 0 ax :S 6 1e-300 1e-50 1e-8 Chapter 5: Transcription factor spatial binding constraints B HeLa-S3 CTCF:B 1 153 141 1 56 CTCF:C 151 11 126 2 73 CTCF:ST 144 123 1 2 2 64 2 1 61 1 25 86 2 59 1 1 49 1 1 1 2 23 53 2 1 AP-2alpha:S AP-2gamma:S 1 5 GABP:M STAT1:S-IFNg30 1 49 71 66 NRSF:M 2 2 c-Fos:S 2 120 102 93 109 2 1 117 110 119 3 1 97 95 108 38 91 113 107 104 59 111 127 128 15 12 50 77 73 133 78 138 142 136 7 16 48 38 3 1 c-Jun:S JunD:S 103 1 100 106 20 129 3 1 71 96 69 170 187 45 53 19 79 9 72 1 71 31 157 165 89 113 55 133 15 95 75 1 68 178 187 75 27 73 1 97 39 1 20 40 60 80 100 93 c-My c:C c-My c:S 1 1 120 140 2 97 117 128 78 139 49 174 154 176 97 1 151 Max:S 114 118 103 114 130 78 136 39 184 165 184 38 147 1 3 2 1 2 1 rf1 N M AB AT P: 1: M SIF Ng 30 N RS F: M cFo s: S cJu n: S Ju nD :S cM yc :C cM yc :S 200 G m a: S 180 ST AP -2 ga m T ha :S F: S al p AP -2 F: C TC C TC C C TC F: B Nrf 1:S 160 :S 110 ax :S 96 Figure 5-10 Spatial binding constraints detected from human HeLa-S3 cells. A) Matrix representation of pairwise spatial binding constraints between factor B (column) and factor A (row) detected from 15 ChIP-Seq dataset in human HeLa-S3 cells. The colors represent the significance levels (corrected p-value) of the strongest spacings. The numbers represent the distances between the factors in the strongest spacings. B) The colors and numbers represent the number of positions exhibiting significant spatial binding constraints within the 201bp window around the binding sites of factor B (column). 112 Chapter 5: Transcription factor spatial binding constraints A H1 NRSF:M 0 0 7 8 3 4 NRSF:M-v2 0 0 7 8 3 4 JunD:Myers 7 7 0 20 1 10 POU5F1:M 4 0 SRF:M 5 RXRA:M 7 USF1:M 4 5 7 4 2 10 1 47 7 0 5 2 3 12 5 0 4 7 3 4 CTCF:B 24 24 1 10 2 27 0 5 5 8 CTCF:M 3 10 1 3 7 5 0 0 4 3 5 0 0 5 4 8 8 5 0 3 Egr-1:M 25 YY1:M 5 6 12 5 7 1e-200 1e-150 1e-100 1e-50 1e-8 N R N R SF : SF M : M Ju nD -v2 :M PO ye r U s 5F 1: M SR F: M R XR A : U M SF 1: C M TC F: B C TC F: Eg M r1: M YY 1: M 5 1e-250 6 12 0 1e-300 B H1 NRSF:M 1 1 51 39 38 55 NRSF:M-v2 1 1 51 39 38 55 JunD:Myers 53 53 1 37 POU5F1:M 36 1 SRF:M 15 RXRA:M 25 USF1:M 42 15 21 45 103 102 55 118 1 2 7 11 3 1 3 11 8 1 4 3 82 68 3 57 39 102 2 13 87 2 134 3 182 CTCF:M 38 38 102 12 10 70 134 1 7 178 3 4 7 1 22 21 1 60 61 120 28 8 7 53 183 179 160 180 200 N N R R SF : SF M : M Ju nD -v2 :M PO ye r U s 5F 1: M SR F: M R XR A : U M SF 1: C M TC F: B C TC F: Eg M r1: M YY 1: M YY1:M 61 140 30 CTCF:B 39 Egr-1:M 120 Figure 5-11 Spatial binding constraints detected from human H1-hESC cells. A) Matrix representation of pairwise spatial binding constraints between factor B (column) and factor A (row) detected from 11 ChIP-Seq dataset in human H1-hESC cells. The colors represent the significance levels (corrected p-value) of the strongest spacings. The numbers represent the distances between the factors in the strongest spacings. B) The colors and numbers represent the number of positions exhibiting significant spatial binding constraints within the 201bp window around the binding sites of factor B (column). 113 Chapter 5: Transcription factor spatial binding constraints (48 pairs/15 TFs), and H1-hESC (23 pairs/11 TFs). Certain factor-pairs exhibited a highly significant single binding spacing offset within 100bp, such as the 4bp distance between Egr1 and CTCF in K562 cells (Figure 5-7). Other factor pairs exhibited a large number of significant offsets, such as the 167 significant spacings between JunD and Max with the most significant being at 4bp (Figure 5-7). Our analysis confirmed the known interaction pairs MYC-MAX (Blackwood and Eisenman, 1991), the FOS-JUN heterodimer (Glover and Harrison, 1995), and CTCF-YY1 (Donohoe et al., 2007) (Figure 5-7). Observed novel genome wide spatial binding constraints include c-Fos:c-Jun/USF1, CTCF/Egr1, HNF4α/FOXA1. We find that USF1 often binds 4bp from c-Fos:c-Jun (Figure 5-12A). Inspection of the sequences of the co-bound regions shows a partial overlap of the two motifs (Figure 5-12D). This binding is consistent with Fra1’s facilitation of a complex between USF1 and c-Fos:c-Jun (Pognonec et al., 1997). We find a significant number of cases where CTCF co-binds 4bp from Egr1 (Figure 5-12B), with the Egr1 motif overlaps significantly with half of CTCF motif (Figure 5-12E). Egr1 promotes terminal myeloid differentiation in the presence of deregulated c-Myc expression, and Egr1 has been implicated in down regulating c-Myc in conjunction with CTCF (Hoffman et al., 2002). In addition, the co-binding of CTCF and Egr1 at the EPO regulatory region has been suggested (Yamaguchi et al., 1994). FOXA1 binds at a large number of significant positions close to HNF4α (total 4215 regions with a spacing within 30bp, Figure 5-12C and Figure 5-12F), and there are also significant binding constraints between HNF4α and HNF4γ and FOXA1, FOXA2 in HepG2 cells (Figure 5-9). While cobinding of HNF4α/FOXA2 has been reported (Wallerman et al., 2009), co-binding of HNF4α/FOXA1, HNF4γ/FOXA1 and HNF4γ/FOXA2 are not known. We note that HNF4α and any one of FOXA1, FOXA2, or FOXA3 is sufficient to reprogram cells towards a hepatocytic fate (Sekiya and Suzuki, 2011). 114 Chapter 5: Transcription factor spatial binding constraints Figure 5-12 Examples of transcription factor spatial binding constraints detected from GEM analysis in ENCODE ChIP-Seq data. A) Genome wide spatial distribution of USF1 binding sites in a 201bp window around c-Jun binding sites. B) Egr1 binding sites around CTCF binding sites. C) FOXA1 binding sites around HNF4α binding sites. For panel A-C, vertical dashed lines represent the centered factor binding sites at position 0; horizontal dashed lines represent the number of occurrences at a position corresponding to corrected p-value of 1e-8. D) Color chart representation of 61bp sequences in 259 regions with 4bp c-Fos:c-Jun/USF1 binding constraint. E) Color chart representation of 100bp sequences in 315 regions with 4bp CTCF/Egr1 binding constraint. F) Color chart representation of 71bp sequences in 4215 regions with a wide range of HNF4 α /FOXA1 binding constraints within 30bp of HNF4α binding sites. For panel D-F, each row represents a bound sequence. Green, blue, yellow and red indicate A, C, G and T. For panel F, the rows are ordered by the FOXA1 offset positions. The motif logos are generated by STAMP (Mahony et al., 2007) from the motifs discovered using all the binding sites in the respective datasets. 5.4 Discussion Collectively, our results demonstrate that it is now possible to reveal aspects of functional genome syntax by surveying in vivo binding relationships between transcription factors at high spatial resolution. We show that TF binding constraints can 115 Chapter 5: Transcription factor spatial binding constraints be strict (e.g. Oct4/Sox2, c-Fos:c-Jun/USF1, CTCF/Egr1, etc.) or constrained but flexible (e.g. HNF4 α /FOXA1). We also discovered 123 examples of enhancer grammar elements that capture the complex binding relationship among 6 ES cell TFs in a 70bp window. Our analysis has been made possible by sequenced ChIP data and a new computational method, GEM, which provides exceptional spatial resolution. GEM makes binding predictions and observes spatial constraints by discovering significant events utilizing both motifs and read coverage information. Prior work has documented specific genomic regions extensively targeted by multiple transcription factors (TFs) (Chen et al., 2008). However, we have shown that the functional syntax of DNA motifs in regulatory elements cannot be fully elaborated with the imprecise ChIPSeq event calls provided by previous methods. Motif analysis approaches such as SpaMo discover enriched motif spacings by scanning a list of known motifs in sequences anchored by ChIP-Seq data of a single factor (Whitington et al., 2011). Since the existence of motif instances does not guarantee condition specific in vivo binding, SpaMo cannot confidently determine the spacing between binding events and the factors involved, especially for motifs that are shared by a family of TFs. Furthermore, SpaMo excludes repetitive sequences (Whitington et al., 2011). In contrast, GEM predicts binding based on uniquely-mapped reads and is able to detect spatial binding constraints in transposable elements. Such elements have been implicated in rewiring the core regulatory network of human and mouse ES cells (Kunarso et al., 2010). We expect that the genome grammatical rules that are suggested here will be examined in further studies to elucidate mechanisms of transcriptional control, and potential protein-protein interactions that have regulatory consequences. These spatial binding constraints provide starting points to test the enhanceosome model and billboard model of binding site arrangement in the enhancers (Thanos and Maniatis, 1995; Arnosti and Kulkarni, 2005). Exploration of other genome grammatical constructs can be accomplished with the use of further ChIP experiments and GEM. 116 Chapter 6: Conclusions Chapter 6 Conclusions 117 Chapter 6: Conclusions Chapter 6: Conclusions 6.1 Summary and contributions The focus of this thesis research has been characterizing the interactions between transcription factors and their binding sites in regulatory DNA sequences at high spatial resolution, and using this characterization to reveal genomic grammars that may underlie the combinatorial control of gene regulation. In this thesis, I developed three computational methods to learn genome-wide transcription factor binding events, in vivo binding preferences and binding constraints from ChIP-Seq profiling of protein-DNA interactions. I will summarize the results presented in previous chapters and outline the main contributions of this thesis. 6.1.1 Genome Positioning Systems (GPS) The Genome Positioning Systems (GPS) algorithm is a model-based computational method to predict ChIP-Seq binding events with high spatial resolution. In order to address the challenges of ChIP-Seq analysis, GPS explicitly models random fragmentation of ChIP DNA and the mixing of closely spaced events using a novel probabilistic mixture model. Compared to other published ChIP-Seq analysis methods, GPS improves the spatial resolution of binding event predictions and resolves more proximal binding events. The main contributions of this work are: • A novel generative probabilistic model for spatial distribution of ChIP-Seq data. To our knowledge, GPS is the first published method to directly model ChIPSeq read spatial distribution at single base-pair resolution. Previous methods typically used sliding window or density smoothing approaches to aggregate the reads and thus are not able to predict with high resolution (Pepke et al., 2009). This modeling framework allows more accurate representation of the data and easy incorporation of position-specific information (as in GEM). Similar approaches may be applied to other data types such as DNase-Seq and RNA-Seq, which also consist of distributions of reads along the genome. • A novel mixture model with a sparse prior to explicitly model closely spaced binding events. The explicitly modeling of closely spaced events allows more accurate quantification of each event in terms of binding strength and binding 119 Chapter 6: Conclusions location. Such accurate quantification may facilitate further study of cooperative proximal events. The use of a sparse prior instead of trying a range of component (events) numbers allows intuitive interpretation of the sparse prior parameter (in terms of read counts) and easy incorporation of position-specific information (as in GEM). • Explicit modeling of the “peak shape” information. GPS models the “peak shape” with an empirical spatial distribution of reads. It is more accurate than using the parametric distributions used by other methods and allows automatic adaptation to new data types that have a very different distribution, such as ChIP-exo data. It is the main contributor to the improved spatial resolution. The use of “peak shape” information also allows a principled filtering of the false positive predictions resulted from artifacts in the ChIP-Seq data. • A novel method for ChIP-Seq event prediction with high spatial accuracy, specificity, and more accurate quantification of binding strength. This facilitates downstream analysis such as motif discovery or gene expression prediction using binding information. 6.1.2 K-mer set motif representation and discovery The k-mer set motif (KSM) representation and the k-mer motif alignment and clustering (KMAC) motif discovery method are designed to respectively represent and learn motifs that are enriched in ChIP-Seq bound sequences versus control sequences. Our results showed that the KSM model is more informative and predictive than the PWM model. KMAC discovers motif by using a combined enumerative and alignmentbased approach and weighting the motif sites with binding event strength and binding positional information. KMAC outperforms other motif discovery methods, including several ChIP-Seq oriented methods, in recovering known motifs using a large number of diverse ChIP-Seq datasets. The main contributions of this work are: • A novel motif representation. The KSM representation overcomes the positionindependence assumption of the position weight matrix (PWM) and consensus sequence representation and allows richer and more accurate representation of the motif. The value of this motif representation is also demonstrated in the advantageous performance of KMAC motif discovery. Such k-mer based representation also facilitates the direct comparison of the in vivo binding specificity 120 Chapter 6: Conclusions with the in vitro binding specificity of the same factor, which is commonly represented as a list of k-mer binding affinities. • A novel motif discovery method for ChIP-Seq data. The KMAC motif discovery method exploits the advantages of the large number of training examples in ChIPSeq data. It is able to process large numbers of sequences and utilizes both strong and weak binding events. It also takes advantage of the higher spatial resolution and more accurate quantification of binding event strength by biasing motif discovery towards the binding positions and more confident events. The value of incorporating these informative features is demonstrated in the improved performance of KMAC motif discovery compared with traditional and ChIP-Seq oriented motif discovery methods. 6.1.3 Genome-wide event finding and motif discovery (GEM) Genome-wide event finding and motif discovery (GEM) is an integrative model to resolve the location of protein-DNA interactions and discover explanatory DNA sequence motifs. GEM extends the GPS model to incorporate motif information as a position-specific prior to bias binding event prediction. GEM achieves exceptional spatial resolution in locating most binding events exactly on the motif positions, and further improves proximal event deconvolution. GEM can also be directly applied to ChIP-exo data and improves upon existing methods. The main contributions of this work are: • A novel integrated model to predict binding events and binding motifs. The motif information is modeled as a position-specific prior to bias the binding event predictions towards motif positions and thus improve the spatial resolution of event predictions. The motif information also helps to more accurately deconvolve closely spaced events. GEM offers a more principled approach than simply snapping binding event predictions to the closest instance of the motif. • A flexible approach to incorporate position-specific information. GEM integrates position-specific count information as the prior count. It is flexible to incorporate other position-specific information into binding event prediction, for example phastCons scores for cross-species sequence conservation (Siepel et al., 2005). • Exceptional spatial resolution of binding event prediction. The exceptional spatial resolution enables the discovery of in vivo TF binding constraints and further 121 Chapter 6: Conclusions improves motif discovery. • A novel computational method for ChIP-exo data. GEM is able to automatically adapt to ChIP-exo read distribution and improves upon existing methods for ChIPexo analysis. 6.1.4 Transcription factor spatial binding constraints Exploiting GEM’s exceptional spatial resolution, statistically significant TF binding constraints were discovered using GEM binding predictions of a large number of TFs in the same cellular condition. We confirmed that it is possible to discover binding constraints between the Oct4-Sox2 dimers that are impossible to observe with the noisy binding calls from the existing peak callers. This approach also discovered more binding constraints than using motif positions that are closest to GPS binding calls. We found 37 examples of TF binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4, Sox2 and other key regulatory factors. In human ENCODE data, we found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4 α /FOXA1. The main contributions of this work are: • A novel approach to discover in vivo TF binding constraints. Our results demonstrate that it is now possible to reveal aspects of functional genome syntax by surveying in vivo binding relationships between transcription factors at high spatial resolution. The binding constraints are found based on confident binding calls. Therefore they are more reliable findings than SpaMo’s motif analysis based on ChIP-Seq data of a single factor (Whitington et al., 2011). • A large number of TF binding constraints were discovered. We found 37 examples in mouse ES cells and 390 examples in 5 human cell types. The results confirm the known interaction pairs MYC-MAX (Blackwood and Eisenman, 1991), FOS-JUN (Glover and Harrison, 1995), and CTCF-YY1 (Donohoe et al., 2007). We also discovered novel pairs such as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4 α /FOXA1. We expect that these will be examined in further studies to elucidate mechanisms of transcriptional control, and potential protein-protein interactions that have regulatory consequences. • Enhancer grammar elements in mouse ES cells. We found strong binding constraints between Klf4, Sox2 and other ES cell key regulatory factors. A set of 123 distance-constrained regions are co-bound by 6 ES cell factors within a 70bp 122 Chapter 6: Conclusions window. Some of these regions are shown to interact with TSS of Tcfcp2l1, Nanog, PPARα and other genes in RNA polymerase II ChIA-PET experiments, most of the regions are bound by p300 and the majority of them are marked by H3K27ac, suggesting that they may be active enhancer regions. Most of the 123 distanceconstrained regions are annotated as transposable elements. Kunarso, et al. suggested that transposable elements have rewired the core regulatory network of ES cells (Kunarso et al., 2010). Our analysis found that the repetitive sequences constrain the in vivo binding of a number of key transcription factors in ES cells. 6.2 Directions for future work 6.2.1 Weighting factor of motif prior in the GEM algorithm In this thesis, we set μ, the weighting factor of motif prior, to be 0.8 so that the k-mer based prior will not force the model to predict a binding event at a motif position without sufficient read coverage (see Subsection 4.2.4). The current setting produces good results for the ENCDOE and mouse ES cell datasets, but may not be optimal for future datasets with higher read coverage or datasets with different read coverage characteristics such as ChIP-exo data. In addition, other type of position-specific prior information, e.g. sequence conservation scores, may be incorporated. The weighting factor ideally should be adjusted automatically to allow a balanced contribution of different sources of information. One strategy is to run GEM on multiple random subsets of the reads and to select the setting that gives most consistent result. Another strategy is to evaluate the spatial read distribution of the predicted events to assess the effect of motif prior. The mappability of reads should be taken into account when evaluating the read distribution of predicted events. A too strong motif prior weighting may result in predictions of false events that do not have an expected read distribution. 6.2.2 K-mer based comparison of in vivo versus in vitro binding for similar TFs in a family TFs in large families such as Forkhead box (Fox) or Homeo box (Hox) proteins tend to bind similar sites in vitro yet display diverse functions in vivo, suggesting specificities are gained from co-factor interactions. A k-mer based comparison between binding motifs learned from in vivo ChIP-Seq data and those from in vitro data such as protein 123 Chapter 6: Conclusions binding microarray (PBM) (Berger et al., 2006), HT-SELEX (Zhao et al., 2009), and Bacterial one-hybrid (B1H) (Meng et al., 2005; Meng and Wolfe, 2006), especially for multiple TFs in the same family, will allow interesting sequence specificity features to be tested. For example, k-mers that are differentially bound in vivo versus in vitro, or kmers that are bound differently by related TFs in vivo or in vitro can be investigated comprehensively to discover the sequence features that may explain in vivo specificity. ChIP-Seq for similar TFs may be performed with antibodies against epitope-tagged proteins (Cao et al., 2011; Mazzoni et al., 2011). The counts of an in vivo k-mer may need to be properly normalized with the copy number of that k-mer in the genome. 6.2.3 Discovery of binding constraints In this thesis, I have shown mainly pair-wise binding constraints. One case of more complex pattern of Sox2/Klf4/Nanog/Esrrb/Nr5a2/Tcpcf2l1 co-binding has been found by targeted search. Automatic search strategies will allow a more systematic search of complex patterns that may involve arbitrary number of TFs. One search strategy might consist of building a binding constraint network using significant pair-wise constraints in the same cellular condition, then finding the cliques in the network and performing a targeted search to verify whether those pair-wise constraints in a clique occur in the same set of genomic regions. Another direction related to binding constraints is to build an online database to store ChIP-Seq binding calls from GEM across multiple TFs in multiple conditions. Binding constraints can then be searched and visualized in the desired set of experiments. Existing public datasets, including data from ENCODE (Birney et al., 2007), modENCODE (Celniker et al., 2009), will provide large number of experiments to start. This will become more useful as more ChIP-Seq data are produced. 6.3 Conclusions In conclusion, I presented three novel computational methods from my thesis research. These methods improved spatial resolution and joint event deconvolution in ChIP-Seq binding event prediction, and improved accuracy in motif representation and discovery. The improved results from these methods allow discovery of in vivo transcription factor spatial binding constraints in both human and mouse cells, as well as improvement in downstream analysis in other research area. From these results, I showed that a high resolution model for inherently noisy 124 Chapter 6: Conclusions genome-wide high-throughput biological data is feasible. This has been made possible by modeling every ChIP-Seq read, using more accurate k-mer based motif representation and incorporating constraints from the biological experiments Collectively, the results from my thesis research show that it is possible to reveal aspects of functional genome syntax using a high resolution computational model of ChIP-Seq data. 125 References References Aho, A.V., and Corasick, M.J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM 18, 333–340. Arnosti, D.N., and Kulkarni, M.M. (2005). Transcriptional enhancers: Intelligent enhanceosomes or flexible billboards? J. Cell. Biochem. 94, 890–898. Bailey, T.L. (2011). DREME: Motif discovery in transcription factor ChIP-seq data. Bioinformatics. Bailey, T.L., Boden, M., Whitington, T., and Machanick, P. (2010). The value of position-specific priors in motif discovery using MEME. BMC Bioinformatics 11, 179. Bailey, T.L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36. Barash, Y., Bejerano, G., and Friedman, N. (2001). A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites. In Proceedings of the First International Workshop on Algorithms in Bioinformatics, (London, UK, UK: Springer-Verlag), pp. 278–293. Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I., and Zhao, K. (2007). High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B (Methodological) 57, 289–300. Benos, P.V., Bulyk, M.L., and Stormo, G.D. (2002). Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451. Berger, M.F., Philippakis, A.A., Qureshi, A.M., He, F.S., Estep, P.W., 3rd, and Bulyk, M.L. (2006). Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435. Bernstein, B.E., Stamatoyannopoulos, J.A., Costello, J.F., Ren, B., Milosavljevic, A., Meissner, A., Kellis, M., Marra, M.A., Beaudet, A.L., Ecker, J.R., et al. (2010). The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048. Bicego, M., Cristani, M., and Murino, V. (2007). Sparseness Achievement in Hidden Markov Models. In Proceedings of the 14th International Conference on Image Analysis and Processing (ICIAP07)., (Modena: IEEE Computer Society),. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigó, R., Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E., et al. (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Bishop, C.M. (2006). Pattern recognition and machine learning (New York: Springer). 126 References Blackwood, E.M., and Eisenman, R.N. (1991). Max: a helix-loop-helix zipper protein that forms a sequence-specific DNA-binding complex with Myc. Science 251, 1211– 1217. Boeva, V., Surdez, D., Guillon, N., Tirode, F., Fejes, A.P., Delattre, O., and Barillot, E. (2010). De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res 38, e126. Bourque, G., Leong, B., Vega, V.B., Chen, X., Lee, Y.L., Srinivasan, K.G., Chew, J.-L., Ruan, Y., Wei, C.-L., Ng, H.H., et al. (2008). Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Res. 18, 1752–1762. Boyadjiev, S.A., and Jabs, E.W. (2000). Online Mendelian Inheritance in Man (OMIM) as a knowledgebase for human developmental disorders. Clin. Genet. 57, 253–266. Bulyk, M.L., Johnson, P.L.F., and Church, G.M. (2002). Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 30, 1255–1261. Cao, A.R., Rabinovich, R., Xu, M., Xu, X., Jin, V.X., and Farnham, P.J. (2011). Genome-wide analysis of transcription factor E2F1 mutant proteins reveals that N- and C-terminal protein interaction domains do not participate in targeting E2F1 to the human genome. J. Biol. Chem. 286, 11985–11996. Celniker, S.E., Dillon, L.A.L., Gerstein, M.B., Gunsalus, K.C., Henikoff, S., Karpen, G.H., Kellis, M., Lai, E.C., Lieb, J.D., MacAlpine, D.M., et al. (2009). Unlocking the secrets of the genome. Nature 459, 927–930. Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V.B., Wong, E., Orlov, Y.L., Zhang, W., Jiang, J., et al. (2008). Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106–1117. Chen, Y., Negre, N., Li, Q., Mieczkowska, J.O., Slattery, M., Liu, T., Zhang, Y., Kim, T.-K., He, H.H., Zieba, J., et al. (2012). Systematic evaluation of factors influencing ChIP-seq fidelity. Nature Methods. Chew, J.-L., Loh, Y.-H., Zhang, W., Chen, X., Tam, W.-L., Yeap, L.-S., Li, P., Ang, Y.S., Lim, B., Robson, P., et al. (2005). Reciprocal transcriptional regulation of Pou5f1 and Sox2 via the Oct4/Sox2 complex in embryonic stem cells. Mol. Cell. Biol. 25, 6031– 6046. Chung, D., Kuan, P.F., Li, B., Sanalkumar, R., Liang, K., Bresnick, E.H., Dewey, C., and Keleş, S. (2011). Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data. PLoS Comput. Biol. 7, e1002111. Creyghton, M.P., Cheng, A.W., Welstead, G.G., Kooistra, T., Carey, B.W., Steine, E.J., Hanna, J., Lodato, M.A., Frampton, G.M., Sharp, P.A., et al. (2010). Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl. Acad. Sci. U.S.A. 107, 21931–21936. D’haeseleer, P. (2006a). How does DNA sequence motif discovery work? Nat. Biotechnol 24, 959–961. 127 References D’haeseleer, P. (2006b). What are DNA sequence motifs? Nat. Biotechnol. 24, 423–425. Dang, C.V. (2012). MYC on the path to cancer. Cell 149, 22–35. Das, M.K., and Dai, H.-K. (2007). A survey of DNA motif finding algorithms. BMC Bioinformatics 8 Suppl 7, S21. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society 39,. Donohoe, M.E., Zhang, L.-F., Xu, N., Shi, Y., and Lee, J.T. (2007). Identification of a Ctcf cofactor, Yy1, for the X chromosome binary switch. Mol. Cell 25, 43–56. Eden, E., Lipson, D., Yogev, S., and Yakhini, Z. (2007). Discovering motifs in ranked lists of DNA sequences. PLoS Comput. Biol 3, e39. Ernst, J., Kheradpour, P., Mikkelsen, T.S., Shoresh, N., Ward, L.D., Epstein, C.B., Zhang, X., Wang, L., Issner, R., Coyne, M., et al. (2011). Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49. Farnham, P.J. (2009). Insights from genomic profiling of transcription factors. Nat. Rev. Genet. 10, 605–616. Feng, X., Grossman, R., and Stein, L. (2011). PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinformatics 12, 139. Figueiredo, M.A.., and Jain, A.K. (2002). Unsupervised Learning of Finite Mixture Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 4, 381–396. Fullwood, M.J., Liu, M.H., Pan, Y.F., Liu, J., Xu, H., Mohamed, Y.B., Orlov, Y.L., Velkov, S., Ho, A., Mei, P.H., et al. (2009). An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 462, 58–64. Furney, S.J., Higgins, D.G., Ouzounis, C.A., and López-Bigas, N. (2006). Structural and functional properties of genes involved in human cancer. BMC Genomics 7, 3. Galas, D.J., and Schmitz, A. (1978). DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 5, 3157–3170. Georges, A.B., Benayoun, B.A., Caburet, S., and Veitia, R.A. (2010). Generic binding sites, generic DNA-binding domains: where does specific promoter recognition come from? FASEB J. 24, 346–356. Glover, J.N., and Harrison, S.C. (1995). Crystal structure of the heterodimeric bZIP transcription factor c-Fos-c-Jun bound to DNA. Nature 373, 257–261. Gotea, V., Visel, A., Westlund, J.M., Nobrega, M.A., Pennacchio, L.A., and Ovcharenko, I. (2010). Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res 20, 565–577. Guo, Y., Papachristoudis, G., Altshuler, R.C., Gerber, G.K., Jaakkola, T.S., Gifford, D.K., and Mahony, S. (2010). Discovering homotypic binding events at high spatial resolution. Bioinformatics 26, 3028–3034. 128 References Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F., Feldser, D., Huarte, M., Zuk, O., Carey, B.W., Cassady, J.P., et al. (2009). Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227. Heng, J.-C.D., Feng, B., Han, J., Jiang, J., Kraus, P., Ng, J.-H., Orlov, Y.L., Huss, M., Yang, L., Lufkin, T., et al. (2010). The nuclear receptor Nr5a2 can replace Oct4 in the reprogramming of murine somatic cells to pluripotent cells. Cell Stem Cell 6, 167–174. Hoffman, B., Amanullah, A., Shafarenko, M., and Liebermann, D.A. (2002). The protooncogene c-myc in hematopoietic development and leukemogenesis. Oncogene 21, 3414–3421. Hu, J., Li, B., and Kihara, D. (2005). Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 33, 4899–4913. Hu, M., Yu, J., Taylor, J.M.G., Chinnaiyan, A.M., and Qin, Z.S. (2010). On the detection and refinement of transcription factor binding sites using ChIP-Seq data. Nucleic Acids Res 38, 2154–2167. Hueber, S.D., and Lohmann, I. (2008). Shaping segments: Hox gene function in the genomic age. Bioessays 30, 965–979. Hughes, J.D., Estep, P.W., Tavazoie, S., and Church, G.M. (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214. Hughes, T.R. (2011). Introduction to “a handbook of transcription factors.”Subcell. Biochem. 52, 1–6. Ise, W., Kohyama, M., Schraml, B.U., Zhang, T., Schwer, B., Basu, U., Alt, F.W., Tang, J., Oltz, E.M., Murphy, T.L., et al. (2011). The transcription factor BATF controls the global regulators of class-switch recombination in both B cells and T cells. Nat. Immunol. 12, 536–543. Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M., and Brown, P.O. (2001). Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409, 533–538. Jacob, F., and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–356. Ji, H., Jiang, H., Ma, W., Johnson, D.S., Myers, R.M., and Wong, W.H. (2008). An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol. 26, 1293–1300. Jiang, M., Anderson, J., Gillespie, J., and Mayne, M. (2008). uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics 9, 192. Johnson, D.S., Mortazavi, A., Myers, R.M., and Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502. 129 References Jothi, R., Cuddapah, S., Barski, A., Cui, K., and Zhao, K. (2008). Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 36, 5221–5231. Kim, T.-M., and Park, P.J. (2011). Advances in analysis of transcriptional regulatory networks. Wiley Interdiscip Rev Syst Biol Med 3, 21–35. Kulakovskiy, I.V., Boeva, V.A., Favorov, A.V., and Makeev, V.J. (2010). Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics 26, 2622–2623. Kulkarni, M.M., and Arnosti, D.N. (2003). Information display by transcriptional enhancers. Development 130, 6569–6575. Kunarso, G., Chia, N.-Y., Jeyakani, J., Hwang, C., Lu, X., Chan, Y.-S., Ng, H.-H., and Bourque, G. (2010). Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat Genet 42, 631–634. Laajala, T.D., Raghav, S., Tuomela, S., Lahesmaa, R., Aittokallio, T., and Elo, L.L. (2009). A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments. BMC Genomics 10, 618. Ladunga, I. (2010). An overview of the computational analyses and discovery of transcription factor binding sites. Methods Mol. Biol. 674, 1–22. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799–804. Levine, M. (2010). Transcriptional enhancers in animal development and evolution. Curr. Biol. 20, R754–763. Levine, M., and Tjian, R. (2003). Transcription regulation and animal diversity. Nature 424, 147–151. Li, H., Ruan, J., and Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858. Lifanov, A.P., Makeev, V.J., Nazina, A.G., and Papatsenko, D.A. (2003). Homotypic regulatory clusters in Drosophila. Genome Res 13, 579–588. Liu, X.S., Brutlag, D.L., and Liu, J.S. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20, 835–839. Lodish, H.F. (2004). Molecular cell biology (New York: W.H. Freeman and Co.). Ma, X., Kulkarni, A., Zhang, Z., Xuan, Z., Serfling, R., and Zhang, M.Q. (2012). A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information. Nucleic Acids Research. MacIsaac, K.D., and Fraenkel, E. (2006). Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput. Biol. 2, e36. 130 References Maerkl, S.J., and Quake, S.R. (2007). A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233–237. Mahony, S., Auron, P.E., and Benos, P.V. (2007). DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput. Biol. 3, e61. Man, T.K., and Stormo, G.D. (2001). Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res. 29, 2471–2478. Mardis, E.R. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9, 387–402. Maston, G.A., Evans, S.K., and Green, M.R. (2006). Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 7, 29–59. Matys, V., Fricke, E., Geffers, R., Gössling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A.E., Kel-Margoulis, O.V., et al. (2003). TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378. Mazzoni, E.O., Mahony, S., Iacovino, M., Morrison, C.A., Mountoufaris, G., Closser, M., Whyte, W.A., Young, R.A., Kyba, M., Gifford, D.K., et al. (2011). Embryonic stem cell-based mapping of developmental transcriptional programs. Nat. Methods 8, 1056– 1058. Meng, X., Brodsky, M.H., and Wolfe, S.A. (2005). A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nat. Biotechnol. 23, 988–994. Meng, X., and Wolfe, S.A. (2006). Identifying DNA sequences recognized by a transcription factor using a bacterial one-hybrid system. Nat Protoc 1, 30–45. Metzker, M.L. (2010). Sequencing technologies - the next generation. Nat. Rev. Genet. 11, 31–46. Mikkelsen, T.S., Ku, M., Jaffe, D.B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T.-K., Koche, R.P., et al. (2007). Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553–560. Mitchell, P.J., and Tjian, R. (1989). Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245, 371–378. Moorman, C., Sun, L.V., Wang, J., de Wit, E., Talhout, W., Ward, L.D., Greil, F., Lu, X.J., White, K.P., Bussemaker, H.J., et al. (2006). Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster. Proceedings of the National Academy of Sciences 103, 12027–12032. Narlikar, L., Gordan, R., Ohler, U., and Hartemink, A.J. (2006a). Informative priors based on transcription factor structural class improve de novo motif discovery. Bioinformatics 22, e384–92. 131 References Narlikar, L., Gordân, R., Ohler, U., and Hartemink, A.J. (2006b). Informative priors based on transcription factor structural class improve de novo motif discovery. Bioinformatics 22, e384–392. Nomenclature Committee of the International Union of Biochemistry (1986). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Proc. Natl. Acad. Sci. U.S.A. 83, 4–8. Panne, D., Maniatis, T., and Harrison, S.C. (2007). An atomic model of the interferonbeta enhanceosome. Cell 129, 1111–1123. Papatsenko, D., and Levine, M. (2005). Quantitative analysis of binding motifs mediating diverse spatial readouts of the Dorsal gradient in the Drosophila embryo. Proc. Natl. Acad. Sci. U.S.A. 102, 4966–4971. Park, P.J. (2009). ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680. Pavesi, G., Mauri, G., and Pesole, G. (2001). An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17 Suppl 1, S207–214. Pepke, S., Wold, B., and Mortazavi, A. (2009). Computation for ChIP-seq and RNA-seq studies. Nat. Methods 6, S22–32. Pognonec, P., Boulukos, K.E., Aperlo, C., Fujimoto, M., Ariga, H., Nomoto, A., and Kato, H. (1997). Cross-family interaction between the bHLHZip USF and bZip Fra1 proteins results in down-regulation of AP1 activity. Oncogene 14, 2091–2098. Ponticos, M., Partridge, T., Black, C.M., Abraham, D.J., and Bou-Gharios, G. (2004). Regulation of collagen type I in vascular smooth muscle cells by competition between Nkx2.5 and deltaEF1/ZEB1. Mol. Cell. Biol. 24, 6151–6161. Ptashne, M., and Gann, A. (1997). Transcriptional activation by recruitment. Nature 386, 569–577. Qi, Y., Rolfe, A., MacIsaac, K.D., Gerber, G.K., Pokholok, D., Zeitlinger, J., Danford, T., Dowell, R.D., Fraenkel, E., Jaakkola, T.S., et al. (2006). High-resolution computational models of genome binding events. Nat. Biotechnol. 24, 963–970. Rahl, P.B., Lin, C.Y., Seila, A.C., Flynn, R.A., McCuine, S., Burge, C.B., Sharp, P.A., and Young, R.A. (2010). c-Myc regulates transcriptional pause release. Cell 141, 432– 445. Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., et al. (2000). Genome-wide location and function of DNA binding proteins. Science 290, 2306–2309. Rhee, H.S., and Pugh, B.F. (2011). Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419. Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Euskirchen, G., Bernier, B., Varhol, R., Delaney, A., et al. (2007). Genome-wide profiles of STAT1 132 References DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods 4, 651–657. Rozowsky, J., Euskirchen, G., Auerbach, R.K., Zhang, Z.D., Gibson, T., Bjornson, R., Carriero, N., Snyder, M., and Gerstein, M.B. (2009). PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 27, 66–75. Rye, M.B., Sætrom, P., and Drabløs, F. (2011). A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs. Nucleic Acids Res. 39, e25. Sandelin, A., Alkema, W., Engström, P., Wasserman, W.W., and Lenhard, B. (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–94. Schneider, T.D., and Stephens, R.M. (1990). Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100. Sekiya, S., and Suzuki, A. (2011). Direct conversion of mouse fibroblasts to hepatocytelike cells by defined factors. Nature 475, 390–393. Shen, Y., Yue, F., McCleary, D.F., Ye, Z., Edsall, L., Kuan, S., Wagner, U., Dixon, J., Lee, L., Lobanenkov, V.V., et al. (2012). A map of the cis-regulatory sequences in the mouse genome. Nature 488, 116–120. Sherwood, L. (1997). Human physiology : from cells to systems (Belmont, CA: Wadsworth Pub. Co.). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A.S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., et al. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050. Slattery, M., Riley, T., Liu, P., Abe, N., Gomez-Alcala, P., Dror, I., Zhou, T., Rohs, R., Honig, B., Bussemaker, H.J., et al. (2011). Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147, 1270–1282. Solomon, M.J., Larsen, P.L., and Varshavsky, A. (1988). Mapping protein-DNA interactions in vivo with formaldehyde: evidence that histone H4 is retained on a highly transcribed gene. Cell 53, 937–947. Song, L., Zhang, Z., Grasfeder, L.L., Boyle, A.P., Giresi, P.G., Lee, B.-K., Sheffield, N.C., Gräf, S., Huss, M., Keefe, D., et al. (2011). Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 21, 1757–1767. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297. Stormo, G.D. (2000). DNA binding sites: representation and discovery. Bioinformatics 16, 16–23. 133 References Stormo, G.D., Schneider, T.D., Gold, L., and Ehrenfeucht, A. (1982). Use of the “Perceptron” algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 10, 2997–3011. Stormo, G.D., and Zhao, Y. (2010). Determining the specificity of protein–DNA interactions. Nature Reviews Genetics 11, 751–760. Swanson, C.I., Schwimmer, D.B., and Barolo, S. (2011). Rapid evolutionary rewiring of a structurally constrained eye enhancer. Curr. Biol. 21, 1186–1196. Taatjes, D.J., Marr, M.T., and Tjian, R. (2004). Regulatory diversity among metazoan coactivator complexes. Nat. Rev. Mol. Cell Biol. 5, 403–410. Takahashi, K., and Yamanaka, S. (2006). Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663–676. Thanos, D., and Maniatis, T. (1995). Virus induction of human IFN beta gene expression requires the assembly of an enhanceosome. Cell 83, 1091–1100. The modENCODE Consortium, Roy, S., Ernst, J., Kharchenko, P.V., Kheradpour, P., Negre, N., Eaton, M.L., Landolin, J.M., Bristow, C.A., Ma, L., et al. (2010). Identification of Functional Elements and Regulatory Circuits by Drosophila modENCODE. Science 330, 1787–1797. Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., et al. (2005). Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144. Valouev, A., Johnson, D.S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers, R.M., and Sidow, A. (2008). Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods 5, 829–834. Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A., and Luscombe, N.M. (2009). A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet. 10, 252–263. Visel, A., Blow, M.J., Li, Z., Zhang, T., Akiyama, J.A., Holt, A., Plajzer-Frick, I., Shoukry, M., Wright, C., Chen, F., et al. (2009). ChIP-seq accurately predicts tissuespecific activity of enhancers. Nature 457, 854–858. Wallerman, O., Motallebipour, M., Enroth, S., Patra, K., Bysani, M.S.R., Komorowski, J., and Wadelius, C. (2009). Molecular interactions between HNF4a, FOXA2 and GABP identified at regulatory DNA elements through ChIP-sequencing. Nucleic Acids Res. 37, 7498–7508. Wang, X., and Zhang, X. (2011). Pinpointing transcription factor binding sites from ChIP-seq data with SeqSite. BMC Systems Biology 5, S3. Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63. Wasserman, W.W., and Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 5, 276–287. 134 References Watson, J.D. (2004). Molecular biology of the gene (San Francisco: Pearson/Benjamin Cummings). Whitington, T., Frith, M.C., Johnson, J., and Bailey, T.L. (2011). Inferring transcription factor complexes from ChIP-seq data. Nucleic Acids Res. Wilbanks, E.G., and Facciotti, M.T. (2010). Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One 5, e11471. Wolberger, C. (1999). Multiprotein-DNA complexes in transcriptional regulation. Annu Rev Biophys Biomol Struct 28, 29–56. Wold, B., and Myers, R.M. (2008). Sequence census methods for functional genomics. Nat. Methods 5, 19–21. Wu, S., Wang, J., Zhao, W., Pounds, S., and Cheng, C. (2010). ChIP-PaM: an algorithm to identify protein-DNA interaction using ChIP-Seq data. Theor Biol Med Model 7, 18. Wunderlich, Z., and Mirny, L.A. (2009). Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 25, 434–440. Yamaguchi, Y., Zhang, D.E., Sun, Z., Albee, E.A., Nagata, S., Tenen, D.G., and Ackerman, S.J. (1994). Functional characterization of the promoter for the gene encoding human eosinophil peroxidase. J. Biol. Chem. 269, 19410–19419. Zambelli, F., Pesole, G., and Pavesi, G. (2012). Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Briefings in Bioinformatics. Zhang, X., Robertson, G., Krzywinski, M., Ning, K., Droit, A., Jones, S., and Gottardo, R. (2011). PICS: probabilistic inference for ChIP-seq. Biometrics 67, 151–163. Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M., Li, W., et al. (2008). Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137. Zhao, Y., Granas, D., and Stormo, G.D. (2009). Inferring binding energies from selected binding sites. PLoS Comput. Biol 5, e1000590. 135