Simultaneous Computational Discovery of DNA Regulatory Motifs and

Simultaneous Computational Discovery of DNA Regulatory Motifs and
Transcription Factor Binding Constraints at High Spatial Resolution
by
Yuchun Guo
M.S. Computer Science
Northeastern University, 2000
SUBMITTED TO THE COMPUTATIONAL AND SYSTEMS BIOLOGY PROGRAM
IN PARTIAL FULLLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY IN COMPUTATIONAL AND SYSTEMS BIOLOGY
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
SEPTEMBER 2012
 2012 Massachusetts Institute of Technology
All rights reserved
Signature of
Author………………………………………………………………………………..……….…….
Yuchun Guo
Computational and Systems Biology Program
August 31, 2012
Certified
by……………………………………………………………………………………………………
David K. Gifford
Professor of Computer Science and Engineering
Thesis Supervisor
Accepted
by……………………………………………………………………………….............…………..
Chris Burge
Professor of Biology and Biological Engineering
Computational and Systems Biology Ph.D. Program Director
2
Simultaneous Computational Discovery of DNA Regulatory Motifs and
Transcription Factor Binding Constraints at High Spatial Resolution
by
Yuchun Guo
Submitted to the Computational and Systems Biology Program
on August 31, 2012 in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computational and Systems Biology
Abstract
I present three novel computational methods to address the challenge of identifying
protein-DNA interactions at high spatial resolution from noisy ChIP-Seq data. I first
present the genome positioning system (GPS) algorithm which predicts protein-DNA
interaction events from ChIP-Seq data using a single-base resolution generative
probabilistic model. Using synthetic and actual ChIP-Seq data, I show that GPS
improves the effective spatial resolution and accuracy in resolving proximal binding
events when comparing with existing methods. Second, I present the k-mer set motif
(KSM) representation and the k-mer motif alignment and clustering (KMAC) method
which discovers DNA-binding motifs from ChIP-Seq derived sequences. I demonstrate
that the KSM model is more predictive than the widely used position weight matrix
model, and that KMAC outperforms other existing motif discovery programs in
recovering known motifs from a large collection of human ChIP-Seq experiments.
Finally, I present an integrative method, genome wide event finding and motif discovery
(GEM), which models ChIP data with explanatory motifs and binding events at high
spatial resolution. The GEM model links binding event discovery and motif discovery
with positional priors in the context of a generative probabilistic model of ChIP data and
genome sequence. I show that GEM further improve upon previous methods for
processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and
discovery of proximal binding events. GEM enables a systematic analysis of in vivo
transcription factor binding to discover hundreds of spatial binding constraints between
factors in human and mouse cells, including known factor pairs and novel pairs such as
c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4α/FOXA1. I also discovered a complex spatial
binding relationship involved 6 key regulatory factors in mouse embryonic stem (ES) cell
that is likely to be functional in ES cell gene regulation. Such computational discoveries
propose testable models for regulatory factor interactions that will help elucidate genome
function and the implementation of combinatorial control.
Thesis Supervisor: David K. Gifford
Title: Professor of Computer Science and Engineering
3
4
To My Teachers
and My Family
5
6
Acknowledgments
Getting a Ph.D. degree is a journey. Without the inspiration, guidance and supports
from the people around me, it is almost impossible. I would like to sincerely thank all the
teachings and inspirations I received that led me to my Ph.D. study at MIT and kept
motivating and supporting me throughout this journey.
For my thesis research at MIT, I first would like to thank my thesis advisor, Prof.
David Gifford. David introduced me to the field of computational genomics and
developmental biology, guided me through every step in my research. David
encouraged me to explore research ideas that are really interesting and meaningful to
me, helped me to get help from other lab members to get my project started quickly, and
suggested new directions when I made progress. His knowledge in both computer
science and biology, and his extraordinary ability to effectively communicate and
collaborate with scientist in other fields have set a great example for me. Next I want to
thank my thesis committee members, Prof. Tommi Jaakkola and Prof. Ernest Fraenkel.
Tommi and Ernest not only provided their expert suggestions to improve my research,
but also taught me how to do science in practice.
The members of the Gifford Lab have provided me a wonderful and relaxed learning
environment. Shaun Mahony, a research scientist in the Gifford Lab, has been a great
colleague, mentor, and friend. Shaun helped me to develop a lot of analysis in my
project, helped to improve my writing, and was always available to chat when I got stuck.
Chris Reeder, a fellow graduate student, and I shared the ups and downs of the
graduate student experience. The conversations we had, ranging from machine learning,
biology, and life in general, helped me to keep the research work in perspective. I would
also like to thank all the other Gifford lab members, including Jeanne Darling, Georg
Gerber, Robin Dowell, Alan Qi, Alex Rolfe, Tim Danford, Charlie O’Donnell, Matt
Edwards, Tahin Syed and Tatsu Hashimoto, who made my life at MIT much easier and
richer.
I would like to thank the opportunity to study in the Computational and Systems
Biology (CSB) Program at MIT. The program directors, Prof. Chris Burge and Prof.
Bruce Tidor, are great teachers. They always made themselves available when I
needed their helps the most. The CSB administrator Bonnie Whang, Darlene Ray, and
other CSB students, gave me another home at MIT. I would also like to thank many MIT
7
students, including Pouya Kheradpour, Georgios Papachristoudis and Bob Altshuler for
their helps in my research.
I also wish to acknowledge and thank my friends at MIT and in the Boston area. The
encouragements and helps from them helped me to overcome difficulties in the past few
years.
Finally, I am grateful to my family, my parents, my wife, and my two sons. Their
love and support are always with me. I feel like we share the MIT degree.
8
Table of contents
Chapter 1 : Introduction .............................................................................................. 17
1.1
Gene expression and transcription regulation .................................................... 17
1.2
Transcription factor binding and combinatorial regulation .................................. 19
1.3
Next-generation sequencing technologies ......................................................... 21
1.4
ChIP-Seq and computational challenges ........................................................... 23
1.5
Thesis road map ................................................................................................ 25
Chapter 2 : Genome Positioning Systems (GPS) ...................................................... 31
2.1
Introduction ........................................................................................................ 31
2.2
GPS algorithm ................................................................................................... 33
2.2.1
GPS algorithm overview ................................................................................ 33
2.2.2
Empirical spatial distribution of reads ............................................................. 35
2.2.3
GPS mixture model ........................................................................................ 35
2.2.4
EM algorithm.................................................................................................. 36
2.2.5
Setting the sparseness parameter α .............................................................. 39
2.2.6
Statistical significance of predicted events ..................................................... 39
2.2.7
Artifact filtering ............................................................................................... 41
2.2.8
Software implementation ............................................................................... 41
2.3
Results .............................................................................................................. 42
2.3.1
GPS automatically adapts the empirical read distribution ............................... 42
2.3.2
GPS predictions have higher spatial resolution .............................................. 43
2.3.3
GPS discovers more joint events ................................................................... 44
2.4
Discussion ......................................................................................................... 46
2.5
Methods ............................................................................................................. 48
2.5.1
Datasets used ................................................................................................ 48
2.5.2
ChIP-Seq analysis methods ........................................................................... 49
2.5.3
Method comparison on spatial resolution ....................................................... 49
2.5.4
Evaluating performance in deconvolving joint events using synthetic data ..... 50
2.5.5 Evaluating performance in deconvolving joint binding events using GABP
ChIP-Seq data ........................................................................................................... 51
9
Chapter 3 : K-mer set motif representation and discovery ...................................... 55
3.1
Introduction ........................................................................................................ 55
3.1.1
DNA motif representations ............................................................................. 55
3.1.2
DNA motif discovery methods ........................................................................ 56
3.1.3
About this chapter .......................................................................................... 58
3.2
K-mer set motif (KSM) model ............................................................................. 59
3.2.1
The KSM representation ................................................................................ 59
3.2.2
Scoring K-mer set motif in a DNA sequence .................................................. 61
3.3
K-mer motif alignment and clustering (KMAC) ................................................... 63
3.3.1
Discovery of the set of enriched k-mers ......................................................... 64
3.3.2
Clustering the enriched k-mers into k-mer set motifs...................................... 64
3.4
Results ..............................................................................................................67
3.4.1 The PWM model does not capture k-mer differences between CTCF binding in
mouse and human cells............................................................................................. 67
3.4.2
K-mer set motif model is more predictive for in vivo binding than PWM model
67
3.4.3 KMAC outperforms other motif discovery methods in discovering known DNAbinding motifs ............................................................................................................ 69
3.4.4
KMAC outperforms other ChIP-Seq oriented motif discovery methods .......... 71
3.5
Discussion ......................................................................................................... 71
3.6
Methods ............................................................................................................. 74
3.6.1
Datasets ........................................................................................................ 74
3.6.2
Motif-finding performance comparison ........................................................... 74
3.6.3 ROC comparison of motif representation performance in predicting in vivo
binding 75
Chapter 4 : Genome-wide event finding and motif discovery (GEM)....................... 79
4.1
Introduction ........................................................................................................ 79
4.2
GEM algorithm................................................................................................... 80
4.2.1
Predicting protein DNA-binding events with a sparse prior ............................. 80
4.2.2
Discover the k-mer set motifs at binding event locations ................................ 80
4.2.3
Positional prior generation ............................................................................. 81
4.2.4
Binding event prediction with a positional prior............................................... 81
4.2.5
Motif discovery using improved event locations ............................................. 84
10
4.2.6
4.3
GEM software ................................................................................................ 84
Results ..............................................................................................................84
4.3.1
GEM improves the spatial resolution of binding event prediction.................... 84
4.3.2
GEM is better at resolving closely spaced binding events .............................. 86
4.3.3
GEM improves the spatial resolution of ChIP-exo binding event prediction .... 87
4.4
Discussion ......................................................................................................... 89
4.5
Methods ............................................................................................................. 91
4.5.1
Datasets ........................................................................................................ 91
4.5.2
Evaluating spatial resolution of ChIP-Seq event calls..................................... 91
4.5.3 Evaluating performance in deconvolving proximal binding events using GABP
ChIP-Seq data ........................................................................................................... 91
4.5.4
Analysis of ChIP-exo data .............................................................................. 92
Chapter 5 : Transcription factor spatial binding constraints ................................... 95
5.1
Introduction ........................................................................................................ 95
5.2
Spatial binding constraints discovery ................................................................. 95
5.3
Results ..............................................................................................................96
5.3.1 GEM reveals known Sox2-Oct4 distance-constrained transcription factor
binding distances....................................................................................................... 96
5.3.2 Enhancer grammar elements deduced from transcription factor binding sites
predicted by GEM...................................................................................................... 98
5.3.3
5.4
Spatially constrained human factor binding in ENCODE data ...................... 104
Discussion ....................................................................................................... 115
Chapter 6 : Conclusions ........................................................................................... 119
6.1
Summary and contributions ............................................................................. 119
6.1.1
Genome Positioning Systems (GPS) ........................................................... 119
6.1.2
K-mer set motif representation and discovery .............................................. 120
6.1.3
Genome-wide event finding and motif discovery (GEM) ............................... 121
6.1.4
Transcription factor spatial binding constraints............................................. 122
6.2
Directions for future work ................................................................................. 123
6.2.1
Weighting factor of motif prior in the GEM algorithm .................................... 123
6.2.2 K-mer based comparison of in vivo versus in vitro binding for similar TFs in a
family 123
11
6.2.3
6.3
Discovery of binding constraints................................................................... 124
Conclusions ..................................................................................................... 124
References ................................................................................................................. 126
12
Figures
Figure 2-1 Random mixture of ChIP-Seq reads from joint events ................................. 31
Figure 2-2 Spatial distribution of ChIP-Seq reads ......................................................... 34
Figure 2-3 GPS probabilistically models ChIP-Seq read spatial distributions ................ 34
Figure 2-4 GPS automatically adapts the empirical read distribution ............................ 42
Figure 2-5 GPS improves the effective spatial resolution ............................................. 44
Figure 2-6 GPS has better spatial resolution than other shape-aware methods ........... 45
Figure 2-7 GPS improves accuracy in resolving joint binding events ............................ 46
Figure 3-1 Oct4 KSM and PWM motif representation ................................................... 60
Figure 3-2 Search k-mer set motif in a DNA sequence ................................................. 62
Figure 3-3 Schematic of k-mer set motif finding............................................................ 65
Figure 3-4 The PWM model does not capture k-mer differences .................................. 67
Figure 3-5 The KSM model is more predictive than the PWM model ............................ 69
Figure 3-6 KMAC motif discovery outperforms other methods when detecting motifs in
ChIP-Seq data. ...................................................................................................... 70
Figure 3-7 KMAC outperforms other ChIP-Seq oriented motif discovery methods ........ 72
Figure 4-1 GEM improves spatial accuracy in binding event prediction ........................ 86
Figure 4-2 GEM is better at resolving closely spaced binding events. .......................... 87
Figure 4-3 GEM improves the spatial resolution of ChIP-exo data event prediction. ..... 88
Figure 5-1 GEM reveals transcription factor spatial binding constraints. ....................... 98
Figure 5-2 Spatial binding constraints detected from mouse ES cells. .......................... 99
Figure 5-3 Spatial relationship between Klf4 and other 15 factors in mouse ES cells . 100
Figure 5-4 Enhancer grammar elements deduced from mouse ES cell transcription
factor binding sites predicted by GEM. ................................................................ 102
Figure 5-5 A Klf4-Sox2 distance-constrained region interacts with Tcfcp2l1
transcriptional start site. ....................................................................................... 103
Figure 5-6 Klf4-Sox2 distance-constrained regions are bound by p300 and marked by
H3K27ac ............................................................................................................. 104
13
Figure 5-7 Spatial binding constraints detected from human K562 cells. .................... 106
Figure 5-8 Spatial binding constraints detected from human GM12878 cells. ............. 108
Figure 5-9 Spatial binding constraints detected from human HepG2 cells .................. 110
Figure 5-10 Spatial binding constraints detected from human HeLa-S3 cells. ............ 112
Figure 5-11 Spatial binding constraints detected from human H1-hESC cells. ........... 113
Figure 5-12 Examples of transcription factor spatial binding constraints detected from
GEM analysis in ENCODE ChIP-Seq data. ......................................................... 115
14
Chapter 1: Introduction
Chapter 1
Introduction
15
Chapter 1: Introduction
Chapter 1: Introduction
This thesis is about developing and applying new computational algorithms to
discover precise transcription factor binding locations, corresponding in vivo DNA
regulatory motifs, and transcription factor binding spatial constraints from highthroughput biological experimental datasets. My research is within the broader research
area of computational biology, with a focus on understanding the regulatory mechanisms
of transcription, a fundamental biological process. To explain what transcription
regulation is, why it is an important research subject, and how my thesis work fits in the
frontier of this research area, I will start with a brief overview of the biological
background and the related research problems.
1.1 Gene expression and transcription regulation
The human body is full of wonders. In our brain which weighs only about 3 pounds,
approximately 100 billion neurons interconnected with each other by electrical or
chemical signals give rise to our five senses, consciousness, memory, emotion,
creativity, and so on. An army of immune cells constantly defends our bodies from
countless foreign invaders (for example, virus and bacteria) and cancer cells. Over 2
billion heart cells beat in a highly coordinated manner for roughly 3 billion times
throughout our life, supplying oxygen and other nutrients to our bodies (Sherwood,
1997). The list goes on. Perhaps more amazingly, all ~200 major types of cells in our
bodies, with diverse and complex functions, originate from a single fertilized egg cell,
starting with the same copy of genetic instructions encoded in DNA. Although nearly all
the cells in our bodies contain the same full set of genes, only some of the genes are
active, or expressed, and used to make proteins or functional RNAs in a particular cell
type (Lodish, 2004).
Gene expression, the process by which information from the gene is used to
synthesize a protein or another functional gene product, is used by all life forms to make
the macromolecular machinery which carry out life’s functions. The first main step of
gene expression is transcription, in which the DNA sequence information of a gene is
copied from the DNA template to a single-stranded RNA by the enzyme RNA
polymerase. In eukaryotic cells, the initial RNA copy is processed into a messenger
RNA (mRNA). In the second main step of gene expression called translation, a complex
17
Chapter 1: Introduction
molecular machine, the ribosome, assembles proteins using the precise sequence
information in the mRNA, which is originally coded in the gene.
With the widespread influence of gene expression on the basic cellular processes
such as cell growth, maintenance, development and differentiation, proper regulation is
essential to ensure gene expression happens at the right time and right place. A classic
example is lactose metabolism in E. coli: the enzyme that metabolizes lactose is
expressed at high levels only when lactose is available in the environment, but when
glucose (a better food source) is also available, the enzyme is not expressed even when
lactose is present (Jacob and Monod, 1961). Regulation of gene expression can
happen at various steps during the process, including initiation, elongation, and
termination of transcription, splicing, mRNA transport, mRNA decay, and translation.
However, the regulation of transcription initiation –the first step- is the most important
mechanism for determining which genes are expressed and how much of the encoded
mRNAs and, consequently, proteins are produced (Lodish, 2004).
Transcription initiation from a gene promoter is controlled by sequence-specific
DNA-binding regulatory proteins called transcription factors (TFs, also called activators
or repressors in bacteria). Eukaryotic TFs typically contain one or more DNA-binding
domains that recognize specific DNA sequences and a transcription regulation domain
that interacts with other transcriptional regulatory proteins and regulate the activity of
transcription (Mitchell and Tjian, 1989; Ptashne and Gann, 1997). During transcription
initiation, RNA polymerase (together with the general transcription factors, also called
the transcriptional machinery) binds to the promoter region of a gene and starts the
process of transcription. However, at many promoters, in the absence of regulatory
proteins, RNA polymerase binds only weakly and produces a low level of constitutive
expression. With a regulatory activator, which typically binds specific sites at or near the
promoter, the polymerase is recruited to the promoter and produces a high level of
transcription. The transcriptional activators, usually with the help of co-activators
(Taatjes et al., 2004), can interact with one or more of many different components of the
transcriptional machinery to recruit polymerase. Alternatively they can interact with
chromatin modifiers that open inaccessible DNA to allow binding of transcriptional
machinery to a promoter. On the other hand, a regulatory repressor may interfere with
or inhibit the transcriptional machinery or activators, or recruit repressive chromatin
modifiers to suppress transcription. Thus gene promoters typically contain specific short
sequences elements that can be recognized by the specific TFs. In higher eukaryotes,
18
Chapter 1: Introduction
TFs may bind regions called enhancers located tens of thousands base pairs either
upstream or downstream from the promoter. Some TFs may also regulate
transcriptional elongation (Rahl et al., 2010). Therefore, a gene can be regulated by
multiple TFs that work together in large numbers and various combinations. This allows
the integration of multiple signal transduction pathways, particularly in multicellular
organisms (Watson, 2004).
The regulation of transcription by transcription factors is critical for numerous
biological phenomena, including development, signal transduction, immune response,
sensory perception, etc. (Vaquerizas et al., 2009). For example, introducing only four
transcription factors, Oct4, Sox2, c-Myc, and Klf4 can change the cell fate of mouse
embryonic or adult fibroblasts into induced pluripotent state cells (Takahashi and
Yamanaka, 2006). Dysfunction in transcription regulation may lead to various diseases,
such as developmental syndromes (Boyadjiev and Jabs, 2000) and cancers (Furney et
al., 2006). For example, deregulated expression of transcription factor c-Myc is found to
cause unregulated expression of many cell proliferation genes and result in certain
cancers (Dang, 2012). In a manually curated census of human TFs, 164 TFs were
identified to be directly responsible for 277 diseases or syndromes (Vaquerizas et al.,
2009). Mutations in transcriptional regulatory elements have also been found associated
with numerous human diseases (Maston et al., 2006).
In summary, transcription factors are key players in regulating gene expression and
in influencing broad a spectrum of biological process. However, most human TFs are
uncharacterized (Vaquerizas et al., 2009). It is important to understand how TFs
(possibly interacting with other TFs) bind to the regulatory DNA sequences and regulate
the expression of their target genes.
1.2 Transcription factor binding and combinatorial regulation
Combinatorial binding of TFs plays a key role in the specificity of transcriptional
regulation and is thought to contribute to the complexity and diversity of eukaryotes
(Watson, 2004). An increase in both the ratio and absolute number of transcription
factors in a genome seems to correlate with organismal complexity (Levine and Tjian,
2003). The complexity of the regulatory sequences follows the same trend. From
bacteria to yeast, to multicellular organisms such as fruit fly and human, the regulatory
sequences typically contain increasing numbers of binding sites and are further away
from the gene promoter.
19
Chapter 1: Introduction
Theoretical analysis showed that bacterial TFs can recognize a specific DNA site in
the genomic background, but the same is not true for eukaryotic TFs because the
eukaryotic binding sites are shorter and their genomes are much larger (Wunderlich and
Mirny, 2009). In addition, TFs in large families share similar DNA-binding domains and
recognize very similar consensus sequences. One example is the so-called Hox
paradox: Homeobox (Hox) family factors recognize similar sequences containing a
TAAT core in vitro, yet display functional diversities in vivo (Hueber and Lohmann,
2008). Given the generic binding sites and generic DNA-binding domains of the TFs,
the formation of complex nucleoprotein structures involving a combinatorial TF partner
code and their DNA sites increases the effective length of the target DNA sequences
and thus increases the specificity of gene regulation (Georges et al., 2010).
Thus in multicellular organisms, enhancers usually contain clusters of sequencespecific TF binding sites (Maston et al., 2006). Specific genomic regions that are
extensively targeted by multiple TFs have been reported in fruit fly (Moorman et al.,
2006; The modENCODE Consortium et al., 2010) and in mouse (Chen et al., 2008). An
intriguing question is how these TFs work together to regulate specific gene expression
patterns. The notion of grammar (Levine, 2010; Swanson et al., 2011) has been
referred to the phenomenon that spacing and arrangement of binding sites matter for the
activity of the enhancer, just like the order of words in a sentence can affect its meaning.
Such regulatory grammars have been observed in certain enhancers. An open question
is the pervasiveness of grammatical features (Levine, 2010). However, most of the
current binding data show overlapping binding regions but do not have enough spatial
resolution to reveal the detailed grammars that govern the interactions among the TFs
and between the TFs and the DNA.
The nature of the combinatorial control with respect to the arrangement (position
and orientation) of the binding sites has been described in two competing models: the
“enhanceosome” and “billboard” models (Arnosti and Kulkarni, 2005). The
enhanceosome model proposes that the binding sites within the enhancer are precisely
positioned, allowing for highly cooperative assembly of TFs. One well-studied example
is the Interferon-β enhanceosome, where binding sites for ATF/c-Jun, IRF-3/IRF-7, and
NR-κB are tightly clustered on a sequence only 55 base pairs long. Specific type and
number of TF binding sites and their correct positioning on the surface of the DNA
double helix are required for enhancer function (Thanos and Maniatis, 1995). In contrast,
the billboard models suggests that the arrangement of the binding sites may not be very
20
Chapter 1: Introduction
strict because TFs binding on sub-elements of the enhancer can be interpreted by
transcriptional machinery separately (Kulkarni and Arnosti, 2003). These two models
has been observed in only a few detailed studies. In reality, enhancers may function
somewhere between these two extreme models. In addition, the TF binding site
arrangement may be only part of the regulatory code. Sometimes protein-protein
interactions may modify the binding preferences of the TFs. Recent study of in vitro
binding of Hox-cofactor complexes showed that cofactor binding evoked differences in
DNA binding among different Hox proteins and this may contribute to the in vivo binding
specificities of Hox proteins (Slattery et al., 2011). Furthermore, there may or may not
be protein-protein interactions between the TFs in a multiprotein-DNA complex. A
composite structure model of Interferon-β enhanceosome showed the absence of major
protein-protein interfaces between the TFs, suggesting the cooperative occupancy of the
enhancer comes from both binding-induced DNA conformational changes and specific
interactions with co-activators (Panne et al., 2007). Thus, more cases of detailed
analysis of in vivo binding sites within the enhancers will be needed to unravel the
grammars of combinatorial regulation.
Computational prediction of in vivo TF binding sites suffers from high false positive
rates (Wasserman and Sandelin, 2004). Although the situation is improving with the
new approaches that model combinatorial binding to improve predictive specificity, the
improvements are limited by the availability of sufficient known sites to train the model
(Wasserman and Sandelin, 2004). Therefore, a complete survey of genome wide TF
binding and further studies of the binding motifs and spatial constraints of TFs in vivo will
be helpful to elucidate the nature of combinatorial control.
1.3 Next-generation sequencing technologies
New technological advances in experimental methods, especially in sequencing
technologies (Mardis, 2008; Metzker, 2010) have brought excitements in genomics
research. The next-generation sequencing (NGS) technologies made various
innovations in areas such as template preparation, sequencing and imaging, and
genome alignment and assembly methods (Metzker, 2010). The major advance offered
by NGS is the ability to generate very large amount of sequencing data that cost much
less than the automated Sanger sequencing method, and this enables various
innovative approaches in basic, applied and clinical research (Metzker, 2010).
NGS has been applied in the research field of functional genomics. By sequencing
21
Chapter 1: Introduction
the ends of the DNA/RNA molecules in the sample and mapping them to the genome,
one can count the mapped reads and analyze their distribution throughout the genome.
Such sequence census methods enables researchers to assay the regulatory input and
output of the genome routinely and comprehensively, and vastly increases our ability to
understand how the genome specifies all the different cell types and their states of
behavior (Wold and Myers, 2008). For example, RNA-Seq is replacing microarrays for
gene expression profiling. RNA-Seq reveals unexpected complexity in eukaryotic
transcriptomes and provides a far more precise measurement of levels of transcripts and
their isoforms than other methods (Wang et al., 2009). Chromatin immunoprecipitation
followed by high-throughput sequencing (ChIP-Seq) enables genome-wide profiling of
protein-DNA interactions at a much higher resolution and coverage than previous
methods (Park, 2009). ChIP-Seq studies of TF binding find that most TFs bind to
thousands of places in the genome, often outside of the proximal promoter regions, and
that combinatorial binding and recruitment of co-activators are important for high level of
transcription activity (Farnham, 2009). ChIP-Seq profiling of multiple histone marks have
been used for genome annotation and detection regulatory sequences and non-coding
RNAs (Guttman et al., 2009; Ernst et al., 2011; Shen et al., 2012). DNase-seq and
FAIRE-seq have been applied to map nearly a million open chromatin regions that cover
9% of the human genome and to discover clusters of open regulatory elements that are
suggested to control gene activity required for the maintenance of cell-type identity
(Song et al., 2011).
NGS-based technologies have also been applied to variant discovery by
resequencing targeted regions of interest or whole genomes, de novo assemblies of
bacterial and lower eukaryotic genomes, and species classification and or gene
discovery by metagenomics studies, etc. (Metzker, 2010).
The application of emerging new technologies and the large consortium efforts
across multiple institutions such as ENCODE (Birney et al., 2007), modENCODE
(Celniker et al., 2009), and the Roadmap Epigenomics Mapping Consortium (Bernstein
et al., 2010) are starting to generate unprecedented amounts of data. Integrative
analysis through detailed computational modeling on these comprehensive datasets will
greatly leverage the potential of these resources and facilitate the translation of data into
biological knowledge. Such combined experimental and computational efforts promise
to unravel the molecular mechanisms of gene regulation and improve human health.
22
Chapter 1: Introduction
1.4 ChIP-Seq and computational challenges
As one of the early applications of NGS, chromatin immunoprecipitation followed by
high-throughput sequencing (ChIP-Seq) has become an indispensable tool for genomewide profiling of protein-DNA interactions (Barski et al., 2007; Johnson et al., 2007;
Mikkelsen et al., 2007; Robertson et al., 2007; Park, 2009). Compared to its
predecessor, Chromatin immunoprecipitation followed by microarray hybridization (ChIPchip) (Ren et al., 2000; Iyer et al., 2001), ChIP-Seq has higher resolution, fewer artifacts,
greater coverage and a larger dynamic range (Park, 2009) and therefore provides
substantially improved mapping of physical interactions between proteins and DNA in
the living cell. ChIP-Seq has been applied to genome-wide profiling of TF binding sites
and histone modifications and has generated valuable knowledge on global gene
regulation (Farnham, 2009). It is considered the most successful high-throughput
experimental technique for discovery of TF binding sites (Ladunga, 2010).
ChIP-Seq is based on Chromatin immunoprecipitation (ChIP)(Solomon et al., 1988)
to enrich the DNA fragments that are associated with a specific protein. The DNAbinding protein is crosslinked to the DNA by formaldehyde and the DNA is sheared by
sonication into small fragments. An antibody specific to the protein of interest is used to
selectively immunoprecipitate the protein-bound DNA fragments. Finally, the pulled
down protein-DNA links are reversed and the recovered DNA is assayed by NGS to
determine the sequences bound by that protein. The output is a list of reads that are
sequenced from the 5’ end of the ChIP DNA fragments.
The specificity of the antibody is critical in the experimental design but it may be
difficult to find a native antibody with sufficient quality for the protein of interest. The
situation is improving thanks to the wider adoption of ChIP-Seq and the large consortium
efforts such as ENCODE (Birney et al., 2007), modENCODE (Celniker et al., 2009), and
Roadmap Epigenomics Mapping Consortium (Bernstein et al., 2010). As an alternative
to the requirement for TF specific antibodies, ChIP-Seq with epitope-tagged human or
mouse proteins has also been developed (Cao et al., 2011; Mazzoni et al., 2011).
Sequencing depth is another important experimental design issue. Early ChIP-Seq
datasets typically contained several million reads. A recent evaluation suggested that
the regularly adopted depth of 15-20 million reads in human experiments is insufficient
(Chen et al., 2012). Typically, a control experiment is used to correct biases in DNA
shearing, amplification and sequencing (Park, 2009). There are three types of
commonly used control: genomic DNA, chromatin input DNA and DNA from a
23
Chapter 1: Introduction
nonspecific IP. Chromatin input DNA has been shown to outperform genomic DNA in
predicting binding events with more enriched binding motifs (Chen et al., 2012).
Short read alignment software such as Bowtie (Langmead et al., 2009), Eland (the
default aligner for the Illumina platform) or MAQ (Li et al., 2008) are then used to align
the sequenced reads to the genome. A few base mismatches are typically permitted
during read alignment to allow for sequencing errors. Depending on the length of the
reads and the genome, ~10-20% of the reads cannot be uniquely mapped to the
genome (Rozowsky et al., 2009). The non-uniquely mapped reads are typically
discarded for downstream analysis (Park, 2009), although a recent study showed that
they are important for studying TF binding in highly repetitive regions of genomes
(Chung et al., 2011).
The next step, the most critical step, is to infer actual binding events by identifying
statistically enriched regions in the ChIP data as compared to the control data. Many
computational methods (usually called “peak callers”) have been developed to detect
binding events. They are reviewed in (Park, 2009; Pepke et al., 2009) and compared in
(Laajala et al., 2009; Wilbanks and Facciotti, 2010; Rye et al., 2011). Some of the
methods are discussed in Chapter 2 and Chapter 4.
For TF binding, a critical issue in computational analysis for ChIP-Seq is the spatial
resolution of binding event predictions, which is defined as the difference between the
predicted location of a binding event and the midpoint of its actual location. Spatial
resolution is important for downstream analysis such as motif discovery and annotation
of the binding sites, particularly for mapping the binding constraints among multiple TFs
in the same cellular condition. Although the reads are mapped at single-base resolution,
random variation in the ChIP DNA fragmentation process obscures the actual location of
interaction events. In addition, ChIP-Seq reads caused by different closely spaced
events (joint events or proximal events) will spatially mix with one another along the
genome, presenting a challenge for precisely estimating the multiplicity and exact
positions of proximal binding events of the same TFs. The typical spatial resolution of
ChIP-Seq binding event detection is within 40-50 base pairs and varies with the dataset
and the methods used (Wold and Myers, 2008; Wilbanks and Facciotti, 2010). To fully
capitalize on the benefits of ChIP-Seq, the spatial resolution of event detection needs to
be greatly improved.
The most common follow-up analysis of binding event detection, motif discovery,
has also not been optimized for ChIP-Seq data. Motif discovery is one of the most
24
Chapter 1: Introduction
widely studied problems in computational biology and many methods have been
developed. They are reviewed in (MacIsaac and Fraenkel, 2006; Das and Dai, 2007;
Zambelli et al., 2012) and compared in (Hu et al., 2005; Tompa et al., 2005). Some of
the methods are discussed in Chapter 3. Traditional motif discovery programs such as
MEME (Bailey and Elkan, 1994) and Weeder (Pavesi et al., 2001) are not suitable for
large number of ChIP-Seq bound sequences due to computational inefficiency. These
methods are thus limited to process only ~500-1000 top ranking sequences, ignoring
weak binding sites. New methods have been developed to take advantage of some
features of ChIP-Seq data, such as higher spatial resolution, more quantitative binding
strength and higher redundancy of motif instances in the sequences (see more
discussion in 3.1.2). But as shown in Chapter 3, the performance of these methods is
not improved as expected. A related issue is that there is currently no generallyaccepted gold standard for motif representation (Hughes, 2011). Thus it is beneficial to
explore potentially more suitable motif representations and better approaches to
discovering motifs for ChIP-Seq data.
Other down-stream analyses of ChIP-Seq binding predictions includes studying the
relationships among binding calls from multiple transcription factors in the same cellular
condition, and elaborating the relationship between binding calls and gene structure,
gene target assignment, gene expression, condition specific binding, etc. (Park, 2009).
For more accurate down-stream analysis, it is important to have binding calls with higher
spatial resolution and more accurate binding strength quantification.
The interaction between a TF and its target site on the DNA is the basic unit to
understand the complex network of global gene regulation. Innovations in the
computational analysis of ChIP-Seq data promise to reveal aspects of transcription
factor binding at a new level of resolution, enables further mechanistic study of the
combinatorial control and the gene regulatory network.
1.5 Thesis road map
In this thesis I present novel computational algorithms for ChIP-Seq binding event
prediction, DNA regulatory motif discovery, and transcription factor binding constraint
discovery, and the resulting biological findings.
Chapter 2: Genome Positioning Systems (GPS)
In Chapter 2, I present the Genome Positioning Systems (GPS), a principled modelbased computational method to predict ChIP-Seq binding events with high spatial
25
Chapter 1: Introduction
resolution. I first introduce the challenge presented by the random fragmentation of ChIP
DNA and the mixing of closely spaced events for precisely estimating the multiplicity and
exact positions of proximal binding events (Section 2.1). Next I describe the GPS
algorithm. GPS models the spatial distribution of reads and deconvolves proximal
binding events using a probabilistic mixture model with a sparse prior (Section 2.2). I
compare these results with the widely used published methods, and I find GPS improves
the spatial resolution of binding event predictions and resolves more proximal binding
events (Section 2.3). Finally, I discuss the significance of improved spatial resolution
and discovery of proximal binding events, compare GPS with recently published similar
approaches (Section 2.4), and describe analysis methods (Section 2.5).
Chapter 3: K-mer set motif representation and discovery
In Chapter 3, I present a novel k-mer set motif representation and a new motif
discovery method, k-mer motif alignment and clustering (KMAC), to learn motifs that are
most enriched in ChIP-Seq bound sequences versus control sequences. I give a brief
introduction on widely used motif representations and motif discovery methods and
discuss their limitations, particularly the challenge of incorporating informative features in
ChIP-Seq derived data (Section 3.1). Then I describe the k-mer set motif representation
(Section 3.2) and the KMAC motif discovery method. KMAC discovers motif by using a
combined enumerative and alignment-based approach and weighting the motif sites with
binding event strength and position information (Section 3.3). When KMAC is used to
recover known motifs using a large number of diverse ChIP-Seq datasets I show that it
is more informative and predictive than the position weight matrix (PWM) model, and
that it also outperforms other motif discovery methods, including ChIP-Seq oriented
methods (Section 3.4). Finally I discuss the significance of using k-mer set motif
representation and KMAC motif discovery method in the context of ChIP-Seq analysis
(Section 3.5), and describe analysis methods (Section 3.6).
Chapter 4: Genome-wide event finding and motif discovery (GEM)
In Chapter 4, I present genome-wide event finding and motif discovery (GEM), an
integrative model to resolve the location of protein-DNA interactions and discover
explanatory DNA sequence motifs. I first introduce the value of integrating motif finding
and event discovery (Section 4.1). Then I describe the GEM algorithm. GEM extends
the GPS model to incorporate motif information as a position-specific prior to bias
binding event prediction (Section 4.2). Next I show the results that GEM locates binding
26
Chapter 1: Introduction
events with exceptional spatial resolution on their corresponding motif positions, and
further improves proximal event deconvolution. GEM can also be directly applied to
ChIP-exo data and improves upon existing methods (Section 4.3). Finally I discuss the
significance of GEM for improving signal-to-noise ratio in motif discovery, the flexibility to
incorporate other positional information, and the application to ChIP-exo data (Section
4.4), and describe analysis methods (Section 4.5).
Chapter 5: Transcription factor spatial binding constraints
In Chapter 5, I present the discovery of TF binding constraints using GEM
predictions. First I introduce the value of discovering in vivo TF binding constraints and
the limitation of motif-based approaches (Section 5.1). Next I describe the method to
discover statistically significant TF binding constraints using GEM binding predictions of
a large number of TFs in the same cellular condition (Section 5.2). Then I present the
biological findings from mouse ES cells and 5 human cell types. GEM found 37
examples of TF binding constraints in mouse ES cells, including strong distance-specific
constraints between Klf4, Sox2 and other key regulatory factors. In human ENCODE
data, GEM found 390 examples of spatially constrained pair-wise binding, including such
novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1 (Section 5.3). Finally
I discuss the value of using GEM to discover TF binding constraints (Section 5.4).
Chapter 6: Conclusions
In Chapter 6, I summarize the work presented here and outline the main
contributions of this thesis.
27
Chapter 2: Genome Positioning Systems (GPS)
Chapter 2
Genome Positioning Systems (GPS)
The material presented in this chapter was adapted from the following publication:
Yuchun Guo, Georgios Papachristoudis, Robert C. Altshuler, Georg K. Gerber, Tommi
S. Jaakkola, David K. Gifford, and Shaun Mahony (2010). Discovering homotypic
binding events at high spatial resolution, Bioinformatics 26(24): 3028-3034.
Collaborations:
Y.G., S.M. and D.K.G. conceived the project. Y.G., S.M., G.K.G., G.P., T.S.J. and
D.K.G. designed the computational model and implemented the algorithm. Y.G., S.M.,
G.P., and R.C.A. analyzed the data. Y.G., S.M. and D.K.G. wrote the manuscript.
29
Chapter 2: Genome Positioning Systems (GPS)
30
Chapter 2: Genome Positioning Systems (GPS)
Chapter 2: Genome Positioning Systems (GPS)
2.1 Introduction
The precise physical description of where transcription factors, histones, RNA
polymerase II, and other proteins interact with the genome provides an invaluable
mechanistic foundation for understanding gene regulation. ChIP-Seq (Chromatin
immunoprecipitation followed by high-throughput sequencing) has become an
indispensable tool for genome-wide profiling of protein-DNA interactions (Barski et al.,
2007; Johnson et al., 2007; Mikkelsen et al., 2007; Robertson et al., 2007; Park, 2009).
Computational methods are necessary to predict the location of protein-DNA
interaction events from ChIP-Seq data because random variation in the ChIP DNA
fragmentation process obscures the actual location of interaction events (Figure 2-1). In
the ChIP-Seq protocol, reads are sequenced from the 5’ end of the ChIP DNA fragments
that are sonicated randomly in solution. Thus while ChIP-Seq DNA sequence reads are
mapped to precise bases in the genome, these reads do not manifestly indicate the
location of the protein-DNA interaction events that caused them. In addition, ChIP-Seq
reads caused by different closely spaced events (joint events) spatially mix with one
another along the genome, presenting a challenge for precisely estimating the
multiplicity and exact positions of proximal protein-DNA interaction events (Figure 2-1).
Figure 2-1 Random mixture of ChIP-Seq reads from joint events
Protein-DNA interaction events at closely spaced positions 1 and 2 on the genome result in
mixture of reads (tags) in the ChIP-Seq protocol. Green and orange ovals represent protein bind at
different positions. Solid lines and rectangle bars represent the ChIP DNA fragments and the
reads at the end of the fragments, respectively.
31
Chapter 2: Genome Positioning Systems (GPS)
The difference between the computationally predicted location of a protein-DNA
binding event and the midpoint of its actual location is defined as spatial resolution. An
ideal computational method for analyzing ChIP-Seq data would accurately localize
protein-DNA interaction events (high spatial resolution), would include no false events
(high specificity), would include all true events (high sensitivity), and would be able to
resolve closely spaced DNA-protein interactions (joint event discovery).
Joint event discovery is important because it can capture cooperative biological
regulatory mechanisms in proximal genomic locations (Pepke et al., 2009). Homotypic
clusters of transcription factor binding sites (TFBS) have been extensively studied in
Drosophila (Lifanov et al., 2003). Such regulatory mechanisms may be common in
mammalian genomes as 40-60% of certain ChIP-Seq defined protein-DNA interaction
regions contain more than one motif within 200bp (Jothi, et al., 2008; Valouev, et al.,
2008). Furthermore, homotypic clusters of TFBS occupy nearly 2% of the human
genome and may act as key components of almost half of the human promoters and
enhancers (Gotea, et al., 2010). Thus, homotypic event discovery is necessary to fully
reveal the transcription factor regulatory interactions present in ChIP-Seq data.
Existing ChIP-Seq computational methods (Park, 2009; Pepke, et al., 2009) do not
simultaneously consider multiple events as the cause for observed reads in the context
of a probabilistic model at mammalian genome scale. To detect binding events,
PeakSeq extends the length of mapped reads to create peaks (Rozowsky, et al., 2009),
MACS shifts the mapped position of reads a fixed distance towards their 3’ ends (Zhang,
et al., 2008), FindPeaks aggregates overlapping reads (Fejes, et al., 2008), SISSRs
identifies positive to negative strand transition points at read accumulations (Jothi, et al.,
2008), cisGenome scans for the center of modes of the 5’ and 3’ peaks (Ji, et al., 2008),
and QuEST (Valouev, et al., 2008) and spp (Kharchenko, et al., 2008) use kernel density
estimation methods. All of these methods use statistical detection criteria such as
overlapping read counts or read distribution strand symmetry to estimate the location of
a protein-DNA interaction event.
ChIP-Seq event calling method evaluations showed that although they identified
binding sites with a highly significant overlap with the corresponding sequence motif
(Laajala et al., 2009), and exhibited similar sensitivity and specificity (Wilbanks and
Facciotti, 2010), there are pronounced differences in the spatial resolution of all these
methods.
One important piece of information that is not fully exploited by the early ChIP-Seq
32
Chapter 2: Genome Positioning Systems (GPS)
methods is the spatial distribution of reads (also called as “read position densities”, “tag
density profile” or “peak shape”). A recent evaluation of five peak-finder methods
demonstrated the room for improvement by showing that over 80% of false binding calls
can be visually identified using the shape of read profiles without additional information
from background data or replicates (Rye et al., 2011). It further called for development of
methods utilizing the shape information. A recent method named CSDeconv
deconvolves proximal binding events using a computed spatial read distribution (Lun, et
al., 2009), although it is at present computationally impractical on entire mammalian
genomes.
In this chapter, I present the Genome Positioning System (GPS), a high-resolution
computational method for genome-wide ChIP-Seq analysis that can accurately detect
protein-DNA interaction events and deconvolve closely spaced events by modeling the
spatial distribution of ChIP-Seq reads at single base resolution. GPS detects more joint
events in synthetic and actual ChIP-Seq data and has superior spatial resolution when
compared with other methods.
2.2 GPS algorithm
2.2.1 GPS algorithm overview
GPS has three phases: spatial distribution discovery, event discovery, and the
determination of event significance.
In its first phase, GPS summarizes the observed genomic spatial distribution of
mapped reads from protein-DNA interaction events in the input ChIP-Seq data. The
farther a mapped read is located from an event, the less likely it is to be caused by the
event (Figure 2-2). We assume in GPS that for a given ChIP-Seq experiment, every
interaction event will produce the same characteristic distribution of reads. While this
assumption will not always be true, we have found that it produces good results in
practice.
In its second phase, GPS employs a probabilistic mixture model to assign an event
probability to every base in the genome. Each potential event’s contribution to
generating the observed reads is modeled (Figure 2-3A). A sparse prior on event
probabilities provides a complexity penalty that biases events to have their probability
mass at a single base position. Event probabilities are selected to maximize the
penalized likelihood of observed reads using an Expectation-Maximization (EM)
33
Chapter 2: Genome Positioning Systems (GPS)
Figure 2-2 Spatial distribution of ChIP-Seq reads
The observed spatial read density (blue: “+” strand, red: “-” strand) from ~4,000 CTCF events
aligned with respect to the CTCF motif position at each event
Figure 2-3 GPS probabilistically models ChIP-Seq read spatial distributions
A) GPS models ChIP-Seq reads as being generated by a mixture of binding events at every
genomic base, with each event producing the characteristic spatial read density. B) A sparse prior
on mixture components causes GPS to assign events to as few bases as possible to explain the
observed reads (green and orange reads). Positions 1 and 2 represent the estimated binding
positions of the protein of interest. In GPS, a given read can be explained by more than one event
(yellow reads).
34
Chapter 2: Genome Positioning Systems (GPS)
algorithm that segments the genome into efficiently solvable subproblems. GPS uses
the number of reads assigned to a base by the mixture model as a measure of the
relative strength of a predicted event at that base.
In its third and final phase, GPS determines significant events by comparing the
number of reads at the predicted events to the corresponding normalized number of
reads in the control channel. We compute the statistical significance using the binomial
distribution (Rozowsky, et al., 2009) and correct for multiple hypothesis testing by
applying a Benjamini-Hochberg correction (Benjamini and Hochberg, 1995).
2.2.2 Empirical spatial distribution of reads
GPS iteratively estimates the empirical spatial distribution of reads directly from
ChIP-Seq data. Given a set of events, we count all the reads at each position relative to
the corresponding event positions. Only the base positions within 250bp of the event are
counted because typical ChIP-Seq protocols performs a size selection in the range of
~150-300bp (Park, 2009) and we have empirically found that the probability of
generating reads at positions further than 250bp is not significant. The initial set of
events for estimating the empirical spatial distribution can be defined by using known
motifs or by finding the center of the forward and reverse read profiles (Zhang, et al.,
2008). Alternatively, GPS can use a generic empirical spatial distribution for ChIP-Seq
data to make the initial event prediction and then re-estimate the empirical spatial
distribution and use it for more accurate prediction (Figure 2-4). This process can be
repeated until convergence.
2.2.3 GPS mixture model
GPS is based on a generative mixture model that describes the likelihood of an
observed set of ChIP-Seq reads from a set of protein-DNA interaction events. Each
event (mixing component) contributes a distribution of reads surrounding its genomic
position to the mixture of reads. We assume that reads are independent conditioned on
the locations of their underlying causal events.
GPS performs event discovery by finding the set of protein-DNA interaction events
that maximizes the penalized likelihood of the observed ChIP-Seq reads. We consider N
ChIP-Seq reads that have been mapped to genome locations R = {r 1 , …, r N } and M
possible protein-DNA interaction events at genome locations B = {b 1 , …, b M } (Figure
2-3A). We represent the latent assignments of reads to the location of events that
35
Chapter 2: Genome Positioning Systems (GPS)
caused them as Z = {z1 , …, z N }, where zn = j when j is the index of the event located at
position b j that caused read n.
The conditional probability of read r n being generated from event j is
p(rn | z n = j ) = emp ( sn (rn − b j ))
where emp(d) is the empirical spatial distribution that models the probability of a read
occurring d bases away from its corresponding event position (Figure 2-2). Strand
sense is handled by s n = 1 or s n = -1 if read r n is mapped to the forward strand or reverse
strand, respectively. We assume that all the events in one ChIP-Seq experiment have
the same empirical spatial distribution.
The probability of observing a read is a convex combination of possible binding
events
M
p(rn | π ) = ∑ π j p (rn | j )
j =1
where M is the number of possible events, π denotes the parameter vector of mixing
probabilities (i.e. the probabilities of the possible events), and π j is the probability of
event j, with
∑ j =1π j = 1 .
M
The overall likelihood of the observed set of reads is then,
N M
p ( R | π ) = ∏ ∑ π j p (rn | j )
n =1 j =1
Our assumption is that binding events are relatively sparse throughout the genome.
To model this assumption, we place a negative Dirichlet prior distribution (Figueiredo
and Jain, 2002; Bicego et al., 2007) p(π) on π:
1
M
p (π ) ∝ ∏
j =1
(π j )α
,α > 0
where α is a tuning parameter to adjust the degree of sparseness. If for event j, the value
of π j becomes zero (see component elimination below), the model is restructured to
eliminate the event.
2.2.4 EM algorithm
We solve for the MAP (maximum a posteriori) solution for π using the Expectation-
36
Chapter 2: Genome Positioning Systems (GPS)
Maximization (EM) algorithm (Dempster, et al., 1977). The complete-data log penalized
likelihood is
N  M
M

ln p( R, Z , π ) = ∑ ∑ 1( z n = j )(ln π j + ln p (rn | j ) ) − α ∑ ln π j
n =1  j =1
j =1

where 1( z n = j ) is the indicator function.
We initialize the mixing probabilities π with uniform probabilities, π j = 1/M, where
j=1, …, M.
At the E step, we use the current parameter estimates π to evaluate the expectation
of Z given R,
γ ( zn = j) =
π j p(rn | j )
M
∑π
j '=1
j'
p (rn | j ' )
We can interpret γ ( z n = j ) as the fraction of read n that is assigned to event j. This is
referred to as a "soft assignment'' because read n can be assigned partially to multiple
events.
At the M step, on iteration i we find parameter πˆ (i ) to maximize the expected
complete-data log penalized likelihood,
M
 N  M


(i )
ˆ
π j = arg max ∑ ∑ γ ( z n = j )(ln π j + ln p(rn | j ) ) − α ∑ ln π j 
πj
 n =1  j =1

j =1

under the constraint
∑ j =1π j = 1 .
M
Use a Lagrange multiplier λ (Bishop, 2006) to incorporate the constraint
 N  M
 n =1  j =1

M

j =1
∑ j =1π j = 1 ,
M


πˆ j (i ) = arg max ∑ ∑ γ ( z n = j )(ln π j + ln p(rn | j ) ) − α ∑ ln π j + λ (∑ j =1π j − 1)
πj
M
To maximize the right hand side term, set its derivative with respect to π j to 0,
γ ( zn = j) α
−
+λ =0
πˆ j
πˆ j
n =1
N
∑
N
πˆ j λ = α − ∑ γ ( z n = j )
(2-1)
n =1
37
Chapter 2: Genome Positioning Systems (GPS)
Sum both sides of the equation over j to solve for λ,
∑
M
j =1
N
πˆ j λ = ∑ j =1 (α − ∑ γ ( z n = j ))
M
n =1
N
λ = ∑ j =1 (α − ∑ γ ( z n = j ))
M
(2-2)
n =1
Substitute (2-2) back to (2-1), we find
πˆ j (i ) =
N j −α
∑
M
j '=1
(N j' − α )
, N j = ∑n=1 γ ( z n = j )
N
where N j is the expected number of reads assigned to event j.
As we iteratively estimate πˆ , we use a component elimination method (Figueiredo
and Jain, 2002). If N j ≤ α, we set π j = 0 to eliminate event j. Our final estimate of πˆ (i ) is
πˆ j (i ) =
max(0, N j − α )
∑ j '=1 max(0, N j ' − α )
M
The sparseness parameter α can be interpreted as the minimum number of reads
that an event needs to survive the EM iterations. Intuitively, the effect of the sparseness
prior is to penalize each event with α read count and promote the competition among the
remaining events. The EM algorithm is deemed to have converged when the change in
likelihood falls below a specified threshold.
Our implementation of component elimination includes two special cases. To avoid
premature elimination of components during EM iterations, we start with α = 0 for a
number of iterations to allow nascent components to gain support from the data. We
then set α to our desired value. This is because when the number of components M is
large, no component may have enough initial support to prevent π from being
immediately forced to zero. Furthermore, in a single iteration we do not eliminate all the
components that meet the criteria N j ≤ α. Instead, we only eliminate the components
with the lowest value of N j at each iteration. This allows the data points that supported
the eliminated components to be re-distributed immediately to support the other
components.
At the convergence of the EM algorithm, the GPS mixture model produces a list of
non-zero-probability events π j ≠ 0, and the "soft" read assignments to these
events γ ( z n = j ) . We do not use the mixing probabilities π in subsequent analysis
38
Chapter 2: Genome Positioning Systems (GPS)
because we segment the genome into regions for analysis, and π values are dependent
on the region analyzed.
We define event strength as the expected number of reads associated with the
event. Thus the event strength of event j is calculated as
N j = ∑n=1γ ( z n = j )
N
2.2.5 Setting the sparseness parameter α
The value of the sparseness parameter α will influence the sensitivity and specificity
of event detection. It should scale with the read count of the events in the region that
GPS is analyzing. From our experience in analyzing mouse CTCF and human GABP
datasets, the α value is set empirically as follows to achieve better spatial accuracy:
α = max( C max A , alpha min )
where C max is the maximum read count in a 500bp (i.e. roughly the length of non-zero
density region of the read distribution) sliding window across the region that GPS is
evaluating, alpha min is the minimum number of read count for a valid binding event. A is
a constant factor that can be specified at command line.
We set the value of alpha min using a Poisson test. The parameter of the Poisson
distribution is set as the mean read count in the 500bp sliding windows across the whole
genome. alpha min is then set as the value that gives a p-value of 1e-4 and that is not
less than 6.
We tested setting different A values (A=1,2,3,4 or 5) or using fixed α values (α=10 or
20) when analyzing the GABP data. The results show that GPS with the settings
A=2,3,4 or 5 call more joint events (~8-10%) and give marginally better spatial resolution
of binding calls (~0.6bp) than with other settings. Thus we set A=3 in our analyses.
2.2.6 Statistical significance of predicted events
To evaluate the statistical significance of predicted events when we have a control
dataset, we compare the number of reads of the ChIP event to the number of reads in
the corresponding region in the control sample.
For non-overlapping events, we count the number of control reads in the range of
the empirical spatial distribution (+/- 250bp centered on the IP event). For joint events,
we need to assign control reads to the corresponding events. We run the EM algorithm
without the sparse prior (no component elimination, equivalent to α = 0) on the control
39
Chapter 2: Genome Positioning Systems (GPS)
data, initializing the events j at the same positions as predicted IP events. The M step of
EM algorithm is modified as
πˆ j (i ) =
Nj
∑
M
j '=1
=
N j'
Nj
N
where N j = ∑n=1 γ ( z n = j ) .
N
To account for differences between IP and control dataset sizes, we multiply the
control reads by a scaling factor. We divide long non-specific-binding regions (defined by
excluding the "enriched regions") into short segments (length 10 kb) and perform leastsquare linear regression using all the read count pairs of IP and control segments that
have at least one read. The slope of the regression is then the scaling factor, F IP/C ,
between the read counts from the IP and control (Kharchenko, et al., 2008; Rozowsky,
et al., 2009).
Using a statistical testing method proposed by Rozowsky, et. al. (Rozowsky, et al.,
2009), we calculate the P-value from the cumulative distribution function for the binomial
distribution using the corresponding IP and scaled control read counts,
k   n 
F(k, n, P) = ∑  P l (1 − P) ( n−l )
l =0  l 
where k is the scaled control read count, n is the ceiling of the sum of IP and scaled
control reads, P = 0.5, which is the probability under the null hypothesis that reads
should occur with equal likelihood from the IP as from the scaled control data.
To correct for multiple hypothesis testing, we apply a Benjamini-Hochberg correction
to adjust the P-value (Benjamini and Hochberg, 1995). All the predicted events that are
tested for significance are ranked by P-value from most significant to least significant.
For each event, the Q-value is given by
Q − value = P − value ×
count
rank
where count is the total number of events tested. Significant events are then selected
using a Q-value threshold.
If control data is not available, we apply a statistical test proposed by Zhang, et. al.
(Zhang, et al., 2008) that uses a dynamic Poisson distribution to account for local biases.
The dynamic parameter of a local Poisson model for the candidate event is defined as
λ local = max(λ BG, λ 5kb, λ 10kb )
40
Chapter 2: Genome Positioning Systems (GPS)
where the λ BG, λ 5kb, λ 10kb are λ estimated from corresponding chromosome (background),
5kb or 10kb window centered at the event location, to capture the background variability
at both global and local scales. The P-value is calculated to be the upper tail of the
Poisson distribution,
P - value = 1 −
N event −1
∑ Pois(n; λ
local
n =0
)
where N event is the read count of the candidate event. To correct for multiple hypothesis
testing, we apply a Benjamini-Hochberg correction as above.
2.2.7 Artifact filtering
GPS filters the predicted events by computing the Kullback–Leibler divergence
(Kullback and Leibler, 1951) from the empirical read distribution to the spatial read
distribution of each predicted event,
DKL (emp event ) = ∑ emp(i ) log
i
emp(i )
event (i )
where event() is the spatial distribution of non-zero read count of the event computed
from the EM algorithm, and emp() is the empirical read distribution with the
corresponding positions of the non-zero reads, i is the index of the non-zero read
positions.
Events with a Kullback–Leibler divergence value higher than a user defined
threshold are discarded.
2.2.8 Software implementation
We have implemented GPS in Java, and our software is available for download from
our website (http://cgs.csail.mit.edu/gps).
For computational efficiency, GPS independently processes separable genomic
regions. We identify separable regions with a conservative method that spatially
segments the genome at read gaps that are larger than the width of empirical spatial
distribution (500bp) and further excludes regions that contain fewer than 6 reads. The
segmented protein binding regions are typically a few thousand base pairs long.
To further reduce memory requirements and run time, GPS estimates events in two
stages for each region. In the first stage, initial events are spaced at 5bp intervals to
make a rough estimate of event locations. In the second stage, events are spaced at
41
Chapter 2: Genome Positioning Systems (GPS)
1bp near locations predicted in the first stage.
For the CTCF ChIP-Seq experiment in this study (~4.2 million IP reads and ~7.9
million control reads), GPS requires 750MBytes of main memory, and runs for 21
minutes on an AMD 64bit 2.3GHz computer.
2.3 Results
2.3.1 GPS automatically adapts the empirical read distribution
Figure 2-4 GPS automatically adapts the empirical read distribution
A generic read distribution (from CTCF data, blue) was initially used to predict GABP binding
events. GPS then used the predicted positions to iteratively re-estimate the read distribution
specific for GABP. The GPS learned distribution (red) is highly similar to the GABP read
distribution defined by the GABP motifs (green).
We first verified that GPS is able to automatically adapt to the empirical read
distribution of the analyzed ChIP-Seq data. This is important because the GPS mixture
model is initialized with a pre-determined read distribution and the actual read
distribution generating the observed data can be very different depending on the binding
factors or the experimental protocols. A more accurately determined empirical read
distribution will lead to more accurate prediction of binding events by GPS. We tested
the adaptation of read distribution during GPS analysis on a human GABP dataset
(Valouev, et al., 2008) with an initial read distribution from a mouse CTCF dataset
(Chen, et al., 2008). GPS automatically adapts the empirical read distribution to the
GABP data by learning a read distribution anchoring on the GABP event positions
42
Chapter 2: Genome Positioning Systems (GPS)
predicted using the CTCF distribution. The GPS learned GABP distribution is different
from the initial CTCF distribution, but highly similar to the GABP read distribution defined
by the GABP motif match positions (Figure 2-4). Therefore, even with a generic initial
read distribution, GPS is able to adapt to the read distribution of the analyzed data and
subsequently improve the prediction accuracy.
2.3.2 GPS predictions have higher spatial resolution
We next analyzed the spatial resolution of GPS on ChIP-Seq data profiling the
insulator binding factor CTCF (CCCTC-binding factor) (Chen, et al., 2008), as the high
information-content CTCF motif allows us to reliably measure spatial resolution based on
event distance to the CTCF motif. We used GFP ChIP-Seq data (Chen, et al., 2008) in
the third phase of GPS to control for non-specific binding.
We found that the spatial resolution of GPS on the CTCF data is superior to the
spatial resolution produced by eight published ChIP-Seq analysis methods (Figure 2-5):
MACS (Zhang, et al., 2008), SISSRs (Jothi, et al., 2008), cisGenome (Ji, et al., 2008),
QuEST (Valouev, et al., 2008), FindPeaks (Fejes, et al., 2008), spp-wtd, spp-mtc
(Kharchenko, et al., 2008) and PeakRanger (Feng et al., 2011). Because different
methods predict different sets of binding events, we limit our comparison to a matched
set of events. From the 22,222 top ranking predictions by each method, 4,322 events
are predicted by all nine methods and correspond to the same high-scoring CTCF
binding motif. Of these matched events, 84.5% of the predictions by GPS are within
20bp of the CTCF binding motif, while between 63.2% and 73.4% of predictions by other
methods are within 20bp (Figure 2-5A). GPS has an average spatial resolution of
11.27±10.21bp, compared to 14.55±12.50bp for SISSRs, 16.14±12.30bp for MACS,
16.14±13.90bp for cisGenome, 17.54±13.44bp for QuEST, 14.97 ±11.09 for FindPeaks,
16.52 ±12.82 for spp-wtd, 16.22 ±14.57 for spp-mtc and 15.03 ±12.03 for PeakRanger.
Although the matched set allows direct comparison on the same binding events, it is
possible to introduce some bias such as focusing only on the subset of more significant
binding events. We thus evaluated spatial resolution while increasing the number of all
the top ranking events identified by each method and the performances of the methods
are similar to those of the matched set (Figure 2-5B). The above analysis was repeated
with 50bp window size and the results are similar to those with 100bp window size.
SISSRs, MACS and two spp methods were shown to have better spatial resolution
than seven other methods in a recent performance evaluation (Wilbanks and Facciotti,
43
Chapter 2: Genome Positioning Systems (GPS)
2010), and thus our analysis of CTCF data shows that GPS may have superior spatial
resolution to these seven methods.
Figure 2-5 GPS improves the effective spatial resolution
A) Fraction of predicted CTCF binding events with a motif within the given distance with event
discovery by GPS, SISSRs, MACS, cisGenome, QuEST, FindPeaks, spp-wtd, spp-mtc and
PeakRanger. Events shown were predicted by all eight methods and had a CTCF motif within
100bp. B) The spatial resolution of CTCF event calls is shown averaged over increasing
numbers of the strongest ranked events identified by different methods.
GPS was further compared to two recently published methods, PICS (Zhang et al.,
2011) and SeqSite (Wang and Zhang, 2011), on the spatial resolution of predicted event
positions. These two methods also model the ChIP-Seq read distribution of the binding
event and achieve better spatial resolution but were not included in the previous
comparison because they were published after GPS (Guo et al., 2010). A widelyevaluated human FoxA1 binding dataset (Zhang et al., 2008) was used for the
evaluation because PICS requires a read mappability profile, which is available only in
the human genome. Of 698 matched events that are predicted by all three methods and
correspond to the same high-scoring FoxA1 binding motif, GPS has an average spatial
resolution of 16.35±15.70bp, compared to 17.86±16.16bp for PICS, and 20.15±14.54bp
for SeqSite (Figure 2-6). Therefore, GPS achieves better spatial resolution of binding
event predictions than these two new “shape-aware” methods.
2.3.3 GPS discovers more joint events
Using synthetic data we found that GPS is able to detect more joint events than
other methods. We generated synthetic joint events and single events by placing ChIPSeq binding events from actual CTCF data at pre-defined intervals. GPS detects 99.7%
44
Chapter 2: Genome Positioning Systems (GPS)
Figure 2-6 GPS has better spatial resolution than other shape-aware methods
Fraction of predicted FoxA1 binding events with a motif within the given distance with event
discovery by GPS, PICS and SeqSite. Events shown were predicted by all three methods and
had a FoxA1 motif within 100bp.
of joint events that are 200bp apart, while SISSRs only detects 54.5% of joint events that
are 200bp apart, MACS and QuEST detect none of the joint events that are 200bp apart
and detect joint events only when they are more than 280bp apart (Figure 2-7 A).
Although SISSRs appears to be more sensitive when the joint evens are less than
100bp apart, it makes more false joint event calls with the synthetic single events than all
other methods (data not shown).
GPS is also able to predict more joint events than the other methods we tested on
actual ChIP-Seq data. For example, GPS uniquely detects two CTCF events in mouse
ES cells over proximal CTCF motifs that are 99bp apart on chromosome 8 (Figure 2-7
B). However, the CTCF dataset does not contain a sufficient number of joint events to
effectively evaluate the methods on a whole genome scale. We selected a human
Growth Associated Binding Protein (GABP) ChIP-Seq dataset for our evaluation
because GABP ChIP-Seq data were previously reported to contain joint events (Lun, et
al., 2009; Valouev, et al., 2008). We identified 581 candidate sites of joint events that all
had at least one event detected by all five methods and where each site contains two or
more GABP motifs separated by less than 500bp. GPS identified joint events in 122
candidate sites, while SISSRs and QuEST detected joint events at fewer than 83 of the
candidate sites, and MACS and cisGenome only identified 3 and 5 of the candidate sites
as containing joint events respectively (Figure 2-7 C).
45
Chapter 2: Genome Positioning Systems (GPS)
Figure 2-7 GPS improves accuracy in resolving joint binding events
A) Fraction of binary events recovered vs. the distance between the generated synthetic events for
GPS, SISSRs, MACS and QuEST. B) Example of a predicted binary CTCF event that contains
coordinately located CTCF motifs. C) Number of GABP events discovered by GPS, SISSRs,
MACS, cisGenome, and QuEST in regions that contain clustered GABP motifs within 500bp.
2.4 Discussion
GPS is a novel computational method that predicts the most likely positions of
binding events at single-base resolution. It uses a probabilistic mixture model based on
the characteristic spatial distribution of reads. Instead of aggregating ChIP-Seq reads to
ChIP-chip like analog “peaks”, GPS models the individual reads and thus retains the rich
digital information gained from high-throughput sequencing. Our analysis with synthetic
and actual ChIP-Seq data demonstrates the value of our approach in resolving closely
spaced joint events and improving event spatial resolution.
The “peak shape” information is useful for distinguishing false events from true
events but it is not fully exploited by the early peak finders and is considered challenging
to model (Rye et al., 2011). To the best of our knowledge, GPS is the first method to
explicitly model real “peak shape” information, with an adaptable empirical read spatial
distribution learned directly from the ChIP-Seq data. GPS also provides a principled
method to use “peak shape” for filtering artifacts and thus improves specificity of
prediction. Before GPS, the “peak shape” information was only used in a simplified
manner to infer average fragment length (Zhang, et al., 2008). After the publication of
GPS (Guo et al., 2010), two subsequently published works employed a similar approach
to modeling the spatial distribution of reads: PICS uses DNA fragment length information
to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture
model (Zhang et al., 2011); SeqSite models the read distribution as a truncated gammadistribution and locates TFBSs with a least-squares model fitting strategy on smoothed
data (Wang and Zhang, 2011). Such efforts showed improved performance in resolving
46
Chapter 2: Genome Positioning Systems (GPS)
closely spaced joint events and improved spatial resolution of the prediction (Wang and
Zhang, 2011; Zhang et al., 2011), underscoring the value of high resolution modeling of
ChIP-Seq data. Although such parametric distributions provide a simple description of
the read distribution and allow parameter estimation to account for the variation among
the events, the parametric distributions do not adequately fit the real data. In particular,
the asymmetry of the read distribution of real data (Figure 2-4) cannot be accurately
modeled by the symmetric t-distribution and the non-zero probability region around
binding site (Figure 2-4) cannot be captured by the gamma-distribution with zero density
at the binding position. In GPS, the automatically learned empirical read distribution
provides a more accurate description of the reads, which is consistent with the
comparison result that GPS has better spatial resolution than PICS and SeqSite. In
addition, the empirical read distribution does not cause a computation performance
disadvantage for GPS because it is represented as a look-up table, allowing probabilities
be computed efficiently for the EM algorithm iterations.
GPS provides improved spatial resolution when compared with contemporary
methods. Recent evaluation showed that DNA-binding motif discovery is more
successful from the fixed-size regions flanking the point position of predicted binding
sites (i.e. the summits of the peaks) than from the long regions returned by some of the
methods evaluated (Rye et al., 2011). Thus the improved spatial resolution of GPS may
enable researchers to search for DNA-binding motif in narrower windows of sequence,
effectively increasing the signal to noise ratio to improve motif discovery. The high
spatial resolution of GPS can be used to produce a position-specific prior (Narlikar et al.,
2006b; Qi et al., 2006; Bailey et al., 2010) that can be used by motif discovery methods
to limit the motif search to tight genomic regions around events, or to exclude event
locations for co-factor motif discovery.
GPS’s ability to resolve homotypic events from ChIP-Seq data will facilitate the
genome-wide study of cooperative binding on gene expression under specific biological
conditions. This is achieved by explicitly modeling the mixing of closely spaced
homotypic events using a mixture model framework. Accurately deconvolving homotypic
events allows more accurate quantification of each event in terms of binding strength as
well as binding location. Such accurate quantification will facilitate further study on the
cooperativity of homotypic events. Homotypic binding sites have been suggested to act
as key components of invertebrate and mammalian promoters and enhancers (Gotea, et
al., 2010; Lifanov, et al., 2003). In addition, modeling based approaches have
47
Chapter 2: Genome Positioning Systems (GPS)
demonstrated that identifying homotypic binding is important for the faithful reproduction
of biological behaviors (Segal, et al., 2008).
Furthermore, we expect that alternative empirical read distributions can be used for
different kinds of events, such as histone location, as the GPS framework is inherently
adaptable to other empirical read distributions.
2.5 Methods
2.5.1 Datasets used
Dataset 1: CTCF binding
To evaluate the performance of GPS, we analyzed a ChIP-Seq dataset of insulator
binding factor CTCF in mouse ES cells, with a control using antibody against GFP to
control for non-specific binding (Chen, et al., 2008). We chose CTCF data for our
evaluation because the strong CTCF consensus motif allows us to reliably measure
spatial resolution. The ChIP-Seq data comprised 4.2 million CTCF reads and 7.9 million
GFP reads that uniquely map to the mm8 mouse genome.
Dataset 2: GABP binding
To evaluate the performance of joint event discovery, we analyzed a ChIP-Seq
dataset of GABP in human Jurkat cells, with a control using input DNA (Valouev, et al.,
2008). GABP binding has been reported to contain multiple binding motifs in a short
region (Lun, et al., 2009; Valouev, et al., 2008). The data was downloaded from QuEST
website (http://mendel.stanford.edu/SidowLab/downloads/quest/). It comprised 7.9
million GABP reads and 17.4 million input DNA reads that uniquely map to the hg18
human genome.
Dataset 3: FoxA1 binding
The FoxA1 dataset (Zhang et al., 2008) is used to evaluate the performance of 3
shape-based ChIP-Seq methods, GPS, PICS and SeqSite. The FoxA1 ChIP-Seq data
were downloaded from (http://liulab.dfci.harvard.edu/MACS/Sample.html). It comprised
3.9 million FoxA1 reads and 5.2 million input DNA reads that uniquely map to the hg18
human genome.
Motif
The CTCF motif is reported in (Guo et al., 2010). GABP motif was retrieved from
TRANSFAC database (M00341) (Matys, et al., 2003). FoxA1 motif was retrieved from
Jasper database (MA0148.1) (Sandelin et al., 2004).
48
Chapter 2: Genome Positioning Systems (GPS)
2.5.2 ChIP-Seq analysis methods
We compared the performance of GPS against ten published ChIP-Seq analysis
methods: MACS (version 1.3.7.1)(Zhang, et al., 2008), SISSRs (version 1.4)(Jothi, et al.,
2008), cisGenome (version 1.2)(Ji, et al., 2008), QuEST (version 2.4)(Valouev, et al.,
2008), FindPeaks (version 4)(Fejes, et al., 2008), spp-wtd and spp-mtc (version 1.8)
(Kharchenko, et al., 2008), PeakRanger (version 1.12)(Feng et al., 2011), PICS (version
1.11)(Zhang et al., 2011) and SeqSite (version 1.1.2)(Wang and Zhang, 2011). All the
methods were run using default parameters except as described in the following.
For MACS, we used the summit location as the predicted binding site position. The
binding events are sorted by p-value.
For SISSRs, “-t” option was used to obtain binding site predictions.
For cisGenome, we analyzed the data with the option of boundary refinement, and
used the center of the predicted region as the binding site position. In our tests, these
options gave the best result in spatial resolution.
For FindPeaks, we ran with options “-dist_type 1 -duplicatefilter “ to filter artifact
reads. We used the max_coord position as the predicted binding site. The binding
events are sorted by height.
For spp-wtd and spp-mtc, the binding events are sorted by FDR and then by score.
2.5.3 Method comparison on spatial resolution
We evaluated the effective spatial resolution of GPS against other methods. We
define effective spatial resolution as the absolute value of the distance between genome
coordinates of predicted binding events and the middle of corresponding high-scoring
binding motif hit. The sign of the offset was adjusted according to the strand on which
the motif hit occurred. Because the center of the motif hit may not represent a true
center of binding event, the offsets to the motif were centered by subtracting the mean
offsets (Kharchenko, et al., 2008). Because different methods predict different sets of
binding events, we compare spatial resolution on the “matched” set of predictions that
correspond to the same high-scoring binding motif. Only those events within 100bp of a
motif match are included in the calculation. For CTCF, from the top 34019 predictions of
each method, we select the 7,653 events that were predicted by all eight methods.
We also evaluate spatial resolution while increasing the number of top ranking
events identified by each method. Note that this analysis does not have a “matched” set
of predictions. We simply average the spatial resolution of the top ranking events that
49
Chapter 2: Genome Positioning Systems (GPS)
have a motif at a distance less than 100bp. The results are similar to those of the
“matched” set.
2.5.4 Evaluating performance in deconvolving joint events using synthetic
data
In order to test the performance of joint event detection we constructed realistic
synthetic datasets using CTCF binding data. These synthetic datasets allow us to more
accurately evaluate the performance of different methods, as we know the true location
of the constituent parts of joint binding events. To construct the datasets, we first collect
the set of CTCF events that have the following properties: i) they are predicted by five
evaluated methods (GPS, SISSRs, MACS, cisGenome, QuEST), ii) none of the five
methods predicts more than two events in the region, iii) they contain a match to the
CTCF motif, iv) the average distance from the motif match to the event prediction across
all five methods is less than 10bp, v) the enrichment of CTCF ChIP-Seq reads under the
event is significantly greater than the level of GFP reads with a P-value of less than
0.001 (as calculated by a binomial test). A total of 3,233 CTCF binding events meet
these criteria.
Synthetic ChIP-Seq data were constructed by randomly choosing one of the real
CTCF events and translating the coordinates of its reads in the surrounding 1Kbp onto a
fake genome. This is repeated to simulate 20,000 synthetic single events, each placed
100Kbp apart on the fake genome. We similarly create 1,000 joint (binary) binding
events by randomly choosing two real CTCF events and placing them on the fake
genome a fixed distance apart. Note that this method of constructing synthetic joint
events assumes that the ChIP-Seq reads generated by closely neighboring events will
be an independent mixture of the reads generated by each component event. A
synthetic control channel is simulated by taking GFP reads in the regions around CTCF
events and translating their coordinates in the same way as the matched IP reads.
Further control reads are randomly spread across the fake genome until the read counts
in the synthetic IP and control channels match. Datasets are constructed for the
following distances between joint binding events: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
110, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300, 350, 400, 450, 500, 550, 600,
650, 700, and 750. The datasets can be downloaded from the GPS website.
Note that we generated the synthetic data with the number of single events and joint
events on the same order as real data. However, the read counts and fake genome size
50
Chapter 2: Genome Positioning Systems (GPS)
are different from the real experiment, and this may throw off some methods that are
tuned to make certain assumptions about the distribution of the data.
GPS, SISSRs, MACS and QuEST were run with default settings on the synthetic
datasets. cisGenome, FindPeaks and spp are not evaluated because we cannot script
them to run on the command line for the repeated tests on multiple synthetic datasets.
Note that MACS should not be unduly affected by joint binding events when estimating
the correct (CTCF) binding distribution, as a large set of single events exist in the
synthetic experiment.
An algorithm is said to have correctly recovered joint binding events when it makes
two event predictions in the relevant area and these predictions are each within 100bp of
the position at which the event was simulated. The false calls of a method are also
counted when synthetic single events are called as joint events.
2.5.5 Evaluating performance in deconvolving joint binding events using
GABP ChIP-Seq data
To evaluate the genome-wide performance of joint event discovery in real ChIP-Seq
data, we analyzed a human GABP ChIP-Seq dataset, which was reported previously to
contain multiple binding motifs in a short region (Lun, et al., 2009; Valouev, et al., 2008).
For the GABP dataset, we compared GPS against 4 other methods (SISSRs,
MACS, cisGenome and Quest) genome wide. FindPeaks only reported 615 events (991
events with the –subpeaks option), much fewer than the other methods. Therefore it is
not included in the subsequent joint event discovery analysis. We did not run spp on
GABP data because the data format we downloaded from QuEST website can not be
used for spp, which reads BOWTIE or ELAND format.
The number of events predicted by all five methods are: GPS (17,179), SISSRs
(16,567), MACS (14,527), cisGenome (21,101), QuEST (6,442). The same number of
top 6,442 events from each of the methods were used in our comparison.
We define a set of candidate sites that all have at least one event detected by all
five methods, and that contain two or more GABP motifs separated by less than 500bp.
We discovered 581 such sites. Thus nearly 9% of the GABP bound regions potentially
contain joint binding events. For each of these sites, we count the number of events
discovered by different methods.
51
Chapter 2: Genome Positioning Systems (GPS)
52
Chapter 3: K-mer set motif representation and discovery
Chapter 3
K-mer set motif representation and
discovery
Part of the material presented in this chapter was adapted from the following publication:
Yuchun Guo, Shaun Mahony, David K Gifford (2012). High resolution genome wide
binding event finding and motif discovery reveals transcription factor spatial binding
constraints, PLOS Computational Biology, 8(8): e1002638.
Collaborations:
Y.G., S.M. and D.K.G. conceived the project. Y.G. designed the computational model
and implemented the algorithm. Y.G. and S.M. analyzed the data. Y.G., S.M. and D.K.G.
wrote the manuscript.
53
Chapter 3: K-mer set motif representation and discovery
54
Chapter 3: K-mer set motif representation and discovery
Chapter 3: K-mer set motif representation and discovery
3.1 Introduction
DNA sequence motifs are short patterns that are presumed to have a biological
function (D’haeseleer, 2006b). The identification of motifs corresponding to the
sequence-specific binding sites of transcription factors (TFs) has become one of the
most widely studied problems in computational biology, both for its biological
significance and for its bioinformatics challenge (Zambelli et al., 2012).
Computational identification of transcription factor binding sites has been proved to
be of great importance in deciphering complex transcriptional regulatory networks
(Spellman et al., 1998; Lee et al., 2002; Kim and Park, 2011). The binding motifs of TFs
describe the sequence preferences of the factors and indicate where they likely to bind
and which genes they likely regulate.
However, motif discovery has been a challenging problem in computational biology.
The DNA sequences bound by TFs can be as short as 6-8 bases and be quite variable.
Finding such short and imperfect copies of an unknown pattern in a set of noisy
sequences of hundreds or thousands base pairs has been likened to searching for the
proverbial needle in a haystack (D’haeseleer, 2006a).
3.1.1 DNA motif representations
A motif model can be viewed as a classifier in the context of supervised learning. It
is learned from the training sequences that contain bound TF sites. The motif model is
then used to scan new DNA sequences (testing sequences) and classify them according
to a score cutoff into two classes: motif instances and background sequences.
There are many ways for a motif model to represent the binding specificity of a
transcription factor. One simple and widely used motif representation is the consensus
sequence. It is a short word with ambiguity codes, most commonly IUPAC codes
(Nomenclature Committee of the International Union of Biochemistry, 1986), to represent
binding degeneracy (binding positions that admit more than one base). Motif discovery
with consensus sequences (or words, k-mers) has the advantage of being rigorous and
exhaustive but is less effective for long words and several mismatches (Ladunga, 2010).
The most widely used motif model is the position weight matrix (PWM). A PWM is a
matrix where each matrix element represents the probability of each base at each motif
55
Chapter 3: K-mer set motif representation and discovery
sequence offset. The PWM score for an observed sequence (motif instance) is the sum
of corresponding base-specific probabilities for each base in that sequence (Stormo,
2000). A PWM can be also depicted graphically as a motif logo (Schneider and
Stephens, 1990). An early analysis showed that the PWM representation is more
sensitive and more precise than the consensus representation (Stormo et al., 1982).
Typically PWM and consensus sequence motif representations are employed when
the number of observed binding sites is limited. Although simple and compact, such
highly compressed motif representations fail to capture subtle but meaningful differences
that may be caused by dependencies in neighboring bases. These models assume that
each sequence position independently contributes to the binding energy and ignore
some of the complexity of protein-DNA interaction. Dependencies between nucleotides
at different positions in protein binding sites have been observed (Man and Stormo,
2001; Bulyk et al., 2002; Berger et al., 2006; Maerkl and Quake, 2007). However, it is
accepted that although the position independence assumption does not fit the data
perfectly, it provides a good approximation of protein-DNA interactions (Benos et al.,
2002). Although more complex models have been proposed that take into account the
positional dependencies, they require more data to properly estimate the model’s
parameters and may overfit the data if data are limited (MacIsaac and Fraenkel, 2006;
Zambelli et al., 2012). In vitro protein binding microarray (PBM) results suggest that the
complete binding specificity of a transcription factor can only be represented by a full list
of weighted k-mers (Berger et al., 2006). Currently, there is no generally-accepted
standard for motif representation (Hughes, 2011).
3.1.2 DNA motif discovery methods
Early motifs were identified with laborious low throughput methods such as
footprinting methods (Galas and Schmitz, 1978). Microarrays (Spellman et al., 1998)
provided new data sources to discover motifs from up to a few thousand base pair long
promoter regions selected from dozens to hundreds co-expressed genes. With the rapid
advancement in ChIP-chip (Ren et al., 2000; Iyer et al., 2001) and ChIP-Seq (Barski et
al., 2007; Johnson et al., 2007; Robertson et al., 2007) technologies, an unprecedented
amount of high quality sequences that are bound by transcription factors in vivo are
available. ChIP-Seq offers higher spatial resolution, higher coverage of the genome and
less noisy binding regions than ChIP-chip for improved motif discovery (Park, 2009;
Zambelli et al., 2012).
56
Chapter 3: K-mer set motif representation and discovery
As the amount of bound sequences has been increasing rapidly, computational
methods are needed to discover the important features in the sequences. Over 200
tools have been developed for the computational identification of DNA motifs (Ladunga,
2010). The algorithmic approaches for motif discovery are linked to the motif models
that they used and can be broadly grouped into two categories: enumerative methods
and alignment-based methods (MacIsaac and Fraenkel, 2006).
Enumerative methods typically exhaustively count words up to some maximum size
in the sequence set, and are thus best suited to consensus sequence motif models
(MacIsaac and Fraenkel, 2006). For example, Weeder implemented a suffix tree to
deterministically search for short words with few mismatches (Pavesi et al., 2001). It
had been shown to be the most sensitive and selective tool in a comprehensive
assessment of 14 tools (Tompa et al., 2005). However, it can be slow when searching
10bp or longer words with up to 3 mismatches in ChIP dataset (Zambelli et al., 2012).
Alignment-based methods often construct a probabilistic model of the sequence data
and optimize the parameters of the model to find motifs. Two well-known examples are
MEME (Bailey and Elkan, 1994), which uses the EM algorithm to optimize a mixture
model of binding sites and background sequences, and AlignACE (Hughes et al., 2000),
which uses the Gibbs sampling technique.
A comprehensive assessment of motif discovery programs (Tompa et al., 2005)
showed that their performance are good on yeast data but significantly worse when
applied to more complex sequence data in flies and human; each program typically
covers a small subset of the known binding sites, with relatively little overlap between
the methods. Therefore, the results from multiple motif discovery tools are typically
combined to improve performance (D’haeseleer, 2006a; MacIsaac and Fraenkel, 2006).
Traditional motif discovery methods treat all the bases in all the sequences equally.
However, additional information can be used for more accurate motif discovery. For
example, phylogenetic conservation information (reviewed in MacIsaac and Fraenkel,
2006), or position-specific prior information based on ChIP-chip predicted binding
locations (Qi et al., 2006) or the structural class of the protein (Narlikar et al., 2006b),
have been integrated into motif discovery methods to achieved better performance.
The advent of ChIP-Seq technology enables a better in vivo motif discovery,
compared with data obtained from ChIP-chip or promoters of co-expressed genes.
ChIP-Seq provides much higher spatial resolution (typically within 40-50bp) in localizing
the binding events (Wold and Myers, 2008; Wilbanks and Facciotti, 2010), in contrast to
57
Chapter 3: K-mer set motif representation and discovery
a few hundreds base pairs for ChIP-chip or up to a few thousand base pairs of promoter
regions (Zambelli et al., 2012). New computational methods such as GPS (Chapter 2)
further improve the spatial resolution. In addition, ChIP-Seq is not limited by the
coverage of array designs, is more sensitive in discovering both strong and weak binding
events, and produces fewer artifacts (Park, 2009). The higher sensitivity of ChIP-Seq
offers a much higher redundancy of motif instances in the sequences as well as the
opportunity to learn weak binding sites, which have been shown to activate enhancer at
higher TF concentration (Papatsenko and Levine, 2005) and contribute to gene
expression pattern in modeling based approaches (Segal, et al., 2008). Also, ChIP-Seq
enrichment has been reported to be an indicator of binding affinity of the TF for the
bound region (Jothi et al., 2008).
It is still a computational challenge how to best integrate all the informative features
offered by ChIP-Seq into motif discovery. Traditional motif discovery program such as
MEME (Bailey and Elkan, 1994) and Weeder (Pavesi et al., 2001) are not suitable for
large number of ChIP-Seq bound sequences due to their computational inefficiency
(Jothi et al., 2008; Zambelli et al., 2012). These methods are can process only the top
~500-1000 ranking sequences, and thus ignore weak binding sites. Newer methods
have been developed to exploit the improved spatial resolution of ChIP-Seq by using a
positional prior or by narrowing the sequence length. HMS (Hu et al., 2010) and
ChIPMunk (Kulakovskiy et al., 2010) use read coverage profiles as a positional prior for
a greedy search for motifs. POSMO (Ma et al., 2012) uses positional information to
score and rank k-mers and subsequently cluster k-mers into PWMs. DREME (Bailey,
2011) searches for IUPAC words up to 8 base-pairs wide in short sequences from ChIPSeq binding events in a discriminative setting. However, these methods are not yet
optimized for ChIP-Seq motif discovery.
3.1.3 About this chapter
In this chapter, to fully exploit the advantages of ChIP-Seq data, I present a novel kmer set motif (KSM) model and a k-mer motif alignment and clustering (KMAC) method
that learns such motif models. I show that KMAC analysis of a large set of human ChIPSeq experiments recovers more known factor motifs than other contemporary methods.
In addition, the KSM model is more accurate than the PWM model when predicting
which new sequences will be bound in vivo.
58
Chapter 3: K-mer set motif representation and discovery
3.2 K-mer set motif (KSM) model
In this section, I propose a novel k-mer set motif (KSM) model to describe the
binding preference of a protein in the discriminative context of bound and unbound
sequences, and describe a motif scoring method based on evaluating the statistical
significance of candidate KSM instances in DNA sequences.
The choice of the KSM representation is motivated by the limitations of the
positional independence assumption of the PWM model and the large number of training
examples found in ChIP-Seq datasets that enable KSM learning. Similar to the in vitro
protein binding microarray (PBM) derived k-mers (Berger et al., 2006), we hypothesize
that the KSM model derived from ChIP-Seq binding sequences should result in more
accurate bound sequence classification.
3.2.1 The KSM representation
A KSM motif model is represented as a single set of aligned enriched k-mers
(overlapping ungapped words of length k), each k-mer with an offset, a positive count,
and a negative count that are derived from the training sequences. A k-mer and its
reverse compliment k-mer are treated as one k-mer. For example, the motif of ES cell
regulator Oct4 contains a set of aligned 8-mers, which are ranked by the statistical
significance of enrichment in the ChIP-Seq bound (positive) sequences relative to a set
of unbound (negative) sequences (Figure 3-1A). The top 8-mers such as ATGCAAAT,
TATGCAAA or TGCAAATG covers the core of the motif PWM (Figure 3-1B) and some
flanking bases, or a variation of the motif core (e.g. ATGCTAAT). As the k-mer rank
lowers, the k-mers contains more flanking bases or mismatches, becoming more and
more divergent from the top ranking k-mers. Each 8-mer is characterized by its offset
relative to the expected binding position in the training sequences, and the number of
positive or negative training sequences that contain the 8-mer (positive or negative hit
count). The “offset” of a k-mer is defined as the offset of the first base of the k-mer
relative to the expected binding position, which is estimated from the motif discovery
process described in section 3.3.2. For the Oct4 example, the expected binding position
is the position with base C.
59
Chapter 3: K-mer set motif representation and discovery
Figure 3-1 Oct4 KSM and PWM motif representation
A) The k-mer set motif (KSM) representation of the Oct4 motif. The top ranking k-mers
are shown to be aligned with each other, each with a consistent offset relative to the
expected binding position (yellow). “Pos Hit” indicates the number of positive training
sequences containing the k-mer. “Neg Hit” indicates the number of negative training
sequences containing the k-mer. B) The PWM logo representation of the motif. The
relative heights of the bases at each position represent the relative frequencies of the
bases at that position. The total height of a position signifies the information content at
that position.
The overlapping k-mers are included to capture the effect of flanking bases because
the flanking sequences overlapping the motif cores may reflect the interaction with cofactors and contain information about the in vivo binding specificity of the protein of
interest (Maerkl and Quake, 2007; Slattery et al., 2011). The k-mers are aligned and
associated with consistent offsets such that when scanning motif instances on a
sequence using the KSM model, the multiple k-mer matches can be grouped based on
the expected binding positions derived from their matched positions and their offsets.
In summary, a KSM motif representation is the set of all enriched and consistently
aligned overlapping ungapped k-mers. Under this simple k-mer set representation, a
sequence is said to contain an instance of the motif if it contains any of the component kmers. The binding position can then be computed using the offset of the matched kmers.
The length of each k-mer (the parameter k) will influence the accuracy of the motif
model. If k is too small, the k-mer collection may not be rich enough to capture the
binding specificity. If k is too large, the number of observed instances of each k-mer in
the data is too few to generate useful statistics. The k-mer set motif discovery method is
60
Chapter 3: K-mer set motif representation and discovery
designed to select the optimal k value based on the enrichment of the motif in the in vivo
bound sequences, as described in section 3.3.
3.2.2 Scoring K-mer set motif in a DNA sequence
An important use of a KSM motif model is to scan a query DNA sequence and
assign a score to the sequence to indicate the significance of its potential matches to the
KSM. To assign a score to a DNA sequence, we simultaneously search for the
occurrences of all the k-mers in the KSM using the Aho-Corasick algorithm (Aho and
Corasick, 1975). The Aho-Corasick algorithm is an efficient algorithm to locate all
occurrences of any of a finite number of keywords in a string of text. It constructs a finite
state pattern matching machine from the keywords and then uses the pattern matching
machine to process the text string in a single pass. The complexity of the algorithm is
linear in the length of the searched text plus the number of output matches if the
construction of the pattern matching machine is pre-processed (Aho and Corasick,
1975).
The k-mer matches in the query sequences are grouped based on their respective
expected binding locations. The expected binding event location of a k-mer is computed
by the matched position of the k-mer and the KSM offset of the k-mer. We define the “kmer group” as the subset of component k-mers in the KSM model that occur in the
query sequence and that are mapped to the same expected binding position on the
sequence. Thus each k-mer group is a KSM motif instance in the query DNA sequence.
We illustrate this use a simple example as follows. We want to search the motif
represented by a KSM model (Figure 3-2A) in a DNA sequence (Figure 3-2B). Four
component k-mers from the KSM model match the query sequence and can be grouped
into two groups based on the expected binding positions.
To explain how a k-mer group is scored, we first need to introduce the definition of a
motif hit count. We define the “hit count” of a motif as the number of sequences
containing the motif in the training sequence set. This is similar to the ZOOPS (Zero Or
One Per Sequence) mode of MEME (Bailey and Elkan, 1994) in that we count the
number of the sequences, but not the number of motif occurrences. This definition is
useful to avoid overcounting motifs in simple repeats, such as “AAAAAAAAAAAA…”.
61
Chapter 3: K-mer set motif representation and discovery
Figure 3-2 Search k-mer set motif in a DNA sequence
A) An example of a simple KSM model. B) The component k-mer matches are grouped to two
k-mer groups by the expected binding positions (yellow).
With the above definitions, it is straightforward to define hit count for the PWM motif.
However, the hit count for a k-mer group cannot be obtained by simply summing the hit
count of all the matching component k-mers because the component k-mers are
overlapping and a simple summation will overcount the number of sequences. Thus the
“k-mer group hit count” is defined as the number of all the positive training sequences
that contain at least one of the matched k-mers in the k-mer group. To compute the kmer group hit counts without the actual training sequences, we record with each k-mer
the IDs of the positive/negative training sequences that contain the k-mer during the
motif learning phase (Figure 3-2A). The union of all the IDs from all the matching k-mers
can thus identify the training sequences that contain at least one of the component kmer of the k-mer group. In the above example (Figure 3-2B), k-mer group 1 match 8
positive training sequences (ID=1~7,9) and 1 negative training sequence (ID=1). Thus,
it has 8 positive hit counts and 1 negative hit count.
In our discriminative motif discovery setting, given the positive/negative motif hit
count and the total number of positive and negative training sequences, the statistical
significance of a motif instance can be evaluated in terms of its relative enrichment in
positive and negative training sequences by computing a hypergeometric p-value (HGP)
(Barash et al., 2001):
62
Chapter 3: K-mer set motif representation and discovery
HGP =
min( N + , n )
∑
l = n+
 N +  N − N + 
 

 l  n − l 
N
 
n
where N is the total number of positive and negative training sequences, N + is the total
number of positive training sequences, n is the number of positive and negative training
sequences containing the motif (positive and negative hit count), and n + is the number of
positive training sequences containing the motif (positive hit count). Note that the motif
instance can be a PWM match, consensus sequence match, or a k-mer group.
The “KSM score” of a k-mer group is then defined as the negative logarithm (base
10) of the HGP of the k-mer group. Typical KSM score ranges from 0 to a few thousand.
3.3 K-mer motif alignment and clustering (KMAC)
The goal of the K-mer Motif Alignment and Clustering (KMAC) method is to discover
the set of k-mers that are enriched in the DNA sequences bound by the protein of
interest, and to cluster the k-mers into one or more k-mer set motifs that describe distinct
binding preferences. The k-mer sets may correspond to the primary binding motif,
variations of the primary binding motif, or secondary motifs that correspond to co-factor
binding.
The input data are a set of ChIP-Seq bound sequences (positive sequences) and
optionally a set of unbound sequences (negative sequences). The negative sequences
will be randomly generated by using a di-nucleotide shuffling method if they are not
provided. Each positive sequence also carries a weight, which can be the strength of
the corresponding binding event, or the read count of the event. The weights of the
positive sequences are set to 1 if they are not provided. The center base of each
sequence is assumed to be the ChIP-Seq predicted binding site position. Although this
method is designed to analyze ChIP-Seq derived DNA sequences, it can be applied to
other DNA sequences as long as the positional weight and the sequence weight can be
set appropriately (see section 3.3.2).
In this study, values of k from 5 to 13 are used on each dataset, and the final k value
is chosen as the one that gives the most significantly enriched primary PWM as
described below. Note that the width of the PWMs estimated from this method is not
explicitly specified by the value of k. Different k values may converge to the same or
63
Chapter 3: K-mer set motif representation and discovery
different width of PWMs.
3.3.1 Discovery of the set of enriched k-mers
First, a set of enriched k-mers is discovered by comparing k-mer hit counts between
positive sequences and negative control sequences. The number of positive and
negative sequences that contain instances of each possible k-mer (hit count) are
counted, treating each k-mer and its reverse complement as the same sequence. A
hypergeometric p-value (HGP) is computed to evaluate the significance of each k-mer in
terms of its relative enrichment in positive and negative sequences. A k-mer is
considered enriched if its HGP is less than 1e-3 and it has at least 3-fold enrichment in
terms of positive/negative hit count. Combinations of HGP thresholds (1e-2, 1e-3 and
1e-4) and fold values (2 and 3) were tested and the selected thresholds gave the best
results in finding the correct motifs.
3.3.2 Clustering the enriched k-mers into k-mer set motifs
KMAC next clusters the enriched k-mers into k-mer set motifs (KSMs) that describe
similar DNA binding preferences. A genomic sequence is said to match a KSM if the
genomic sequence contains any of its component k-mers. KMAC clusters enriched kmers into KSMs by the following steps (Figure 3-3):
Step1: A k-mer set is initialized with the most enriched k-mer (seed k-mer) and any
other enriched k-mers that differ by a single base from the most enriched k-mer. Positive
set sequences that match the k-mer set are selected and aligned on the k-mer match
positions.
Step2: Extract k-mer set. Any enriched k-mer that appears in a 2k+1 bp window
around a KSM match is tested for addition to the k-mer set. An enriched k-mer must
have the same alignment offset to window sequences in at least f fraction of its
occurrences to be added to the k-mer set, 0< f ≤1. In this study, f =0.5 (see below for
the choice of this value).
Step3: Select and align the positive sequences matching the KSM.
Step4: Construct a PWM from the aligned sequences.
Step5: Select positive sequences containing a PWM hit and align the sequences
using the PWM, continue with step 2.
Thus, the k-mer set is further expanded by iteratively repeating step 2-5.
We assume that a k-mer in the true motif should align consistently with the other k-
64
Chapter 3: K-mer set motif representation and discovery
mers in the motif. Thus, when extracting the k-mer set in step 2, each selected k-mer
should have a consistent alignment offset in at least f fraction of the k-mer occurrences.
The value of f affects the stringency of including a k-mer into the k-mer set and the
optimal f value may vary in different datasets. We tested different f values (f = 0.3, 0.5
and 0.7) when analyzing the large set of ENCODE data and the performances for motif
discovery are similar. Therefore we chose f = 0.5 in this study.
Figure 3-3 Schematic of k-mer set motif finding
Step 1: Initialize k-mer set and aligned matched sequences. Step 2: Extract k-mer set from
aligned sequences. Step 3: Align matched sequences with k-mer set. Step 4: Construct PWM
from aligned sequences. Step 5: Align PWM-matched sequences with the PWM. Step 2-5 are
repeated until the hypergeometric p-value of the PWM stops improving.
The rationale of alternating KSM and PWM motif representation is to combine the
precision of the KSM model and the generalizability of the PWM model. While the k-mer
set motif is more precise because it requires consistent alignment to add a k-mer and
requires exact matching of existing k-mers to find a motif instance, it is not easy to
generalize to new k-mers, especially when initially the component k-mers are limited to 1
mismatch to the seed k-mer. On the other hand, the PWM motif allows generalization to
un-seen sites, but it may lead to false positives. Therefore, a combination of the two
may result in a more rich and accurate representation of the motif.
At step 4, the enrichment of the PWM is evaluated by computing a hypergeometric
p-value (HGP) from the PWM hit count in the positive sequences compared to the
65
Chapter 3: K-mer set motif representation and discovery
negative sequences. The PWM hit threshold is set to be 60% of the maximum PWM
score, which has been shown to approximately equal to the cutoffs determined by crossvalidation method (MacIsaac and Fraenkel, 2006). Iteration stops when the HGP of the
PWM stops improving. The PWM and the KSM representations of the motif are then
recorded. We compute the expected binding position in the alignment by averaging over
the relative offsets of predicted binding event positions in all aligned sequences. For
each k-mer in the k-mer set, the offset of the k-mer relative to the expected binding
position is recorded.
When constructing the PWM, we incorporate two sources of information from the
binding events to weight the sequences in order to bias motif discovery towards the
binding event positions, and to bias motif discovery towards patterns that occur in more
confident binding events. The first weighting factor is a positional weight that represents
the probability of having a motif at a certain position given its distance from the binding
event. The distance weighting function we use was fit to characterized ChIP-Seq data,
and is the logistic distribution with mean 0 and variance 13. The second weighting factor
is the binding strength (read count) of the binding event. A PWM is constructed with
weighted positive sequences centered on the k-mer set match and a zero order Markov
model learned from negative sequences. Flanking positions are progressively trimmed
to find the PWM with the most significant HGP.
After finding the primary KSM, KMAC searches for other KSMs. To accomplish this,
the previous seed k-mer is removed from the enriched k-mer pool and PWM motif
occurrences are masked in the sequences. The process of building new KSMs is
repeated until no more significantly enriched PWMs can be constructed. Rarely, a
secondary motif PWM can become more significantly enriched than the primary motif. If
this happens, the motif finding process is restarted using the seed k-mer of this
secondary motif.
In summary, KMAC selects the enriched k-mers from the positive sequences
compared with the negative sequences and produces a list of both KSMs and PWMs
corresponding to the primary binding motif and secondary binding motifs.
66
Chapter 3: K-mer set motif representation and discovery
3.4 Results
3.4.1 The PWM model does not capture k-mer differences between CTCF
binding in mouse and human cells
The KSM model consists of a comprehensive list of k-mers with their respective hit
count information. However, this rich and quantitative information about TF binding
specificity is lost after being compressed into the PWM model. For example, a 12-mer
CCAGAAGAGGGC occurs in 3,962 of the 40,000 top ranking CTCF bound sequences
in mouse ES cells (Figure 3-4A), but occurs only in 81~83 of the 40,000 top ranking
CTCF bound sequences in 3 types of human (H1-hES, GM12878 and K562) cells.
Similar cases include CCACAAGAGGGC, CCAGAAGAGGGT and CCAGAAGAGGGG.
Contrasting to the dramatic differences in k-mers, the PWM models for these 4 cell types
are highly similar (Figure 3-4B) and could not explain the differences displayed by the kmer results.
Figure 3-4 The PWM model does not capture k-mer differences
A) The hit counts of top 10 non-overlapping k-mers shows dramatic differences of CTCF binding
in mouse ES cells as compared to in human H1-hES, GM12878 and K562 cells. The hit counts
are computed from 40,000 61bp sequences from CTCF binding sites in the respective cells. B)
The highly similar PWM motif logos of mouse ES cells, human H1-hES, GM12878 and K562
cells. The motif logos are generated by STAMP (Mahony et al., 2007) from the motifs
discovered by KMAC using the 40,000 top ranking binding sites in the respective datasets.
3.4.2 K-mer set motif model is more predictive for in vivo binding than
PWM model
The classification performance of the k-mer set motif (KSM) model and the PWM
model are further compared in terms of their sensitivity and specificity. KMAC is applied
to sequences derived from a mouse ES cell CTCF ChIP-Seq data to learn both KSM
67
Chapter 3: K-mer set motif representation and discovery
and PWM representations for the mouse CTCF motif. Both representations of the same
mouse motif were then used to predict motif instances in the same set of mouse CTCF
bound sequences (n=39071) and in the human CTCF bound sequences (n=61767), with
corresponding negative sequences generated by dinucleotide shuffling. This is
analogous to testing on the training sample and testing on new testing samples. The
performances are characterized using truncated (false positive rate <= 0.1) receiver
operating characteristic (ROC) curves because in practice only the predictions with such
low false positive rates are meaningful. We also tested 2 alternative scoring methods:
for each sequence, we only used one single k-mer which has the most significant pvalue among the matched k-mers (top match k-mer) to score the sequence. The p-value
or the hit count of this single top match k-mer was used as the score for that test
sequence. The results show that KSM with k-mer group score gives the best
performance (Figure 3-5 A and B). The area under curve (AUC) for the truncated ROC
(maximum score is 0.1) is 0.081 for KSM and 0.074 for PWM when predicting the same
mouse training sequences, and 0.058 for KSM and 0.052 for PWM when predicting the
new human testing sequences. It is remarkable that even taking the top match k-mer in
the sequence outperforms the PWM model. In addition, the k-mer group score gives
better performance than scores from single top match k-mer, underscoring the value of
including the flanking sequences into the motif model. A similar analysis with c-Myc also
showed that the KSM model is more predictive than the PWM model (Figure 3-5 C and
D). The area under curve (AUC) for the truncated ROC is 0.035 for KSM and 0.031 for
PWM when predicting the same mouse training sequences, and 0.023 for KSM and
0.021 for PWM when predicting the new human testing sequences. We also tested the
PWM motif from public database (Jasper MA0147.1) derived from the same mouse cMyc ChIP-Seq data, the results were similar (data not shown).
68
Chapter 3: K-mer set motif representation and discovery
Figure 3-5 The KSM model is more predictive than the PWM model
A) Truncated ROC of 4 different representations/scoring methods for the mouse CTCF motif to
predict binding in mouse CTCF ChIP-Seq bound sequences. B) Truncated ROC of 4 different
representations/scoring methods for the mouse CTCF motif to predict binding in human CTCF
ChIP-Seq bound sequences. C) Truncated ROC of 4 different representations/scoring methods
for the mouse c-Myc motif to predict binding in mouse c-Myc ChIP-Seq bound sequences. D)
Truncated ROC of 4 different representations/scoring methods for the mouse c-Myc motif to
predict binding in human c-Myc ChIP-Seq bound sequences. For all cases, the negative sets of
sequences are generated by dinucleotide shuffling using the positive sequences.
3.4.3 KMAC outperforms other motif discovery methods in discovering
known DNA-binding motifs
We tested KMAC’s ability to discover biologically relevant DNA-binding motifs in
data from the ENCODE project (Birney et al., 2007). We chose this large collection of
experiments because we expected they would be representative of the typical range of
ChIP-Seq data noise and sequencing depth. Noise can be caused by low antibody
affinity and deviations from ideal experimental procedure. We used a set of 214 ChIPSeq experiments and associated controls comprising 63 distinct transcription factors that
were profiled in one or more cell lines by the ENCODE project and for which validated
69
Chapter 3: K-mer set motif representation and discovery
DNA-binding motifs exist in public databases.
We applied KMAC to 61bp sequences extracted from the 5000 most highly ChIPenriched GPS peaks calls from these 214 ChIP-Seq data, and the most significant
KMAC-discovered motifs from each analysis were compared to corresponding known
binding preferences of the same TFs using STAMP (Mahony et al., 2007). A motif
alignment with E-value less than 1e-5 was considered a match. For comparison, we
also used four popular traditional motif discovery tools covering a range of computational
techniques, including MEME (Bailey and Elkan, 1994), Weeder (Pavesi et al., 2001),
MDScan (Liu et al., 2002), and AlignACE (Hughes et al., 2000), and four ChIP-Seq
oriented tools, DREME (Bailey, 2011), POSMO (Ma et al., 2012), HMS (Hu et al., 2010)
and ChIPMunk (Kulakovskiy et al., 2010) on the same data. A set of 100bp sequences
extracted from the 500 most highly ChIP-enriched GPS peaks calls are examined by the
motif-finders MEME, Weeder, MDScan, AlignACE, DREME, or POSMO. For HMS and
ChIPMunk, a set of 100bp sequences and corresponding read coverage profiles are
extracted from the 500 most highly ChIP-enriched GPS peaks calls.
Figure 3-6 KMAC motif discovery outperforms other methods when detecting
motifs in ChIP-Seq data.
The motif detection performance of KMAC is compared to the motif detection performance of
various motif-finders on 214 ENCODE ChIP-Seq experiments.
70
Chapter 3: K-mer set motif representation and discovery
We found KMAC outperforms all of the compared motif discovery approaches, even
when allowing each method to make multiple motif predictions (Figure 3-6). We note
that KMAC sometimes (8 out of 214) failed to find the known motif in datasets where one
of the other algorithms succeeded.
3.4.4 KMAC outperforms other ChIP-Seq oriented motif discovery methods
For the ChIP-Seq oriented tools, KMAC, DREME, POSMO, HMS and ChIPMunk,
we also tested the methods in two set of conditions: 1) sequence sets with the length
and number recommended by the authors of these methods, 2) same set of 500x100bp
sequences. We found that for the other 4 methods, the “500x100bp” condition provided
superior results than using more sequences as designed for ChIP-Seq data (Figure 3-7).
This suggests that the additional sequences from the relatively weak binding events
degrade the motif discovery performance of these methods, which defeats the purpose
of capturing weak binding sites available from ChIP-Seq data. In contrast, KMAC with
5000 sequences performs better than the “500x100bp” condition, and all the other
methods in both conditions (Figure 3-7). The comparison of ChIP-Seq oriented tools
suggested that KMAC is more suitable for learning the whole spectrum of binding
preferences from a large number of both strong and weak binding sites.
3.5 Discussion
In this study, we presented a novel k-mer set motif (KSM) representation to describe
the binding sequence preferences of TFs and a motif discovery method to learn the KSM
model from ChIP-Seq defined sequences. We have shown that the KSM model with the
motif group scoring method is able to predict in vivo binding more accurately than the
widely used PWM model, even when predicting binding across species. In addition, our
KMAC motif discovery method outperforms several widely used programs, including
several new ChIP-Seq oriented methods, in discovering known motif from a large set of
214 ENCODE ChIP-Seq data.
The comprehensive list of k-mers in the KSM model gives a more complete and
quantitative description of in vivo binding preferences. We have shown that the KSM
model is more predictive than the PWM model. The performance gain is likely due to
the more comprehensive representation of the motif and the extra information provided
by the flanking k-mers learned from a large set of sequence data. In addition, the
71
Chapter 3: K-mer set motif representation and discovery
Figure 3-7 KMAC outperforms other ChIP-Seq oriented motif discovery methods
The motif detection performances of various motif-finder/condition combinations on 214
ENCODE ChIP-Seq experiments are compared. KMAC, POSMO, HMS, ChIPMunk and
DREME motifs are learned from the recommended number of sequences by the methods
(DREME: all sequences, other methods: top 5000 sequences). KMAC500, DREME500,
POSMO500, HMS500 and ChIPMunk500 motifs are learned from the same condition (top 500
100bp sequences).
advantageous performance of the KSM model suggests that the stringent exact k-mer
match criterion does not limit our model to generalize to un-seen sequences while ruling
out unrealistic binding sequences that may pass the PWM threshold. In this work, the kmers are limited to ungapped k-mers for the purpose of computation simplicity and
efficiency. A more flexible model that incorporates gapped k-mers will represent the
binding specificity more accurately, but will need more sophisticated learning methods
and motif scanning methods.
The KMAC motif discovery method exploits the advantages of ChIP-Seq when
compared to ChIP-chip, such as better spatial resolution, more complete coverage,
higher sensitivity and specificity and better quantification of binding events. It learns
motifs enriched in bound sequences as compared to unbound sequences, utilizes both
strong and weak binding events and biases motif discovery towards the binding event
positions and more confident events. The value of integrating these factors has been
72
Chapter 3: K-mer set motif representation and discovery
demonstrated by the improved performance of KMAC for discovering biologically
relevant DNA-binding motifs in the large set of ENCODE ChIP-Seq data. Motif
discovery methods typically compare their performance with existing method by
analyzing synthetic sequences or sequences bound by a small set of TFs (Zambelli et
al., 2012). The ENCODE data we compiled consists of 214 ChIP-Seq experiments and
63 TFs spanning different structural classes. These ChIP-Seq data covers a variety of
cell types, antibodies and lab protocols, provides a diverse set of ChIP-Seq binding
sequences for a comprehensive assessment of motif discovery performance. Traditional
motif discovery methods such as MEME and Weeder are found to be not suitable for
analyzing the whole sequence set due to computational inefficiencies (Jothi et al., 2008;
Zambelli et al., 2012), and are usually applied to only a small subset of top ranking
binding sequences. Several ChIP-Seq oriented methods such as DREME, POSMO,
HMS and ChIPMunk are designed to exploit the high spatial resolution gained from
ChIP-Seq data. However, their performance with the designed ChIP-Seq compatible
condition (i.e. top 5000 or all sequences) was surprisingly worse than using the top 500
sequences. In contrast, KMAC motif discovery on the ENCODE data with 5000 top
sequences outperforms these methods in both tested conditions, as well as outperforms
traditional methods such as MEME, Weeder, AlignACE and MDScan. Thus, the
combination of the KSM model and the KMAC motif discovery method offers promise in
taking advantage of what ChIP-Seq technology has to offer for a better understanding of
in vivo sequence specificity of protein-DNA interactions.
The KSM model and the KMAC motif discovery method are designed to leverage
the large amount of high spatial resolution binding sites produced by ChIP-Seq
technology. As better sequencing technology and more ChIP-Seq datasets become
available, the advantage of using a more comprehensive k-mer based motif
representation such as the KSM model will be of greater significance. The integration of
our in vivo binding model with an in vitro model learned from technologies such as
protein binding microarray (PBM) (Berger et al., 2006), HT-SELEX (Zhao et al., 2009),
and Bacterial one-hybrid (B1H) (Meng et al., 2005; Meng and Wolfe, 2006) will generate
a more complete picture of binding preferences for further understanding of mechanisms
of protein-DNA interactions. Such quantitative characterization of transcription factor
binding will allow more systems biology approaches to modeling and predicting the
behavior of the complex regulatory network in living cells.
73
Chapter 3: K-mer set motif representation and discovery
3.6 Methods
3.6.1 Datasets
214 ENCODE ChIP-Seq datasets that have an embargo date before Oct 28, 2011
and have known motifs in public databases were downloaded from the ENCODE project
website (Birney et al., 2007). Mouse ES cell factor ChIP-Seq datasets (Chen et al.,
2008) were downloaded from GEO. FastQ files of the ChIP-Seq data were then aligned
with genome (human hg19 or mouse mm9) using Bowtie (Langmead et al., 2009)
version 0.12.7 with options “-q --best --strata -m 1 -p 4 --chunkmbs 1024”.
3.6.2 Motif-finding performance comparison
For the 214 ENCODE ChIP-Seq data, the GPS event-finder (Guo et al., 2010) were
applied to call binding events. KMAC and 8 other motif finding methods, AlignACE v4.0
(Hughes et al., 2000), MDscan v2004 (Liu et al., 2002), MEME v4.7.0 (Bailey and Elkan,
1994), Weeder v1.4.2 (Pavesi et al., 2001), DREME v4.7.0 (Bailey, 2011), POSMO v1
(Ma et al., 2012), HMS v0.1 (Hu et al., 2010) and ChIPMunk v3 (Kulakovskiy et al.,
2010), were applied to discover motifs independently. For KMAC, the positive set
consists of 61bp sequences centered on the GPS predicted binding locations, and a
negative set consists of 61bp sequences that are 300bp away in the reference genome
from binding locations and that don’t overlap positive sequences. For AlignACE,
MDscan, MEME and Weeder, 100bp sequences centered on the top 500 peaks were
extracted from each dataset, as suggested by the MEME Suite’s documentation based
on the typical resolution of ChIP-Seq peaks. For ChIP-Seq oriented methods, DREME,
POSMO, HMS and ChIPMunk, two sets of sequences were tested: 1) a set of 100bp
sequences centered on the top 500 GPS peaks; 2) a set of sequences with number and
length recommended by the authors of these methods (DREME: all binding calls, 100bp;
POSMO: top 5000 1000bp; HMS and ChIPMunk: top 5000 200bp sequences). For
HMS and ChIPMunk, a set of read coverage profiles matching the sequences were also
extracted.
MEME was run with “-nmotifs 6” and Weeder was run with option “large”. POSMO
was run with options “5000 11111111 sequence_file 1.6 2.5 20 200”. ChIPMunk was
run with options “6 15 yes 1.0 p:read_coverage_profile 100 10 1 4 random 0.41”. HMS
was run with options “-w motif_width -dna 4 -iteration 100 -chain 50 -seqprop 0.1 -strand
2 -base read_coverage_profile -dep 2”; motif_width was determined by width of motif
74
Chapter 3: K-mer set motif representation and discovery
discovered by MEME for the same data. All other parameters were the defaults
specified by the authors.
We collected known binding preference motifs (PWMs) from the TRANSFAC (Matys
et al., 2003), JASPAR (Sandelin et al., 2004), and Uniprobe (Berger et al., 2006)
databases. We only included motifs of the factors of interest or motifs for the TF family
but not motifs of factors in the same family because factors in the same family may have
very different binding motifs. Discovered motifs (PWMs) were compared to known
motifs using STAMP (Mahony et al., 2007). A motif with E-value less than 1e-5 was
considered a match. For each program, we counted the number of datasets that had a
motif matching at least one known motif of that transcription factor. In some cases, the
correct motifs were not matched by the first motif that a method outputs, but by the
second or later motifs. Therefore we compared the motif-finding performance using the
top 1, top 2… or top 6 motifs. Little improvement was observed after the 6th motifs.
3.6.3 ROC comparison of motif representation performance in predicting in
vivo binding
The mouse motifs (in both KSM and PWM representations) were learned by
applying KMAC on top 5000 GPS (Guo et al., 2010) binding event calls in CTCF and cMyc ChIP-Seq of mouse ES cells (Chen et al., 2008). The test mouse CTCF (n=39071)
and c-Myc (n=7085) bound sequences are obtained by extracting 100 bp sequences
centered on all the GPS binding event calls in the respective ChIP-Seq data (Chen et al.,
2008). The test human CTCF and c-Myc bound sequences are obtained by extracting
100 bp sequences centered on all the GPS binding event calls in CTCF (n=61767) and
c-Myc (n=18537) ChIP-Seq data by Crawford Lab in the ENCODE project, respectively
(Birney et al., 2007). A matched set of negative test sequences for each test sequence
set is generated by dinucleotide shuffling (Jiang et al., 2008) of the ChIP-Seq bound
sequences. The KSM model (k-mer group score, scores using the p-value or the hit
count of the single top match k-mer) and the PWM model are used to scan the testing
sequences. A list of scores on positive and negative sequences for each testing case is
processed using Matlab software (The MathWorks, Inc.) to compute the truncated (false
positive rate <= 0.1) ROC curves and the AUC values.
75
Chapter 4: Genome-wide event finding and motif discovery (GEM)
Chapter 4
Genome-wide event finding and motif
discovery (GEM)
Part of the material presented in this chapter was adapted from the following publication:
Yuchun Guo, Shaun Mahony, David K Gifford (2012). High resolution genome wide
binding event finding and motif discovery reveals transcription factor spatial binding
constraints, PLOS Computational Biology, 8(8): e1002638.
Collaborations:
Y.G., S.M. and D.K.G. conceived the project. Y.G. designed the computational model,
implemented the algorithm and analyzed the data. Y.G., S.M. and D.K.G. wrote the
manuscript.
77
Chapter 4: Genome-wide event finding and motif discovery (GEM)
Chapter 4: Genome-wide event finding and motif
discovery (GEM)
4.1 Introduction
As the method of choice for genome-wide profiling of in vivo protein-DNA
interactions, ChIP-Seq offers much improved spatial resolution (also called spatial
accuracy, positional resolution), which is considered perhaps the greatest improvement
over ChIP-chip (Park, 2009). High spatial resolution in binding event calls can greatly
facilitate computational motif discovery from the bound regions by narrowing the search
window (Bailey, 2011) or by providing positional prior information for computational
models (Narlikar et al., 2006a; Qi et al., 2006), thus increasing the motif signal-to-noise
ratio. In addition, the spatial accuracy in locating the binding events influences the
quality of regulatory site annotation relative to binding sites of other transcription factors
(TFs), transcription start sites, exon/intron boundaries, the 3’ end of genes and other
conserved noncoding features.
However, although the reads are mapped to the genome at base pair resolution,
random variation in the ChIP DNA fragmentation process obscures the actual location of
interaction events (see discussion in Section 2.1). The majority of TFs are known to
interact with DNA in a highly sequence-specific manner, with much higher affinity to the
motif sites than neighboring sequences (Vaquerizas et al., 2009; Stormo and Zhao,
2010). DNA binding motifs have been used in computational identification of TF binding
sites, but the prediction of in vivo binding from sequence may result in false positives
and has been shown to be empirically unreliable (Wasserman and Sandelin, 2004;
Farnham, 2009). Thus the ChIP-Seq read coverage and DNA motif information provide
complimentary perspectives to enhance the accuracy for predicting sequence-specific
protein-DNA interactions. By using genome sequence data, it should be possible to
predict the TF binding locations at a much higher spatial resolution.
Contemporary methods to resolve binding events in ChIP-Seq data identify
statistically enriched regions of ChIP-Seq read density and the peak points of
enrichment within those regions (Ji et al., 2008; Jothi et al., 2008; Valouev et al., 2008;
Zhang et al., 2008; Guo et al., 2010; Feng et al., 2011). The resulting binding calls can
be offset from the bound site by dozens of bases (Park, 2009). Recent studies have
integrated peak detection and motif discovery by including motif occurrences to score
79
Chapter 4: Genome-wide event finding and motif discovery (GEM)
the significance of predicted binding events (Boeva et al., 2010; Wu et al., 2010), or by
using ChIP-Seq read coverage as a positional prior to improve motif discovery (Hu et al.,
2010; Kulakovskiy et al., 2010). However, no study has yet used the motif position
information to reciprocally improve the spatial accuracy of binding event prediction.
To improve the spatial resolution of ChIP-Seq binding prediction, we have
developed a new method called GEM, which simultaneously resolves the location of
protein-DNA interactions and discovers explanatory DNA sequence motifs with an
integrated model of ChIP-Seq reads and proximal DNA sequences. GEM reciprocally
improves motif detection using binding event locations, and binding event predictions
using discovered motifs.
In this chapter, I describe the GEM algorithm and review the GEM derived results.
GEM significantly improves upon previous methods for processing ChIP-Seq and ChIPexo data to yield unsurpassed spatial resolution and improves the discovery of closely
spaced binding events for the same factor. Additional results of using GEM to discover
TF spatial binding constraints are presented in Chapter 5.
4.2 GEM algorithm
The GEM algorithm consists of five phases:
1. Predict protein-DNA binding event locations with a sparse prior
2. Discover the k-mer set motifs at binding event locations
3. Generate a positional prior for event discovery with the most enriched k-mer set
motif
4. Predict improved protein-DNA binding event locations with a k-mer based
positional prior
5. Repeat motif discovery (Steps 2) from the Phase 4 improved event locations.
4.2.1 Predicting protein DNA-binding events with a sparse prior
Initial protein-DNA binding event locations are predicted by GPS (Guo et al., 2010),
which employs a negative Dirichlet sparse prior.
4.2.2 Discover the k-mer set motifs at binding event locations
Next we apply KMAC to discover the k-mer set motifs (KSMs) that are enriched in
the bound sequences as compared to the unbound sequences. The positive set
consists of 61bp sequences centered on the predicted binding locations from Phase 1,
80
Chapter 4: Genome-wide event finding and motif discovery (GEM)
and a negative set consists of 61bp sequences that are 300bp away from binding
locations and that don’t overlap positive sequences.
4.2.3 Positional prior generation
Phase 3 of GEM uses the primary KSM to compute a positional prior that will be
used for binding event discovery in Phase 4. As in GPS, the genome is segmented into
independent separable regions (typically a few kb long) by dividing at read gaps that are
larger than 500bp and further excluding regions that contain fewer than 6 reads. At each
evaluated genome region, we search the k-mer group instances of the primary k-mer set
motif, as described in Subsection 3.2.2. The position-specific prior for a sequence base
is defined as the hit count of the k-mer group whose binding offsets match that base (i.e.
the number of positive set sequences that contain one or more of the matched k-mers
whose binding offsets match that base). Alternatively, the PWM of the TF of interest is
used to scan the sequence and the PWM scores that pass a certain threshold is used to
specify the position-specific prior information. The concept of using informative
positional priors for motif discovery has been explored previously (Narlikar et al., 2006a;
Qi et al., 2006).
4.2.4 Binding event prediction with a positional prior
GEM employs a generative probabilistic model that describes the likelihood of a set
of ChIP-Seq reads being generated from a set of protein-DNA interaction events
originating at specific DNA sequences. The model generates protein-DNA interaction
events that are biased to occur at explanatory DNA sequences by a motif based
positional prior. Each event then independently generates reads following an empirical
read spatial distribution that describes the probability of reads given the distance from
the event (see Figure 2-2 for an example).
Formally, in an evaluated region of length M, we consider N ChIP-Seq reads that
have been mapped to genome locations R = {r 1 , …, r N } and M all possible protein-DNA
interaction events at single base locations B = {b 1 , …, b M }. We represent the latent
assignments of reads to events that caused them as Z = {z1 , …, z N }, where indicator
function 1(zn = m) = 1 when read n is caused by the event m.
The probability of a read n is based on a mixture of possible binding events:
81
Chapter 4: Genome-wide event finding and motif discovery (GEM)
M
p(rn | π ) = ∑ π m p(rn | m),
m =1
∑
M
m =1
πm =1
where M is the number of possible events; π denotes the parameter vector of
mixing probabilities, and π m is the probability of event m; p(r n | m) is the probability of
read n being generated from event m and can be determined from the empirical spatial
distribution of reads given the event.
The overall likelihood of the observed set of reads is:
N
M
p ( R | π ) = ∏∑ π m p (rn | m)
n =1 m =1
We make two prior assumptions about the binding events: 1) binding events prefer
to occur at the sequence specific DNA motif positions; 2) binding events are relatively
sparse throughout the genome. To incorporate these assumptions, we place a negative
Dirichlet prior (Figueiredo and Jain, 2002; Guo et al., 2010) p(π) on binding event
probabilities π:
M
p (π ) ∝ ∏ (π m ) −α s +α m
m =1
where α s is the uniform sparse prior parameter governing the degree of sparseness, α s
>0; α m denotes the binding event specific prior parameter and its value is proportional to
C m, the positional prior count underlying event m (as defined in Phase 3):
αm = αsµ
Cm
max Cm '
m'
where µ is a parameter to tune the effect of the motif based prior, 0≤µ<1.
The rationale for scaling the motif-based positional prior with the genome-wide
occurrences of motif is that if the motif matched position m has more occurrences at
binding events genome wide, it is more likely to cause a binding event at that genome
position. The parameter α s is set based on the read coverage in the whole dataset and
in the local genomic region (see more details in Subsection 2.2.5). The parameter α m is
scaled such that the values of all possible α m will be less than α s . Therefore the k-mer
based prior will not force the model to predict a binding event at a motif position when
the observed reads do not provide sufficient evidence of a protein-DNA interaction
event. We tested different settings for µ values (µ = 0.5, 0.8 or 0.95) in analyzing the
GABP data. Results show that GEM with µ = 0.8 or 0.95 give similar results but produce
82
Chapter 4: Genome-wide event finding and motif discovery (GEM)
much better spatial resolution (~1bp) and call more joint events (~5%) than GEM with µ
= 0.5. Thus we chose µ = 0.8.
Since the k-mers underlying the possible binding event positions and their counts
are known, the value of term -α s +α m remains constant when we estimate the parameters
in the mixture model. Therefore, we can solve the mixture model using the ExpectationMaximization (EM) algorithm (Dempster et al., 1977).
The complete-data log penalized likelihood is:
M
 M
(
)
ln p ( R, Z , π ) = ∑ ∑ 1( z n = m) ln π m + ln p (rn | m)  + ∑ (−α S + α m ) ln π m
n =1  m =1
 j =1
N
where 1(z n =m) is the indicator function.
In the E Step we have:
γ ( z n = m) =
π j p (rn | m)
M
∑π
m '=1
j'
p (rn | m' )
where γ(zn =m) can be interpreted as the fraction of read n that is assigned to event m.
In the M step, on iteration i we find parameter πˆ (i ) to maximize the expected
complete-data log penalized likelihood:

N
M

M


j =1

πˆ m (i ) = arg max ∑ ∑ γ ( zn = m)(ln π m + ln p(rn | m) ) + ∑ (−α S + α m ) ln π m 
 n =1  m =1
πm
∑ j =1π j = 1 .
M
under the constraint
By simplifying, we find the closed-form solution of the
maximization as:
πˆ m (i ) =
∑
max(0, N m − α S + α m )
M
m '=1
, N m = ∑n =1 γ ( z n = m)
N
max(0, N m ' − α S + α m )
where N m is the effective number of reads assigned to event m, or the binding strength
of event m. Intuitively, the effective read count of an event is decreased by a pseudocount α s for the sparseness penalty, and is increased by a pseudo-count α m for the motif
match at position m. If for event m, the value of π m becomes zero, the model is
restructured to eliminate it (Figueiredo and Jain, 2002).
The EM algorithm is deemed to have converged when the change in likelihood falls
below a small value, for example 1e-5.
Since the value of term -α s +α m is negative, a binding event supported by enriched k-
83
Chapter 4: Genome-wide event finding and motif discovery (GEM)
mers may still be eliminated if it is not sufficiently supported by read data. In addition, a
binding event not supported by enriched k-mers may still survive if it is sufficiently
supported by the read data.
The predicted binding events are tested for significance as described previously
(Guo et al., 2010). Briefly, if a control dataset is available, we compare the number of
reads in the ChIP event to the number of reads in the corresponding region in the control
sample using a Binomial test. If control data is not available, we apply a statistical test
that uses a dynamic Poisson distribution to account for local biases. To correct for
multiple hypothesis testing, a Benjamini-Hochberg correction (Benjamini and Hochberg,
1995) is applied. It is worth mentioning that we only use read counts of events to test for
significance.
The read spatial distribution of binding events is updated after each round of binding
event prediction.
4.2.5 Motif discovery using improved event locations
Phase 5 repeats Phase 2 motif discovery using the binding events predicted from
Phase 4. As described in the results section (Figure 4-1), the spatial accuracy of binding
events discovered from Phase 4 (GEM) is significantly improved from Phase 1 (GPS).
Thus, these events will be more accurately centered on motifs and the performance of
motif discovery is correspondingly improved.
4.2.6 GEM software
GEM is a stand-alone Java software that takes alignment files of ChIP-Seq reads
and a genome sequence as input and reports a list of predicted binding events and the
explanatory binding motifs. It can be downloaded from our web site
(http://cgs.csail.mit.edu/gem). For analysis of mammalian genome experiments, GEM
requires about 5-15G memory.
4.3 Results
4.3.1 GEM improves the spatial resolution of binding event prediction
We compared GEM’s spatial resolution to six well known ChIP-Seq analysis
methods, including GPS (Guo et al., 2010), SISSRs (Jothi et al., 2008), MACS (Zhang et
al., 2008), cisGenome (Ji et al., 2008), QuEST (Valouev et al., 2008) and PeakRanger
84
Chapter 4: Genome-wide event finding and motif discovery (GEM)
(Feng et al., 2011). We used a human Growth Associated Binding Protein (GABP)
ChIP-Seq dataset for our evaluation because GABP ChIP-Seq data were previously
reported to contain homotypic events where the reads generated by multiple closely
spaced binding events overlap (Valouev et al., 2008). Thus the GABP dataset offers the
opportunity to test if integrating motif information and binding event prediction improves
our ability to deconvolve closely spaced binding events with greater accuracy. We also
evaluated the methods using ChIP-Seq data (Chen et al., 2008) from the insulator
binding factor CTCF (CCCTC-binding factor), as it binds to a more informative motif than
GABP. These two factors are representative of relatively easy (CTCF) and difficult
(GABP) cases for ChIP-Seq data analysis. They are also used by other studies as
benchmarks allowing for the direct evaluation of our results. GEM performance on other
factors may vary.
We found that GEM has the best spatial resolution among tested methods. Spatial
resolution is defined as the average absolute value difference between the
computationally predicted locations of binding events and the middle of the nearest motif
match. From all observations, spatial resolution is corrected for a fixed offset by
subtracting the mean difference before averaging the absolute value differences. To
ensure a fair comparison, we used 428 shared GABP binding sites that are predicted by
all seven tested methods and which contain an instance of the GABP motif within 100bp.
GEM exactly locates the events at the motif position in 56.5% of these events (Figure
4-1A). For a dataset with a stronger consensus motif, ChIP-Seq data from CTCF, GEM
exactly locates the events at the motif position in more than 90% of the shared events,
significantly improving the spatial accuracy of predicted binding events over other
methods (Figure 4-1B). Alternative evaluations with all the binding sites that have a
motif at a distance less than 100bp were also performed for both GABP and CTCF data,
and the results were similar to those above (data not shown).
To show that GEM’s binding calls indeed more accurately locate the actual binding
sites and do not simply map to the motif positions, we performed an independent
analysis without using motif as the gold standard for evaluation. ChIP-exo is a new
experimental method for generating binding data with higher spatial resolution than
ChIP-Seq (Rhee and Pugh, 2011). We used human HeLa cell CTCF ChIP-exo binding
sites (without using motif information) as an approximation of actual binding sites.
These ChIP-exo sites were then used as a gold standard to evaluate the GEM and GPS
binding calls of human HeLa cell CTCF ChIP-Seq data. We found that overall GEM
85
Chapter 4: Genome-wide event finding and motif discovery (GEM)
binding calls were located closer to the ChIP-exo binding sites than the GPS binding call
(Figure 4-1C), suggesting that GEM binding calls indeed more accurately locate the
actual binding sites.
Thus, GEM’s joint model of ChIP-Seq read coverage and genome sequence is able
to more accurately predict the location of binding sites than other approaches, which do
not use motif information in their binding event predictions.
Figure 4-1 GEM improves spatial accuracy in binding event prediction
A) Fraction of predicted GABP binding events with a motif within the given distance following
event discovery by GEM, GPS, SISSRs, MACS, cisGenome, QuEST and PeakRanger. Events
shown were predicted by all seven methods and had a GABP motif within 100bp. B) Fraction of
predicted CTCF binding events with a motif within the given distance following event discovery
by GEM, GPS, SISSRs, MACS, cisGenome, QuEST, FindPeaks, spp-wtd and spp-mtc. Events
shown were predicted by all nine methods and had a CTCF motif within 100bp. C) Fraction of
predicted CTCF binding events with ChIP-exo site within the given distance following event
discovery by GEM and GPS. Events shown were predicted by both methods and had a CTCF
ChIP-exo site within 50bp.
4.3.2 GEM is better at resolving closely spaced binding events
GEM is also better at resolving closely spaced binding events (Gotea et al., 2010) in
the GABP data than the other methods we tested. For example, GEM uniquely detects
two GABP events over proximal GABP motifs that are 32bp apart on chromosome 2
86
Chapter 4: Genome-wide event finding and motif discovery (GEM)
(Figure 4-2A). To evaluate binding deconvolution on a genome-wide scale, we identified
477 candidate clusters of closely spaced binding events. Each candidate cluster was
detected as bound by all seven tested methods and contained two or more proximal
GABP motifs separated by less than 500bp. GEM identified two or more closely spaced
events in 144 of the candidate clusters, significantly more than GPS(108), SISSRs(77),
QuEST(77), PeakRanger(36), MACS(4) and cisGenome(5) (Figure 4-2B).
Figure 4-2 GEM is better at resolving closely spaced binding events.
A) Example of a predicted binary GABP event that contains coordinately located GABP
motifs. B) Numbers of GABP binding events discovered by GEM, GPS, SISSRs,
MACS, cisGenome, QuEST and PeakRanger in 477 regions that contain clustered GABP
motifs within 500bp.
4.3.3 GEM improves the spatial resolution of ChIP-exo binding event
prediction
We also tested GEM and GPS on the new experimental protocol ChIP-exo. ChIPexo aims to improve transcription factor binding spatial resolution by extensively
digesting ChIP fragments down to the DNA that is protected by the bound protein
complex (Rhee and Pugh, 2011). While ChIP-exo experiments provide high-resolution
binding information, typical peak-finding methodologies may fail to achieve single-base
resolution binding event predictions if they do not account for the properties of the ChIPexo experiment. An example is provided by the published CTCF ChIP-exo experiment
(Rhee and Pugh, 2011), where ChIP-exo reads are bimodally distributed around binding
sites on both strands because CTCF is cross-linked at two distinct sites of DNA. The
published event predictions did not account for this characteristic distribution, and are
thus often offset from CTCF binding motif instances. Since GPS and GEM automatically
learn a model of sequence reads around binding events, they may be directly applied to
87
Chapter 4: Genome-wide event finding and motif discovery (GEM)
ChIP-exo data without modification. To test GEM’s ability to automatically adapt to
ChIP-exo data, we initialized GEM with a ChIP-Seq empirical read distribution. Results
showed that GEM is able to automatically adapt to the read distribution produced by the
ChIP-exo protocol. We compared GEM’s final computed read distribution to the
expected empirical distribution of ChIP-exo and found that they were consistent (Figure
4-3B). A similar test was done on the yeast Reb1 ChIP-exo data and GEM also
automatically adapted to the ChIP-exo distribution (Figure 4-3D).
Figure 4-3 GEM improves the spatial resolution of ChIP-exo data event prediction.
A) Fraction of predicted CTCF binding events with a motif within the given distance following
event discovery by GEM, GPS, and the peak-pair midpoint method of Rhee, et al. GEM and GPS
analysis of ChIP-exo data are initialized with a ChIP-Seq read distribution. B) GEM
automatically adapts to the ChIP-exo read spatial distribution. C) Fraction of predicted Reb1
binding events with a motif within the given distance with event discovery by GEM, GPS, and
the peak-pair midpoint method of Rhee, et al. GEM and GPS are initialized with a ChIP-Seq read
distribution. D) GEM automatically adapts to the Reb1 ChIP-exo read spatial distribution.
GEM improves upon the spatial resolution of binding event detection over other
methods for ChIP-exo data analysis (Figure 4-3A). To investigate the performance of
GEM on ChIP-exo data, we compared the binding event predictions of GEM and GPS
88
Chapter 4: Genome-wide event finding and motif discovery (GEM)
on ChIP-exo CTCF binding and the “middle of peak-pair” method from the original ChIPexo study (Rhee and Pugh, 2011), as well as predictions of GEM and GPS on ChIP-Seq
CTCF binding data from same cell type. To ensure a fair comparison, we used 4507
shared binding sites that are predicted by all tested methods and that contain a strong
CTCF motif match within 100bp of the binding positions. The original ChIP-exo study
(Rhee and Pugh, 2011) had 5.4% of the binding event calls centered on the motif match
position, 40.3% of the calls within 10bp, and an average spatial resolution of
15.85±15.29bp. Applying GPS to the ChIP-exo data improved the spatial resolution,
with 8.8% calls at 0bp positions, 59.7% of calls within 10bp, and an average spatial
resolution of 10.38±11.26bp. Applying GEM to the ChIP-exo data located 76.5% calls
exactly at the motif match positions, 89.7% of calls within 10bp, and an average spatial
resolution of 3.35±9.71bp. In addition, GEM was also applied to yeast Reb1 ChIP-exo
data and located 95.5% calls at the motif match positions. These results demonstrate
that GEM can significantly improve the spatial accuracy of ChIP-exo binding event
predictions.
Interestingly, we found that GEM ChIP-Seq prediction has a marginally better
performance in spatial resolution than GEM ChIP-exo prediction (Figure 4-3A). This
suggests that GEM’s ability to integrate motif information for binding event prediction
may compensate the relative lower resolution of ChIP-Seq read data and still produce
high resolution binding calls. However, we cannot draw conclusions about the relative
performance of the ChIP-Seq and ChIP-exo protocols, because only a single ChIP-exo
dataset for a vertebrate transcription factor is publically available (CTCF), and the CTCF
ChIP-Seq experiments we analyzed are not matched controls for the ChIP-exo
experiment (i.e. they were performed by different groups under different laboratory
conditions). We also note that GPS and GEM may not yet be fully optimized for ChIPexo analysis (see discussion). It is therefore difficult to separate the inherent resolution
of ChIP-Seq and ChIP-exo experimental data from the relative performance of our
methods on these data types.
4.4 Discussion
GEM builds on the probabilistic mixture model framework of GPS to integrate motif
information with read coverage for ChIP-Seq binding event prediction. The motif
information is modeled as the position-specific prior to bias the binding events towards
motif positions and thus improve the spatial resolution of event predictions. In doing so,
89
Chapter 4: Genome-wide event finding and motif discovery (GEM)
GEM offers a more principled approach than simply snapping binding event predictions
to the closest instance of the motif, and indeed, GEM does not require that all binding
events are associated with strong motifs. GEM achieves exceptional spatial resolution
and improves the deconvolution of closely spaced binding events, underscoring the
value of integrating motif information as a positional prior into binding event prediction.
An important issue for de novo motif discovery is the quality of the input sequences,
as evident from the improvement from using promoters of co-expressed genes to using
ChIP enriched regions. An algorithm has been developed to improve the signal-to-noise
ratio by determining the threshold to partition the data into target and background sets
(Eden et al., 2007). GEM’s significant improvement on spatial resolution further
facilitates the motif discovery by locating more binding regions with motif instances, thus
increasing the motif signal-to-noise ratio.
The manner of integrating motif information as position-specific counts is
generalizable. Either k-mer set motif hit counts or PWM scores can be integrated as
positional prior into the model. In addition, other position-specific information, e.g.
phastCons scores for cross-species sequence conservation (Siepel et al., 2005), may be
integrated in a similar fashion. In the current implementation of GEM, KMAC is used to
find the k-mer set motif. Although the performance of KMAC motif discovery and the
predictive power of the k-mer set motif model has been shown to be superior (Chapter
3), GEM can be easily modified to use motif priors generated by other motif discovery
methods. It is important to note that GEM’s performance is dependent on the correct
identification of the motif prior information. In any case, it is a good practice to verify that
the discovered motif is indeed biologically relevant.
GEM can also be applied to ChIP-exo data without modification and further
improves the spatial resolution of binding prediction. GEM’s successful automatic
adaptation to the ChIP-exo read distribution suggests that it is flexible enough to be
applied to new types of data. GPS/GEM uses a single read distribution model for all
binding events. This may be a limitation because different types of binding events may
be associated with different types of read distributions in ChIP-exo experiments (e.g.
because of different patterns of exonuclease protection associated with recruitment of
different cofactor complexes). Using a single read distribution to analyze ChIP-exo data
may therefore hurt the relative spatial resolution performance of our methods on ChIPexo data. Thus it would be useful to account for such potential features of ChIP-exo
datasets in the next generation of GPS/GEM models.
90
Chapter 4: Genome-wide event finding and motif discovery (GEM)
4.5 Methods
4.5.1 Datasets
The mouse ES cell CTCF (Chen, et al., 2008) and human Jurkat cell GABP
(Valouev, et al., 2008) ChIP-Seq binding datasets are described in subsection 2.5.1.
ChIP-exo (Rhee and Pugh, 2011) data were provided by Ho Sung Rhee and B.
Franklin Pugh.
Human HeLa-S3 cell CTCF ChIP-Seq dataset was generated by Crawford Lab and
was downloaded from the ENCODE project website.
4.5.2 Evaluating spatial resolution of ChIP-Seq event calls
The genome-wide spatial resolution performance in ChIP-Seq event calls is
evaluated as following. We define effective spatial resolution as the average absolute
value of the distance between genome coordinates of predicted binding events and the
middle of the corresponding high-scoring binding motif hit. Because the center of the
motif hit may not represent the true center of a binding event, the offsets to the motif
were centered by subtracting the mean offsets. We compare spatial resolution on the
“matched” set of predictions that are called by all the methods and correspond to the
same high-scoring binding motif. Only those events within 100bp of a motif match are
included in the calculation. An alternative evaluation with all the events that have a motif
at a distance less than 100bp is also performed.
For Figure 4-1C, we used the GPS binding calls from a human HeLa cell CTCF
ChIP-exo data (without using motif information) as an approximate version of actual
binding sites.
4.5.3 Evaluating performance in deconvolving proximal binding events
using GABP ChIP-Seq data
The genome-wide performance of proximal event discovery in ChIP-Seq data is
evaluated as follows. For GABP dataset, we compared GEM against other 6 methods
(GPS, SISSRs, MACS, cisGenome, Quest and PeakRanger) genome wide. We define
a set of candidate sites that all have at least one event detected by all seven methods,
and that contain two or more GABP motifs separated by less than 500bp. We
discovered 477 such sites. For each of these sites, we count the number of events
discovered by different methods. The GABP motif was retrieved from the TRANSFAC
91
Chapter 4: Genome-wide event finding and motif discovery (GEM)
database (M00341) (Matys et al., 2003). A motif score threshold of 9.9, which is 60% of
the maximum PWM score, is used in this analysis.
4.5.4 Analysis of ChIP-exo data
To test GEM’s ability to automatically adapt to ChIP-exo data, we initialized GEM
with a generic ChIP-Seq empirical read distribution, and ran GEM with one extra run
(phase 4 and 5) so that GEM could use more accurately positioned events to refine the
read distribution and use it for final prediction. In practice, the user can directly initialize
GEM with a ChIP-exo empirical read distribution (provided with GEM software) and
apply GEM the same way as analyzing ChIP-Seq data.
92
Chapter 5: Transcription factor spatial binding constraints
Chapter 5 Transcription factor spatial binding
constraints
Part of the material presented in this chapter was adapted from the following publication:
Yuchun Guo, Shaun Mahony, David K Gifford (2012). High resolution genome wide
binding event finding and motif discovery reveals transcription factor spatial binding
constraints, PLOS Computational Biology, 8(8): e1002638.
Collaborations:
Y.G., S.M. and D.K.G. conceived the project. Y.G. designed the computational model
and implemented the algorithm. Y.G., S.M. and D.K.G. analyzed the data. Y.G., S.M.
and D.K.G. wrote the manuscript.
93
Chapter 5: Transcription factor spatial binding constraints
Chapter 5: Transcription factor spatial binding
constraints
5.1 Introduction
Genomic sequences facilitate both cooperative and competitive regulatory factorfactor interactions that implement cellular transcriptional regulatory logic. The functional
syntax of DNA motifs in regulatory elements is thus an essential component of cellular
regulatory control. Appropriately spaced motifs can facilitate cooperative homo-dimeric
or hetero-dimeric factor binding, while overlapping motifs can implement competitive
binding by steric hindrance. Cooperative and competitive binding are an integral part of
complex cellular regulatory logic functions (Wolberger, 1999; Ponticos et al., 2004). The
notion of grammar (Levine, 2010; Swanson et al., 2011) has been referred to the
phenomenon that spacing and arrangement of binding sites matter for the activity of the
enhancer, just like the order of words in a sentence can affect its meaning. The binding
of regulatory proteins to the genome cannot at present be predicted from primary DNA
sequence alone as chromatin structure, co-factors, and other mechanisms make the
prediction of in vivo binding from sequence empirically unreliable (Farnham, 2009).
Thus it is not possible to use primary DNA sequence to determine the aspects of
genome syntax that are employed in vivo.
To discover novel pair-wise factor spatial binding constraints in vivo, we developed
GEM to improve the spatial resolution of binding event predictions (Chapter 4). GEM’s
unbiased computational approach has enabled us to discover novel binding constraints
between transcription factors from sequenced ChIP experiments. These spatial
constraints directly suggest biological regulatory mechanisms that will be useful in future
studies. SpaMo studied motif spacing using ChIP-Seq events to infer transcription factor
complexes but the predicted motif spacing does not necessarily indicate in vivo binding
in the specific cellular conditions (Whitington et al., 2011).
Here we review our GEM derived results, discuss these results in the context of
current data production projects, and detail our methods.
5.2 Spatial binding constraints discovery
To study the in vivo binding spatial relationship between a pair of transcription
95
Chapter 5: Transcription factor spatial binding constraints
factors A and B in a certain cell type and condition, we apply GEM independently to
ChIP-Seq data from A and B to predict the respective binding sites. To compute the
distribution of spacing between A relative to B, we compute the offsets of A binding sites
from B binding sites within a 201bp window. We choose this window size because we
expect most of the observable binding constraints to be within this range. The sequence
strand of the binding predictions is oriented using the B motif when a match to the motif
is present, and B is placed in the middle of the window. The occurrences of A at each
offset position are summed over all the B sites to produce the empirical spatial
distribution. In this study, we evaluate three different methods to call binding sites: GEM
binding calls, GPS binding calls, and GPS binding calls that are snapped to a motif
within 50bp if one is present. Another motif distance for snapping binding calls, 100bp,
was also tested and the result was very similar to the 50bp distance.
To determine if a specific spacing is significant, we compute the p-value of the
number of occurrences of factor A at that offset position using a Poisson test. The
parameter of the Poisson distribution is set as the mean number of A site occurrences
across all the positions in the [-400bp -200bp] and [200bp 400bp] windows, assuming
there are no significant spatial binding constraints in these windows. The p-value is
corrected for multiple hypotheses testing using Bonferroni correction by multiplying the
p-value by the number of positions in the window and the total number of pair-wise tests
across all cell types. The significance threshold for the corrected p-value is 1e-8.
Because the strand orientation of bound sequences cannot be oriented consistently
when comparing multiple factor pairs, we report the absolute distance between the most
significant interacting factor pairs in the pairwise spatial constraint matrices.
5.3 Results
5.3.1 GEM reveals known Sox2-Oct4 distance-constrained transcription
factor binding distances
We examined if GEM could detect pairs of transcription factors that bind to the
genome with characteristic pair-wise spacing, beginning with the well-known heterodimeric pair Sox2-Oct4 (Chew et al., 2005). In general, distance-constrained
transcription factor binding cannot be predicted based solely on sequence motifs, as
motif presence does not guarantee binding. Such spatial binding constraints may be
caused by combinatorial binding, alternative binding, binding that is orchestrated by
96
Chapter 5: Transcription factor spatial binding constraints
multimeric protein complexes, or the spread of constrained enhancer syntax.
We were able to discover Sox2-Oct4 transcription factor spatial binding constraints
by combining GEM binding calls from Sox2 and Oct4 ChIP-Seq data. We applied GEM
independently to mouse ES cell Sox2 and Oct4 ChIP-Seq data (Chen, et al., 2008) to
call the respective binding sites, and then computed the distance between Oct4 sites
from Sox2 sites within a 201bp window. The sequence strand of the GEM binding
predictions is oriented using the Sox2 motif when a match to the motif is present. As
expected, GEM predicted Oct4 binding sites are predominantly (630 sites out of 2525 in
the 201bp window) located at -6bp position relative to GEM predicted Sox2 sites (Figure
5-1A). However, this spacing cannot be observed from the binding calls of GPS or other
event discovery methods alone because of their more limited spatial accuracy (Figure
5-1B). An alternative approach is to snap binding calls to the nearest instance of the
transcription factor’s binding motif. We tested this approach using GPS binding calls as
the starting points and found that the alternate approach captures fewer (277 sites out of
2753 in the 201bp window) instances of Oct4-Sox2 spatial binding constraints (Figure
5-1C), presumably because some of the bound motifs do not pass the motif scoring
threshold or because some unbound motif instances are located closer to the binding
calls than the true motif. We also tested using the PWM motifs as the motif prior for
GEM. 476 instances of Oct4-Sox2 spatial binding constraints were discovered (Figure
5-1D), less than 630 sites from GEM with k-mer set motif (KSM) prior. This is consistent
with the finding that the KSM representation is more predictive than the PWM
representation (see Subsection 3.4.2). Inspection of the sequences of the 630 Oct4 and
Sox2 co-bound regions shows that Oct4 and Sox2 motifs locates right next to each other
(Figure 5-1E).
97
Chapter 5: Transcription factor spatial binding constraints
Figure 5-1 GEM reveals transcription factor spatial binding constraints.
A), B), C) and D) Genome wide spatial distribution of Oct4 binding sites in a 201bp window
around Sox2 binding sites, obtained by using GEM binding calls, GPS binding calls, GPS binding
calls snapping to the nearest motifs within 50bp or GEM (with PWM motif prior) binding calls
respectively. Dashed lines represent the Sox2 binding sites at position 0. E) Color chart
representation of 61bp sequences in 630 regions with 6bp Sox2/Oct4 binding constraint. Each
row represents a 61bp bound sequence. Green, blue, yellow and red indicate A, C, G and T. The
motif logos are generated by STAMP (Mahony et al., 2007) from the motifs discovered using all
the binding sites in the respective datasets.
5.3.2 Enhancer grammar elements deduced from transcription factor
binding sites predicted by GEM
We next studied pair-wise binding relationships between 14 sequence-specific
transcription factors (Oct4, Sox2, Nanog, Klf4, STAT3, Smad1, Zfx, c-Myc, n-Myc, Esrrb,
Nr5a2, Tcfcp2l1, E2f1 and CTCF) and two transcriptional regulators (p300 and Suz12) in
mouse ES cells by applying GEM to a large compendium of ChIP-Seq binding data
(Chen et al., 2008; Heng et al., 2010). Binding prediction by GEM enables the detection
of 37 pairs of statistically significant spatial binding constraints, involving Oct4, Sox2,
Nanog, Klf4, Esrrb, Nr5a2, Tcfcp2I1, E2f1, c-Myc, n-Myc and Zfx (Figure 5-2).
Interestingly, we found that Klf4, one of the ES cell reprogramming factors, exhibits
strong distance-specific binding with many other factors, including Nanog, Sox2, Zfx, cMyc, n-Myc, E2f1, Esrrb, Nr5a2 and Tcfcp2l1 (Figure 5-3).
98
Chapter 5: Transcription factor spatial binding constraints
A
Nanog 0
1
7
2
1
24 58 57 65
Sox2 1
0
6
1
0
25 56 58 66
Oct4 7
6
0
27
c-Myc
0
0
n-Myc
0
0
Ctcf
0
STAT3
3
1
1
1 23
6
57
1e-300
1e-250
2
1e-200
0
Suz12
0
Zfx
0
P300 2
1
Smad1 1
0
6
5
3
1e-150
0
0
0
E2f1
3
Klf4 24 25 27 1
Esrrb 58 56 24
1
6
6
0
1
9
5
3
0 31 30 41
23
Nr5a2 57 58
0
Tcfcp2l1 65 66
1e-100
2
31 0
1 10
30 1
0 11
1e-50
N
an
og
So
x2
O
ct
c- 4
M
y
n- c
M
yc
C
ST tcf
AT
Su 3
z1
2
Zf
P3 x
Sm 0 0
ad
1
E2
f1
Kl
f
Es 4
rrb
N
Tc r5a
fc 2
p2
l1
41 10 11 0
B
Nanog 1
3 12
1
1
1
5
4
Sox2 5
1 52
1
1
6 26 4
8
Oct4 7 50 1
11
c-Myc
1
4
n-Myc
3
1
Ctcf
1
STAT3
2
2
5
2
3
1
60
3
1
1
1
P300 1
1
1
Tcfcp2l1 3
7
100
1
120
1
140
1
7
3
2
1
2
1
4
2
1
2
1 40 3
1
1
2
4
43 1
4 85
2
1
2
4 77 4
1
4
160
180
200
Zf
P3 x
Sm 0 0
ad
1
E2
f1
Kl
f
Es 4
rrb
N
Tc r5a
fc 2
p2
l1
N
an
og
So
x2
O
ct
c- 4
M
y
n- c
M
yc
C
ST tcf
AT
Su 3
z1
2
4
1
1
Esrrb 3 25 10
Nr5a2 4
4
1
E2f1
6
20
80
Zfx
Smad1 1
1
40
1
Suz12
Klf4 1
7
Figure 5-2 Spatial binding constraints detected from mouse ES cells.
A) Matrix representation of pairwise spatial binding constraints between factor B (column) and
factor A (row) detected from 16 ChIP-Seq dataset in mouse ES cells. The colors represent the
significance levels (corrected p-value) of the strongest spacings. The numbers represent the
distances between the factors in the strongest spacing. B) The colors and numbers represent the
number of positions exhibiting significant spatial binding constraints within the 201bp window
around the binding sites of factor B (column).
99
Chapter 5: Transcription factor spatial binding constraints
0
100
0
100
10
0
100
Esrrb
0
100
10
0
100
5
0
-100
0
0
0
-100
100
0
100
0
100
0
100
20
0
-100
100
100
0
-100
100
100
100
0
200
200
10
0
-100
Nr5a2
n-Myc
0
-100
100
20
Suz12
STAT3
20
0
-100
0
5
0
-100
0
-100
100
40
50
Smad1
10
0
100
100
50
0
-100
0
-100
100
10
P300
Oct4
20
0
-100
0
100
2
0
-100
0
-100
100
Sox2
0
200
Tcfcp2l1
Klf4
0
-100 4
x 10
4
20
200
Zfx
50
400
E2f1
Ctcf
40
Nanog
c-Myc
100
0
100
50
0
-100
Offset from Klf4 binding site
Figure 5-3 Spatial relationship between Klf4 and other 15 factors in mouse ES cells
Spatial distribution of 16 mouse ES cell factor binding sites in a 201bp window around Klf4
binding sites. Vertical dash-dot lines represent the Klf4 binding sites at position 0; horizontal
dashed lines represent the number of occurrences at a position corresponding to corrected p-value
of 1e-8.
The discovered pair-wise spatial binding constraints reveal complex relationships
among the factors. For example, Klf4 exhibits constrained binding with Sox2 but much
less significantly with Oct4 (Figure 5-3). However, we did observe strong distancespecific binding between Oct4-Sox2 (Figure 5-1). This raises the question of whether
the detected Klf4-Sox2 and Oct4-Sox2 spatial binding constraints are on the same
genomic regions. We therefore studied all Sox2 bound regions that are co-bound with
Klf4. Out of a total of 5609 Sox2 bound regions with a Sox2 motif instance that can be
oriented, 123 regions are co-bound by Klf4 at position +25bp (Figure 5-4A). However,
only four regions show co-binding of Klf4 at position +25bp and Oct4 at position -6bp.
More surprisingly, the distance-constrained Sox2/Klf4 regions are co-bound by 6 ES cell
factors within a 70bp window, including Sox2 (at 0bp), Nanog (at 1bp), Klf4 (at 25bp),
Esrrb (at 56, 59bp), Nr5a2 (at 55, 58, 61bp) and Tcfcp2I1 (at 66, 69bp). Inspecting the
underlying sequences of these regions, we found that the binding motifs of these factors
100
Chapter 5: Transcription factor spatial binding constraints
are embedded at the positions consistent with the binding positions (Figure 5-4B). In
addition to the consistent spatial arrangement of motifs, these sequences (spanning
from -70bp to 100bp) exhibit a high degree of similarity. A subset of the sequences is
shifted 3 bases by some insertion/deletions, consistent with the 3bp shift of some of the
factor binding positions.
Several lines of evidence suggest that these Klf4-Sox2 distance-constrained regions
may be functional in ES cell transcriptional regulation. First, 21 of these regions are
shown to interact with total 36 other genomic regions by mouse ES cell RNA polymerase
II ChIA-PET experiments (Reeder et al., unpublished results). Chromatin Interaction
Analysis by Paired-End-Tag sequencing (ChIA-PET) is a recently developed technology
for genome-wide investigation of chromatin interactions bound by specific protein factors
(Fullwood et al., 2009). One of the Klf4-Sox2 distance-constrained regions located at
the second intron of Tcfcp2l1 shows a strong (22 paired-end reads) long-range
chromatin interaction with the Tcfcp2l1 transcriptional start site over a 20kb distance
(Figure 5-5) (Reeder et al., unpublished results), and is bound by p300 (Creyghton et al.,
2010) , a histone acetyltransferase and transcriptional coactivator that predicts tissuespecific enhancers (Visel et al., 2009), suggesting potential roles in regulating the
transcription of Tcfcp21. Long-range chromatin interaction between some of these 21
regions and the transcriptional start sites of Nanog and other genes are also observed
(data not shown). In addition, analyses with p300 and H3K27ac ChIP-Seq data from
mouse ES cell (Creyghton et al., 2010) suggest that these Klf4-Sox2 distanceconstrained regions may be active enhancer regions (Figure 5-6). GPS binding calls
show that almost all (119 out of 123) of these regions are bound by p300. Read
coverage enrichment analysis shows that the large majority (111 out of 123) of these
regions are also marked by H3K27ac, a histone modification associated with active
enhancers (Creyghton et al., 2010). The enrichment of H3K27ac mark is computed
using a Binomial test (p-value<1e-4) of read count in a 1001bp window using a matched
control of histone H3 ChIP-Seq dataset. These results demonstrated that GEM analysis
enables detection of coordinated binding of multiple factors that are may be functional in
ES cell transcriptional regulation.
101
Chapter 5: Transcription factor spatial binding constraints
Figure 5-4 Enhancer grammar elements deduced from mouse ES cell transcription
factor binding sites predicted by GEM.
A) The binding site distribution of Sox2, Klf4, Nanog, Oct4, Esrrb, Nr5a2 and Tcfcp21l in 123
regions that exhibit Sox2-Klf4 spatial binding constraints. The Sox2 sites are aligned at the 0bp
positions, and Klf4 sites are at the 25bp positions. The rows are ordered by Esrrb offset positions.
B) Color chart representation of 201bp sequences in the same regions as in A. Each row
represents a 201bp bound sequence. Green, blue, yellow and red indicate A, C, G and T. The
motif logos are generated by STAMP (Mahony et al., 2007) from the motifs discovered using all
the binding sites in the respective datasets.
102
Chapter 5: Transcription factor spatial binding constraints
Figure 5-5 A Klf4-Sox2 distance-constrained region interacts with Tcfcp2l1
transcriptional start site.
The tracks are: chromosome coordinates of the region overlapping the Tcfcp2l1 gene; mouse ES
cell RNA polymerase II ChIA-PET interactions (the arcs connect the two ends of the paired-end
reads); clusters of the pol2 ChIA-PET interactions (the left and right ends indicate the mean
positions of paired-end reads in the cluster, number indicates the number of paired-end reads in
the cluster) ; p300 binding ChIP-Seq read profile; Refseq gene annotation (arrow indicates
transcriptional start site); a Klf-Sox2 distance-constrained region (red rectangle). The ChIA-PET
interaction clusters are generated by applying hierarchical clustering to the paired-end reads in the
region, using the Chebyshev distance metrics with a 4kb distance cutoff. The clusters with less
than 3 reads are not shown.
Of the 123 regions where Sox2, Klf4, and other sites display constrained spacing,
109 (89%) are annotated instances of the RLTR9 ERVK family of long terminal repeat
elements. It is interesting to note that while Bourque, et al. found an association
between Oct4/Sox2 co-binding sites and other members of the ERVK repeat class
(Bourque et al., 2008), we found a set of repetitive elements that encode the binding of
Sox2 and other factors without Oct4 in ES cells. Kunarso, et al. suggested that
transposable elements have rewired the core regulatory network of ES cells (Kunarso et
al., 2010). Our analysis found that the repetitive sequences constrain the in vivo binding
of a number of key transcription factors in ES cells.
103
Chapter 5: Transcription factor spatial binding constraints
p300
-
Sox2 site
H3K27ac
1kb -
Sox2 site
1kb
Figure 5-6 Klf4-Sox2 distance-constrained regions are bound by p300 and marked
by H3K27ac
Read profiles and Heatmaps of p300 and K3K27ac histone mark read coverage in 123 Klf4-Sox2
distance-constrained regions. Top: read profile, bottom: Heatmap of read coverage. The regions
are 2kbp over the Sox2 binding sites. The regions are in same order as in Figure 5-4. Color
shading corresponds to the ChIP-Seq read count in the region.
5.3.3 Spatially constrained human factor binding in ENCODE data
We computed statistically significant pair-wise spatially constrained binding events
between 46 transcription factors characterized in 184 ENCODE ChIP-Seq data sets in
five different cell lines. Each transcription factor ChIP was processed independently by
GEM so that we could assess any differences in observed binding between cell lines
and biological replicates.
We found that 390 pairs of transcription factors have significant binding distance
constraints within 100bp of each other (Figure 5-7 ~ Figure 5-11). The number of pairs
found in each cell line differed as did the number of transcription factors assayed: K562
(152 pairs/37 TFs), GM12878 (148 pairs/29 TFs), HepG2 (107 pairs/29 TFs), HeLa-S3
104
Chapter 5: Transcription factor spatial binding constraints
A
K562
c-Myc:S-IFNa30 0
1
1
1
1
1
2
0
3
Max:S 1
0
2
3
3
2
2
1
4
c-Myc:S-IFNa6h 1
26
3
3
3 22
2
2
2
13
10
14 13 3
7
9
8
34
1e-300
2
4
15 3
2
0
1
1
0
1
1
2 24 27 10 4
4
4 34
15 11
12
13 59 13 13 7
4
5
4
1
3
1
0
0
0
0
2
2
5
5
5 16
21 16
0 26
11 16 12 11 5
5
5
6
5
c-Myc:S 1
3
0
0
0
0
0
2
2 16 30 30 5
5
5 36
12 14
6 15
25 16 12 11 5
5
5
6
11 4
c-Myc:S-IFNg6h 1
4
4 28
8
4 16 6 32
12 16 13 12 4
4
4
5
5
4
3
3
6
5
5 31
3 34 6
1
20 14 12 12 4
4
4
5
5
4
3
3
6
3
3 25
10 0 35 1
9
24 28 14 13 3
3
3
4
4
3
2
2
5
c-Myc:C
7 15 0
2
0
0
0
0
0
1
2 13 19 1
USF1:M
2
2
1
0
0
0
0
2
2
Max:M
0
1
1
2
2
1
2
0
3 11 8
2
Egr-1:M
3
4
2
2
2
2
2
3
0
8
4
3
7 20 13 8
3
8
0
2
1
6
6
6 25
2
37 19 31 14 42 9
4
2
0
1
4
2
26 0 30 1 18 2
3
1
1
CTCF:B
CTCF:C 26
CTCF:ST
4
8 27 2
6
3
2 11 13
5
6
3
0
16
8 32
43 13 28 3
4
9
11
0
7 11 7 26
2 15 37 7
12
5
9
7
0
0
0
0
5
4
5
5
4
5
3
YY1:M-v1 3
2
4
5
5
4
5
3 13 6
2 11 0
0
0
4
YY1:M-v2 3
2
4
5
5
4
5
3 13 6
8
0
0
1
2
5
0
4
4
3
4
0
0
1
61 4
0
0
1
3
1
1
0
5
5
0
1
1
0 19 10
7
0
STAT1:S-IFNa30
STAT1:S-IFNg30
11 16 13 6 14 0
GABP:M 12
PU.1:M
NRSF:M
0
8
SRF:M
6 11
10
8
STAT1:S-IFNg6h
5
5 12 15
6
3
1
14 34 35 2 34 15 37
33 0
9
10 3 12
4
2
0 11 6
6
26 15
1 24 13
1
4
3 11 7
10 7
8
2
1
7
17 5 16 17 4 13 7 12
3 33 3 28
30 13 30 19 5
0 12 10
5
5
6
6
21 6
5 11
5
4
4
33
11
33
5 20 6
9
5
4 21 6 17 0
11 22 10 16 5
5
5
6
6
1 15
8
1 22 0
3 19 0 17
14
12
56 55 13
2 34
6
5
6
0
15
28 15
16 16
4
0
GATA1:S
6 12 12 12 19 24 10
17 32
15 11 1
0
4
1
0 26
GATA2:M
16 16 16 16 14 28 20
23
33 22 5
4
0
3
4 11 13 72 20 12 35 16 12
GATA2:S 14
13 12 13 12 26 11 17
24 16 16
56 5 18 17
1
3
0
1 14 11 22 24 13 11 13 20
GATA2:W 12
12 11 12 11 11 24 16
9 19
55 20 11 1
0
4
1
0 17 17 24 22 16 16 11 16
4
4
9
15 21 5
5 11 4
4
3 37 3
FOS:W
5
5
4
4
3 37
7 13 5
FOSL1:M
5
5
4
4
3 37
7
6
6
4
5
4 12 0
c-Jun:S-IFNa6h
5
5
4
c-Jun:S-IFNg30
15 28 12 4
4
3 37
3
3
2 15 16 11 9
c-Jun:S 25 8
7
JunD:W 7 34 4
0
JunB:W 11
c-Jun:S-IFNg6h
4
24 3
2 15
7
6
5
6
0
5 11 1
26 11 14 17 0
0
0
1
1
0
1
1
2
8
12 21 20 21 0
0
0
1
1
0
1
1
2
5 11 4
21 12 14 25 0
0
0
1
1
0
1
1
2
0
16 9 22 15 1
1
1
0
0
1
2
2
1
1
1
1
0
0
1
0
0
3
16 20 19 16 0
0
0
1
1
0
1
1
2
11 16 19 18 1
1
1
2
0
1
0
0
3
12 12 16 1
1
1
2
0
1
0
0
3
2
2
2
1
3
2
3
3
0
31 5
5
5
3 12 6
12 4
6
6
6 22
21 5
6
5 10
5
7
6 30
6
11
6
3
14 17 6 16 15
22
0
37
:S
-IF
M
yc
M
yc
1e-100
10 21
N
a
M 30
: S ax
-IF :S
N
c- a6
M h
cM c- yc:
yc M C
: S yc
-IF : S
N
U g6h
SF
1
M :M
a
Eg x:M
rC 1 :M
TC
C F: B
T
C CF
TC : C
F:
S
Y T
YY Y 1
1: :S
YY M1: v1
ST
M
A
E ST T1 L F v2
A :S- 1:M
ST T1 : IFN
A S-I a3
T1 FN 0
:S g
-IF 3 0
N
G g6
A h
BP
PU :M
N .1 :
R M
ST
SF
A
T1 SR :M
:S F:
-IF M
G Na
A 6
T h
G A1
A :S
T
G A2
A :M
T
G A2
A :S
TA
2
c- :W
Ju
FO n :S
FO S:
SL W
cJu Ju 1: M
n
c- :S n D
Ju -I :W
n : FN
S- a
IF 6 h
N
cJu J g3
n : un 0
S- B
IF :W
N
N g6
F- h
E
2:
S
4
4
17
3
13 5
c-
c-
1e-150
6 37
1 11 13 11 3 22 16 0 10
1
7 28 10 33 33 34 0 10 17 0
STAT1:S-IFNa6h
NF-E2:S
1e-250
1e-200
47
2
13 16 13 15 9 13 8 25 15 26
7
14 16 37 37 37 38 36 37 36 38
14
5
13 78 8
YY1:S 3
ELF1:M 22
4
105
1e-50
1e-8
Chapter 5: Transcription factor spatial binding constraints
B
K562
c-Myc:S-IFNa30 1
4
5
4
4
5 18 4
4
Max:S 4
1
4
5
5
4 27 4
2
c-Myc:S-IFNa6h 5
4
3
9
9 27 49 29 21 1 11 5
1 54 27 41
3 68
66
19 3 66 48 71 2
4 108
4 49
5
9
1 26 40 62 74 98 73 94 85 4 102 100 96
1 108
104 3
54 27 114 96 107 11 44 131
26 94
c-Myc:S 4
5 10 29 1 34 63 45 76 18 62 30 3 93 94 99
2 106
90 3
47 14 85 77 119 19 30 123
29 93
c-Myc:S-IFNg6h 5
4 30 39 38 3 67 75 86 80 102 100 1 100 106 114
c-Myc:C
4
1
USF1:M 17 30 45 65 64 69 2 67 99 2 26 4
2
6
5
1
2
1
1
12
50 54 141 133 139 103 111 167 12 94 150 32 8
109 78 114
124 8 106 4
75 17 80 3 133 28 1 148 146 1 98 83 66
1 12
17
103 63 127 40 92 133 2 50 72 12
51
38
109
54
8 61 4 74
65
115
78
7 83 26 96 6 136 32 145 107 1
2 95 93 101
6 45 2 62
52
107
68
1
1 66 49
2
2
1
7
1
1
1
12 4 60 45 117 25 51 127 4 25 80 10 7
13 97 63 99 28 136 74 148 4 107 1 98 100 106
3
1
8
2 126 49 125 8
3 22 97 74 85 91 131 1 28 74 29
1
33
85 90 149 134 127 56 80 141 22 96 128 55 9
6
YY1:S 1
67
14
103 4 69 14
Egr-1:M
CTCF:ST
6 17
26 119 8 118 5
4 30 73 46 74 75 1 134 134 137 134 7 118 121 146
1
5
62 62 97
4
CTCF:B
19
2
Max:M
CTCF:C
13
60
2 53 102 95 103 66 117 111 97 103 99 67 1 97 115
110
YY1:M-v2 4
1 26 99 93 107 62 125 80 87 99 93 50 97 1 136
97 5 93 12
5
1 35 24 110 21 44 136
6
5 73 38 137 14 41 150 2 22 62 2
2
37 103 100 115 102 145 121 74 104 101
STAT1:S-IFNg30
3
3
5 111 71 114 49
3
1
2
3
3
2
1 26
4
3 29 1
1
STAT1:S-IFNg6h
30
GABP:M 10
3
4
3
70 109 113 116 102 123 114 15 61 42
PU.1:M
15 6 55 5
NRSF:M 13
SRF:M
1
4
112 102 112
4
10 66
62 104 92 119 67 129 103 49 72 62
97 94 112
2
1
13 6
1
99 18
121 131 1
STAT1:S-IFNa30
16 11 52 1
9
3 84 2 35 108
3
4
2
1
80
1
4
1
1 64 107 47
2 106 39 1
2
3 48 1
1
2
1
7
49
100
3
98 74 128 114 119 40 80 129 6 62 99 9
65 1 35 3
4
54
5 86 1
1
STAT1:S-IFNa6h
1
8 52 25 41 3
1 133 1
7 10 2
1
93 101 146 119 119 81 97 143 11 105 133 51
44
4 47
3
3 88
24 110
60 78
120
1
GATA1:S
19 58 40 84 18 56 25
7
98 1 92
1
2
GATA2:M
6 33 14 91 8 55 6
4
69 10 95
2
1 13 2 102 18 25 157 8 83 122 26
GATA2:S 11
62 111 93 145 55 139 98
4 40 78
3 130 51 145
6
4
GATA2:W 5
48 91 76 129 32 128 51
5 34 44
2 112 24 122
2
2 29 1 111 27 61 139 21 105 127 41
c-Jun:S 12 11 78 111 116 127 111 139 126 36 66 52
85 108 133
5 115 35 117 47
82 108 141 112 1 57 59 101 49 93 104 73 2
FOS:W
12 15 58 20 101 38
7 23 16
42 2 79
2 13 73 21 62 1 32 49 4 28 54 6
FOSL1:M
36 32 71 50 108 86
36 46 38
78 2 99 8
15 13 113 61 59 34 1 46 5 29 46 10 2
106 138 150
49 129 136 140 48
108 156 156 142 101 49 44 1 22 68 104 43 4
JunD:W 65 34 101 130 119 145 127 166 138 106 114 107
c-Jun:S-IFNa6h
6
25 5 15
c-Jun:S-IFNg30
4 26 30 92 26 94 52
JunB:W 4
48 101 93 121 84 149 75 52 78 66
c-Jun:S-IFNg6h
5
2
57 13 28 8
1
7
7
7
2
4 20
58 78 63
5
9
1 12
6
140
8 22 145 77 105 164 45 134 151 89
5 52
48 5
2
4 21 1 45 22 43 4
65 6 105
41 85 137 108 92 30 28 68 45 1 73 56 2
4 101 7 132 4
81 129 156 128 105 52 47 104 14 74 1 50 4
15
2
48
1
28 92 48 75 6
2
2
8 42 43 57 49 1
4
2
1
5
4
2
4
3
160
180
200
M
yc
c-
c-
M
yc
:S
-IF
N
a
M 30
: S ax
-IF :S
N
c- a6
M h
cM c- yc:
yc M C
: S yc
-IF : S
N
U g6h
SF
1
M :M
a
Eg x:M
rC 1 :M
TC
C F: B
T
C CF
TC : C
F:
S
Y T
YY Y 1
1: :S
YY M1: v1
ST
M
A
E ST T1 : L F v2
A S-I 1:M
ST T1 : FN
A S-I a3
T1 F 0
:S Ng
-IF 3 0
N
G g6
A h
B
PU P:M
N .1 :
M
R
ST
SF
A
T1 SR :M
:S F:
-IF M
G Na
A 6
T h
G A1
A :S
TA
G 2
A :M
T
G A2
A :S
TA
2
c- :W
Ju
FO n :S
FO S:
SL W
cJu Ju 1: M
c- n :S n D
Ju -I :W
n : FN
S- a
IF 6 h
N
cJu J g3
n : un 0
S- B
IF :W
N
N g6
F- h
E
2:
S
NF-E2:S
40
1
YY1:M-v1 3
ELF1:M
20
Figure 5-7 Spatial binding constraints detected from human K562 cells.
Matrix representation of pairwise spatial binding constraints between factor B (column) and
factor A (row) detected from 37 ChIP-Seq dataset in human K562 cells. A) The colors represent
the significance levels (corrected p-value) of the strongest spacings. The numbers represent the
distances between the factors in the strongest spacings. B) The colors and numbers represent the
number of positions exhibiting significant spatial binding constraints within the 201bp window
around the binding sites of factor B (column).
106
Chapter 5: Transcription factor spatial binding constraints
A
GM12878
c-My c:S 0
0
1
1
3
6
5
21 21 14
3
17
13 12
Max:S 0
0
1
1
0
USF1:M 1
1
0
0
1
c-My c:C 1
1
0
0
2
PAX5:M-C20 3
0
1
2
0
3
4 13 7
3
0
4
1
c-Jun:S 6
JunD:S
25
34
10
3
44
0
33
7
4
2
2
4
2
11 0
0
1 12 13 15
1
23 7
0 20 19 6
6
8
3 14 5 17 4
14
5
0 13 14 16
9
2
13 1
13 0
1
1
0
19 2 11 11 3
1
8
12 10 7
MEF2C:M
7 13 14 1
0
12
0
10 23 11 14
14
8
6
4 15 16 1
12 0
20
9
15
17 2
17
NRSF:M
4
Y Y 1:S
6 21 10 8
SRF:M-v 1
11 5
2
EBF1:M 21 14
12
NFKB:S 14
0
0
20
9
34 33
GABP:M
10
10 0
2
10
3
2
0
7
8
2 26
5
17 6
7
0
0
7
8
2
4
0
8
0
0
4
4 10 5
5
2
1
16 43 1
0
0 17
2
1
1
1
8
2
23 10
2
0
0
4
4
8
5
5
8
18 1
26
4
4
0
0
3
9
2
8
9
9
4
4
0
0
4
2
2
8
10
5
8
12 3
4
0
2
3
2
17
5
6
2
2
0
1
0
3
1
0
1
2
0
1
0
6
15
10 5
8
8
14 15
4
9
6
5
6
8
16 17
47 7
5
5
2
2
1
0
2
3
4
1
0
16 14
4
2
2
11
12
2
5
2
2
1
14 0
1
3
5
1
5
7
11
10
0
1
1
0
16 1
0
2
4
0
4
TCF12:M 10 32
4
10 5
1
16 1
2
16 2 36 3
2
0
2
1
3
8
43 3
4
14
4
1
1
0
0
1
6
ZEB1:M
19
2
0
CTCF:C
10
4
2
6
10 38 13
0
2
2
4
0
9
8
9
4
2
9
5
4
2
0 11 1
32 1
0
1
0
0
4
2
4
3
1
4
0
5
c-
M
yc
:
M S
U ax:
SF S
1
PA c -M : M
X5 y c:
:M C
c- C2
Ju 0
n
Ju :S
M nD
EF : S
M 2A
E :M
PO F2C
U :M
2
N F2:
RS M
F
Y :M
SR Y1
F :S
SR : M
F: -2x
Eg M
r-1 -v 1
:M
EB -2x
F1
EB :M
F
N
FK N F :M
B: KB
S- : S
TN
EL Fa
F
ET 1:M
S
G 1:M
AB
P
PU : M
C .1: M
TC
C F:
TC B
T C F: S
F1 T
Z E 2:M
B
C 1:
Eg TC M
r-1 F: C
:M
-v
1
Egr-1:M-v 1 3
7
8
107
1e-100
46 2
CTCF:B
11 7
1e-150
9
14 15 25
CTCF:ST
6
1e-200
47
7 17 2 16 10 10
3
PU.1:M 26
4
0
35 14 18 11 18 15
ELF1:M
2
0
19 6
NFKB:S-TNFa
4
2
8
0
1e-250
12
5
0
Egr-1:M-2x
ETS1:M
20
0
SRF:M-2x
EBF:M 0
8 20 12
19 10 6
6
MEF2A:M
POU2F2:M 5
1e-300
1e-50
1e-8
Chapter 5: Transcription factor spatial binding constraints
B
GM12878
c-My c:S 1
5
4
4 44 1
53
24 29 4
13
5
Max:S 5
3
4
4 10 1
USF1:M 4
4
3
3
4
c-My c:C 4
4
3
1
7
PAX5:M-C20 40 9
5
5
1
3
1
37
1
2
1
2
1
2
2
1
46 1 50 119 61 89
56
82 80 114 89
61
59
JunD:S
8 48 1 96 37 44
28
11 15 73 30
6
12
MEF2A:M
164120 96 1 98 161
115
MEF2C:M
73 66 40 99 1 95
70
166 91 45 159 97 1
NRSF:M
2
6
34
101 64
3
EBF1:M 23 3
EBF:M 35 4
NFKB:S 3
NFKB:S-TNFa
ELF1:M
1
1
2
GABP:M
115 69 93
70
164 81
153 65 157
71
147104 79 153 98 149
66
60 94 30 112 23 84
3
91 1
81
120
86
85 9 72
76
26
PU.1:M 3
5
1
169 77 11 152 59 155
134
84 70 49 92 22 34 90 6 35 56
67 69 68
45 1
3
157 60 5 157 34 165
80
1
1
1 45
SRF:M-v 1
1 70 139 77 121 17
164 94 101 163 16 101 51
71 1 144 99 121 22
164 102106 161 29 116 61
141 144 1 59 4
73 99 60 1
119 121 7
1
1
4
138
151
2
1
47
79
2
1 54 60 77 110107 135 70 104 86
22 23 10 1 60 1 13 8
48
62 7
1
5
1 92
169 160133 51 74 9
7
70 2
23
1
6
15 30
1
3
CTCF:B
106
69
102
25
96 101
108
1
1 132 99 17 135 2
CTCF:ST
105
69
97
30
99 102
102
7 131 1 93 9 114 3
TCF12:M 35 6
163 58 11 158 79 151 2
ZEB1:M
82
6
62
CTCF:C
104
64
97
2
1
1 87
82 9 44
88
2
7 152
162 159146 76 133 66 1 154 100 95 1 26 110 68
7
14 34
34
108 108
58 6 51 61 2
63 1
104
19 6 25 1 42 4
18 132117 109 36 1
2
1
1
1 85 24 5 36 1
2 70 4
100
120
140
160
180
200
c-
M
yc
:
M S
U ax:
SF S
1
PA c -M : M
X5 y c:
:M C
c- C2
Ju 0
n
Ju :S
M nD
EF : S
M 2A
EF : M
PO 2C
U :M
2
N F2:
RS M
F
Y :M
SR Y1
F :S
SR : M
F: -2x
Eg M
r-1 -v 1
:M
EB -2x
F1
EB :M
F
N
FK N F :M
B: KB
S- : S
TN
EL Fa
F
ET 1:M
S
G 1:M
AB
P
PU : M
C .1: M
TC
C F:
TC B
T C F: S
F1 T
Z E 2:M
B
C 1:
Eg TC M
r-1 F: C
:M
-v
1
Egr-1:M-v 1 1
60
1
1
1
Egr-1:M-2x
6
82
164 105101 153 62 99 33
156 155150 81 123 69
1
SRF:M-2x
ETS1:M
57 71 96 32
40
158 72 63 160 5 64 87
150 157155 108 81 88
94
Y Y 1:S
20
101 2 166 160149 56 134 92 27 157 110101 168 84 101 89
1 51 3 163 68 166 2
c-Jun:S 1
POU2F2:M 52 13
1
Figure 5-8 Spatial binding constraints detected from human GM12878 cells.
A) Matrix representation of pairwise spatial binding constraints between factor B (column) and
factor A (row) detected from 29 ChIP-Seq dataset in human GM12878 cells. The colors represent
the significance levels (corrected p-value) of the strongest spacings. The numbers represent the
distances between the factors in the strongest spacings. B) The colors and numbers represent the
number of positions exhibiting significant spatial binding constraints within the 201bp window
around the binding sites of factor B (column).
108
Chapter 5: Transcription factor spatial binding constraints
A
HepG2
CTCF:C
0
4
1
1
21
0
23
24
7
41
40
5
4
2
32
18
36
4
CTCF:M
4
0
5
3
17
4
19
13
11
62
18
11
7
3
11
14
11
2
CTCF:B
1
5
0
2
7
1
30
25
6
24
23
23
12
12
6
19
6
3
CTCF:ST
1
3
2
0
8
1
22
17
8
26
25
3
1
19
26
13
35
15
0
0
1
20
10
12
BHLHE40:M
USF1:M 22
0
c-My c:C
ELF1:M 14
8
24
44
9
0
0
0
11
14
4
14
12
13
12
6
6
8
26
7
5
4
1
1
1
0
0
19
15
20
0
5
6
6
5
6
7
7
7
5
19
13
29
12
9
0
2
6
22
15
4
4
5
7
0
0
0
3
14
15
2
0
12
6
2
19
5
30
18
7
3
5
8
0
3
9
9
6
GABP:M
1
5
0
1
HSF1:S-f orskolin
4
18
12
7
1
2
2
1e-250
1e-200
6
0
NRSF:M
0
SREBP1:S
0
12
6
0
11
0
14
6
8
12
5
15
7
13
6
7
14
7
12
6
7
2
2
1
20
6
5
5
19
2
3
19
5
6
6
5
14
7
11
6
41
8
7
0
30
3
9
14
26
13
26
7
0
5
9
11
37
7
7
0
8
8
17
5
5
3
9
13
8
10
0
7
11
18
8
FOXA1:M-SC-101058 41
62
24
28
FOXA1:M-SC-6553 67
36
23
80
FOXA2:M 15
11
23
2
2
SRF:M
0
6
3
2
2
10
3
4
9
0
7
6
0
3
2
2
11
3
3
3
13
26
18
1
3
3
0
1
1
4
4
5
5
5
3
2
1
0
0
3
3
4
4
4
2
2
1
0
0
3
3
4
4
4
2
JunD:M 13
0
11
4
3
3
0
0
1
20
6
9
3
3
4
3
3
0
0
6
10
5
15
HNF4A:M 36
9
HNF4G:M 36
3
13
5
4
4
13
11
0
0
0
3
4
3
5
4
4
20
10
0
0
0
3
6
9
13
5
4
4
6
5
0
0
0
3
15
4
5
0
26
3
2
2
8
2
3
3
3
0
2
TC
F:
C
TC
C
TC
F:
M
6
RXRA:M
C
HNF4A:S-f orskolin
2
2
C F: B
T
BH C F
LH : ST
E4
0
U :M
SF
1:
M
cM
yc
:C
EL
F1
H
SF G :M
1: AB
SP:
fo
rs M
ko
N lin
R
SR SF
:
EB M
P1
ER
:S
R
S
C A: S R F
EB
-fo : M
F O PB rs
XA :S- koli
1: f or n
F O M -S sk o
XA C- lin
1: 101
M
-S 058
C6
F O 553
XA
F O 2:M
SL
2
J u :M
nD
H
NF
:
H
4A NF M
4
:S
-f o A:M
rs
k
H olin
NF
4G
R :M
XR
A:
M
5
FOSL2:M
5
2
5
ERRA:S-f orskolin
CEBPB:S-f orskolin
C
4
1e-300
109
1e-150
1e-100
1e-50
1e-8
Chapter 5: Transcription factor spatial binding constraints
B
HepG2
CTCF:C 12 148 162 125
17 105 128
36
85
88
82 138 81
79
54
72
8
8
CTCF:M 144 26 168 130
22 102 131
40
86
94
87 127 83
81
79
77
8
20
CTCF:B 166 168
14 105 124
24
85
91
78 139 82
70
39
68
2
6
21
77
63
58 123 72
66
30
50
1
1
2
5
9
1
87
93 110 70
50
56
1
CTCF:ST 125 130 150
149
1
9
97 111
1
4
3
10
4
3
48
65
c-Myc:C 112 103 106 97
3
46
1
120 14
BHLHE40:M
USF1:M 21
17
10
ELF1:M 128 127 125 110
GABP:M
5
61 122
1
64
6
66
1
10
HSF1:S-forskolin
1
12
2
2
80
63
66
3
38
112 130 120 130 120 118 146 132 121 123
77
130 152 140 151 139 140 162 123 132 124
30
5
2
1
NRSF:M
67
49
3
1
54
71
60
77
1
1
1
52
33
5
7
5
8
7
3
70
77
87
67
65
42
1
29
37
9
2
20
40
60
1
80
1
SREBP1:S
1
SRF:M 39
36
22
18
12
79
26
CEBPB:S-forskolin 87
86
83
78
67 106 128
1
FOXA1:M-SC-101058 90
95
89
52
1
82 134 152
FOXA1:M-SC-6553 82
88
78
62
1
67 127 133
FOXA2:M 143 137 138 120
1
69 127 154
FOSL2:M 83
81
84
77
5
90 120 141 59
6
69
121 178 181 174
JunD:M 78
79
68
65
2
90 118 140 36
5
78
127 184 179 167 110
HNF4A:M 60
78
31
27
107 143 163
8
83
28 166 199 201 199 160 163
HNF4A:S-forskolin 75
44
70 131 129
10
54
37 154 191 185 182 139 140 125
49 120 132
4
64
16 149 194 194 192 137 136 129 108
43
2
ERRA:S-forskolin
39
1
1
8
2
RXRA:M
7
16
3
C
TC
2
2
1
57
1
1
4
70
1
178
59
5
6
80
48 128 116
179 179 184 124 130 167 155 154 143
1
179 123
1
122 118 178 182
3
183 116 108
190 193 196
111 180 175
1
184 192 197
175 166 198 184 189 194
1
120
140
109 163 137 138 161
1
162 144 136 160
1
160
126 132 187
1
107 173
1
148 198 193 191 158 161 186 170 171
180
169
1
200
TC
F:
C
63
4
1
52
100
F
C :M
TC
C F:B
T
BH CF:
LH S T
E
40
U :M
SF
1
c - :M
M
yc
:
EL C
F1
H
SF GA :M
BP
1:
S
-fo :M
rs
ko
N l in
R
SF
SR
:
E M
BP
1:
ER
S
R
A SR
:
C
F:
E S
FO B P -for M
X B: s ko
A1 S
l
:M -fo in
FO -S rs k
o
C
l
X
A1 -10 in
10
:M
58
-S
C
-6
FO 55
X 3
A2
FO :M
S
L2
:
Ju M
n
H
D
N
:
F 4 HN M
A: F4
S- A:
M
fo
rs
k
H ol in
N
F4
G
:M
R
XR
A
:M
79
HNF4G:M
C
1
Figure 5-9 Spatial binding constraints detected from human HepG2 cells
A) Matrix representation of pairwise spatial binding constraints between factor B (column) and
factor A (row) detected from 29 ChIP-Seq dataset in human HepG2 cells. The colors represent
the significance levels (corrected p-value) of the strongest spacings. The numbers represent the
distances between the factors in the strongest spacings. B) The colors and numbers represent the
number of positions exhibiting significant spatial binding constraints within the 201bp window
around the binding sites of factor B (column).
110
Chapter 5: Transcription factor spatial binding constraints
A
HeLa-S3
CTCF:B
0
1
2
3
4
CTCF:C
1
0
1
2
28
CTCF:ST
2
1
0
1
1
27
1
0
0
4
13
3
1
0
0
4
0
4
4
0
4
0
0
4
0
AP-2alpha:S
3
AP-2gamma:S
2
GABP:M
2
4
STAT1:S-IFNg30
28
24
NRSF:M
1
4
c-Fos:S
1
0
8
13
2
7
1
1
8
1
11
4
9
16
0
2
3
3
6
8
3
3
3
20
13
5
5
5
15
12
11
9
10
0
16
12
12
8
1
4
0
c-Jun:S
JunD:S
11
7
19
3
3
9
23
2
0
0
0
6
15
3
3
3
9
9
12
0
0
0
6
15
3
3
3
5
10
12
0
0
0
6
6
6
3
3
3
0
1
3
c-My c:C
1e-250
1e-200
1e-150
1e-100
1
0
24
20
12
0
23
5
5
6
1
0
1
Max:S
7
11
2
10
13
12
6
1
3
3
6
3
1
0
0
0
5
0
0
G
AB
AT
P:
1:
M
SIF
Ng
30
N
RS
F:
M
cFo
s:
S
cJu
n:
S
Ju
nD
:S
cM
yc
:C
cM
yc
:S
ST
m
a:
S
ha
:S
AP
-2
ga
m
F:
ST
AP
-2
al
p
F:
C
TC
C
TC
C
C
TC
F:
B
Nrf 1:S
111
N
2
M
c-My c:S
rf1
:S
0
ax
:S
6
1e-300
1e-50
1e-8
Chapter 5: Transcription factor spatial binding constraints
B
HeLa-S3
CTCF:B
1
153
141
1
56
CTCF:C
151
11
126
2
73
CTCF:ST 144
123
1
2
2
64
2
1
61
1
25
86
2
59
1
1
49
1
1
1
2
23
53
2
1
AP-2alpha:S
AP-2gamma:S
1
5
GABP:M
STAT1:S-IFNg30
1
49
71
66
NRSF:M
2
2
c-Fos:S
2
120
102
93
109
2
1
117
110
119
3
1
97
95
108
38
91
113
107
104
59
111
127
128
15
12
50
77
73
133
78
138
142
136
7
16
48
38
3
1
c-Jun:S
JunD:S 103
1
100
106
20
129
3
1
71
96
69
170
187
45
53
19
79
9
72
1
71
31
157
165
89
113
55
133
15
95
75
1
68
178
187
75
27
73
1
97
39
1
20
40
60
80
100
93
c-My c:C
c-My c:S
1
1
120
140
2
97
117
128
78
139
49
174
154
176
97
1
151
Max:S 114
118
103
114
130
78
136
39
184
165
184
38
147
1
3
2
1
2
1
rf1
N
M
AB
AT
P:
1:
M
SIF
Ng
30
N
RS
F:
M
cFo
s:
S
cJu
n:
S
Ju
nD
:S
cM
yc
:C
cM
yc
:S
200
G
m
a:
S
180
ST
AP
-2
ga
m
T
ha
:S
F:
S
al
p
AP
-2
F:
C
TC
C
TC
C
C
TC
F:
B
Nrf 1:S
160
:S
110
ax
:S
96
Figure 5-10 Spatial binding constraints detected from human HeLa-S3 cells.
A) Matrix representation of pairwise spatial binding constraints between factor B (column) and
factor A (row) detected from 15 ChIP-Seq dataset in human HeLa-S3 cells. The colors represent
the significance levels (corrected p-value) of the strongest spacings. The numbers represent the
distances between the factors in the strongest spacings. B) The colors and numbers represent the
number of positions exhibiting significant spatial binding constraints within the 201bp window
around the binding sites of factor B (column).
112
Chapter 5: Transcription factor spatial binding constraints
A
H1
NRSF:M
0
0
7
8
3
4
NRSF:M-v2
0
0
7
8
3
4
JunD:Myers
7
7
0
20
1
10
POU5F1:M
4
0
SRF:M
5
RXRA:M
7
USF1:M
4
5
7
4
2
10
1
47
7
0
5
2
3
12
5
0
4
7
3
4
CTCF:B 24
24
1
10
2
27
0
5
5
8
CTCF:M
3
10
1
3
7
5
0
0
4
3
5
0
0
5
4
8
8
5
0
3
Egr-1:M
25
YY1:M
5
6
12
5
7
1e-200
1e-150
1e-100
1e-50
1e-8
N
R
N
R
SF
:
SF M
:
M
Ju
nD -v2
:M
PO ye r
U s
5F
1:
M
SR
F:
M
R
XR
A
:
U M
SF
1:
C M
TC
F:
B
C
TC
F:
Eg M
r1:
M
YY
1:
M
5
1e-250
6
12
0
1e-300
B
H1
NRSF:M
1
1
51
39
38
55
NRSF:M-v2
1
1
51
39
38
55
JunD:Myers
53
53
1
37
POU5F1:M
36
1
SRF:M
15
RXRA:M
25
USF1:M
42
15
21
45
103 102
55
118
1
2
7
11
3
1
3
11
8
1
4
3
82
68
3
57
39
102
2
13
87
2
134
3
182
CTCF:M 38
38
102
12
10
70
134
1
7
178
3
4
7
1
22
21
1
60
61
120
28
8
7
53
183 179
160
180
200
N
N
R
R
SF
:
SF M
:
M
Ju
nD -v2
:M
PO ye r
U s
5F
1:
M
SR
F:
M
R
XR
A
:
U M
SF
1:
C M
TC
F:
B
C
TC
F:
Eg M
r1:
M
YY
1:
M
YY1:M 61
140
30
CTCF:B 39
Egr-1:M
120
Figure 5-11 Spatial binding constraints detected from human H1-hESC cells.
A) Matrix representation of pairwise spatial binding constraints between factor B (column) and
factor A (row) detected from 11 ChIP-Seq dataset in human H1-hESC cells. The colors represent
the significance levels (corrected p-value) of the strongest spacings. The numbers represent the
distances between the factors in the strongest spacings. B) The colors and numbers represent the
number of positions exhibiting significant spatial binding constraints within the 201bp window
around the binding sites of factor B (column).
113
Chapter 5: Transcription factor spatial binding constraints
(48 pairs/15 TFs), and H1-hESC (23 pairs/11 TFs). Certain factor-pairs exhibited a
highly significant single binding spacing offset within 100bp, such as the 4bp distance
between Egr1 and CTCF in K562 cells (Figure 5-7). Other factor pairs exhibited a large
number of significant offsets, such as the 167 significant spacings between JunD and
Max with the most significant being at 4bp (Figure 5-7). Our analysis confirmed the
known interaction pairs MYC-MAX (Blackwood and Eisenman, 1991), the FOS-JUN
heterodimer (Glover and Harrison, 1995), and CTCF-YY1 (Donohoe et al., 2007) (Figure
5-7).
Observed novel genome wide spatial binding constraints include c-Fos:c-Jun/USF1,
CTCF/Egr1, HNF4α/FOXA1. We find that USF1 often binds 4bp from c-Fos:c-Jun
(Figure 5-12A). Inspection of the sequences of the co-bound regions shows a partial
overlap of the two motifs (Figure 5-12D). This binding is consistent with Fra1’s
facilitation of a complex between USF1 and c-Fos:c-Jun (Pognonec et al., 1997). We
find a significant number of cases where CTCF co-binds 4bp from Egr1 (Figure 5-12B),
with the Egr1 motif overlaps significantly with half of CTCF motif (Figure 5-12E). Egr1
promotes terminal myeloid differentiation in the presence of deregulated c-Myc
expression, and Egr1 has been implicated in down regulating c-Myc in conjunction with
CTCF (Hoffman et al., 2002). In addition, the co-binding of CTCF and Egr1 at the EPO
regulatory region has been suggested (Yamaguchi et al., 1994). FOXA1 binds at a large
number of significant positions close to HNF4α (total 4215 regions with a spacing within
30bp, Figure 5-12C and Figure 5-12F), and there are also significant binding constraints
between HNF4α and HNF4γ and FOXA1, FOXA2 in HepG2 cells (Figure 5-9). While cobinding of HNF4α/FOXA2 has been reported (Wallerman et al., 2009), co-binding of
HNF4α/FOXA1, HNF4γ/FOXA1 and HNF4γ/FOXA2 are not known. We note that HNF4α
and any one of FOXA1, FOXA2, or FOXA3 is sufficient to reprogram cells towards a
hepatocytic fate (Sekiya and Suzuki, 2011).
114
Chapter 5: Transcription factor spatial binding constraints
Figure 5-12 Examples of transcription factor spatial binding constraints detected
from GEM analysis in ENCODE ChIP-Seq data.
A) Genome wide spatial distribution of USF1 binding sites in a 201bp window around c-Jun
binding sites. B) Egr1 binding sites around CTCF binding sites. C) FOXA1 binding sites around
HNF4α binding sites. For panel A-C, vertical dashed lines represent the centered factor binding
sites at position 0; horizontal dashed lines represent the number of occurrences at a position
corresponding to corrected p-value of 1e-8. D) Color chart representation of 61bp sequences in
259 regions with 4bp c-Fos:c-Jun/USF1 binding constraint. E) Color chart representation of
100bp sequences in 315 regions with 4bp CTCF/Egr1 binding constraint. F) Color chart
representation of 71bp sequences in 4215 regions with a wide range of HNF4 α /FOXA1 binding
constraints within 30bp of HNF4α binding sites. For panel D-F, each row represents a bound
sequence. Green, blue, yellow and red indicate A, C, G and T. For panel F, the rows are ordered
by the FOXA1 offset positions. The motif logos are generated by STAMP (Mahony et al., 2007)
from the motifs discovered using all the binding sites in the respective datasets.
5.4 Discussion
Collectively, our results demonstrate that it is now possible to reveal aspects of
functional genome syntax by surveying in vivo binding relationships between
transcription factors at high spatial resolution. We show that TF binding constraints can
115
Chapter 5: Transcription factor spatial binding constraints
be strict (e.g. Oct4/Sox2, c-Fos:c-Jun/USF1, CTCF/Egr1, etc.) or constrained but flexible
(e.g. HNF4 α /FOXA1). We also discovered 123 examples of enhancer grammar
elements that capture the complex binding relationship among 6 ES cell TFs in a 70bp
window. Our analysis has been made possible by sequenced ChIP data and a new
computational method, GEM, which provides exceptional spatial resolution.
GEM makes binding predictions and observes spatial constraints by discovering
significant events utilizing both motifs and read coverage information. Prior work has
documented specific genomic regions extensively targeted by multiple transcription
factors (TFs) (Chen et al., 2008). However, we have shown that the functional syntax of
DNA motifs in regulatory elements cannot be fully elaborated with the imprecise ChIPSeq event calls provided by previous methods. Motif analysis approaches such as
SpaMo discover enriched motif spacings by scanning a list of known motifs in
sequences anchored by ChIP-Seq data of a single factor (Whitington et al., 2011).
Since the existence of motif instances does not guarantee condition specific in vivo
binding, SpaMo cannot confidently determine the spacing between binding events and
the factors involved, especially for motifs that are shared by a family of TFs.
Furthermore, SpaMo excludes repetitive sequences (Whitington et al., 2011). In
contrast, GEM predicts binding based on uniquely-mapped reads and is able to detect
spatial binding constraints in transposable elements. Such elements have been
implicated in rewiring the core regulatory network of human and mouse ES cells
(Kunarso et al., 2010).
We expect that the genome grammatical rules that are suggested here will be
examined in further studies to elucidate mechanisms of transcriptional control, and
potential protein-protein interactions that have regulatory consequences. These spatial
binding constraints provide starting points to test the enhanceosome model and billboard
model of binding site arrangement in the enhancers (Thanos and Maniatis, 1995; Arnosti
and Kulkarni, 2005). Exploration of other genome grammatical constructs can be
accomplished with the use of further ChIP experiments and GEM.
116
Chapter 6: Conclusions
Chapter 6
Conclusions
117
Chapter 6: Conclusions
Chapter 6: Conclusions
6.1 Summary and contributions
The focus of this thesis research has been characterizing the interactions between
transcription factors and their binding sites in regulatory DNA sequences at high spatial
resolution, and using this characterization to reveal genomic grammars that may
underlie the combinatorial control of gene regulation. In this thesis, I developed three
computational methods to learn genome-wide transcription factor binding events, in vivo
binding preferences and binding constraints from ChIP-Seq profiling of protein-DNA
interactions. I will summarize the results presented in previous chapters and outline the
main contributions of this thesis.
6.1.1 Genome Positioning Systems (GPS)
The Genome Positioning Systems (GPS) algorithm is a model-based computational
method to predict ChIP-Seq binding events with high spatial resolution. In order to
address the challenges of ChIP-Seq analysis, GPS explicitly models random
fragmentation of ChIP DNA and the mixing of closely spaced events using a novel
probabilistic mixture model. Compared to other published ChIP-Seq analysis methods,
GPS improves the spatial resolution of binding event predictions and resolves more
proximal binding events.
The main contributions of this work are:
• A novel generative probabilistic model for spatial distribution of ChIP-Seq
data. To our knowledge, GPS is the first published method to directly model ChIPSeq read spatial distribution at single base-pair resolution. Previous methods
typically used sliding window or density smoothing approaches to aggregate the
reads and thus are not able to predict with high resolution (Pepke et al., 2009). This
modeling framework allows more accurate representation of the data and easy
incorporation of position-specific information (as in GEM). Similar approaches may
be applied to other data types such as DNase-Seq and RNA-Seq, which also
consist of distributions of reads along the genome.
• A novel mixture model with a sparse prior to explicitly model closely spaced
binding events. The explicitly modeling of closely spaced events allows more
accurate quantification of each event in terms of binding strength and binding
119
Chapter 6: Conclusions
location. Such accurate quantification may facilitate further study of cooperative
proximal events. The use of a sparse prior instead of trying a range of component
(events) numbers allows intuitive interpretation of the sparse prior parameter (in
terms of read counts) and easy incorporation of position-specific information (as in
GEM).
• Explicit modeling of the “peak shape” information. GPS models the “peak
shape” with an empirical spatial distribution of reads. It is more accurate than using
the parametric distributions used by other methods and allows automatic adaptation
to new data types that have a very different distribution, such as ChIP-exo data. It
is the main contributor to the improved spatial resolution. The use of “peak shape”
information also allows a principled filtering of the false positive predictions resulted
from artifacts in the ChIP-Seq data.
• A novel method for ChIP-Seq event prediction with high spatial accuracy,
specificity, and more accurate quantification of binding strength. This
facilitates downstream analysis such as motif discovery or gene expression
prediction using binding information.
6.1.2 K-mer set motif representation and discovery
The k-mer set motif (KSM) representation and the k-mer motif alignment and
clustering (KMAC) motif discovery method are designed to respectively represent and
learn motifs that are enriched in ChIP-Seq bound sequences versus control sequences.
Our results showed that the KSM model is more informative and predictive than the
PWM model. KMAC discovers motif by using a combined enumerative and alignmentbased approach and weighting the motif sites with binding event strength and binding
positional information. KMAC outperforms other motif discovery methods, including
several ChIP-Seq oriented methods, in recovering known motifs using a large number of
diverse ChIP-Seq datasets.
The main contributions of this work are:
• A novel motif representation. The KSM representation overcomes the positionindependence assumption of the position weight matrix (PWM) and consensus
sequence representation and allows richer and more accurate representation of the
motif. The value of this motif representation is also demonstrated in the
advantageous performance of KMAC motif discovery. Such k-mer based
representation also facilitates the direct comparison of the in vivo binding specificity
120
Chapter 6: Conclusions
with the in vitro binding specificity of the same factor, which is commonly
represented as a list of k-mer binding affinities.
• A novel motif discovery method for ChIP-Seq data. The KMAC motif discovery
method exploits the advantages of the large number of training examples in ChIPSeq data. It is able to process large numbers of sequences and utilizes both strong
and weak binding events. It also takes advantage of the higher spatial resolution
and more accurate quantification of binding event strength by biasing motif
discovery towards the binding positions and more confident events. The value of
incorporating these informative features is demonstrated in the improved
performance of KMAC motif discovery compared with traditional and ChIP-Seq
oriented motif discovery methods.
6.1.3 Genome-wide event finding and motif discovery (GEM)
Genome-wide event finding and motif discovery (GEM) is an integrative model to
resolve the location of protein-DNA interactions and discover explanatory DNA
sequence motifs. GEM extends the GPS model to incorporate motif information as a
position-specific prior to bias binding event prediction. GEM achieves exceptional
spatial resolution in locating most binding events exactly on the motif positions, and
further improves proximal event deconvolution. GEM can also be directly applied to
ChIP-exo data and improves upon existing methods.
The main contributions of this work are:
• A novel integrated model to predict binding events and binding motifs. The
motif information is modeled as a position-specific prior to bias the binding event
predictions towards motif positions and thus improve the spatial resolution of event
predictions. The motif information also helps to more accurately deconvolve closely
spaced events. GEM offers a more principled approach than simply snapping
binding event predictions to the closest instance of the motif.
• A flexible approach to incorporate position-specific information. GEM
integrates position-specific count information as the prior count. It is flexible to
incorporate other position-specific information into binding event prediction, for
example phastCons scores for cross-species sequence conservation (Siepel et al.,
2005).
• Exceptional spatial resolution of binding event prediction. The exceptional
spatial resolution enables the discovery of in vivo TF binding constraints and further
121
Chapter 6: Conclusions
improves motif discovery.
• A novel computational method for ChIP-exo data. GEM is able to automatically
adapt to ChIP-exo read distribution and improves upon existing methods for ChIPexo analysis.
6.1.4 Transcription factor spatial binding constraints
Exploiting GEM’s exceptional spatial resolution, statistically significant TF binding
constraints were discovered using GEM binding predictions of a large number of TFs in
the same cellular condition. We confirmed that it is possible to discover binding
constraints between the Oct4-Sox2 dimers that are impossible to observe with the noisy
binding calls from the existing peak callers. This approach also discovered more binding
constraints than using motif positions that are closest to GPS binding calls. We found 37
examples of TF binding constraints in mouse ES cells, including strong distance-specific
constraints between Klf4, Sox2 and other key regulatory factors. In human ENCODE
data, we found 390 examples of spatially constrained pair-wise binding, including such
novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4 α /FOXA1.
The main contributions of this work are:
• A novel approach to discover in vivo TF binding constraints. Our results
demonstrate that it is now possible to reveal aspects of functional genome syntax
by surveying in vivo binding relationships between transcription factors at high
spatial resolution. The binding constraints are found based on confident binding
calls. Therefore they are more reliable findings than SpaMo’s motif analysis based
on ChIP-Seq data of a single factor (Whitington et al., 2011).
• A large number of TF binding constraints were discovered. We found 37
examples in mouse ES cells and 390 examples in 5 human cell types. The results
confirm the known interaction pairs MYC-MAX (Blackwood and Eisenman, 1991),
FOS-JUN (Glover and Harrison, 1995), and CTCF-YY1 (Donohoe et al., 2007). We
also discovered novel pairs such as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4 α
/FOXA1. We expect that these will be examined in further studies to elucidate
mechanisms of transcriptional control, and potential protein-protein interactions that
have regulatory consequences.
• Enhancer grammar elements in mouse ES cells. We found strong binding
constraints between Klf4, Sox2 and other ES cell key regulatory factors. A set of
123 distance-constrained regions are co-bound by 6 ES cell factors within a 70bp
122
Chapter 6: Conclusions
window. Some of these regions are shown to interact with TSS of Tcfcp2l1, Nanog,
PPARα and other genes in RNA polymerase II ChIA-PET experiments, most of the
regions are bound by p300 and the majority of them are marked by H3K27ac,
suggesting that they may be active enhancer regions. Most of the 123 distanceconstrained regions are annotated as transposable elements. Kunarso, et al.
suggested that transposable elements have rewired the core regulatory network of
ES cells (Kunarso et al., 2010). Our analysis found that the repetitive sequences
constrain the in vivo binding of a number of key transcription factors in ES cells.
6.2 Directions for future work
6.2.1 Weighting factor of motif prior in the GEM algorithm
In this thesis, we set μ, the weighting factor of motif prior, to be 0.8 so that the k-mer
based prior will not force the model to predict a binding event at a motif position without
sufficient read coverage (see Subsection 4.2.4). The current setting produces good
results for the ENCDOE and mouse ES cell datasets, but may not be optimal for future
datasets with higher read coverage or datasets with different read coverage
characteristics such as ChIP-exo data. In addition, other type of position-specific prior
information, e.g. sequence conservation scores, may be incorporated. The weighting
factor ideally should be adjusted automatically to allow a balanced contribution of
different sources of information.
One strategy is to run GEM on multiple random subsets of the reads and to select
the setting that gives most consistent result. Another strategy is to evaluate the spatial
read distribution of the predicted events to assess the effect of motif prior. The
mappability of reads should be taken into account when evaluating the read distribution
of predicted events. A too strong motif prior weighting may result in predictions of false
events that do not have an expected read distribution.
6.2.2 K-mer based comparison of in vivo versus in vitro binding for similar
TFs in a family
TFs in large families such as Forkhead box (Fox) or Homeo box (Hox) proteins tend
to bind similar sites in vitro yet display diverse functions in vivo, suggesting specificities
are gained from co-factor interactions. A k-mer based comparison between binding
motifs learned from in vivo ChIP-Seq data and those from in vitro data such as protein
123
Chapter 6: Conclusions
binding microarray (PBM) (Berger et al., 2006), HT-SELEX (Zhao et al., 2009), and
Bacterial one-hybrid (B1H) (Meng et al., 2005; Meng and Wolfe, 2006), especially for
multiple TFs in the same family, will allow interesting sequence specificity features to be
tested. For example, k-mers that are differentially bound in vivo versus in vitro, or kmers that are bound differently by related TFs in vivo or in vitro can be investigated
comprehensively to discover the sequence features that may explain in vivo specificity.
ChIP-Seq for similar TFs may be performed with antibodies against epitope-tagged
proteins (Cao et al., 2011; Mazzoni et al., 2011). The counts of an in vivo k-mer may
need to be properly normalized with the copy number of that k-mer in the genome.
6.2.3 Discovery of binding constraints
In this thesis, I have shown mainly pair-wise binding constraints. One case of more
complex pattern of Sox2/Klf4/Nanog/Esrrb/Nr5a2/Tcpcf2l1 co-binding has been found by
targeted search. Automatic search strategies will allow a more systematic search of
complex patterns that may involve arbitrary number of TFs. One search strategy might
consist of building a binding constraint network using significant pair-wise constraints in
the same cellular condition, then finding the cliques in the network and performing a
targeted search to verify whether those pair-wise constraints in a clique occur in the
same set of genomic regions.
Another direction related to binding constraints is to build an online database to
store ChIP-Seq binding calls from GEM across multiple TFs in multiple conditions.
Binding constraints can then be searched and visualized in the desired set of
experiments. Existing public datasets, including data from ENCODE (Birney et al.,
2007), modENCODE (Celniker et al., 2009), will provide large number of experiments to
start. This will become more useful as more ChIP-Seq data are produced.
6.3 Conclusions
In conclusion, I presented three novel computational methods from my thesis
research. These methods improved spatial resolution and joint event deconvolution in
ChIP-Seq binding event prediction, and improved accuracy in motif representation and
discovery. The improved results from these methods allow discovery of in vivo
transcription factor spatial binding constraints in both human and mouse cells, as well as
improvement in downstream analysis in other research area.
From these results, I showed that a high resolution model for inherently noisy
124
Chapter 6: Conclusions
genome-wide high-throughput biological data is feasible. This has been made possible
by modeling every ChIP-Seq read, using more accurate k-mer based motif
representation and incorporating constraints from the biological experiments
Collectively, the results from my thesis research show that it is possible to reveal
aspects of functional genome syntax using a high resolution computational model of
ChIP-Seq data.
125
References
References
Aho, A.V., and Corasick, M.J. (1975). Efficient string matching: an aid to bibliographic
search. Communications of the ACM 18, 333–340.
Arnosti, D.N., and Kulkarni, M.M. (2005). Transcriptional enhancers: Intelligent
enhanceosomes or flexible billboards? J. Cell. Biochem. 94, 890–898.
Bailey, T.L. (2011). DREME: Motif discovery in transcription factor ChIP-seq data.
Bioinformatics.
Bailey, T.L., Boden, M., Whitington, T., and Machanick, P. (2010). The value of
position-specific priors in motif discovery using MEME. BMC Bioinformatics 11, 179.
Bailey, T.L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization
to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36.
Barash, Y., Bejerano, G., and Friedman, N. (2001). A Simple Hyper-Geometric
Approach for Discovering Putative Transcription Factor Binding Sites. In Proceedings of
the First International Workshop on Algorithms in Bioinformatics, (London, UK, UK:
Springer-Verlag), pp. 278–293.
Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Schones, D.E., Wang, Z., Wei, G.,
Chepelev, I., and Zhao, K. (2007). High-resolution profiling of histone methylations in
the human genome. Cell 129, 823–837.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical
and powerful approach to multiple testing. J Roy Statist Soc Ser B (Methodological) 57,
289–300.
Benos, P.V., Bulyk, M.L., and Stormo, G.D. (2002). Additivity in protein-DNA
interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451.
Berger, M.F., Philippakis, A.A., Qureshi, A.M., He, F.S., Estep, P.W., 3rd, and Bulyk,
M.L. (2006). Compact, universal DNA microarrays to comprehensively determine
transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435.
Bernstein, B.E., Stamatoyannopoulos, J.A., Costello, J.F., Ren, B., Milosavljevic, A.,
Meissner, A., Kellis, M., Marra, M.A., Beaudet, A.L., Ecker, J.R., et al. (2010). The NIH
Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048.
Bicego, M., Cristani, M., and Murino, V. (2007). Sparseness Achievement in Hidden
Markov Models. In Proceedings of the 14th International Conference on Image Analysis
and Processing (ICIAP07)., (Modena: IEEE Computer Society),.
Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigó, R., Gingeras, T.R., Margulies,
E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E., et al. (2007).
Identification and analysis of functional elements in 1% of the human genome by the
ENCODE pilot project. Nature 447, 799–816.
Bishop, C.M. (2006). Pattern recognition and machine learning (New York: Springer).
126
References
Blackwood, E.M., and Eisenman, R.N. (1991). Max: a helix-loop-helix zipper protein
that forms a sequence-specific DNA-binding complex with Myc. Science 251, 1211–
1217.
Boeva, V., Surdez, D., Guillon, N., Tirode, F., Fejes, A.P., Delattre, O., and Barillot, E.
(2010). De novo motif identification improves the accuracy of predicting transcription
factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res 38, e126.
Bourque, G., Leong, B., Vega, V.B., Chen, X., Lee, Y.L., Srinivasan, K.G., Chew, J.-L.,
Ruan, Y., Wei, C.-L., Ng, H.H., et al. (2008). Evolution of the mammalian transcription
factor binding repertoire via transposable elements. Genome Res. 18, 1752–1762.
Boyadjiev, S.A., and Jabs, E.W. (2000). Online Mendelian Inheritance in Man (OMIM)
as a knowledgebase for human developmental disorders. Clin. Genet. 57, 253–266.
Bulyk, M.L., Johnson, P.L.F., and Church, G.M. (2002). Nucleotides of transcription
factor binding sites exert interdependent effects on the binding affinities of transcription
factors. Nucleic Acids Res. 30, 1255–1261.
Cao, A.R., Rabinovich, R., Xu, M., Xu, X., Jin, V.X., and Farnham, P.J. (2011).
Genome-wide analysis of transcription factor E2F1 mutant proteins reveals that N- and
C-terminal protein interaction domains do not participate in targeting E2F1 to the human
genome. J. Biol. Chem. 286, 11985–11996.
Celniker, S.E., Dillon, L.A.L., Gerstein, M.B., Gunsalus, K.C., Henikoff, S., Karpen,
G.H., Kellis, M., Lai, E.C., Lieb, J.D., MacAlpine, D.M., et al. (2009). Unlocking the
secrets of the genome. Nature 459, 927–930.
Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V.B., Wong, E., Orlov, Y.L.,
Zhang, W., Jiang, J., et al. (2008). Integration of external signaling pathways with the
core transcriptional network in embryonic stem cells. Cell 133, 1106–1117.
Chen, Y., Negre, N., Li, Q., Mieczkowska, J.O., Slattery, M., Liu, T., Zhang, Y., Kim,
T.-K., He, H.H., Zieba, J., et al. (2012). Systematic evaluation of factors influencing
ChIP-seq fidelity. Nature Methods.
Chew, J.-L., Loh, Y.-H., Zhang, W., Chen, X., Tam, W.-L., Yeap, L.-S., Li, P., Ang, Y.S., Lim, B., Robson, P., et al. (2005). Reciprocal transcriptional regulation of Pou5f1 and
Sox2 via the Oct4/Sox2 complex in embryonic stem cells. Mol. Cell. Biol. 25, 6031–
6046.
Chung, D., Kuan, P.F., Li, B., Sanalkumar, R., Liang, K., Bresnick, E.H., Dewey, C., and
Keleş, S. (2011). Discovering transcription factor binding sites in highly repetitive
regions of genomes with multi-read analysis of ChIP-Seq data. PLoS Comput. Biol. 7,
e1002111.
Creyghton, M.P., Cheng, A.W., Welstead, G.G., Kooistra, T., Carey, B.W., Steine, E.J.,
Hanna, J., Lodato, M.A., Frampton, G.M., Sharp, P.A., et al. (2010). Histone H3K27ac
separates active from poised enhancers and predicts developmental state. Proc. Natl.
Acad. Sci. U.S.A. 107, 21931–21936.
D’haeseleer, P. (2006a). How does DNA sequence motif discovery work? Nat.
Biotechnol 24, 959–961.
127
References
D’haeseleer, P. (2006b). What are DNA sequence motifs? Nat. Biotechnol. 24, 423–425.
Dang, C.V. (2012). MYC on the path to cancer. Cell 149, 22–35.
Das, M.K., and Dai, H.-K. (2007). A survey of DNA motif finding algorithms. BMC
Bioinformatics 8 Suppl 7, S21.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum Likelihood from
Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society 39,.
Donohoe, M.E., Zhang, L.-F., Xu, N., Shi, Y., and Lee, J.T. (2007). Identification of a
Ctcf cofactor, Yy1, for the X chromosome binary switch. Mol. Cell 25, 43–56.
Eden, E., Lipson, D., Yogev, S., and Yakhini, Z. (2007). Discovering motifs in ranked
lists of DNA sequences. PLoS Comput. Biol 3, e39.
Ernst, J., Kheradpour, P., Mikkelsen, T.S., Shoresh, N., Ward, L.D., Epstein, C.B.,
Zhang, X., Wang, L., Issner, R., Coyne, M., et al. (2011). Mapping and analysis of
chromatin state dynamics in nine human cell types. Nature 473, 43–49.
Farnham, P.J. (2009). Insights from genomic profiling of transcription factors. Nat. Rev.
Genet. 10, 605–616.
Feng, X., Grossman, R., and Stein, L. (2011). PeakRanger: a cloud-enabled peak caller
for ChIP-seq data. BMC Bioinformatics 12, 139.
Figueiredo, M.A.., and Jain, A.K. (2002). Unsupervised Learning of Finite Mixture
Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 4, 381–396.
Fullwood, M.J., Liu, M.H., Pan, Y.F., Liu, J., Xu, H., Mohamed, Y.B., Orlov, Y.L.,
Velkov, S., Ho, A., Mei, P.H., et al. (2009). An oestrogen-receptor-alpha-bound human
chromatin interactome. Nature 462, 58–64.
Furney, S.J., Higgins, D.G., Ouzounis, C.A., and López-Bigas, N. (2006). Structural and
functional properties of genes involved in human cancer. BMC Genomics 7, 3.
Galas, D.J., and Schmitz, A. (1978). DNAse footprinting: a simple method for the
detection of protein-DNA binding specificity. Nucleic Acids Res. 5, 3157–3170.
Georges, A.B., Benayoun, B.A., Caburet, S., and Veitia, R.A. (2010). Generic binding
sites, generic DNA-binding domains: where does specific promoter recognition come
from? FASEB J. 24, 346–356.
Glover, J.N., and Harrison, S.C. (1995). Crystal structure of the heterodimeric bZIP
transcription factor c-Fos-c-Jun bound to DNA. Nature 373, 257–261.
Gotea, V., Visel, A., Westlund, J.M., Nobrega, M.A., Pennacchio, L.A., and Ovcharenko,
I. (2010). Homotypic clusters of transcription factor binding sites are a key component of
human promoters and enhancers. Genome Res 20, 565–577.
Guo, Y., Papachristoudis, G., Altshuler, R.C., Gerber, G.K., Jaakkola, T.S., Gifford,
D.K., and Mahony, S. (2010). Discovering homotypic binding events at high spatial
resolution. Bioinformatics 26, 3028–3034.
128
References
Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F., Feldser, D., Huarte, M., Zuk,
O., Carey, B.W., Cassady, J.P., et al. (2009). Chromatin signature reveals over a thousand
highly conserved large non-coding RNAs in mammals. Nature 458, 223–227.
Heng, J.-C.D., Feng, B., Han, J., Jiang, J., Kraus, P., Ng, J.-H., Orlov, Y.L., Huss, M.,
Yang, L., Lufkin, T., et al. (2010). The nuclear receptor Nr5a2 can replace Oct4 in the
reprogramming of murine somatic cells to pluripotent cells. Cell Stem Cell 6, 167–174.
Hoffman, B., Amanullah, A., Shafarenko, M., and Liebermann, D.A. (2002). The protooncogene c-myc in hematopoietic development and leukemogenesis. Oncogene 21,
3414–3421.
Hu, J., Li, B., and Kihara, D. (2005). Limitations and potentials of current motif
discovery algorithms. Nucleic Acids Res. 33, 4899–4913.
Hu, M., Yu, J., Taylor, J.M.G., Chinnaiyan, A.M., and Qin, Z.S. (2010). On the detection
and refinement of transcription factor binding sites using ChIP-Seq data. Nucleic Acids
Res 38, 2154–2167.
Hueber, S.D., and Lohmann, I. (2008). Shaping segments: Hox gene function in the
genomic age. Bioessays 30, 965–979.
Hughes, J.D., Estep, P.W., Tavazoie, S., and Church, G.M. (2000). Computational
identification of cis-regulatory elements associated with groups of functionally related
genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214.
Hughes, T.R. (2011). Introduction to “a handbook of transcription factors.”Subcell.
Biochem. 52, 1–6.
Ise, W., Kohyama, M., Schraml, B.U., Zhang, T., Schwer, B., Basu, U., Alt, F.W., Tang,
J., Oltz, E.M., Murphy, T.L., et al. (2011). The transcription factor BATF controls the
global regulators of class-switch recombination in both B cells and T cells. Nat.
Immunol. 12, 536–543.
Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M., and Brown, P.O. (2001).
Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature
409, 533–538.
Jacob, F., and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of
proteins. J. Mol. Biol. 3, 318–356.
Ji, H., Jiang, H., Ma, W., Johnson, D.S., Myers, R.M., and Wong, W.H. (2008). An
integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol.
26, 1293–1300.
Jiang, M., Anderson, J., Gillespie, J., and Mayne, M. (2008). uShuffle: a useful tool for
shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics 9,
192.
Johnson, D.S., Mortazavi, A., Myers, R.M., and Wold, B. (2007). Genome-wide mapping
of in vivo protein-DNA interactions. Science 316, 1497–1502.
129
References
Jothi, R., Cuddapah, S., Barski, A., Cui, K., and Zhao, K. (2008). Genome-wide
identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids
Res. 36, 5221–5231.
Kim, T.-M., and Park, P.J. (2011). Advances in analysis of transcriptional regulatory
networks. Wiley Interdiscip Rev Syst Biol Med 3, 21–35.
Kulakovskiy, I.V., Boeva, V.A., Favorov, A.V., and Makeev, V.J. (2010). Deep and wide
digging for binding motifs in ChIP-Seq data. Bioinformatics 26, 2622–2623.
Kulkarni, M.M., and Arnosti, D.N. (2003). Information display by transcriptional
enhancers. Development 130, 6569–6575.
Kunarso, G., Chia, N.-Y., Jeyakani, J., Hwang, C., Lu, X., Chan, Y.-S., Ng, H.-H., and
Bourque, G. (2010). Transposable elements have rewired the core regulatory network of
human embryonic stem cells. Nat Genet 42, 631–634.
Laajala, T.D., Raghav, S., Tuomela, S., Lahesmaa, R., Aittokallio, T., and Elo, L.L.
(2009). A practical comparison of methods for detecting transcription factor binding sites
in ChIP-seq experiments. BMC Genomics 10, 618.
Ladunga, I. (2010). An overview of the computational analyses and discovery of
transcription factor binding sites. Methods Mol. Biol. 674, 1–22.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 10,
R25.
Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett,
N.M., Harbison, C.T., Thompson, C.M., Simon, I., et al. (2002). Transcriptional
regulatory networks in Saccharomyces cerevisiae. Science 298, 799–804.
Levine, M. (2010). Transcriptional enhancers in animal development and evolution. Curr.
Biol. 20, R754–763.
Levine, M., and Tjian, R. (2003). Transcription regulation and animal diversity. Nature
424, 147–151.
Li, H., Ruan, J., and Durbin, R. (2008). Mapping short DNA sequencing reads and
calling variants using mapping quality scores. Genome Res. 18, 1851–1858.
Lifanov, A.P., Makeev, V.J., Nazina, A.G., and Papatsenko, D.A. (2003). Homotypic
regulatory clusters in Drosophila. Genome Res 13, 579–588.
Liu, X.S., Brutlag, D.L., and Liu, J.S. (2002). An algorithm for finding protein-DNA
binding sites with applications to chromatin-immunoprecipitation microarray
experiments. Nat. Biotechnol. 20, 835–839.
Lodish, H.F. (2004). Molecular cell biology (New York: W.H. Freeman and Co.).
Ma, X., Kulkarni, A., Zhang, Z., Xuan, Z., Serfling, R., and Zhang, M.Q. (2012). A
highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using
positional information. Nucleic Acids Research.
MacIsaac, K.D., and Fraenkel, E. (2006). Practical strategies for discovering regulatory
DNA sequence motifs. PLoS Comput. Biol. 2, e36.
130
References
Maerkl, S.J., and Quake, S.R. (2007). A systems approach to measuring the binding
energy landscapes of transcription factors. Science 315, 233–237.
Mahony, S., Auron, P.E., and Benos, P.V. (2007). DNA familial binding profiles made
easy: comparison of various motif alignment and clustering strategies. PLoS Comput.
Biol. 3, e61.
Man, T.K., and Stormo, G.D. (2001). Non-independence of Mnt repressor-operator
interaction determined by a new quantitative multiple fluorescence relative affinity
(QuMFRA) assay. Nucleic Acids Res. 29, 2471–2478.
Mardis, E.R. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics
Hum Genet 9, 387–402.
Maston, G.A., Evans, S.K., and Green, M.R. (2006). Transcriptional regulatory elements
in the human genome. Annu Rev Genomics Hum Genet 7, 29–59.
Matys, V., Fricke, E., Geffers, R., Gössling, E., Haubrock, M., Hehl, R., Hornischer, K.,
Karas, D., Kel, A.E., Kel-Margoulis, O.V., et al. (2003). TRANSFAC: transcriptional
regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378.
Mazzoni, E.O., Mahony, S., Iacovino, M., Morrison, C.A., Mountoufaris, G., Closser,
M., Whyte, W.A., Young, R.A., Kyba, M., Gifford, D.K., et al. (2011). Embryonic stem
cell-based mapping of developmental transcriptional programs. Nat. Methods 8, 1056–
1058.
Meng, X., Brodsky, M.H., and Wolfe, S.A. (2005). A bacterial one-hybrid system for
determining the DNA-binding specificity of transcription factors. Nat. Biotechnol. 23,
988–994.
Meng, X., and Wolfe, S.A. (2006). Identifying DNA sequences recognized by a
transcription factor using a bacterial one-hybrid system. Nat Protoc 1, 30–45.
Metzker, M.L. (2010). Sequencing technologies - the next generation. Nat. Rev. Genet.
11, 31–46.
Mikkelsen, T.S., Ku, M., Jaffe, D.B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez,
P., Brockman, W., Kim, T.-K., Koche, R.P., et al. (2007). Genome-wide maps of
chromatin state in pluripotent and lineage-committed cells. Nature 448, 553–560.
Mitchell, P.J., and Tjian, R. (1989). Transcriptional regulation in mammalian cells by
sequence-specific DNA binding proteins. Science 245, 371–378.
Moorman, C., Sun, L.V., Wang, J., de Wit, E., Talhout, W., Ward, L.D., Greil, F., Lu, X.J., White, K.P., Bussemaker, H.J., et al. (2006). Hotspots of transcription factor
colocalization in the genome of Drosophila melanogaster. Proceedings of the National
Academy of Sciences 103, 12027–12032.
Narlikar, L., Gordan, R., Ohler, U., and Hartemink, A.J. (2006a). Informative priors
based on transcription factor structural class improve de novo motif discovery.
Bioinformatics 22, e384–92.
131
References
Narlikar, L., Gordân, R., Ohler, U., and Hartemink, A.J. (2006b). Informative priors
based on transcription factor structural class improve de novo motif discovery.
Bioinformatics 22, e384–392.
Nomenclature Committee of the International Union of Biochemistry (1986).
Nomenclature for incompletely specified bases in nucleic acid sequences.
Recommendations 1984. Nomenclature Committee of the International Union of
Biochemistry (NC-IUB). Proc. Natl. Acad. Sci. U.S.A. 83, 4–8.
Panne, D., Maniatis, T., and Harrison, S.C. (2007). An atomic model of the interferonbeta enhanceosome. Cell 129, 1111–1123.
Papatsenko, D., and Levine, M. (2005). Quantitative analysis of binding motifs mediating
diverse spatial readouts of the Dorsal gradient in the Drosophila embryo. Proc. Natl.
Acad. Sci. U.S.A. 102, 4966–4971.
Park, P.J. (2009). ChIP-seq: advantages and challenges of a maturing technology. Nat.
Rev. Genet. 10, 669–680.
Pavesi, G., Mauri, G., and Pesole, G. (2001). An algorithm for finding signals of
unknown length in DNA sequences. Bioinformatics 17 Suppl 1, S207–214.
Pepke, S., Wold, B., and Mortazavi, A. (2009). Computation for ChIP-seq and RNA-seq
studies. Nat. Methods 6, S22–32.
Pognonec, P., Boulukos, K.E., Aperlo, C., Fujimoto, M., Ariga, H., Nomoto, A., and
Kato, H. (1997). Cross-family interaction between the bHLHZip USF and bZip Fra1
proteins results in down-regulation of AP1 activity. Oncogene 14, 2091–2098.
Ponticos, M., Partridge, T., Black, C.M., Abraham, D.J., and Bou-Gharios, G. (2004).
Regulation of collagen type I in vascular smooth muscle cells by competition between
Nkx2.5 and deltaEF1/ZEB1. Mol. Cell. Biol. 24, 6151–6161.
Ptashne, M., and Gann, A. (1997). Transcriptional activation by recruitment. Nature 386,
569–577.
Qi, Y., Rolfe, A., MacIsaac, K.D., Gerber, G.K., Pokholok, D., Zeitlinger, J., Danford, T.,
Dowell, R.D., Fraenkel, E., Jaakkola, T.S., et al. (2006). High-resolution computational
models of genome binding events. Nat. Biotechnol. 24, 963–970.
Rahl, P.B., Lin, C.Y., Seila, A.C., Flynn, R.A., McCuine, S., Burge, C.B., Sharp, P.A.,
and Young, R.A. (2010). c-Myc regulates transcriptional pause release. Cell 141, 432–
445.
Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J.,
Schreiber, J., Hannett, N., Kanin, E., et al. (2000). Genome-wide location and function of
DNA binding proteins. Science 290, 2306–2309.
Rhee, H.S., and Pugh, B.F. (2011). Comprehensive genome-wide protein-DNA
interactions detected at single-nucleotide resolution. Cell 147, 1408–1419.
Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Euskirchen,
G., Bernier, B., Varhol, R., Delaney, A., et al. (2007). Genome-wide profiles of STAT1
132
References
DNA association using chromatin immunoprecipitation and massively parallel
sequencing. Nat. Methods 4, 651–657.
Rozowsky, J., Euskirchen, G., Auerbach, R.K., Zhang, Z.D., Gibson, T., Bjornson, R.,
Carriero, N., Snyder, M., and Gerstein, M.B. (2009). PeakSeq enables systematic scoring
of ChIP-seq experiments relative to controls. Nat Biotechnol 27, 66–75.
Rye, M.B., Sætrom, P., and Drabløs, F. (2011). A manually curated ChIP-seq benchmark
demonstrates room for improvement in current peak-finder programs. Nucleic Acids Res.
39, e25.
Sandelin, A., Alkema, W., Engström, P., Wasserman, W.W., and Lenhard, B. (2004).
JASPAR: an open-access database for eukaryotic transcription factor binding profiles.
Nucleic Acids Res. 32, D91–94.
Schneider, T.D., and Stephens, R.M. (1990). Sequence logos: a new way to display
consensus sequences. Nucleic Acids Res. 18, 6097–6100.
Sekiya, S., and Suzuki, A. (2011). Direct conversion of mouse fibroblasts to hepatocytelike cells by defined factors. Nature 475, 390–393.
Shen, Y., Yue, F., McCleary, D.F., Ye, Z., Edsall, L., Kuan, S., Wagner, U., Dixon, J.,
Lee, L., Lobanenkov, V.V., et al. (2012). A map of the cis-regulatory sequences in the
mouse genome. Nature 488, 116–120.
Sherwood, L. (1997). Human physiology : from cells to systems (Belmont, CA:
Wadsworth Pub. Co.).
Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A.S., Hou, M., Rosenbloom, K.,
Clawson, H., Spieth, J., Hillier, L.W., Richards, S., et al. (2005). Evolutionarily
conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15,
1034–1050.
Slattery, M., Riley, T., Liu, P., Abe, N., Gomez-Alcala, P., Dror, I., Zhou, T., Rohs, R.,
Honig, B., Bussemaker, H.J., et al. (2011). Cofactor binding evokes latent differences in
DNA binding specificity between Hox proteins. Cell 147, 1270–1282.
Solomon, M.J., Larsen, P.L., and Varshavsky, A. (1988). Mapping protein-DNA
interactions in vivo with formaldehyde: evidence that histone H4 is retained on a highly
transcribed gene. Cell 53, 937–947.
Song, L., Zhang, Z., Grasfeder, L.L., Boyle, A.P., Giresi, P.G., Lee, B.-K., Sheffield,
N.C., Gräf, S., Huss, M., Keefe, D., et al. (2011). Open chromatin defined by DNaseI and
FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 21,
1757–1767.
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown,
P.O., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol.
Biol. Cell 9, 3273–3297.
Stormo, G.D. (2000). DNA binding sites: representation and discovery. Bioinformatics
16, 16–23.
133
References
Stormo, G.D., Schneider, T.D., Gold, L., and Ehrenfeucht, A. (1982). Use of the
“Perceptron” algorithm to distinguish translational initiation sites in E. coli. Nucleic
Acids Res. 10, 2997–3011.
Stormo, G.D., and Zhao, Y. (2010). Determining the specificity of protein–DNA
interactions. Nature Reviews Genetics 11, 751–760.
Swanson, C.I., Schwimmer, D.B., and Barolo, S. (2011). Rapid evolutionary rewiring of
a structurally constrained eye enhancer. Curr. Biol. 21, 1186–1196.
Taatjes, D.J., Marr, M.T., and Tjian, R. (2004). Regulatory diversity among metazoan coactivator complexes. Nat. Rev. Mol. Cell Biol. 5, 403–410.
Takahashi, K., and Yamanaka, S. (2006). Induction of pluripotent stem cells from mouse
embryonic and adult fibroblast cultures by defined factors. Cell 126, 663–676.
Thanos, D., and Maniatis, T. (1995). Virus induction of human IFN beta gene expression
requires the assembly of an enhanceosome. Cell 83, 1091–1100.
The modENCODE Consortium, Roy, S., Ernst, J., Kharchenko, P.V., Kheradpour, P.,
Negre, N., Eaton, M.L., Landolin, J.M., Bristow, C.A., Ma, L., et al. (2010).
Identification of Functional Elements and Regulatory Circuits by Drosophila
modENCODE. Science 330, 1787–1797.
Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V.,
Frith, M.C., Fu, Y., Kent, W.J., et al. (2005). Assessing computational tools for the
discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144.
Valouev, A., Johnson, D.S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers,
R.M., and Sidow, A. (2008). Genome-wide analysis of transcription factor binding sites
based on ChIP-Seq data. Nat. Methods 5, 829–834.
Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A., and Luscombe, N.M. (2009). A
census of human transcription factors: function, expression and evolution. Nat. Rev.
Genet. 10, 252–263.
Visel, A., Blow, M.J., Li, Z., Zhang, T., Akiyama, J.A., Holt, A., Plajzer-Frick, I.,
Shoukry, M., Wright, C., Chen, F., et al. (2009). ChIP-seq accurately predicts tissuespecific activity of enhancers. Nature 457, 854–858.
Wallerman, O., Motallebipour, M., Enroth, S., Patra, K., Bysani, M.S.R., Komorowski,
J., and Wadelius, C. (2009). Molecular interactions between HNF4a, FOXA2 and GABP
identified at regulatory DNA elements through ChIP-sequencing. Nucleic Acids Res. 37,
7498–7508.
Wang, X., and Zhang, X. (2011). Pinpointing transcription factor binding sites from
ChIP-seq data with SeqSite. BMC Systems Biology 5, S3.
Wang, Z., Gerstein, M., and Snyder, M. (2009). RNA-Seq: a revolutionary tool for
transcriptomics. Nat. Rev. Genet. 10, 57–63.
Wasserman, W.W., and Sandelin, A. (2004). Applied bioinformatics for the identification
of regulatory elements. Nat. Rev. Genet. 5, 276–287.
134
References
Watson, J.D. (2004). Molecular biology of the gene (San Francisco: Pearson/Benjamin
Cummings).
Whitington, T., Frith, M.C., Johnson, J., and Bailey, T.L. (2011). Inferring transcription
factor complexes from ChIP-seq data. Nucleic Acids Res.
Wilbanks, E.G., and Facciotti, M.T. (2010). Evaluation of algorithm performance in
ChIP-seq peak detection. PLoS One 5, e11471.
Wolberger, C. (1999). Multiprotein-DNA complexes in transcriptional regulation. Annu
Rev Biophys Biomol Struct 28, 29–56.
Wold, B., and Myers, R.M. (2008). Sequence census methods for functional genomics.
Nat. Methods 5, 19–21.
Wu, S., Wang, J., Zhao, W., Pounds, S., and Cheng, C. (2010). ChIP-PaM: an algorithm
to identify protein-DNA interaction using ChIP-Seq data. Theor Biol Med Model 7, 18.
Wunderlich, Z., and Mirny, L.A. (2009). Different gene regulation strategies revealed by
analysis of binding motifs. Trends Genet. 25, 434–440.
Yamaguchi, Y., Zhang, D.E., Sun, Z., Albee, E.A., Nagata, S., Tenen, D.G., and
Ackerman, S.J. (1994). Functional characterization of the promoter for the gene encoding
human eosinophil peroxidase. J. Biol. Chem. 269, 19410–19419.
Zambelli, F., Pesole, G., and Pavesi, G. (2012). Motif discovery and transcription factor
binding sites before and after the next-generation sequencing era. Briefings in
Bioinformatics.
Zhang, X., Robertson, G., Krzywinski, M., Ning, K., Droit, A., Jones, S., and Gottardo,
R. (2011). PICS: probabilistic inference for ChIP-seq. Biometrics 67, 151–163.
Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum,
C., Myers, R.M., Brown, M., Li, W., et al. (2008). Model-based analysis of ChIP-Seq
(MACS). Genome Biol. 9, R137.
Zhao, Y., Granas, D., and Stormo, G.D. (2009). Inferring binding energies from selected
binding sites. PLoS Comput. Biol 5, e1000590.
135