Multi-Syllabic DNA Motif Discovery S.

Multi-Syllabic DNA Motif Discovery
by
Rasika S. Kumar
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2005
@ Rasika S. Kumar, MMV. All rights reserved.
The author hereby grants to MIT permission to reproduce and
distribute publicly paper and electronic copies of this thesis document
in whole or in part.
A u th o r ... .. . ... ... . . . . . . . . . . . . .. . . ... .. . . ... . . . . . . .. . . . . .. ... .. ...
Department of Electrical Engineering and Computer Science
t 8. 2005
0A1
..
Certified by................................
David Gifford
Professor
Thesis Supervisor
Accepted by ...........
-
......
....
Arthur C. Smith
Chairman, Department Committee on Graduate Students
MASSACHUSETS ~T=rE
OFTECHNOLOGY
AUG 14 2006
LIBRARIES
ARCHIVES
2
Multi-Syllabic DNA Motif Discovery
by
Rasika S. Kumar
Submitted to the Department of Electrical Engineering and Computer Science
on August 8, 2005, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
This paper describes a method for finding multi-syllabic motifs in a genome. It
expands on the algorithm developed by Takusagawa, et al[1, 2] that uses data from
Chromatin Immuno-Precipitation (ChIP) experiments to isolate regions that have a
given motif. The Takusagawa method uses an enumeration method to search for
motifs in both positive and negative intergenic regions in order to determine the
statistical significance of the results. Our algorithm also uses enumeration to find
motifs that have gaps between the different sub-motifs, or syllables. This thesis also
describes a method to calculate the significance of each motif and tests this method
via Monte Carlo simulations on random test sets. The significant motifs found using
this algorithm are verified against consensus motifs found in the literature.
Thesis Supervisor: David Gifford
Title: Professor
3
4
Acknowledgments
I would first, like to thank Dave Gifford for giving me the opportunity to work in his
group. I would also like to thank Ken Takusagawa who has been my mentor and
immediate advisor for this thesis and supporting research.
It is through his direct
supervision that this thesis has become what it is. Last, but not least, I would like to
thank Dave Gifford, Ken Takusagawa, Kenzie MacIsaac, and Dr. B. Kumar for their
comments and input during the revision process of this thesis.
5
6
Contents
1
13
Introduction
1.1
Overview of Motif Discovery Methods.
13
1.1.1
Determining Bound Regions . . . . . . . . . . . . . . . . . . .
14
1.1.2
Enumeration Methods
. . . . . . . . . . . . . . . . . . . . . .
15
1.1.3
Probabilistic Methods
. . . . . . . . . . . . . . . . . . . . . .
15
1.1.4
Determining Statistical Significance . . . . . . . . . . . . . . .
16
1.2
Overview of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.3
Goals of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.4
Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . .
19
2 Multi-Syllabic Expansion and Enumeration (MSEE)
21
2.1
Terms and Definitions
. . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.2
MSEE: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Examining a Word . . . . . . . . . . . . . . . . . . . . . . . .
24
Object Oriented Implementation . . . . . . . . . . . . . . . . . . . . .
30
2.3.1
Expandable Interface . . . . . . . . . . . . . . . . . . . . . . .
31
2.3.2
Fixed Gap class . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Hash-Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Optimizing the Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
32
The Z modification . . . . . . . . . . . . . . . . . . . . . . . .
33
. . . . . . . . . . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . . . . . .
35
2.2.1
2.3
2.4
2.4.1
2.5
2.5.1
2.6
Alternate Implementation
2.7
Developing a Test Suite
7
3
3.1
3.2
4
37
Significance of a Motif
. . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.1.1
Hypergeometric Model . . . . . . . . . . . . . . . . . . . . . .
37
3.1.2
Binomial Model . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.1.3
Normal Approximation to the Binomial Model . . . . . . . . .
41
3.1.4
A nalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
. . . . . . . . . . . . . . . . . . . .
42
. . . . . . . . . . . . . . . . . . .
43
. . . . . . . . . . . . . . . . . . . . . . . . .
43
Probabilistic Model
Monte Carlo Simulation Strategy
3.2.1
Sequence Selection Strategy
3.2.2
Testing Strategy
Results
47
4.1
Consensus Motifs from Literature . . . . . . . . . . . . . . . . . . . .
47
4.2
Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2.1
Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2.2
Monte Carlo Motif Scores
. . . . . . . . . . . . . . . . . . . .
49
4.2.3
Important Features of Monte Carlo Results . . . . . . . . . . .
51
4.3
4.4
4.5
Testing the Two Methods
. . . . . . . . . . . . . . . . . . . . . . . .
53
4.3.1
Validation of MSEE . . . . . . . . . . . . . . . . . . . . . . .
53
4.3.2
Validation of Alternate Method
. . . . . . . . . . . . . . . .
58
4.3.3
Comparison of MSEE and Alternate Method . . . . . . . . . .
61
Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.4.1
Motif Alignment of Different Gapped Results
. . . . . . . . .
61
4.4.2
Motif as Seed to Probabalistic Methods . . . . . . . . . . . . .
63
4.4.3
Extracting Motifs with Complex Structures
. . . . . . . . . .
63
4.4.4
Motif Significance . . . . . . . . . . . . . . . . . . . . . . . . .
63
Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
A All Motifs found with MSEE
65
B
71
Optimization Comparison Tests
8
List of Figures
1-1
GAL4 transcription factor binding to DNA[3].
the basic structure of the GAL4 protein.
The thin bands show
The darker edges are the
subunits of GAL4 that directly bind to the genomic DNA. This figure
shows two molecules of GAL4 binding to DNA.
. . . . . . . . . . . .
18
2-1
Syllabic Structure of GAL4
. . . . . . . . . . . . . . . . . . . . . . .
23
2-2
Flow Chart of Motif Discovery by MSEE . . . . . . . . . . . . . . . .
25
2-3
Expansion of small word with 1 wildcard . . . . . . . . . . . . . . . .
26
2-4
Log Number of expansions of a given word as a function of gap width
and the number of wildcards . . . . . . . . . . . . . . . . . . . . . . .
28
2-5
Object Diagram of Inheritance . . . . . . . . . . . . . . . . . . . . . .
30
2-6
Introducing the Z Placeholder . . . . . . . . . . . . . . . . . . . . . .
33
2-7
Flow Chart of Alternate Implementation
36
3-1
Normal Distribution Approximation of Number of Successes. We cal-
. . . . . . . . . . . . . . . .
culate the area of the shaded box which is equivalent to the probability
of m or more successes. p is the mean number of successes. . . . . . .
41
3-2
Distribution of Set Sizes
44
3-3
Monte Carlo Simulation Strategy. For each sequence set size chosen,
. . . . . . . . . . . . . . . . . . . . . . . . .
we generate 30 random sets of that size and run MSEE on these sets.
4-1
45
Scores of the Best Motifs from Monte Carlo Simulations. Both mean
and standard deviation from the mean decrease as the number of sequences in the set increases.
. . . . . . . . . . . . . . . . . . . . . . .
9
50
4-2
Number of bases in Monte Carlo sets vs. ChIP sets. The significant
outliers are circled in red.
. . . . . . . . . . . . . . . . . . . . . . . .
51
4-3 Histogram of Scores for Positive Sequence Set Sizes of 4 and 180. Dis-
4-4
tributions are noticeably non-Gaussian . . . . . . . . . . . . . . . . .
52
Motif Scores from Monte Carlo Sets versus ChIP sets . . . . . . . . .
57
B-1 Comparing hash function performance between optimized and nonoptimized versions of algorithm
. . . . . . . . . . . . . . . . . . . . .
10
72
List of Tables
1.1
Consensus motifs that have gaps. These motifs follow the IUPAC abbreviations for nucleotide subsets. Please refer to Table 2.1 in Section 2.1. 19
2.1
IUPAC Map. This table lists IUPAC notation for the corresponding
subset of nucleotides and the integer assigned to each subset
2.2
23
Set of matching wildcards for each nucleotide. There are four wildcard
possibilities for each base.
4.1
. . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
26
Consensus motifs containing gaps from Harbison et al [4] listed by
the transcription factor (TF), the consensus motif, and the enrichment
score. The motifs for which the enrichment score is 0.0 were obtained
from the literature and were not found experimentally in Harbison, et
al.
4.2
. . . .. . . . . . . ..
. . . . . . ....
.
. ..
..
. . . . . . . . .. ..
Consensus motifs containing gaps from Kellis, et al[5]. The motif conservation score(MCS) is listed alongside the motif. . . . . . . . . . . .
4.3
48
48
Number of Positive Intergenic Sequences and Total Number of bases
in these sequences for Transcription Factors with gapped consensus
motifs. Sorted by Number of Positive Sequences. . . . . . . . . . . . .
4.4
49
Scores from Monte Carlo Simulations. The last column "Conf" refers
to the confidence level achieved if we set a threshold of one standard
deviation away from the mean score.
11
. . . . . . . . . . . . . . . . . .
49
4.5
Motifs found using MSEE for GCN4, ABF1, UGA3 in Rapamycin,
RGT1, HSF1 in Heat Shock, GAL4 in Galactose, PUT3, STB4, STP1,
SUT1, and SOK2 after 14 hour Butanol treatment. The default con-
dition is YPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6
54
Comparing MSEE motifs with Consensus Motifs. The "Match?" column displays a y - pattern and structure match; p - partial pattern
match but structure mismatch; n - mismatch
4.7
. . . . . . . . . . . . .
55
Motifs found using Alternate Method for GCN4, RGT1, PUT3, ABF1,
STB4, GAL4 in Raffinose, HSF1, UGA3 in Rapamycin, STP1, SUT1,
and SOK2 after 14 hour Butanol treatment.
4.8
59
Comparing motifs from alternate method with Consensus Motifs. The
"Match?"
4.9
. . . . . . . . . . . . . .
column displays a y - pattern and structure match; p -
partial pattern match but structure mismatch; n - mismatch . . . . .
60
. . . . . . . . . . . . . . . . . . . . . . .
62
Top motifs found for ABF1
12
Chapter 1
Introduction
An important area of computational biology deals with analyzing function and regulation encoded in an organism's genome.
In particular, this thesis examines the
specific interaction between genomic DNA and regulatory proteins such as transcription factors. Transcription factors are proteins that are known to regulate when and
how DNA is transcribed. They work by binding to regions near a gene's transcription
start site, called promoter regions, thus inducing or repressing transcription. Transcription factors bind to specific DNA sequences, or motifs, that are dependent on
the protein structure of the transcription factor and can vary in length and number. Motif discovery refers to the search for and identification of these specific DNA
sequences.
1.1
Overview of Motif Discovery Methods
There are essentially two steps to motif discovery[6].
The first is identifying where
in the genome these motifs exist. In other words, this involves finding regions in the
DNA to which transcription factors bind. Once these regions have been identified, the
second step consists of developing a consensus pattern for the motif. There are many
existing motif discovery methods that vary greatly in their approach and assumptions.
Some popular examples are MEME[7], MDScan[8], BioProspector[9], and Gibbs Motif
Sampler[10].
13
1.1.1
Determining Bound Regions
A variety of methods have been used to isolate sets of sequences which are expected
to bind to a transcription factor.
Sequence Conservation
Some methods use sequence conservation across different species (for instance across
various Saccharomyces species) to identify highly conserved regions arguing that these
regions are more likely to contain areas vital to the transcription of key genes 11].
These regions may contain the motifs for any number of transcription factors. For
example, in Kellis et al[5], they examine sequence conservation across four different
yeast species to determine which areas are most likely to be protein coding regions or
regulatory elements. This method has proved especially useful for species that have
a large variability. However, it has limited utility if sequences between species cannot
be aligned well or if conservation data for either or both the species does not exist.
Clustering Methods
Other methods use clustered expression profiles to approximate where a transcription
factor binds[12]. The expression profiles are obtained from DNA microarray experiments which are performed under various conditions. This method hypothesizes that
genes that are co-expressed are controlled by the same transcription factors. Various
clustering methods are used including K-Means clustering and Bayesian clustering[13]
to group together genes that are co-expressed. The promoter sequences of these genes
are then isolated for motif extraction.
Chromatin ImmunoPrecipitation Experiements
Another method of pruning our search space involves an experimental technique called
Chromatic ImmunoPrecipitation (ChIP) where antibodies are used to isolate regions
of DNA bound by a certain transcription factor. ChIP pinpoints the regions containing the motif for that transcription factor (ie. bound intergenic regions). It does this
14
by fixing the transcription factors to the DNA regions to which they bind and then
isolating these bound regions. Many motif discovery methods use the bound regions
discovered from ChIP experiments as input including MDScan[8].
Once we have successfully isolated sequences where there is a high probability of
a motif being present, we then apply a motif extraction method. These methods fall
into two categories: enumeration and probabilistic selection.
1.1.2
Enumeration Methods
Enumeration methods, or word counting methods, systematically scan through a set
of DNA sequences and look for overrepresented words. Many enumeration methods
take into account variability in a motif by including wildcards in their subsequence
matching. This allows the methods to better characterize different motifs since a
given motif may differ slightly in its various manifestations. The main disadvantage
of this general method is the amount of memory required for the algorithm. However,
optimizing the algorithm in this respect can improve its performance and make it an
efficient method of motif extraction.
1.1.3
Probabilistic Methods
There are many methods which take a statistical approach to motif discovery. The following discusses the Position Specific Score Matrix(PSSM) Model[11] and the Markov
Chain Model of a motif.
Position Specific Score Matrices
The PSSM model of a motif consists of independent multinomial distributions at
each position in the motif. Various probabilistic methods use a seed PSSM in order
to approximate the most likely PSSM for a motif on a given sequence. One such
method is the Expectation Maximization(EM) method which is a two-step iterative
process resulting in a motif model that has maximum likelihood compared to the
background nodel[6, 14]. The EM method is used in particular by MEME[7] which
15
takes as an input a training set of sequences and uses this set as the seed for the
Expectation Maximization. Another similar method is Gibbs sampling which uses a
set of parameters to define a motif model. This method then estimates values for this
set of parameters such that the model best characterizes the data[15].
Markov Chain Model
Other methods use Markov chains to model dependencies between positions in a motif. Arguing that the PSSM model of consensus motifs is based on the assumption that
each position in the motif is independent from the other[16], many use Markov chain
models which incorporate the idea of dependence. One example of this technique is the
motif discovery method LOGOS[17] which uses Hidden Markov Dirichlet-Multinomial
Models and prior biological knowledge of known motifs to model positional dependencies within the motifs. Other examples include Gibbs sampling methods which
estimate maximum likelihood motif models using these Markov Chains[14].
Other motif extraction methods exist which use a variety of statistical methods
and which perform efficiently in differing circumstances.
motifs using random projections [181.
One such method finds
This has proven to work very efficiently in
finding motifs of large width and sufficient variability.
1.1.4
Determining Statistical Significance
Once the motif extraction algorithm has identified a potential motif model, the next
step is to estimate the significance of these candidate motifs using statistical methods.
One popular method is the Random Selection Null Hypothesis[11] which states that
motifs must be sequences that are significantly overrepresented in the positive intergenic regions. This method produces a significance score for each candidate motif
which is derived from the hypergeometric distribution. This thesis uses an approximation to this scoring method which will be discussed later in Section 3.1.
Another method uses Monte Carlo simulations to determine the expected score of
a motif given a random input set of sequences. The score of the candidate motif must
16
be higher than this expected score in order for it to be considered significant. This
method can also be used as a measure of the effectiveness of the given motif scoring
method. This is also discussed later in Section 3.2. Other methods use a variety of
scoring methods by which they rank their motifs[19].
While many of these methods have found new motifs in a variety of genomes, very
few use a priori knowledge about the motif structure to perform a search. A recent
method developed by Sandelin and Wasserman[20] incorporates structural knowledge
of set of transcription factors by creating families of factors with similar structures
and incorporating these families into motif discovery methods. This thesis explores
another method that uses structural information to find a more complex set of motifs
efficiently.
1.2
Overview of this Thesis
The method implemented in this thesis uses data from Chromatin ImmunoPrecipitation (ChIP) experiments to identify regions that are expected to contain a motif and
enumerates all potential motifs, both contiguous and multi-syllabic, in those regions.
Using a priori knowledge about the structure of the motif, we efficiently extract motifs
that are associated with the transcription factors of S. cerevisiae.
This is an extension of the method described in Takusagawa, et al[1, 2].
The
Takusagawa method searches in both positive regions, areas that are believed to have
a motif, and negative regions, areas that are believed to not have a motif.
The
statistical significance of a motif found in the positive sequences is then determined
by ensuring that it is not found in any of the negative sequences as defined by the
Random Selection Null Hypothesis[11].
However, the Takusagawa method was able to extract only contiguous motifs, not
multi-syllabic motifs. More specifically, the motifs extracted were at most seven bases
wide with a maximum of two wildcard positions, or positions of variability. Many
transcription factors are known to be dimers [11] consisting of two proteins joining to
form a complex. Each protein subunit binds to a different DNA subsequence, with
17
subsequences being separated by some number of bases. Arguably, this is due to the
helical nature of DNA; a binding protein binds on one side of the helix thus touching
two parts of the sequence and omitting a section in the middle where the helix turns
as shown in the Figure 1-1.
Figure 1-1: GAL4 transcription factor binding to DNA[3}. The thin bands show the
basic structure of the GAL4 protein. The darker edges are the subunits of GAL4 that
directly bind to the genomic DNA. This figure shows two molecules of GAL4 binding
to DNA.
Table 1.1 shows a subset of consensus motifs in Harbison, et al[4] that have substantial gaps. Here, the wildcard n is used to indicate the set of bases {A,C,G,T}, or
the instance of a gap.
1.3
Goals of this Thesis
The goals of this thesis are:
* find motifs which have a fixed length gap (dimer) in a Saccharomyces species
18
Transcription Factor
Consensus Motif
ABF1
TCAYTnnnnACG
GAL4
CGGnnnnnrinnnnCCG
HSF1
RGT1
SOK2
STB4
STP1
TTCYAnnnnnTTC
CGGAnnA
TGCAGnnA
TCGGnnCGA
RCGGCnnnRCGGC
Table 1.1: Consensus motifs that have gaps. These motifs follow the IUPAC abbreviations for nucleotide subsets. Please refer to Table 2.1 in Section 2.1.
" find motifs containing a gap of undetermined length and find the best fixed
length for that gap between syllables
" determine which motifs are statistically significant using an approximation to
the Random Selection Null Hypothesis
" further examine the the significance of our scoring method using Monte Carlo
simulations on random sets of sequences
In order to reduce the overall computational time, we ran the algorithm on subsections of the data in parallel with each other.
1.4
Contributions of this Thesis
This algorithm accurately predicts a much larger range of transcription factor binding
sites including both contiguous and multi-syllabic motifs with gaps of fixed lengths
than any other existing enumeration method. Since it employs a priori knowledge
about motif structure, the predicted motifs can be used as seed motifs to other likelihood maximization methods including Gibbs sampling and Expectation Maximization.
Predicting these sites would greatly aid further genomics research on DNA
regulatory networks. Finally, we hope to produce software that is accessible through
the Internet for public use.
19
20
Chapter 2
Multi-Syllabic Expansion and
Enumeration (MSEE)
This section describes the implementation of our motif extraction method entitled
Multi-Syllabic Expansion and Enumeration (MSEE).
2.1
Terms and Definitions
Here, we define the terms most commonly used in this thesis. Many of these definitions
are common to several motif discovery methods. We introduce new terminology to
help describe this particular motif discovery method, MSEE.
Sequence
-
A sequence is defined to be any subset of consecutive bases of DNA.
In general, when we refer to a sequence, we mean a substantial number of consecutive
bases ranging from 20 to approximately 1000 bases.
Motif -
A motif is a subsequence in the DNA to which a transcription factor
binds. We define the length of such a subsequence to be at least 3 base pairs wide
and at most as wide as the sequence in which it occurs. Motifs may have bases where
there is some variability across instances of the motif; a transcription factor may still
bind to the motif even if a base differs from one instance of the motif to another.
21
Intergenic region
-
Intergenic regions are the subset of sequences that are in
between the coding regions, or open reading frames (ORFs), in the DNA (ie. they
are not transcribed into mRNA). In a genome, these regions vary greatly in length.
These intergenic sequences are areas of interest since they contain the motifs to which
transcription factors are bound. We further define a region as a positive intergenic
region if we know that a given transcription factor binds to it. Data from ChIP
experiments approximate these positive regions. A negative intergenic region refers to
the regions to which a given transcription factor does not bind. Positive and negative
intergenic regions are determined by the ChIP experiments which are obtained from
Ren et al[211.
Word
-- A word is defined as a candidate motif in this thesis. A candidate motif is
any set of contiguous bases in a sequence that potentially satisfies the structural considerations for a motif as described above. This algorithm, in short, iterates through
all words and counts their reoccurrences to determine the best motif candidate. The
terms word and candidate motif will be used interchangeably.
Wildcard
-
A wildcard refers to a variable base. We use this term to describe a
base which can be any one of two, three, or four bases and not affect the binding of
the transcription factor to the given motif. A list of the eleven different wildcards
we can have in a given word and the IUPAC codes for each set is given in Table 2.1.
In order to reduce the complexity of our algorithm, we limit our method to the use
of only seven of the eleven wildcards listed. The four bases A, C, G, and T and the
seven wildcards are each assigned a specific number.
Syllable
-- In certain cases, a word may have multiple subunits. Such words may
be motifs that contain small regions of bases that do not affect the binding of the
given transcription factor. For example, the sections of the GAL4 motif are shown in
Figure 2-1.
In this figure, the word is said to be divided into syllables which are separated by
a gap. Each syllable must be at least 2 base pairs wide.
22
IUPAC abbrev
A
Nucleotides
{A}
Map to Int
0
C
G
{C}
{G}
1
2
T
M
R
W
{T}
{A,C}
{A,G}
{A,T}
3
4
5
6
S
{C.G}
7
Y
K
{C,T}
{G,T}
8
9
V
H
D
B
{A,C,G}
{A,C,T}
{A,G,T}
{C,G,T}
-
N
{A,C,G,T}
10
Table 2.1: IUPAC Map. This table lists IUPAC notation for the corresponding subset
of nucleotides and the integer assigned to each subset
Gap 1
IC G Gin n n n n n n n n n njc C GI
Syllable
1
Syllable
2
Figure 2-1: Syllabic Structure of GAL4
Gaps are implemented as regions of contiguous wildcards; a gap that has a width
of one is equivalent to a single variable base. Gaps can be as wide as is necessary
with the upper bound being the width of the entire motif. A motif that has multiple
syllables is thus referred to as a multi-syllabic motif. In the rest of this thesis, bases
that a part of a gap will be represented by n while bases that are actual wildcards
within the motif will be represented by N. MSEE can be used to find motifs with one
or more gaps. However, this thesis only reports the results for motifs with a single
gap (see Section 4.3).
Expansion - This refers to the enumeration of all words that match the candidate
motif being examined. This is discussed in greater detail later in Section 2.2.1.
23
2.2
MSEE: An Overview
In the first step of MSEE, we isolate the intergenic regions to which a given transcription factor binds. We do this by analyzing data from ChIP experiments and
compiling a list of probe sequences to which the transcription factor binds. We label
these as the positive intergenic regions.
For each intergenic sequence, MSEE then exhaustively enumerates the candidate
motifs of a given length and structure using the following procedure. It examines
a window of bases of the size of the expected motif and stores the candidate motif.
It then slides the window over a single base and records the next such word. This
continues for the length of the intergenic sequence. When examining a candidate,
MSEE takes into account:
1. the number of wildcards allowed, and
2. the motif structure, ie. the width and placement of syllables and gaps.
In order to account for wildcards when examining a word, MSEE expands the
word to all its motif possibilities and stores these expansions. Taking into account
the gaps we expect in the word, we can reduce the complexity of both the expansion
and the storing.
This procedure is repeated for a given structure on all the intergenic regions.
These results are later used to determine which candidate motifs are significant based
on their occurrences in the positive intergenic regions against their occurrences in all
intergenic regions. The tests for significance are discussed in Section 3.1. The flow
chart in Figure 2-2 summarizes the main function of this algorithm. This emulates
the structure of the algorithm found in Takusagawa [2].
2.2.1
Examining a Word
As mentioned previously, there are two parameters by which MSEE examines a given
word: the number of wildcards and the size and placement of gaps. Full enumeration
24
Positive Intergenic
Sequences
All Intergenic
Sequences
Word Expansion and
Enumeration
Number of
Wildcards
Word Counts
(Positive Regions)
Word Counts
(All Regions)
__
Motif
Structure
____t__
Test for Significance
Motifs
Figure 2-2: Flow Chart of Motif Discovery by MSEE
of a word includes expanding both the word itself and its complement on the opposite
strand.
Expanding a Word
First, MSEE expands each subsequence to all its motif possibilities given the number
of wildcards allowed.
It then checks each expanded manifestation of the word to
determine whether or not it has been seen previously and tallies its occurrence. Using
the set of wildcards as defined previously in Table 2.1, we define a set of expanding
rules as shown in Table 2.2.
When expanding a word, we must allow a wildcard in each possible position of
25
Nucleotide
A
C
G
T
Expands to
{A,C},{A,G},{A,T},
{A,C},{C.G},{C,T},
{A,G},{C.G},{G,T},
{A,T},{C,T},{G,T},
N
N
N
N
Table 2.2: Set of matching wildcards for each nucleotide. There are four wildcard
possibilities for each base.
the word. Figure 2-3 shows an example of expanding a word that is three base pairs
long with 1 wildcard. As we increase the number of wildcards, the number of ways
we can place w wildcards in a word of length m is
(M).
A
C
G
-
expands to
-
AC
AG
AT
N
A
A
A
A
A
A
A
A
C
C
C
C
G
G
G
G
G
G
G
G
AC
CG
CT
N
C
C
C
C
AG
CG
GT
N
Figure 2-3: Expansion of small word with 1 wildcard
The more wildcards we add, the more space our algorithm takes in order to account
for all expansions for each motif. Equation 2.1 compares the number of expansions
done on a sequence s of length n, with motif width m and w wildcards.
num-expansions(s) = (n -
m)
4w
(2.1)
This method iterates through all positions of the word to place wildcards and
expands each instance accordingly. As we see, the number of expansions increases
exponentially with each additional wildcard which limits our ability to search for a
26
large set of motifs. Therefore, we need further prior information about the motif in
order to reduce the added complexity.
Processing Gaps in a Word
Ignoring the bases in the gapped areas during expansion reduces the number of expansions per word and thereby reduces the space required by MSEE. MSEE therefore
uses the motif structure input to identify the location of the syllables in the candidate
motif. During expansion, MSEE only expands those bases in the syllables and avoids
the unnecessary expansion of the gaps.
Expansion through Recursion
MSEE stores the syllabic structure by defining its beginning and ending position offset
within the word. Using this, the algorithm creates clear boundaries around the areas
it needs to expand, and skips from one area to the other. This reduces the number
of expansions greatly for a given sequence s as we see in Equation 2.2 where n is
the length of the current sequence, m is the width of the word including the gapped
areas, g is the total width of all gaps in this word, and w is the number of wildcards
used in this instance.
numexpansions.excluding -gaps (s) = (n - m) (rI
9)4w
(2.2)
Figure 2-4 shows the number of expansions that are done with an increasing
number of wildcards as well as an increasing number of gaps. We observe in Figure 2-4
that as the number of wildcards increase, the number of expansions per word increases
exponentially, but that this is effectively counteracted by the number of gaps we
introduce into the system. This shows very clearly that including knowledge about
gaps can make enumeration much faster.
The recursion is mutually recursive. On one level, MSEE recurses along the bases
in a syllable to insert wildcards. On a higher level, MSEE recurses along syllables in
order to avoid gaps. The following pseudocode demonstrates the recursion through
27
Number of Expansions for word of length 20
! i i i .::: --
......
. .
....
.....
10
-
5
10
0
Co
C
0-
10
.
w
0
(D
-0
-- -
10
E
z 10'
-
-2
- 5
-0
3
10
20
Number OT ViIdc ars
Gap Width
Figure 2-4: Log Number of expansions of a given word as a function of gap width
and the number of wildcards
28
which we iterate over the syllables and add holes accordingly to create the expanded
word.
syllable-iterator (word)
if no-more-syllables OR no-more-wildcards
update-counts (word) % see Section 2.4
else
get next syllable and expand-syllable (word, 0)
expand-syllable (word, base-offset)
if (base-offset == end-of-syllable) OR no-more-wildcards
syllable-iterator (word)
else
expand-syllable (word, base-offset + 1)
insert-wildcard at base-offset
Reversing and Complementing
MSEE examines both the word in the sequence and the reverse complement of the
word. This is done under the assumption that transcription factors do not distinguish
between strands. Therefore, a motif on one strand may be found on its opposing
strand and both are potential sites for transcription factor binding. There are two
ways of enumerating the complement word. The default method is to produce the
reverse complement of the word and repeat the same procedure above on the new
word. This is implemented in the following way.
29
expand (word)
syllable-iterator (word)
rc-word = reverse and complement of word
syllable-iterator (rc-word)
As a result, MSEE outputs both the motif and its reverse and complement. The
exception to this is if the motif is palindromic meaning its reverse complement is
equivalent to the original as we see in GAL4.
2.3
Object Oriented Implementation
Since the overall goal of MSEE is to perform motif extraction for a range of motif
structures, we decided to define a system of objects each of which implements a
different type of motif discovery: motifs with no gaps, motifs with gaps of a fixed
width, and motifs with gaps of variable width. Each of these objects has the common
interface Expandable. Figure 2-5 is an object diagram of the system.
Expandable
implement s
Fixed Length
Gaps
Variable Length
Gaps
Complex Motif
Structures
Figure 2-5: Object Diagram of Inheritance
30
2.3.1
Expandable Interface
The common interface Expandable dictates the common functions of all motif discovery algorithms. Its sole method go initiates the recursion by which each word is
expanded since this method is common to all its objects. Its sole parameter is the
composite width of the motif including all possible gaps. This interface is impleniented by the class Fixed-Gap which is responsible for enumerating motifs with gaps
of a fixed width.
2.3.2
Fixed Gap class
The Fixed Gap class implements the Expandable interface and is responsible for
running the word counting algorithm for motifs with gaps of fixed length. In the Fixed
Gap instance, the method go is equivalent to syllable-iteratordescribed on Page 29.
Additionally, this class stores the motif structure as a list of Syllable objects. Each
Syllable object contains the beginning and ending index of the given syllable as well
as the length of the syllable. The gaps are inferred as the space in between syllables.
This class also contains the data structure which stores the word counts.
2.4
2.4.1
Data Structures
HashMap
The data structure that MSEE uses to store all previously examined words is a
HashMap as implemented in the C++ standard library[22].
For each candidate
motif, a hash key was computed using the hash function described below. The hash
function was designed to minimize collisions.
The Hash-Map class guarantees value lookups in constant time and dynamically
allocates and deallocates space as needed for insertions and deletions. When processing a word, MSEE first checks the hash map to see if it has been processed previously.
The hash key for the word is computed from the hash function below, and the corresponding value is retrieved; this value is the number of occurrences of that word.
31
If the word does exist in the map, MSEE then increments the corresponding value.
Otherwise, MSEE inserts this new word into the map and sets its initial count value
to 1. Given the requirements of MSEE, the hash map is the best available data structure. Other list and array structures perform lookups and insertions in logarithmic
time which would be too slow for our application[22].
motif-counts = HashMap
update-counts (word)
if word has been seen previously
motif-counts[hash-function (word)]++;
else
motif-counts[hash-function (word)] = 1;
hash-function (word)
sum = 0;
multiplier = 11;
length = size(word);
for i = (length - 1); (0 <= i); - - i
sum *= multiplier
sum
+= word[i]; % base at 'ith' offset in word
return sum;
2.5
Optimizing the Algorithm
Introducing gaps and syllables in our model of a motif increased the complexity of
motif extraction, and we introduced certain optimizations to counter this increased
complexity. The greatest time sink in the algorithm occurs during the evaluation of
the hash function; this function takes up the most time for each sequence. When the
32
size of the word is increased in order to include syllables and gaps, the size of the
word to be hashed increases as well. In the original Takusagawa algorithm, this was
limited to 7 or 8 base pairs; now, there is a potentially unlimited number of base pairs
since there is no limit on how large the gaps between syllables can be. As the length
of the gaps gets bigger, we want to ensure that our run time does not increase at
the same rate. Therefore, we needed a technique to eliminate the information about
the gaps when hashing in a way that we could still easily reconstruct the final motif
at termination of the algorithm. We accomplished this by introducing the "Z" base
which acted as a gap place holder. Figure 2-6 shows the different options for this
optimization: we can either keep Z as a placeholder or remove it entirely. This figure
shows this example on the GAL4 motif.
Original Word
CGGnnnnnnnnnnnCCG
no
placeholder
add
placeholder
Z
CGGCCG
CGGZCCG
HashMap Word
Figure 2-6: Introducing the Z Placeholder
2.5.1
The Z modification
After expanding a word with the given wildcards, the word contains Ns in all the
areas where there are gaps not including those which are wildcards within a given
syllable (these are not defined as gaps). We replace all the Ns that are part of gaps
with a placeholder Z. This condenses each gap to a single letter, which serves as a
placeholder for efficient re-expansion at the termination of the algorithm. We can reexpand it using the motif structure we defined in the beginning of the algorithm using
the placeholder Z to indicate to us the original positions of the gaps. In addition, we
implemented the option of removing the placeholder Z.
33
2.6
Alternate Implementation
An alternate implementation was proposed to quicken the rate of the tests done. This
method is a, combination of the method used by MSEE and an update to the method
described in Takusagawa[2].
While MSEE performs motif discovery on the sets of
sequences themselves, this alternate implementation enumerates motifs from sets of
dictionaries that have been compiled previously from the sequences. A dictionary is
defined as list of words with a given structure but that have no wildcards introduced.
This hybrid approach uses the method described above to generate dictionaries
of the positive intergenic sequences.
These dictionaries and dictionaries compiled
from the all intergenic sequences are used as input into the updated method from
Takusagawa which runs motif discovery using enumeration. The flow chart in Figure 2-7 describes this alternate method. Creating dictionaries prior to expansion removes a substantial amount of computational work thus enabling the overall method
to run quicker.
In addition to the dictionary modification, the method implemented in Takusagawa[2]
uses a shortcut that avoids reversing and complementing the entire word. Instead, it
chooses one instance (either the original or the reverse complement) and only stores
that instance. Specifically, it chooses the instance that is canonically higher. When
it comes across a word, it determines, from the first base pair of the original and its
reverse complement, which version is canonically higher and expands only that version. This eliminates the need to reverse and complement the entire subsequence and
perform expansion on both instances. This optimization reduces the space required
by the algorithm by halving the number of candidates we must keep track of.
However, some problems may arise when using this method. Palindromic motifs,
such as GALA, are counted only once though they happen twice for the forward and
reverse strands. While MSEE takes up nearly twice as much space, the redundancy
allows for accuracy in the enumeration. We discuss the difference between these two
methods in the Results section (Section 4.3).
34
2.7
Developing a Test Suite
We performed a series of validation tests of MSEE to ensure that it worked correctly.
We isolated from the literature a list of motifs in S. Cerevisiae that had one or more
gaps and used a range of motif structures as a primer to run the method. Each motif
structure consisted of 2 syllables with 1 gap between them. Each syllable had a width
of 4 bases and each gap ranged from 0 to 20 bases long.
A similar set of validation tests was performed for the second method. Again, we
used a range of motif structures as described above as a primer to run the method.
This was run on all the transcription factors that were found to bind to yeast intergenic
regions.
For both methods, a set of tests was derived to determine which word were significant based on their occurrences in the positive intergenic regions and in all intergenic
regions. These tests are described in the next Section 3.1.
We then ran a series of Monte Carlo simulations to evaluate the effectiveness of
our scoring method. These simulations were run to ensure that the motifs found using
the above niethods were much more significant than those motifs found at random.
This is also described in detail in the next Section 3.2.
35
All Intergenic
Sequences
Positive Intergenic
Sequences
Motif
Structure
Create Dictionary
Create Dictionary
ord Expansion and
Number of
Wildcards
Enumeration
from Dictionaries
Word Counts
(Positive Regions)
Word Counts
(All Regions)
Test for Significance
Motifs
Figure 2-7: Flow Chart of Alternate Implementation
36
Chapter 3
Significance of a Motif
3.1
Probabilistic Model
After obtaining the number of occurrences for each possible motif in both the positive
intergenic sequences and all the intergenic sequences, the next step is to determine
whether or not a given motif is significant. We use the Random Selection Null Hypothesis method as defined in Barash, et al[11]. We define significant to mean that a
given motif is more frequent in the positive intergenic regions than in all intergenic
regions. We iterate through all the candidate motifs found in the positive regions and
determine their significance. Using statistical methods, we can estimate the probability that a candidate motif occurs randomly in a positive intergenic sequence. The
smaller this probability, the more significant the motif. The most accurate way this
can be done is by using the hypergeometric distribution to calculate the significance.
3.1.1
Hypergeometric Model
The hypergeometric model estimates the probability that a candidate motif occurs
randomly at least a number of times in a small window. We set this window to be the
combined length of the positive intergenic sequences. This is equivalent to calculating
the tail of the hypergeometric distribution with the following parameters:
37
n -- the total number of places that the motif can occur in the set of positive
intergenic regions. This is equivalent to the cumulative sum of all the bases in these
regions minus an edge correction factor which accounts for the structure of the motif.
N -- the total number of places that the motif can occur in all intergenic regions.
Again, this is equivalent to the number of bases in all intergenic regions minus the
same edge correction applied to n.
m - -- the total number of occurrences of the candidate motif in the positive regions.
1
-- the total number of occurrences of the candidate motif in all intergenic
regions.
The probability that the given motif is found m number of times randomly in the
n possible positions is calculated in Equation 3.1 below[23, 24]:
(") (N-M)
Phyper(m In, M, N)
(3.1)
=
n
We would like to calculate the probability that the motif occurs at LEAST m
times. This value, which we will refer to as the p - value, is calculated by summing
the tail of the hypergeometric distribution described above as shown in Equation 3.2:
min(M,n)
p - value =-
Pyper(i I n, M, N)
(3.2)
For each element in the sum, the expected value of the number of successes[23, 25]
is the value
=
M
N
while the standard deviation is
a
M
nM
This sum is computationally expensive since there are a large number of elements
in the summation, and therefore it appeared to be infeasible for our application. We
therefore looked to different models which approximate the hypergeometric model.
38
3.1.2
Binomial Model
In order to reduce the computational complexity, we used the binomial model to
approximate the hypergeometric model[25.
This approximation is done using the
following transformation under the following assumptions. The hypergeometric model
can also be modelled as a sum of binary random variables X, where each variable
takes a value of 1 to indicate a success. These random variables are dependent on
each other since the probability that the next event is a success depends on how
many previous events were successes. However, if the sample size that we draw is
small relative to the total number of objects, then the probability that the Vih object
is a success varies slightly from the i - 1 previous objects.
Here, the probability
that the Zth object drawn is a success is close to being independent from all other
draws. We can therefore approximate the hypergeometric distribution as a sum of
independent Bernoulli variables which is a binomial distribution. Specifically, if the
number of potential successes n is much smaller than the number of potential motif
placements N, then sampling without replacement (hypergeometric distribution) is
approximately equivalent to sampling with replacement (binomial distribution).
This model defines the parameters of the binomial distribution by assigning a
probability to the event that a given candidate motif occurs randomly in all the
intergenic sequences (ie. probability of a "success") [26]. We then calculate the probability that such a candidate motif occurs randomly in a given positive sequence. The
parameters for this model are similar to those for the hypergeometric model.
n, m -
defined above.
p -- the probability that the candidate motif occurs randomly in any intergenic
sequence
= occurrences in all intergenic regions/possible occurrences in all intergenic regions
= M/N
The binomial distribution in Equation 3.3 below with the above parameters de-
39
scribes the probability that we will find m instances of the candidate motif randomly
in a positive intergenic sequence[26].
Pbinomial(m In,p)
) pm(I
(
) (n-7n)
(3.3)
The probability of a motif occurring randomly m or more times in the positive
sequences, or its p - value is the sum of the tail of the binomial distribution with the
above parameters as shown in Equation 3.4 below.
min(M,n)
p - value
=
Pbinomial (i I n, p)
(
(3.4)
i=m
For each element in the sum, the expected value of the number of successes[26] is
the value
p = n*p
while the standard deviation is
-= V/n *p *(l - p)
These values are equivalent to the expected value and standard deviation of the
hypergeometric distribution.
Using these probabilities, we rank how significant each candidate motif is in the
same way we did using the probabilities calculated from the hypergeometric distribution.
For this model also, we observe that there are quite a few elements in the summation. We would like to calculate this probability even more efficiently. We observe
that as n becomes larger (ie. the number of potential motif placements in the positive
regions), the binomial distribution approaches a normal distribution. We therefore
use the normal approximation to the binomial distribution with the same mean and
standard deviation.
40
3.1.3
Normal Approximation to the Binomial Model
By the central limit theorem, we can approximate the binomial distribution by using
a normal Gaussian distribution[26, 27}. The transformation is described here.
Let Xi be a Bernoulli random variable indicating whether or not the ith motif was
randomly found in a positive intergenic sequence
Let S,, be the sum of Bernoulli random variables X 1 +
...
+ X,, thus forming
a binomial distribution with n and p defined previously. In other words, Sn is the
number of successes out of n motifs and is modeled as a normal distribution. The
p - value is equivalent to the probability that S, is at least as large as the number
of times the motif was found in the positive regions only as shown in Equation 3.5
p - value
-
(3.5)
P(m < Sn)
Figure 3-1 displays the distribution of Sn.
We are calculating the area of the
shaded box in the bottom right corner.
1
0.90.8CO)
Ca
(D
0
0.7-
S0.6 -
CO)
C
0
=3
0.4-
CaZ0.5
0.3- -
P (m <S)
6
0.20.1
01
Number of Successes
Figure 3-1: Normal Distribution Approximation of Number of Successes. We calculate
the area of the shaded box which is equivalent to the probability of m or more
successes. p is the mean number of successes.
41
The distribution above is transformed to a standard normal distribution(Z,) in
Equation 3.6 by subtracting the mean and scaling by the standard deviation.
Zn=(Sn
- p - 0.5)
0-
-(S.
- np - 0.5)(36
Vnp(1-
p)
We implement the 1/2 correction to account for the conversion from discrete to
continuous values. We can then write the expression for the p - value in terms of (D
as shown in Equation 3.7 which is the CDF of a normal Gaussian[27].
(m - np - 0.5) 5 Zn) =
0
P(m< Sn) = P(-
3.1.4
Vnp(l - p)
-
(m - np - 0.5)
)
rD(
p
rnp l - p)
(3.7)
Analysis
For each of the candidate motifs found in the positive intergenic regions, we determine
the probability that it occurs randomly that many times. We then set a threshold t
for what we consider significant. For example, if t = .5, this means that half the time
this candidate could be an actual motif but half the time it occurs randomly. We
prefer to use a relatively strict threshold thus eliminating the motifs that are just as
likely to occur randomly. A sample threshold might be .05 thus indicating that the
motif should only occur randomly about 5 percent of the time. We run Monte Carlo
simulations on shuffled data to choose an appropriate threshold of significance.
3.2
Monte Carlo Simulation Strategy
We perform another set of tests to determine how well our algorithm performs similar
to the Random Sequence Null Hypothesis as described in Barash, et al{11]. We have
just devised a way to determine how significant our motif is. However, we would like
to evaluate how well our algorithm finds significant motifs. Our algorithm should
find motifs that are more significant than if we searched on a random set of intergenic
sequences. In order to determine this, we perform a series of Monte Carlo simulations.
42
The basic procedure for each Monte Carlo simulation requires us to pick a random
set of all the intergenic sequences and set this to be the set of positive regions for
a given motif. We then run our algorithm using this set of positive sequences and
obtain a set of significant motifs. Our hope is that the motifs found from this random
selection of intergenic sequences should be less significant than those found using the
positive sequences as determined by the ChIP data.
We can use the results from the Monte Carlo simulations to further prune the set
of motifs found using the ChIP sequences. If a motif found in the ChIP intergenic
sequences is not that much more significant than if found by random during the
Monte Carlo simulations, then it is not a likely biological motif. By using this metric
to prune our original results, we can set a threshold for significance that we mentioned
previously and thus better characterize our motifs.
3.2.1
Sequence Selection Strategy
We begin by randomly picking sequences from all intergenic regions. We specify that
the random selection of the sequences to be set as the positive sequences needs to be
modified to include the fact that probes that bind many transcription factors are more
important and that probes that bind few or no factors are less desirable. Therefore,
we weight the selection of a probe sequence by the number of transcription factors
that it binds to.
The transcription factors are then sorted by the number of intergenic sequences
each factor binds to. Figure 3-2 shows the distribution of set sizes for all the transcription factors. We choose ten representative sizes for the number of sequences in
the positive set. This selection strategy is described in conjunction with the results
in Section 4.2.
3.2.2
Testing Strategy
For each size, we generate 30 random sets of sequences of that size. MSEE is run on
these 30 sets and the significant motifs are extracted. The flowchart in Figure 3-3
43
Distribution of the Number of Positive Sequences per Transcription Factor
40
35
30
25
E 20
15
10
5
0
0
100
50
Number of Positive
1S0
200
Intergenic Sequences in a Set
250
Figure 3-2: Distribution of Set Sizes
summarizes this testing strategy.
After running this simulation a number of times on sets of sequences with both
the numbers of sequences and the total number of bases in the sequences being varied,
we obtain a distribution of the significance for the motifs found. We hope to find that
the log of the significance is indeed a normal distribution. Calculating the mean and
standard deviation gives us a way of identifying the noise in the system as produced
by the ChIP data, and a way to eliminate it.
44
Sets of Positive Sequences
for Transcription Factors
Choose 10
set sizes
All Intergenic Sequences
4
for each size
Weighted Random
Selection
1
2
Motif Discovery
/
2
* ~3 * * 0
3
30
MotifDicvr
Motif Discovery
Motif Discovery
Figure 3-3: Monte Carlo Simulation Strategy. For each sequence set size chosen, we
generate 30 random sets of that size and run MSEE on these sets.
45
46
Chapter 4
Results
4.1
Consensus Motifs from Literature
In order to determine whether or not our methods correctly extracted gapped motifs,
we first compiled a set of consensus motifs from the literature that were found to have
gaps. Specifically, we compiled a list of motifs in Table 4.1 as published in Harbison,
et al[4], which includes both motifs found experimentally as well as motifs found in
the literature. The motifs are given here with their enrichment scores. Six different
motif discovery methods were used to compile their list of motifs including MEME,
MDScan, and AlignACE. Additionally, motifs published in Kellis, et al[5] are listed
in Table 4.2. We specifically looked only at motifs that had a gap of 2 or more base
pairs as well as the motif for GCN4 which is biologically very significant. Again, we
use n to indicate a gapped base and N to indicate a wildcard within a syllable.
Section 4.3 shows the best motifs found using MSEE and the alternate method
and their correpsonding scores for all of the above transcription factors except TEA1
and HAPI, which were not included in the the ChIP experimental tests.
4.2
Monte Carlo Simulations
The Monte Carlo simulations were performed using MSEE as described in Section 3.2
on a range of sequence set sizes. Again, the purpose of these simulations is to pick
47
TF
ABF1
GAL4
GCN4
HSF1
PUT3
RGT1
SOK2
STB4
Consensus Motif
rTCAytnnnnAgc
CGGnnnnnnnnnnncCg
TGAsTCa
TTCYAnnnnnTTC
NNCGGnnnnnnnnnnCCG
CGGAnnA
TGCAGnnA
TCGGnnCGA
Enrichment Score
137.742
13.424
64.620
32.956
0.000
0.000
12.280
3.693
STP1
RCGGCnnnRCGGC
0.000
SUT1
UGA3
GCSGSGnnSG
CCGnnnnCGG
21.013
0.000
Table 4.1: Consensus motifs containing gaps from Harbison et al [4] listed by the
transcription factor (TF), the consensus motif, and the enrichment score. The motifs
for which the enrichment score is 0.0 were obtained from the literature and were not
found experimentally in Harbison, et al.
TF
ABF1
GAL4
TEA1
PUT3
HAPI
Known Motif
rTCRYnnnnnACG
CGGnnnnnnnnnnnCCG
CGGnCGG
CGGnnnnnnnnnnCCG
CGGnnnTAnCGG
MCS
50.0
8.0
6.8
6.2
2.5
Table 4.2: Consensus motifs containing gaps from Kellis, et al[5]. The motif conservation score(MCS) is listed alongside the motif.
random sets of intergenic sequences (with a comparable number of sequences and total
number of bases) and find motifs from these sets. If our scoring method is effective,
then these motifs will score lower than motifs found from ChIP positive intergenic
sequences.
4.2.1
Simulation Setup
We chose set sizes that were close to the positive set sizes for the transcription factors
in Table 4.1 above. The number of sequences found to be positive intergenic regions
are displayed in Table 4.3 here along with the total number of bases for each set.
Based on the number of sequences in Table 4.3, we chose 10 representative sizes:
4, 5, 15, 20, 30, 40, 60, 70, 90, and 180. We generated 50 data sets for the sizes 4, 5,
48
TF
Num Sequences
Total Num Bases
4
14
21
28
33
42
59
68
74
90
178
1909
7868
11882
13561
15000
22977
33358
50984
50323
49540
83304
RGT1
PUT3
STP1
STB4
UGA3
HSF1
GCN4
SOK2
SUT1
GAL4
ABF1
Table 4.3: Number of Positive Intergenic Sequences and Total Number of bases in
these sequences for Transcription Factors with gapped consensus motifs. Sorted by
Number of Positive Sequences.
and 15, generated 30 data sets for all other set sizes, and ran MSEE using these sets
as positive intergenic sequences. We also calculated the mean and standard deviation
of the total number of bases for these generated sets.
4.2.2
Monte Carlo Motif Scores
Table 4.4 displays the average and standard deviation of the scores for the best motifs
found for each set size as well as the mean and standard deviation of the total number
of bases.
Size
Mean Score
StdDev
Mean Number Bases
StdDev(Bases)
Conf
4
5
15
20
30
40
60
70
90
180
268.218
215.498
141.777
110.181
88.086
97.774
78.539
72.529
68.988
54.252
183.323
114.318
123.931
53.0920
26.5553
57.520
36.9511
22.7744
24.8676
16.113
1989
2585
7694
10701
16531
21845
32299
38588
49004
97462
652
647
1470
1141
1698
2109
2711
2657
3063
4239
.92
.86
.90
.93
.87
.90
.90
.87
.87
.83
Table 4.4: Scores from Monte Carlo Simulations. The last column "Conf" refers to
the confidence level achieved if we set a threshold of one standard deviation away
from the mean score.
49
Figure 4-1 graphically displays the motif scores listed in Table 4.4 thus allowing
us to view distinct trends in both the mean and standard deviation of the scores
as the sequence set size increases. We observe two key points. One, as the number
of sequences in the set increases, the mean score decreases thus indicating that the
motifs found are less and less significant. Secondly, as the number of sequences in the
set increases, the standard deviation from the mean score also decreases. The fact
that there is less variance in the scores when the positive set is larger is interesting.
Mean and Standard Deviation of Monte Carlo Scores
500
450---
400 -
350 CD,
300 0
0
250 --
C,,
0)
0
2: 200--
(D
0) 150-
100 -T-
50
{
T
-
n
0
20
40
60
80
100
120
140
160
180
200
Number of Sequences in Test Set
Figure 4-1: Scores of the Best Motifs from Monte Carlo Simulations. Both mean and
standard deviation from the mean decrease as the number of sequences in the set
increases.
50
4.2.3
Important Features of Monte Carlo Results
We first examine how well these randomly generated sets compare to the ChIP intergenic sets. We plot the total number of bases for each in Figure 4-2.
Number of Bases in Positive Set in Monte Carlo versus ChIP
4
10
-6-
a-
GAL4
0
I
GCN4
-0
E
Z
2STP1
PUT3x!T
-9HSF1
-f-oUGA3
STB4
0RGT1
-2-
i
i
I
I
I
I
I
I
I
0 5 101520 30 40 50 60 70 80 90 100
I
I
120
I
140
1
160
180
200
Number of Sequences in Positive Set
Figure 4-2: Number of bases in Monte Carlo sets vs. ChIP sets. The significant
outliers are circled in red.
Figure 4-2 shows that the Monte Carlo simulations generated comparable numbers
of positive sets for all but three transcription factors. Both SOK2 and SUT1 have
a larger number of bases than was generated randomly while ABF1 has a fewer
number of bases that was generated randomly. The effect of this difference was not
investigated in this thesis but might produce interesting results.
Secondly., we look at the distribution of scores for a given set size. Many methods
assume that this distribution is Gaussian. Using the calculated mean and standard
51
deviation, they then determine a score threshold which gives a good confidence interval. In the Gaussian case, this threshold is set to be three standard deviations above
the mean.
We observe the score distribution for set sizes of 4 and 180 to get an idea of the
general shape. These distributions are displayed in Figure 4-3.
Score Distribution for 4 Sequences in Positive Set
10
80
0
6--
0
0,4
E
2 -
0
0
--
200
400
600
800
1000
1200
Negative Log Motif Score
Score Distribution for 180 Sequences in Positive Set
--
4 -
0
0
.0a)2-
E
Z
0
30
40
50
60
70
80
90
100
Negative Log Motif Score
Figure 4-3: Histogram of Scores for Positive Sequence Set Sizes of 4 and 180. Distributions are noticeably non-Gaussian
The score distributions do not have Gaussian properties; they are heavy-tailed to
the right of the mean score. We would expect this to decrease the confidence level
at a given standard deviation from the mean as compared a Gaussian distribution.
The confidence of a one standard deviation threshold is therefore expected to be less
than the 84.13% which is the confidence for a one standard deviation threshold of
the Gaussian distribution. However, if we observe the confidence levels as shown in
Table 4.4, it seems that the our confidence levels are as good if not better than the
confidence for a Gaussian distribution. This is a mere coincidence and can be attributed to the small sample sizes of the random runs. Harbison et al parameterized
52
the observed scores from similar random runs by a normal distribution[4] (Supplementary Methods).
However, we have shown that the assumption of normality is,
in fact, incorrect. This should be kept in mind when we compare the Monte Carlo
motifs to the motifs extracted using MSEE.
4.3
Testing the Two Methods
We test the two methods described in Sections 2.2 and 2.6 on a subset of transcription
factors and evaluated the results.
4.3.1
Validation of MSEE
Parameters and Setup
We ran the first method with the following parameters
" Number of Syllables - 2
" Number of Wildcards - 2
" Motif Structure [Type (Width)] - Syllable (4): Gap (variable): Syllable (4)
* Range of Gap Widths tested - [0,20]
* Transcription Factors tested - all 367 from ChIP data[21, 4]
" Hash Function Optimized? - yes
First, we ran MSEE on all intergenic sequences and then on individual sets of
positive intergenic sequences for each transcription factor. Both sets of counts were
then used to calculate the significance of each candidate motif and output a list of
significant motifs found for the given motif structure.
53
Results of MSEE
Table 4.5 shows the highest scoring motif for each transcription factor(TF) and the
corresponding negative log normal approximation score. The highest scoring motifs
for each condition of a particular transcription factor were examined, and the most
significant was chosen for Table 4.5. The corresponding conditions are given in the
caption. The columns following the motif column are "Posit Count" which is the
number of occurrences of this motif in the positive intergenic regions and "All Count"
which is the number of occurrences of this motif in all intergenic regions.
TF
Score
Top Motif (MSEE)
Posit Count
All Count
GCN4
ABF1
UGA3
RGTI
HSF1
PUT3
STB4
GAL4
STP1
SUTi
SOK2
736.365
534.897
513.878
359.364
320.6
294.289
231.693
150.733
130.452
93.1803
69.7339
TGASTCAY
RTCAnnnnnnACGN
GKGTnnnnGTGK
CCGGnnnnnnnnnnnnnnnnnnnCCRG
TTCTAGAA
CGGGnnnnnnnnnCCGA
TCGRnYCGA
CGGRnnnnnnnnnnCCGR
CGGCnnnnnnnnnCGGC
CCNGCGGS
CCCCTRGC
59
182
54
3
36
6
16
18
5
21
11
172
759
448
12
206
17
96
50
17
101
37
Table 4.5: Motifs found using MSEE for GCN4, ABFl, UGA3 in Rapamycin, RGT1,
HSF1 in Heat Shock, GAL4 in Galactose, PUT3, STB4, STP1, SUT1, and SOK2
after 14 hour Butanol treatment. The default condition is YPD.
These results were obtained by extracting the best motif for all gap widths tested
and then choosing the highest scoring motif and corresponding gap width from these
20 motifs. Non palindromic motifs were reported twice since the reversed and complemented version had the same score. One of these two representations was arbitrarily
chosen for the Table 4.5 above.
Comparison with Consensus Motifs
We list the motifs found by MSEE again in Table 4.6 alongside the motifs found in
Harbison, et al[4] and Kellis, et al[5] sorted by their MSEE significance score.
From Table 4.6, we observe that the MSEE motifs and motifs from literature
54
TF
Top Motif (MSEE)
Harbison Motif
Kellis Motif
GCN4
TGASTCAY
TGAsTCa
-
ABF1
RTCAnnnnnnACGN
rTCAytnnnnAgc
rTCRYnnnnnACG
UGA3
GKGTnnnnGTGK
CCGnnnnCGG
-
RGT1
CCGG[n]19CCRG
CGGAnnA
-
HSF1
TTCTAGAA
TTCYAnnnnnTTC
-
PUT3
CGGG[n]1oCCGA
nnCGG[n]1 oCCG
CGG[n]1oCCG
STB4
TCGRnYCGA
TCGGnnCGA
-
GAL4
STP1
SUTI
CGGR[n]1oCCGR
CGGC[n] 9 CGGC
CCNGCGGS
CGG[n]cCg
RCGGCnnnRCGGC
GCSGSGnnSG
CGG[n]ICCG
-
SOK2
CCCCTRGC
TGCAGnnA
-
Match?
y
y
n
p
p
y
y
y
p
n
n
Table 4.6: Comparing MSEE motifs with Consensus Motifs. The "Match?" column
displays a y - pattern and structure match; p - partial pattern match but structure
mismatch; n - mismatch
agree well for GCN4, ABF1, PUT3, STB4, and GAL4. In some cases, the motifs
agree within a syllable though the entire structure is not the same. The three sets of
motifs that agree partially are the motifs for RGT1, HSF1, and STP1. The motifs
do not agree for UGA3, SUTI, and SOK2.
MSEE correctly extracts all motifs that had an enrichment score (see Table 4.1)
of 30 or higher. It also extracted PUT3 which was not found in Harbison et al. We
also observe that of the mismatches and partial matches, MSEE seems to pick up "C"
and "G" signals more frequently than "A" and "T"; this is particularly noticeable for
RGT1, STP1, SUTI and SOK2. This may be due to the fact that the background
frequencies of "C" and "G" are higher than those for "A" and "T" in particular sets
of positive sequences.
For the motifs that were partially extracted, one explanation is that certain syllables are more significant than others. In other words, independent syllables (in their
context) are more significant than the entire multi-syllabic word itself. The motif
found for RGT1 has two syllables: CCGG and CCRG which are both very similar
to the beginning of the actual RGT1 motif. Instead of picking up the correct motif,
MSEE reported separate instances of the first syllable CGG(A) as the most significant
motif.
It was suprising that the motif for UGA3 was a mismatch given that its significance
55
score was so high. One possible explanation is that UGA3 has a transcription partner
which binds in the same probe sequences. The motif found may be the motif to which
this partner binds. Incidentally, the correct UGA3 motif has a score of 15 as calculated
by MSEE which is very insignificant compared to all other top scoring motifs.
Comparison with Monte Carlo Motifs
Based on the Monte Carlo simulations, we would like to prune our results from MSEE
and compare the signficance of the motifs to the motifs found from the Monte Carlo
simulations as decribed in Section 4.2. We compare the actual result to the Monte
Carlo score for the sequence set size that is closest to the size of the actual positive
intergenic sequence set.
Figure 4-4 shows the plot of the scores compared to the
Monte Carlo scores.
56
Monte Carlo Scores versus ChIP Scores
800
I
I
I
I
I
GCN4
0
700 F
600
[
0
ABF1
UGA3
CO)
(D
0
0
500 -
0
%I.Q1
0
-
400-
RGT1
300 -
HSF1
0
PUT3
0
STB4
U
z
0
200
IGAL4
STP1
SUT1
100 -
I
0
0
0
SQK2
I
101
I
I
5 10 1520
30
40
I
50
I
0
I
60
I
I
70
I
80
I
I
90
I
100
I
120
140
Number of Sequences in Positive Set
Figure 4-4: Motif Scores from Monte Carlo Sets versus ChIP sets
I
I
160
180
200
To determine whether we should discard a motif, we compare its score to the
Monte Carlo score corresponding to the closest positive set size. If the score of the
motif is at least one standard deviation away from the corresponding Monte Carlo
score, we retain it; otherwise, we discard it. Upon close examination of Figure 4-4,
we discard the motifs for the following transcription factors: RGT1, STP1, SOK2,
and SUTI. We retain the motifs for PUT3, GAL4, GCN4, UGA3, HSF1, ABF1, and
STB4. We have chosen to include PUT3 although it is barely one standard deviation
away from the mean score of the corresponding Monte Carlo motif. We recall that
the threshold of one standard deviation above the mean is a more lenient threshold
that gives a lower confidence than that of a Gaussian.
The 5 correct motifs (GCN4, ABF1, PUT3, STB4, and GAL4) were retained after
we pruned our results using the Monte Carlo simulations. Therefore, we can conclude
that these motifs are correct with the specific confidence levels reported in Table 4.4.
This demonstrates the correctness of our scoring method.
However, we have noticed that this scoring method is not effective in some cases.
As we have noted with UGA3 and other motifs, the correct motif may have a very low
score in comparison to the best motif found. This suggests that our assumption that
a motif must be significantly overrepresented in positive intergenic regions may not
hold. We must explore other biological factors that limit transcription factor binding
to certain regions and not others.
4.3.2
Validation of Alternate Method
Parameters and Setup
We ran the second method with the following parameters:
" Number of Syllables - 2
" Number of Wildcards - 2
" Motif Structure [Type (Width)] - Syllable (4): Gap (variable): Syllable (4)
" Range of Gap Widths tested - [0,20]
58
* Transcription Factors tested - all 367 from ChIP data
* Hash Function Optimized? - yes
Here, we used the dictionaries created previously on all intergenic sequences and
generated dictionaries of the positive sets of sequences using the separate algorithm.
We then run the algorithm using the updated method described in Section 2.6 which
has been optimized to run much faster.
Results of Alternate Method
Table 4.7 lists the motifs found using the alternate method. Again, preceding the
motif is its negative log normal approximation score and following the motif is the
number of times it occurred in the positive intergenic sequences and the number of
times it occurred in all intergenic sequences.
TF
Score
Top Motif
Posit Count
All Count
GCN4
RGT1
PUT3
ABF1
STB4
GAL4
HSF1
UGA3
STP1
SUTI
SOK2
368.002
350.465
257.885
236.666
188.337
171.357
156.231
152.577
113.162
70.8559
69.8423
TGASTCAY
CCGGnnnnnnnnnnnnnnnnnnnCCGG
CGGGnnnnnnnnCCCG
RTCAnnnnnnACGR
TCGGnCCGA
YCGGnnnnnnnnnnnCCGR
GCATnnATGC
ACMCnnnnnnnCMCA
GCCGnnnnnnnnnnnnnnnnnnnCGGC
CGGGnCCCG
CGGGnCCCG
59
2
6
142
8
14
4
42
4
4
4
214
3
13
588
19
40
32
539
8
3
3
Table 4.7: Motifs found using Alternate Method for GCN4, RGT1, PUT3, ABF1,
STB4, GAL4 in Raffinose, HSF1, UGA3 in Rapamycin, STP1, SUTI, and SOK2
after 14 hour Butanol treatment.
These results were similarly obtained by extracting the best motif for all gap
widths tested and then choosing the highest scoring motif and corresponding gap
width from these 20 motifs. However, in these results, non-palindromic motifs were
counted only once given the addition of the canonical enumeration. As a result, the
motif counts for the positive intergenic and all intergenic sequences differ from those
found using MSEE and thus lead to a different score.
59
Immediately, we notice something wrong with this method if we observe the word
counts for SUTI and SOK2. This method reports that it finds the motif 4 times in the
positive regions yet only 3 times in all regions. However, since the set of all regions
includes the positive regions, it must occur at least 4 times in all regions. Upon closer
inspection, we see that both these motifs are palindromic. This inconsistency is the
result of the use of the canonical form of the word when counting as described in
Section 2.6. Palindromic words are counted only once for each instance on a sequence
though they occur on both forward and reverse strands. For this reason, we must
ignore all palindromic results from this method.
Comparison with Consensus Motifs
We list the motifs found by this alternate method again in Table 4.8 alongside the
motifs found in Harbison et al and Kellis et al and sorted by their significance score.
Kellis Motif
Match?
y
p
p
y
y
y
TF
Top Motif (Alternate)
Harbison Motif
GCN4
TGASTCAY
TGAsTCa
-
RGT1
PUT3
ABF1
CCGG[n]19CCGG
-
RTCAnnnnnnACGR
CGGAnnA
nnCGG[n]1OCCG
rTCAytnnnnAgc
STB4
TCGGnCCGA
TCGGnnCGA
-
GAL4
YCGG[n]uCCGR
CGG[n]1 icCg
CGG[n]1 1 CCG
HSF1
GCATnnATGC
TTCYAnnnnnTTC
-
n
UGA3
STP1
ACMCnnnnnnnCMCA
GCCG[n]19 CGGC
CCGnnnnCGG
RCGGCnnnRCGGC
-
n
p
SUTI
CGGGnCCCG
GCSGSGnnSG
CGGGnCCCG
TGCAGnnA
-
n
SOK2
CGGG[n]sCCGA
CGG[n]1oCCG
rTCRYnnnnnACG
n
Table 4.8: Comparing motifs from alternate method with Consensus Motifs. The
"Match?" column displays a y - pattern and structure match; p - partial pattern
match but structure mismatch; n - mismatch
From Table 4.8, we observe that the motifs agree for reasonably well for GCN4,
ABF1, STB4, and GAL4.
The agree partially, or within individual syllables, for
RGTl, PUT3, and STP1. They do not agree for HSF1, UGA3, SUTi, and SOK2.
The motifs that produced palindromic results were RGTl, PUT3, STB4, GAL4,
HSF1, STP1, SUTI, and SOK2. We observe that GAL4 and STB4 found the correct
60
significant motifs even though they are palindromic. Its score, however, is incorrect
due to the inconsistency of the word counts. For the factors besides GAL4 and STB4,
the palindromic motifs found were thus assigned a higher significance than the correct
value and appeared to be more significant than the correct answer.
4.3.3
Comparison of MSEE and Alternate Method
From our results for 11 transcription factors, we observe that MSEE correctly extracted 5 motifs, partially extracted 3 motifs, and completely missed 3 motifs. The
alternate method correctly extracted 4 motifs, partially extracted 3 motifs, and completely missed 4 motifs. Both MSEE and the alternate method correctly found the
motifs for GCN4, ABF1, STB4, and GAL4 while both completely missed the motifs
for UGA3, SUT1, and SOK2.
Both methods essentially perform motif extraction in the same way with few
differences in run time. It is expected that their results be similar if not exactly
the same. The differences in the results arise from the inconsistent word counting in
the alternate method which favors palindromes over non-palindromic motifs. While
MSEE runs much slower, it is the more accurate method of the two.
4.4
Further Work
The most obvious next step is the refine the motifs produced by MSEE. Using these
motifs, we can perform more specific motif extraction to develop better and more
accurate models.
4.4.1
Motif Alignment of Different Gapped Results
When MSEE is run, it outputs the best motif for each gap width used. Of these, the
best is reported as the most significant motif in Table 4.5 above. However, we can use
the information from the results of different gap widths to obtain more information
about the context and relative strength of the base pairs in the motif. We use ABF1
61
as an example of this method.
We list the top motifs for each gap width for ABF1 in Table 4.9 where the reported
motif is given first. Aligning these various results gives us information about the motif
context as shown directly below Table 4.9.
Gap Width
Score
Motif
Gap
Gap
Gap
Gap
Gap
Gap
Gap
Gap
Gap
Gap
534.897
533.059
528.477
382.929
355.84
137.96
89.4057
38.4222
34.1574
33.496
RTCAnnnnnnACGN
RTCAnnnnnNACG
TCAYnnnnNACG
CAYTnnnNACG
NRTCnnnnnnnACGA
YATCnnnnnnnnCGAN
ACTWnnNACG
ATCACTAW
TATCnnnnnnnnnGANN
CYATnNACG
6
5
4
3
7
8
2
0
9
1
Table 4.9: Top motifs found for ABF1
R T CA
n n n A CG N
R T CA
n n n ACG
T C A
CA
N R T Cn
A
Y AT Cn
AT CA
T AT Cn
n n
NACG
n n N AC G
n n n AC G A
n n N A C G
n n n n C G A N
W
n n n n n G A N N
T n N AC G
Upon inspection, the consensus motif resulting from this alignment is approximately
r T C A y t w n n n A C G a
This consensus motif is very similar to those found in Harbison, et al(4] and Kellis,
et al[5]. This alignment can also be used on the results from HSF1 and GCN4.
62
4.4.2
Motif as Seed to Probabalistic Methods
While alignment is quick and easy, a more accurate way of refining the motif models
is to use the motifs reported by MSEE as a seed to either EM or Gibbs sampling.
One way to do this is to create a PSSM from the most significant motif where each
base has a weight of 1.0. Another way would be to create a PSSM from the alignment
shown above with relative weights for each position.
4.4.3
Extracting Motifs with Complex Structures
MSEE has only been tested for motif structures that have 2 syllables with a gap
separating them. However, it has been implemented to search for any number of
syllables with any number of gaps separating them. The only upper bound is that
the total composite length of the syllables must be less than 8 or 9 base pairs. In
future versions, we hope to relax this upper bound thereby allowing for a much larger
range of motif structures.
4.4.4
Motif Significance
This thesis did not analyze how well the normal distribution approximates the hypergeometric distribution. This requires further work in order to ensure that the
approximation holds. We also mentioned in Section 4.3.1 that statistically enriching
motifs by their overrepresentation in positive intergenic regions may not be sufficient
to completely identify all motifs. For example, chromatin-folding is known to have
an effect on limiting when and where transcription factors bind to the DNA. Further
work can be done to factor in such biological phenomena.
4.5
Conclusions
MSEE is a useful tool for finding motifs and in particular, for finding multi-syllabic
motifs. In comparison to the alternate method described, MSEE takes up more space
and is computationally more expensive. However, its algorithm is more accurate and
63
reports more correct motifs. The best use of this tool would be to generate motifs
that can be used as a seed to other methods which produce more refined models of
the motifs.
64
Appendix A
All Motifs found with MSEE
This appendix lists a subset of the top motifs found using MSEE for all the transcription factors tested. In the following table, the motifs for 179 out of 367 motifs are
listed. Each row contains the transcription factor(TF), the score of the motif, and
the motif itself. They are sorted by their significance score.
TF
AFT2H202
GAL4
YNR063W
YAP8
RDS1
GCN4
GCN4_DTT
RGM1
YPR196W
RDR1
CUP9-Cu2
RCS1_H202
NCB2
CBF1
ABF1
CAD1
UGA3-Rapamycin
YAP5
YAP3
PDR1
Score
-1888.79
-1786.54
-922.758
-882.386
-865.11
-736.365
-725.591
-655.035
-643.426
-637.361
-634.949
-626.613
-621.08
-544.745
-534.897
-527.036
-513.878
-504.645
-447.91
-437.083
Top Motif (MSEE)
ACACACAC
MCACnCMCA
GCCCnnnnnnnnnnnnnnnnnnnnGGGC
ACACACAC
TCGGCCGA
RTGASTCA
CMCAnnnMCAC
CCACnnnnnnnnnnnnnnnTCAC
GKAGGGTA
CCGCnnnnnCCGC
CMCAnnnMCAC
ACACACAC
TGAGnnnnnnnnnnnnnCTCA
CACGTGAS
NCGTnnnnnnTGAY
GKGTnnnnGTGK
GKGTnnnnGTGK
GTGKnnGTGK
ACACACAC
MCACnCMCA
65
Posit
114
84
2
58
6
59
59
9
18
3
72
65
4
16
182
54
54
86
41
97
All
343
571
16
343
22
172
462
46
139
15
462
343
36
126
759
448
448
567
343
571
TF
SFP1_H22
UBP12
RPN4-H202
MCM1
YFL052w
RPH1-H202
WCEvWCE
RGTl
CHA4
RGT1-Galactose
HSF1_Heat
GAL4-Raffinose
RIM101-low_1202
YKR064W
SNT2
PUT3
CAD1-H202
ZAPI
RTG1-SM
YBR239c
SUT2
YAP81H202
TBS1
OPIl
YAP7
SRD1
YJL206CJow-H202
RTG3AowH202
UBP1O
STP4
RDS1IH202
SFPI
PUT3-SM
STB4
TYE7
YDR266c
REXI
AZF1
ZAP1-Zn
PH02-low-H202
RCS 1..ow-H202
YML081W
SFPiSM
ZMS 1
Score
-430.584
-430.308
-402.896
-394.729
-393.293
-389.743
-360.64
-359.364
-330.096
-322.299
-320.6
-311.364
-310.88
-299.985
-296.585
-295.017
-294.289
-292.366
-289.585
-282.133
-276.983
-273.456
-268.878
-268.684
-265.848
-263.24
-262.421
-258.803
-251
-247.948
-246.623
-245.143
-238.514
-233.921
-231.693
-227.253
-225.459
-219.318
-216.074
-214.057
-208.223
-208.093
-208.064
-204.295
-202.281
Top Motif (MSEE)
ACACACAS
CGGWnnnnnnCCGR
MCACnnnnnnnnnnnnnnCMCA
TTCCnnnnnnGGAA
ACAGCTGT
GKGCnnnnnGCMC
GTCTnnnnAGAC
CCGGnnnnnnnnnnnnnnnnnnnCCRG
GTGTGTGY
KGGGnnnnnnnnnnnnnnCCCM
TTCTAGAA
CGGRnnnnnnnnnnCCGR
GCGAnnnnnnTCGC
GGCTnnnAGCC
CCCSnnnnnnnnnnnnnnnnnnnSGGG
GCGCTAYC
CGGGnnnnnnnnnCCGA
GGTGnnnnnnnnnnnCACC
ACCYnnnRGGT
ACACACAC
CTCGnnnnnnnnnnnnnnnnnnnnCGAG
GCGTACGC
CACACACR
ACCCnnnnnnnACCC
CGGWnnnnnnCCGR
AGKAnnnnnnnnnnnnnnnnTCGG
YCCCnnnnnnnnGGGR
TCCCnnnnnnnnGGGA
CCGAnnTCGS
CGGWnnnnnnCCGR
GKAGGGTA
KCGGCCGW
CGGTnnnnnnnnnnnnnACCG
CGGGnnnnnnnnnCCGR
TCGRnYCGA
YCACGTGR
CGGTnnnnnnnnnnnnACCG
GCCATGGC
GGGAnTCCC
CCACnnnnnnnnnnnnnnnTCAC
CCTGnnnnnCAGG
NGGGTGCA
CCGGnnnnnnnCCGG
AYCCnTACA
CYACnnnnnnnnnnnnnnnCCCC
66
Posit
26
14
29
42
2
10
2
3
16
4
36
17
2
2
4
17
6
2
16
19
2
2
19
6
14
2
6
4
6
14
20
20
2
10
16
42
2
6
6
13
2
31
6
31
3
All
400
57
235
118
40
34
10
12
408
26
206
50
18
8
30
55
17
18
108
343
14
18
420
42
57
47
42
8
19
57
139
62
6
34
96
222
10
16
12
46
14
125
10
195
31
TF
AFT2
UBP11
Score
-198.727
RGTI-Low-Glucose
-193.285
-193.249
-191.067
-188.426
TECI
RAPI-SM
UBP9
GAL4-Cu2
HSF1_low-.H202
CIN5-H202
YDR026c
STB1
SIP4_SM
HSF1
BASI
HAP4
ASHl1_4hrButanol
YAP1-low-H202
RTG3-H202
UGA3
GAL4-Galactose
YJL206C-1202
RPH1
TEC1_Alpha
ARG81-SM
MIGi
IN02
YJL206C
PIP2
STP1
CST6
SKN7
SIP4
ADR1-SM
STB5
ARG80-SM
PHO4
UBP8
Z1256
AR080
ROXi-H202
GCR1
SMP1
RTG1
MATAl
BUR6
-196.196
-186.813
-184.532
-182.568
-181.387
-179.017
-174.573
-169.329
-167.399
-167.184
-163.583
-161.605
-154.482
-150.796
-150.733
-149.211
-147.329
-145.058
-143.891
-137.571
-136.384
-131.815
-131.715
-130.452
-128.781
-128.423
-127.886
-126.772
-125.954
-124.576
-122.37
-122.24
-120.823
-118.324
-116.558
-115.952
-114.667
-111.166
-108.164
-107.457
Top Motif (MSEE)
CACAnACAC
I Posit I All
24
330
SGGTnnnnACCS
10
80
TCCGnnnnnnnnnnnnnnnCSGA
10
47
CACMnnnnnnnnnnnnnnnnnnnnMCAC 27
204
CCCGnnnnnnnnnnnnnnnnCGTA
2
7
ACYCnnnnnnnACYC
12
136
CCACnnnnnnnnnnnnnnnTCAC
14
46
TTCTAGAA
18
206
GGGGnnnnnnnnCCCC
4
6
CCGGnTAAA
10
86
CGCRnnnnnnnnnnGCGW
9
47
CCATnnnnCCGR
10
39
4
GCATnnATGC
44
17
MSGAGTCA
107
ACMCnnnnnnnnnnnnnnnnnMCAC
32
225
12
CCCSnnnnnnnnnnnnnnnnnnnSGGG
30
121
GCTKnCTAA
CCGAnnnnCGSM
67
42
CCATnnnnnnnnnnnnnnnATGG
CGGRnnnnnnnnnnCCGR
50
10
CGTGnnnnnnnnnnnnnCCGC
CCTGnnnnnnnnnGSCC
21
210
GKGTnnnnnnnnnnnnnnnnnGTGK
50
CAACnnnnnnnnnnnnnnnnnnnGTTG
12
GGGTnnnnACCC
GCATGTGA
54
8
TCCCnnnnnnnnGGGA
10
CCCCnnnnnnnnnnGGGG
17
CGGCnnnnnnnnnCGGC
GKAGGGTA
139
56
GGSCnnnGSCC
28
CTGAnnTCAG
10
GTCTnnnnAGAC
14
GCGGnnnnnnnCCGC
ATKTnnnnnnnnnnnnnCGGG
58
682
TSATnnnnATSA
CSGGnnnnnnnCCSG
48
CCACnnnnnnnnnnnnnnnnnnCCCC
15
CCGMnnnnnRCCG
68
CACCCACA
111
CCcccCsC
79
AYACnCACA
548
AATSnnnTTAS
457
CCAAnnnnnnnnnnnTTGG
52
TGGGnnnnnnnnCCCR
14
67
TF
PDR3
HAP2
YFL044C
PHO2-PiSTP2
PH02-H202
SKOl
RTG3-Rapamycin
HAP5-SM
MIGIGalactose
HAP5
CUP9
NDT80
CAD1-SM
CHA4-SM
YER051w
BAS1_SM
WARI
RTG3-SM
RPH1-low-H202
ARO80-SM
YAP5_H202
RLR1
ARG81
THI2
RME1
THI2-ThiSIR4
SUM1
PPR1
UPC2
TECi Butanol
UGA3-SM
SIG1-H202
RPN4Jow-H202
RIM10 1
HAP3
OAF1
PHO2-SM
UMEl-H202
ADRI
HAP4_SM
SPT2
RTG3
STE12
Score
ITop Motif (MSEE)
I Posit
-106.327 GATCnnnnnnnnnnnnnGATC
-106.077 TCCCnnnnnnnnnnnnnnnnnnnGGGA
-102.944 TRTAnnTAYA
-100.173 SCACGTGS
-99.537
GCGAnnnnnnnnnnnnnnnnnnCCGA
-98.0243 ACGGCCGT
-97.8135 GGGCnnnnnnGCCC
-97.6281 GCGTnACGC
-97.2144 GACCnnnnnGGTC
-97.2052 GSTCnnnnnnnnnnnnnnnGASC
-95.6302 CGCGnnnnnnnnnnnnnnnnnnCGCG
-95.2845 CGCCnnnnnnnnnnnnGYCG
-94.8907 AGTTnnnnnnnnnnnnnnnTCCC
-94.8462 MTTAnnAATS
-94.6534 CTCTnnnnnnnnnAGAG
-91.891
GTCTnnnnAGAC
-91.552
MSGAGTCA
-91.3099 GGAGnnnnnnnnnnnnnnnnnnAYCC
-90.7169 CGCCnnnnGGCG
-90.2731 GTMGnnnnnnnnnnnnnnnnnCKAC
-87.3551 GGGGnnnnnnnnnnnnnCCCC
-85.1899 GGTAnnnnTACC
-84.3003 CCCCnnnnnGGGG
-81.8173 GGRTnnnnnnnnnnnnAYCC
-80.2348 ARGCnnnnnnGCYT
-80.1286 ASCTnnnnnnnnnnnnnnnnGCGG
-80.0774 TRGGnnnnCCYA
-79.0962 TCGAnTCGA
-76.8761 GWCRCAAA
-76.6462 CGACnnnnnnnnnnnnGCCA
-75.7968 ACCGGTTA
-75.7886 TCGTnACGA
-74.9091 CTAGnnnnnnnnnnnnnCTAG
-73.8237 GCTGnnnnnnnnnnnCAGC
-72.8842 AGACnnnnnnnnnnnnGACY
-71.8964 CTGCnnnnnnCTGA
-71.4041 GCGAnnnnnnnnnnnCGCA
-71.2012 CCCCnnnnnnnnnnSRGG
-70.9359 ACCYnnnRGGT
-70.7771 CGGCCGAR
-69.5593 CGATnnnnnnnnnnnnnGTCC
-69.3184 GATGnnnnnnnnnnnnnACCM
-68.7773 GCGMGCRC
-63.7732 CTAAnCTCA
NTGAAACA
-63.438
68
All
20
10
5642
90
7
16
12
6
6
54
4
31
42
756
24
10
107
31
8
70
8
34
10
100
168
34
64
18
342
11
13
22
24
18
41
22
24
43
108
45
13
65
95
67
690
TF
STPISM
YHP1
RLMl-14hr-Butanol
RCS1-SM
YRR1
ACE2-Cu2
YAPI
UMEl
ARG80
HAP2-Rapamycin
PHO2
RCS1
ASK10
ROX
SIR3
STB2
RPH1-SM
RIM101.H202
RPN4
RLM1
SFPi1ow-H202
YOX1
SPT23
RTG 1Japamycin
Score
-63.0324
-62.4818
-62.2335
-61.855
-59.8441
-59.1808
-59.1462
-59.1042
-58.2859
-57.7562
-54.6925
-53.0868
-52.9335
-52.5781
-51.3149
-51.0293
-48.1694
-47.5489
-46.2583
-45.7255
-43.1684
-41.8195
-39.444
-37.7277
Top Motif (MSEE)
TCGGCCGW
Posit
6
18
30
4
3
6
GCTKnCTAA
17
GKAGGGTA
12
8
4
4
4
11
8
9
8
3
2
5
6
GCCGnnnnnnCGGC
NGCTnnnnnnnnnnnnnnnnnnnAGCN
CYAAnnnTAGM
CCKGnnnnnnnnGGAC
CGACnnnnnnnnnnnnGCCA
GGATnnnGTAY
CGTGnnnnnnnnnnnnnnnnnnnGGCC
GGCCnnnnnnGGCC
GACCnnnnGGTC
AGATnnnnnnnSGAT
CGTSnnnnnnnnnnnnnnnnnGGMC
CGCTnnnnnnnnnnnnnTMSG
SCAGnnnnnnnnnnnnnnCTGS
GGCCnnnGGRC
CCGCnnnnnnnnnnnnnnnGCGG
GCTCnnnCTGA
CCTAGCAC
RATCnnnnnnnnnnnnnnnGATY
GGGCnnnnnnnnnnnnnnnnnnnGCCC
ACCSnnnnnnTACM
GCRGnnnnnnnnnnnnnnCTAC
69
12
2
12
6
All
14
612
291
15
11
40
121
139
100
9
14
16
120
32
83
90
17
4
37
17
196
2
148
44
70
Appendix B
Optimization Comparison Tests
This describes a series of tests of hash function performance. The first test performed
was to test the performance of the hash function on words with and without the "Z"
modification. The prediction was that the hash function would perform better when
we eliminated the string of n characters that corresponded to gaps for a given word.
The test was set up with the following parameters:
" Number of Syllables - 2
" Number of Wildcards - 2
* Motif Structure [Type (Width)] - Syllable (3): Gap (variable): Syllable (3)
" Number of sequences in Positive Intergenic Set - 10
" Range of Gap Widths tested - (0,100)
We then recorded the time taken to expand and enumerate the words for the
positive set and then calculated the average time per sequence. The average time
to process a sequence was then compared between the optimized and non-optimized
algorithm. Figure B-i shows the results of these tests as a function of gap size used.
71
Hash Function Performance in Optimized and Non-optimized Versions of Algorithm
9
Ea
Optimized
0
8
-
0
Optimized Data Points
-Non-Optimized
or Non-Optimized Data Points
-
c
-
.
-
-
-
-
-
-
-
-
- -
-
-
-
-
C)
w2 -- -
-
- -
-
- - -
- - --
--
- -
--
- -
C,
0.
a/)
0
1-
20.4....70....
10
2-0
101
2--
0
4
0
6
7
0
9
0
1
Gap Width
Figure B- 1: Comparing hash function performance between optimized and nonoptimized versions of algorithm
72
Linear regression methods were used to interpolate the trend in hash function
performance from the data points gathered. Here, we observe that for small/medium
size gaps, the overhead for implementing the "Z" modification causes it run slower
than the unmodified version. However, we also observe that as the gap width increases, the hash time for the unmodified version also increases, and at some point,
this becomes larger than the hash time for the modified version. The hash function
performance for the modified version remains constant with the changing gap width
since it is independent of the gap width.
Based on these results, we decided to offer two options in running the algorithm:
with and without the modification of removing the gaps when hashing. This allows us
to perform searches for motifs with small gaps relatively quickly while still allowing
us to find motifs with large gaps at a constant time.
73
74
Bibliography
[1 Ken Takusagawa and David Gifford. Negative information for motif discovery.
Pacific Symposium on Biological Computing, 2003.
[2] Ken Takusagawa.
Negative information for motif discovery. Master's project,
Massachusetts Institute of Technology, Department of Electrical Engineering and
Computer Science, 2003.
[3] Akimori
Sarai.
Bioinfo
at
bank.
Available
http://gibk26.bse.kyutech.ac.jp/jouhou/image/dna-
protein/all/small-N1d66.gif.
[4] C. Harbison, D. Gordon, T. Lee, N. Rinaldi, K. Macisaac, T. Danford, N. Hannett, J. Tagne, D. Reynolds, J. Yoo, E. Jennings, J. Zeitlinger, M. Kellis, P. A.
Rolfe, K. Takusagawa, E. Landeri, D. Gifford, E. Fraenkel, , and R.Young.
Transcriptional regulatory code of the eukaryotic genome. Nature, 431:99-104,
September 2004.
[5] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. Lander. Sequencing and
comparison of yeast species to identify genes and regulatory elements. Nature,
423:241-254, May 2003.
[6] Marie-France
Sagot.
On motifs in biological sequences.
Available
at
http://citeseer-ist.psu.edu/473028.html.
[7] Fitting a mixture model by expectation maximization to discover motifs in
biopolymers, Menlo Park, CA, 1994. AAAI Press.
75
[8] XS Liu, DL Brutlag, and JS Liu. An algorithm for finding protein-dna binding sites with applications to chromatin immunoprecipitation microarray experiments. Nature Biotechnology, 20(8):835-839, 2002.
[9] X. Liu, D. Brutlag, and J. Liu. Bioprospector: discovering conserved dna motifs
in upstream regulatory regions of coexpressed genes. 2001.
[10] W. Thompson, E. C. Rouchka, and C. E. Lawrence. Gibbs recursive sampler:
finding transcription factor binding sites. Nucleic Acids Research, 31(13):3580-
3585, 2003.
[11] Y. Barash, G. Bejerano, and N. Friedman. A simple hyper-geometric approach
for discovering putative transcription factor binding sites. 2001.
Proceedings
First International Workshop.
12] Yoseph Barash and Nir Friedman. Context-specific bayesian clustering for gene
expression data. In RECOMB, pages 12-21, 2001.
[13] E.P. Xing, M.I. Jordan, R.M. Karp, and S. Russell.
A hierarchical bayesian
markovian model for motifs in biopolymer sequences. Advances in Neural Information Processing Systems, 2002.
[14] Jiang Liu. A combinatorial approach for motif discovery in unaligned dna sequences. Master's thesis, University of Waterloo, 2004.
[15] Eric
Rouchka.
A
brief overview
of
gibbs sampling.
Available
at
http://iteseer.ist.psu.edu/85660.html.
[16] Y. Barash, G. Elidan, N. Friedman, and T. Kaplan.
in protein-dna binding sites.
2003.
Modeling dependencies
In Proceedings of the 7th International
Conference on Research in Computational Molecular Biology.
[17] E. Xing, W. Wu, M. Jordan, and R. Karp. Logos: A modular bayesian model
for de novo motif detection.
2003.
In Proceedings IEEE Computer Society
Bioinformatics Conference.
76
[18] Jeremy Buhler and Martin Tompa. Finding motifs using random projections. In
RECOMB, pages 69-76, 2001.
[19] S.
Sinha
for
and
finding
M.
Tompa.
transcription
Performance
factor
binding
comparison
sites,
of
2003.
algorithms
Available
at
http://citeseer.ist.psu.edu/sinha03performance.html.
[20] Albin Sandelin and Wyeth W. Wasserman. Constrained binding site diversity
within families oftranscription factors enhances pattern discovery bioinformatics.
Journal of Molecular Biology, 338(2):207-215, 2004.
[21] Bing Ren, Frangois Robert, John J. Wyrick, Oscar Aparicio, Ezra G. Jennings,
Itamar Simon, Julia Zeitlinger, Jorg Schreiber, Nancy Hannett. Elenita Kanin,
Thomas L. Volkert, Christopher J. Wilson, Stephen P. Bell, and Richard A.
Young. Genome-wide location and function of dna binding proteins. Science,
290:2306-2309, 200.
[22] Microsoft
MSDN
ence:
Map-Class.
at
Library.
Standard
Microsoft
C++
Corporation,
Library
2005.
Refer-
Available
http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/vcstdlib/html/vclrfmap.class.asp.
[23] Eric
W.
Weisstein.
Math World-A
Wolfram
Hypergeometric
Web
Resource.
distribution.
Available
at
http://mathworld.wolfram.com/HypergeometricDistribution.html.
[24] Kyle
Siegrist.
oratories in
The
hypergeometric
Probability
and
distribution.
Statistics,
Virtual
1997-2001.
Lab-
Available
at
http://www.ds.unifi.it/VL/VL-EN/urn/urn2.html.
[25] Suranthe
ability
ric
De
Silva
distributions:
distribution.
and
Peter
D'Andreti.
Binomial
Approximations
approximation
ThinkQuest:
Seeing is
http://ibrary.thinkquest.org/10030/6atpdvah.htm.
77
to
the
Believing.
to
prob-
hypergeometAvailable
at
[26] Eric W. Weisstein. Binomial distribution. Math World-A Wolfram Web Resource.
Available at http://mathworld.wolfram.com/BinomialDistribution.html.
[27] Dimitri P. Bertsekas and John N. Tsitsiklis. Introduction to Probability. Athena
Scientific, 2002.
[28] L. Marsan and M.-F. Sagot. Algorithms for extracting structured motifs using
a suffix tree with application to promoter and regulatory site consensus identification. Journal of ComputationalBiology, 7:354-360, 2000.
[29] B. Brejova, C. DiMarco, T. Vinar, S. Hidalgo, G. Holguin, and C. Patten. Finding patterns in biological sequences.
Unpublished project report for CS798G,
University of Waterloo, September 2000.
78