Multi-Syllabic DNA Motif Discovery by Rasika S. Kumar Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2005 @ Rasika S. Kumar, MMV. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. A u th o r ... .. . ... ... . . . . . . . . . . . . .. . . ... .. . . ... . . . . . . .. . . . . .. ... .. ... Department of Electrical Engineering and Computer Science t 8. 2005 0A1 .. Certified by................................ David Gifford Professor Thesis Supervisor Accepted by ........... - ...... .... Arthur C. Smith Chairman, Department Committee on Graduate Students MASSACHUSETS ~T=rE OFTECHNOLOGY AUG 14 2006 LIBRARIES ARCHIVES 2 Multi-Syllabic DNA Motif Discovery by Rasika S. Kumar Submitted to the Department of Electrical Engineering and Computer Science on August 8, 2005, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract This paper describes a method for finding multi-syllabic motifs in a genome. It expands on the algorithm developed by Takusagawa, et al[1, 2] that uses data from Chromatin Immuno-Precipitation (ChIP) experiments to isolate regions that have a given motif. The Takusagawa method uses an enumeration method to search for motifs in both positive and negative intergenic regions in order to determine the statistical significance of the results. Our algorithm also uses enumeration to find motifs that have gaps between the different sub-motifs, or syllables. This thesis also describes a method to calculate the significance of each motif and tests this method via Monte Carlo simulations on random test sets. The significant motifs found using this algorithm are verified against consensus motifs found in the literature. Thesis Supervisor: David Gifford Title: Professor 3 4 Acknowledgments I would first, like to thank Dave Gifford for giving me the opportunity to work in his group. I would also like to thank Ken Takusagawa who has been my mentor and immediate advisor for this thesis and supporting research. It is through his direct supervision that this thesis has become what it is. Last, but not least, I would like to thank Dave Gifford, Ken Takusagawa, Kenzie MacIsaac, and Dr. B. Kumar for their comments and input during the revision process of this thesis. 5 6 Contents 1 13 Introduction 1.1 Overview of Motif Discovery Methods. 13 1.1.1 Determining Bound Regions . . . . . . . . . . . . . . . . . . . 14 1.1.2 Enumeration Methods . . . . . . . . . . . . . . . . . . . . . . 15 1.1.3 Probabilistic Methods . . . . . . . . . . . . . . . . . . . . . . 15 1.1.4 Determining Statistical Significance . . . . . . . . . . . . . . . 16 1.2 Overview of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Goals of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 19 2 Multi-Syllabic Expansion and Enumeration (MSEE) 21 2.1 Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 MSEE: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Examining a Word . . . . . . . . . . . . . . . . . . . . . . . . 24 Object Oriented Implementation . . . . . . . . . . . . . . . . . . . . . 30 2.3.1 Expandable Interface . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.2 Fixed Gap class . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Hash-Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Optimizing the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 32 The Z modification . . . . . . . . . . . . . . . . . . . . . . . . 33 . . . . . . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2.1 2.3 2.4 2.4.1 2.5 2.5.1 2.6 Alternate Implementation 2.7 Developing a Test Suite 7 3 3.1 3.2 4 37 Significance of a Motif . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1 Hypergeometric Model . . . . . . . . . . . . . . . . . . . . . . 37 3.1.2 Binomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.3 Normal Approximation to the Binomial Model . . . . . . . . . 41 3.1.4 A nalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 . . . . . . . . . . . . . . . . . . . . 42 . . . . . . . . . . . . . . . . . . . 43 . . . . . . . . . . . . . . . . . . . . . . . . . 43 Probabilistic Model Monte Carlo Simulation Strategy 3.2.1 Sequence Selection Strategy 3.2.2 Testing Strategy Results 47 4.1 Consensus Motifs from Literature . . . . . . . . . . . . . . . . . . . . 47 4.2 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.2 Monte Carlo Motif Scores . . . . . . . . . . . . . . . . . . . . 49 4.2.3 Important Features of Monte Carlo Results . . . . . . . . . . . 51 4.3 4.4 4.5 Testing the Two Methods . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.1 Validation of MSEE . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Validation of Alternate Method . . . . . . . . . . . . . . . . 58 4.3.3 Comparison of MSEE and Alternate Method . . . . . . . . . . 61 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.1 Motif Alignment of Different Gapped Results . . . . . . . . . 61 4.4.2 Motif as Seed to Probabalistic Methods . . . . . . . . . . . . . 63 4.4.3 Extracting Motifs with Complex Structures . . . . . . . . . . 63 4.4.4 Motif Significance . . . . . . . . . . . . . . . . . . . . . . . . . 63 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A All Motifs found with MSEE 65 B 71 Optimization Comparison Tests 8 List of Figures 1-1 GAL4 transcription factor binding to DNA[3]. the basic structure of the GAL4 protein. The thin bands show The darker edges are the subunits of GAL4 that directly bind to the genomic DNA. This figure shows two molecules of GAL4 binding to DNA. . . . . . . . . . . . . 18 2-1 Syllabic Structure of GAL4 . . . . . . . . . . . . . . . . . . . . . . . 23 2-2 Flow Chart of Motif Discovery by MSEE . . . . . . . . . . . . . . . . 25 2-3 Expansion of small word with 1 wildcard . . . . . . . . . . . . . . . . 26 2-4 Log Number of expansions of a given word as a function of gap width and the number of wildcards . . . . . . . . . . . . . . . . . . . . . . . 28 2-5 Object Diagram of Inheritance . . . . . . . . . . . . . . . . . . . . . . 30 2-6 Introducing the Z Placeholder . . . . . . . . . . . . . . . . . . . . . . 33 2-7 Flow Chart of Alternate Implementation 36 3-1 Normal Distribution Approximation of Number of Successes. We cal- . . . . . . . . . . . . . . . . culate the area of the shaded box which is equivalent to the probability of m or more successes. p is the mean number of successes. . . . . . . 41 3-2 Distribution of Set Sizes 44 3-3 Monte Carlo Simulation Strategy. For each sequence set size chosen, . . . . . . . . . . . . . . . . . . . . . . . . . we generate 30 random sets of that size and run MSEE on these sets. 4-1 45 Scores of the Best Motifs from Monte Carlo Simulations. Both mean and standard deviation from the mean decrease as the number of sequences in the set increases. . . . . . . . . . . . . . . . . . . . . . . . 9 50 4-2 Number of bases in Monte Carlo sets vs. ChIP sets. The significant outliers are circled in red. . . . . . . . . . . . . . . . . . . . . . . . . 51 4-3 Histogram of Scores for Positive Sequence Set Sizes of 4 and 180. Dis- 4-4 tributions are noticeably non-Gaussian . . . . . . . . . . . . . . . . . 52 Motif Scores from Monte Carlo Sets versus ChIP sets . . . . . . . . . 57 B-1 Comparing hash function performance between optimized and nonoptimized versions of algorithm . . . . . . . . . . . . . . . . . . . . . 10 72 List of Tables 1.1 Consensus motifs that have gaps. These motifs follow the IUPAC abbreviations for nucleotide subsets. Please refer to Table 2.1 in Section 2.1. 19 2.1 IUPAC Map. This table lists IUPAC notation for the corresponding subset of nucleotides and the integer assigned to each subset 2.2 23 Set of matching wildcards for each nucleotide. There are four wildcard possibilities for each base. 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Consensus motifs containing gaps from Harbison et al [4] listed by the transcription factor (TF), the consensus motif, and the enrichment score. The motifs for which the enrichment score is 0.0 were obtained from the literature and were not found experimentally in Harbison, et al. 4.2 . . . .. . . . . . . .. . . . . . . .... . . .. .. . . . . . . . . .. .. Consensus motifs containing gaps from Kellis, et al[5]. The motif conservation score(MCS) is listed alongside the motif. . . . . . . . . . . . 4.3 48 48 Number of Positive Intergenic Sequences and Total Number of bases in these sequences for Transcription Factors with gapped consensus motifs. Sorted by Number of Positive Sequences. . . . . . . . . . . . . 4.4 49 Scores from Monte Carlo Simulations. The last column "Conf" refers to the confidence level achieved if we set a threshold of one standard deviation away from the mean score. 11 . . . . . . . . . . . . . . . . . . 49 4.5 Motifs found using MSEE for GCN4, ABF1, UGA3 in Rapamycin, RGT1, HSF1 in Heat Shock, GAL4 in Galactose, PUT3, STB4, STP1, SUT1, and SOK2 after 14 hour Butanol treatment. The default con- dition is YPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 54 Comparing MSEE motifs with Consensus Motifs. The "Match?" column displays a y - pattern and structure match; p - partial pattern match but structure mismatch; n - mismatch 4.7 . . . . . . . . . . . . . 55 Motifs found using Alternate Method for GCN4, RGT1, PUT3, ABF1, STB4, GAL4 in Raffinose, HSF1, UGA3 in Rapamycin, STP1, SUT1, and SOK2 after 14 hour Butanol treatment. 4.8 59 Comparing motifs from alternate method with Consensus Motifs. The "Match?" 4.9 . . . . . . . . . . . . . . column displays a y - pattern and structure match; p - partial pattern match but structure mismatch; n - mismatch . . . . . 60 . . . . . . . . . . . . . . . . . . . . . . . 62 Top motifs found for ABF1 12 Chapter 1 Introduction An important area of computational biology deals with analyzing function and regulation encoded in an organism's genome. In particular, this thesis examines the specific interaction between genomic DNA and regulatory proteins such as transcription factors. Transcription factors are proteins that are known to regulate when and how DNA is transcribed. They work by binding to regions near a gene's transcription start site, called promoter regions, thus inducing or repressing transcription. Transcription factors bind to specific DNA sequences, or motifs, that are dependent on the protein structure of the transcription factor and can vary in length and number. Motif discovery refers to the search for and identification of these specific DNA sequences. 1.1 Overview of Motif Discovery Methods There are essentially two steps to motif discovery[6]. The first is identifying where in the genome these motifs exist. In other words, this involves finding regions in the DNA to which transcription factors bind. Once these regions have been identified, the second step consists of developing a consensus pattern for the motif. There are many existing motif discovery methods that vary greatly in their approach and assumptions. Some popular examples are MEME[7], MDScan[8], BioProspector[9], and Gibbs Motif Sampler[10]. 13 1.1.1 Determining Bound Regions A variety of methods have been used to isolate sets of sequences which are expected to bind to a transcription factor. Sequence Conservation Some methods use sequence conservation across different species (for instance across various Saccharomyces species) to identify highly conserved regions arguing that these regions are more likely to contain areas vital to the transcription of key genes 11]. These regions may contain the motifs for any number of transcription factors. For example, in Kellis et al[5], they examine sequence conservation across four different yeast species to determine which areas are most likely to be protein coding regions or regulatory elements. This method has proved especially useful for species that have a large variability. However, it has limited utility if sequences between species cannot be aligned well or if conservation data for either or both the species does not exist. Clustering Methods Other methods use clustered expression profiles to approximate where a transcription factor binds[12]. The expression profiles are obtained from DNA microarray experiments which are performed under various conditions. This method hypothesizes that genes that are co-expressed are controlled by the same transcription factors. Various clustering methods are used including K-Means clustering and Bayesian clustering[13] to group together genes that are co-expressed. The promoter sequences of these genes are then isolated for motif extraction. Chromatin ImmunoPrecipitation Experiements Another method of pruning our search space involves an experimental technique called Chromatic ImmunoPrecipitation (ChIP) where antibodies are used to isolate regions of DNA bound by a certain transcription factor. ChIP pinpoints the regions containing the motif for that transcription factor (ie. bound intergenic regions). It does this 14 by fixing the transcription factors to the DNA regions to which they bind and then isolating these bound regions. Many motif discovery methods use the bound regions discovered from ChIP experiments as input including MDScan[8]. Once we have successfully isolated sequences where there is a high probability of a motif being present, we then apply a motif extraction method. These methods fall into two categories: enumeration and probabilistic selection. 1.1.2 Enumeration Methods Enumeration methods, or word counting methods, systematically scan through a set of DNA sequences and look for overrepresented words. Many enumeration methods take into account variability in a motif by including wildcards in their subsequence matching. This allows the methods to better characterize different motifs since a given motif may differ slightly in its various manifestations. The main disadvantage of this general method is the amount of memory required for the algorithm. However, optimizing the algorithm in this respect can improve its performance and make it an efficient method of motif extraction. 1.1.3 Probabilistic Methods There are many methods which take a statistical approach to motif discovery. The following discusses the Position Specific Score Matrix(PSSM) Model[11] and the Markov Chain Model of a motif. Position Specific Score Matrices The PSSM model of a motif consists of independent multinomial distributions at each position in the motif. Various probabilistic methods use a seed PSSM in order to approximate the most likely PSSM for a motif on a given sequence. One such method is the Expectation Maximization(EM) method which is a two-step iterative process resulting in a motif model that has maximum likelihood compared to the background nodel[6, 14]. The EM method is used in particular by MEME[7] which 15 takes as an input a training set of sequences and uses this set as the seed for the Expectation Maximization. Another similar method is Gibbs sampling which uses a set of parameters to define a motif model. This method then estimates values for this set of parameters such that the model best characterizes the data[15]. Markov Chain Model Other methods use Markov chains to model dependencies between positions in a motif. Arguing that the PSSM model of consensus motifs is based on the assumption that each position in the motif is independent from the other[16], many use Markov chain models which incorporate the idea of dependence. One example of this technique is the motif discovery method LOGOS[17] which uses Hidden Markov Dirichlet-Multinomial Models and prior biological knowledge of known motifs to model positional dependencies within the motifs. Other examples include Gibbs sampling methods which estimate maximum likelihood motif models using these Markov Chains[14]. Other motif extraction methods exist which use a variety of statistical methods and which perform efficiently in differing circumstances. motifs using random projections [181. One such method finds This has proven to work very efficiently in finding motifs of large width and sufficient variability. 1.1.4 Determining Statistical Significance Once the motif extraction algorithm has identified a potential motif model, the next step is to estimate the significance of these candidate motifs using statistical methods. One popular method is the Random Selection Null Hypothesis[11] which states that motifs must be sequences that are significantly overrepresented in the positive intergenic regions. This method produces a significance score for each candidate motif which is derived from the hypergeometric distribution. This thesis uses an approximation to this scoring method which will be discussed later in Section 3.1. Another method uses Monte Carlo simulations to determine the expected score of a motif given a random input set of sequences. The score of the candidate motif must 16 be higher than this expected score in order for it to be considered significant. This method can also be used as a measure of the effectiveness of the given motif scoring method. This is also discussed later in Section 3.2. Other methods use a variety of scoring methods by which they rank their motifs[19]. While many of these methods have found new motifs in a variety of genomes, very few use a priori knowledge about the motif structure to perform a search. A recent method developed by Sandelin and Wasserman[20] incorporates structural knowledge of set of transcription factors by creating families of factors with similar structures and incorporating these families into motif discovery methods. This thesis explores another method that uses structural information to find a more complex set of motifs efficiently. 1.2 Overview of this Thesis The method implemented in this thesis uses data from Chromatin ImmunoPrecipitation (ChIP) experiments to identify regions that are expected to contain a motif and enumerates all potential motifs, both contiguous and multi-syllabic, in those regions. Using a priori knowledge about the structure of the motif, we efficiently extract motifs that are associated with the transcription factors of S. cerevisiae. This is an extension of the method described in Takusagawa, et al[1, 2]. The Takusagawa method searches in both positive regions, areas that are believed to have a motif, and negative regions, areas that are believed to not have a motif. The statistical significance of a motif found in the positive sequences is then determined by ensuring that it is not found in any of the negative sequences as defined by the Random Selection Null Hypothesis[11]. However, the Takusagawa method was able to extract only contiguous motifs, not multi-syllabic motifs. More specifically, the motifs extracted were at most seven bases wide with a maximum of two wildcard positions, or positions of variability. Many transcription factors are known to be dimers [11] consisting of two proteins joining to form a complex. Each protein subunit binds to a different DNA subsequence, with 17 subsequences being separated by some number of bases. Arguably, this is due to the helical nature of DNA; a binding protein binds on one side of the helix thus touching two parts of the sequence and omitting a section in the middle where the helix turns as shown in the Figure 1-1. Figure 1-1: GAL4 transcription factor binding to DNA[3}. The thin bands show the basic structure of the GAL4 protein. The darker edges are the subunits of GAL4 that directly bind to the genomic DNA. This figure shows two molecules of GAL4 binding to DNA. Table 1.1 shows a subset of consensus motifs in Harbison, et al[4] that have substantial gaps. Here, the wildcard n is used to indicate the set of bases {A,C,G,T}, or the instance of a gap. 1.3 Goals of this Thesis The goals of this thesis are: * find motifs which have a fixed length gap (dimer) in a Saccharomyces species 18 Transcription Factor Consensus Motif ABF1 TCAYTnnnnACG GAL4 CGGnnnnnrinnnnCCG HSF1 RGT1 SOK2 STB4 STP1 TTCYAnnnnnTTC CGGAnnA TGCAGnnA TCGGnnCGA RCGGCnnnRCGGC Table 1.1: Consensus motifs that have gaps. These motifs follow the IUPAC abbreviations for nucleotide subsets. Please refer to Table 2.1 in Section 2.1. " find motifs containing a gap of undetermined length and find the best fixed length for that gap between syllables " determine which motifs are statistically significant using an approximation to the Random Selection Null Hypothesis " further examine the the significance of our scoring method using Monte Carlo simulations on random sets of sequences In order to reduce the overall computational time, we ran the algorithm on subsections of the data in parallel with each other. 1.4 Contributions of this Thesis This algorithm accurately predicts a much larger range of transcription factor binding sites including both contiguous and multi-syllabic motifs with gaps of fixed lengths than any other existing enumeration method. Since it employs a priori knowledge about motif structure, the predicted motifs can be used as seed motifs to other likelihood maximization methods including Gibbs sampling and Expectation Maximization. Predicting these sites would greatly aid further genomics research on DNA regulatory networks. Finally, we hope to produce software that is accessible through the Internet for public use. 19 20 Chapter 2 Multi-Syllabic Expansion and Enumeration (MSEE) This section describes the implementation of our motif extraction method entitled Multi-Syllabic Expansion and Enumeration (MSEE). 2.1 Terms and Definitions Here, we define the terms most commonly used in this thesis. Many of these definitions are common to several motif discovery methods. We introduce new terminology to help describe this particular motif discovery method, MSEE. Sequence - A sequence is defined to be any subset of consecutive bases of DNA. In general, when we refer to a sequence, we mean a substantial number of consecutive bases ranging from 20 to approximately 1000 bases. Motif - A motif is a subsequence in the DNA to which a transcription factor binds. We define the length of such a subsequence to be at least 3 base pairs wide and at most as wide as the sequence in which it occurs. Motifs may have bases where there is some variability across instances of the motif; a transcription factor may still bind to the motif even if a base differs from one instance of the motif to another. 21 Intergenic region - Intergenic regions are the subset of sequences that are in between the coding regions, or open reading frames (ORFs), in the DNA (ie. they are not transcribed into mRNA). In a genome, these regions vary greatly in length. These intergenic sequences are areas of interest since they contain the motifs to which transcription factors are bound. We further define a region as a positive intergenic region if we know that a given transcription factor binds to it. Data from ChIP experiments approximate these positive regions. A negative intergenic region refers to the regions to which a given transcription factor does not bind. Positive and negative intergenic regions are determined by the ChIP experiments which are obtained from Ren et al[211. Word -- A word is defined as a candidate motif in this thesis. A candidate motif is any set of contiguous bases in a sequence that potentially satisfies the structural considerations for a motif as described above. This algorithm, in short, iterates through all words and counts their reoccurrences to determine the best motif candidate. The terms word and candidate motif will be used interchangeably. Wildcard - A wildcard refers to a variable base. We use this term to describe a base which can be any one of two, three, or four bases and not affect the binding of the transcription factor to the given motif. A list of the eleven different wildcards we can have in a given word and the IUPAC codes for each set is given in Table 2.1. In order to reduce the complexity of our algorithm, we limit our method to the use of only seven of the eleven wildcards listed. The four bases A, C, G, and T and the seven wildcards are each assigned a specific number. Syllable -- In certain cases, a word may have multiple subunits. Such words may be motifs that contain small regions of bases that do not affect the binding of the given transcription factor. For example, the sections of the GAL4 motif are shown in Figure 2-1. In this figure, the word is said to be divided into syllables which are separated by a gap. Each syllable must be at least 2 base pairs wide. 22 IUPAC abbrev A Nucleotides {A} Map to Int 0 C G {C} {G} 1 2 T M R W {T} {A,C} {A,G} {A,T} 3 4 5 6 S {C.G} 7 Y K {C,T} {G,T} 8 9 V H D B {A,C,G} {A,C,T} {A,G,T} {C,G,T} - N {A,C,G,T} 10 Table 2.1: IUPAC Map. This table lists IUPAC notation for the corresponding subset of nucleotides and the integer assigned to each subset Gap 1 IC G Gin n n n n n n n n n njc C GI Syllable 1 Syllable 2 Figure 2-1: Syllabic Structure of GAL4 Gaps are implemented as regions of contiguous wildcards; a gap that has a width of one is equivalent to a single variable base. Gaps can be as wide as is necessary with the upper bound being the width of the entire motif. A motif that has multiple syllables is thus referred to as a multi-syllabic motif. In the rest of this thesis, bases that a part of a gap will be represented by n while bases that are actual wildcards within the motif will be represented by N. MSEE can be used to find motifs with one or more gaps. However, this thesis only reports the results for motifs with a single gap (see Section 4.3). Expansion - This refers to the enumeration of all words that match the candidate motif being examined. This is discussed in greater detail later in Section 2.2.1. 23 2.2 MSEE: An Overview In the first step of MSEE, we isolate the intergenic regions to which a given transcription factor binds. We do this by analyzing data from ChIP experiments and compiling a list of probe sequences to which the transcription factor binds. We label these as the positive intergenic regions. For each intergenic sequence, MSEE then exhaustively enumerates the candidate motifs of a given length and structure using the following procedure. It examines a window of bases of the size of the expected motif and stores the candidate motif. It then slides the window over a single base and records the next such word. This continues for the length of the intergenic sequence. When examining a candidate, MSEE takes into account: 1. the number of wildcards allowed, and 2. the motif structure, ie. the width and placement of syllables and gaps. In order to account for wildcards when examining a word, MSEE expands the word to all its motif possibilities and stores these expansions. Taking into account the gaps we expect in the word, we can reduce the complexity of both the expansion and the storing. This procedure is repeated for a given structure on all the intergenic regions. These results are later used to determine which candidate motifs are significant based on their occurrences in the positive intergenic regions against their occurrences in all intergenic regions. The tests for significance are discussed in Section 3.1. The flow chart in Figure 2-2 summarizes the main function of this algorithm. This emulates the structure of the algorithm found in Takusagawa [2]. 2.2.1 Examining a Word As mentioned previously, there are two parameters by which MSEE examines a given word: the number of wildcards and the size and placement of gaps. Full enumeration 24 Positive Intergenic Sequences All Intergenic Sequences Word Expansion and Enumeration Number of Wildcards Word Counts (Positive Regions) Word Counts (All Regions) __ Motif Structure ____t__ Test for Significance Motifs Figure 2-2: Flow Chart of Motif Discovery by MSEE of a word includes expanding both the word itself and its complement on the opposite strand. Expanding a Word First, MSEE expands each subsequence to all its motif possibilities given the number of wildcards allowed. It then checks each expanded manifestation of the word to determine whether or not it has been seen previously and tallies its occurrence. Using the set of wildcards as defined previously in Table 2.1, we define a set of expanding rules as shown in Table 2.2. When expanding a word, we must allow a wildcard in each possible position of 25 Nucleotide A C G T Expands to {A,C},{A,G},{A,T}, {A,C},{C.G},{C,T}, {A,G},{C.G},{G,T}, {A,T},{C,T},{G,T}, N N N N Table 2.2: Set of matching wildcards for each nucleotide. There are four wildcard possibilities for each base. the word. Figure 2-3 shows an example of expanding a word that is three base pairs long with 1 wildcard. As we increase the number of wildcards, the number of ways we can place w wildcards in a word of length m is (M). A C G - expands to - AC AG AT N A A A A A A A A C C C C G G G G G G G G AC CG CT N C C C C AG CG GT N Figure 2-3: Expansion of small word with 1 wildcard The more wildcards we add, the more space our algorithm takes in order to account for all expansions for each motif. Equation 2.1 compares the number of expansions done on a sequence s of length n, with motif width m and w wildcards. num-expansions(s) = (n - m) 4w (2.1) This method iterates through all positions of the word to place wildcards and expands each instance accordingly. As we see, the number of expansions increases exponentially with each additional wildcard which limits our ability to search for a 26 large set of motifs. Therefore, we need further prior information about the motif in order to reduce the added complexity. Processing Gaps in a Word Ignoring the bases in the gapped areas during expansion reduces the number of expansions per word and thereby reduces the space required by MSEE. MSEE therefore uses the motif structure input to identify the location of the syllables in the candidate motif. During expansion, MSEE only expands those bases in the syllables and avoids the unnecessary expansion of the gaps. Expansion through Recursion MSEE stores the syllabic structure by defining its beginning and ending position offset within the word. Using this, the algorithm creates clear boundaries around the areas it needs to expand, and skips from one area to the other. This reduces the number of expansions greatly for a given sequence s as we see in Equation 2.2 where n is the length of the current sequence, m is the width of the word including the gapped areas, g is the total width of all gaps in this word, and w is the number of wildcards used in this instance. numexpansions.excluding -gaps (s) = (n - m) (rI 9)4w (2.2) Figure 2-4 shows the number of expansions that are done with an increasing number of wildcards as well as an increasing number of gaps. We observe in Figure 2-4 that as the number of wildcards increase, the number of expansions per word increases exponentially, but that this is effectively counteracted by the number of gaps we introduce into the system. This shows very clearly that including knowledge about gaps can make enumeration much faster. The recursion is mutually recursive. On one level, MSEE recurses along the bases in a syllable to insert wildcards. On a higher level, MSEE recurses along syllables in order to avoid gaps. The following pseudocode demonstrates the recursion through 27 Number of Expansions for word of length 20 ! i i i .::: -- ...... . . .... ..... 10 - 5 10 0 Co C 0- 10 . w 0 (D -0 -- - 10 E z 10' - -2 - 5 -0 3 10 20 Number OT ViIdc ars Gap Width Figure 2-4: Log Number of expansions of a given word as a function of gap width and the number of wildcards 28 which we iterate over the syllables and add holes accordingly to create the expanded word. syllable-iterator (word) if no-more-syllables OR no-more-wildcards update-counts (word) % see Section 2.4 else get next syllable and expand-syllable (word, 0) expand-syllable (word, base-offset) if (base-offset == end-of-syllable) OR no-more-wildcards syllable-iterator (word) else expand-syllable (word, base-offset + 1) insert-wildcard at base-offset Reversing and Complementing MSEE examines both the word in the sequence and the reverse complement of the word. This is done under the assumption that transcription factors do not distinguish between strands. Therefore, a motif on one strand may be found on its opposing strand and both are potential sites for transcription factor binding. There are two ways of enumerating the complement word. The default method is to produce the reverse complement of the word and repeat the same procedure above on the new word. This is implemented in the following way. 29 expand (word) syllable-iterator (word) rc-word = reverse and complement of word syllable-iterator (rc-word) As a result, MSEE outputs both the motif and its reverse and complement. The exception to this is if the motif is palindromic meaning its reverse complement is equivalent to the original as we see in GAL4. 2.3 Object Oriented Implementation Since the overall goal of MSEE is to perform motif extraction for a range of motif structures, we decided to define a system of objects each of which implements a different type of motif discovery: motifs with no gaps, motifs with gaps of a fixed width, and motifs with gaps of variable width. Each of these objects has the common interface Expandable. Figure 2-5 is an object diagram of the system. Expandable implement s Fixed Length Gaps Variable Length Gaps Complex Motif Structures Figure 2-5: Object Diagram of Inheritance 30 2.3.1 Expandable Interface The common interface Expandable dictates the common functions of all motif discovery algorithms. Its sole method go initiates the recursion by which each word is expanded since this method is common to all its objects. Its sole parameter is the composite width of the motif including all possible gaps. This interface is impleniented by the class Fixed-Gap which is responsible for enumerating motifs with gaps of a fixed width. 2.3.2 Fixed Gap class The Fixed Gap class implements the Expandable interface and is responsible for running the word counting algorithm for motifs with gaps of fixed length. In the Fixed Gap instance, the method go is equivalent to syllable-iteratordescribed on Page 29. Additionally, this class stores the motif structure as a list of Syllable objects. Each Syllable object contains the beginning and ending index of the given syllable as well as the length of the syllable. The gaps are inferred as the space in between syllables. This class also contains the data structure which stores the word counts. 2.4 2.4.1 Data Structures HashMap The data structure that MSEE uses to store all previously examined words is a HashMap as implemented in the C++ standard library[22]. For each candidate motif, a hash key was computed using the hash function described below. The hash function was designed to minimize collisions. The Hash-Map class guarantees value lookups in constant time and dynamically allocates and deallocates space as needed for insertions and deletions. When processing a word, MSEE first checks the hash map to see if it has been processed previously. The hash key for the word is computed from the hash function below, and the corresponding value is retrieved; this value is the number of occurrences of that word. 31 If the word does exist in the map, MSEE then increments the corresponding value. Otherwise, MSEE inserts this new word into the map and sets its initial count value to 1. Given the requirements of MSEE, the hash map is the best available data structure. Other list and array structures perform lookups and insertions in logarithmic time which would be too slow for our application[22]. motif-counts = HashMap update-counts (word) if word has been seen previously motif-counts[hash-function (word)]++; else motif-counts[hash-function (word)] = 1; hash-function (word) sum = 0; multiplier = 11; length = size(word); for i = (length - 1); (0 <= i); - - i sum *= multiplier sum += word[i]; % base at 'ith' offset in word return sum; 2.5 Optimizing the Algorithm Introducing gaps and syllables in our model of a motif increased the complexity of motif extraction, and we introduced certain optimizations to counter this increased complexity. The greatest time sink in the algorithm occurs during the evaluation of the hash function; this function takes up the most time for each sequence. When the 32 size of the word is increased in order to include syllables and gaps, the size of the word to be hashed increases as well. In the original Takusagawa algorithm, this was limited to 7 or 8 base pairs; now, there is a potentially unlimited number of base pairs since there is no limit on how large the gaps between syllables can be. As the length of the gaps gets bigger, we want to ensure that our run time does not increase at the same rate. Therefore, we needed a technique to eliminate the information about the gaps when hashing in a way that we could still easily reconstruct the final motif at termination of the algorithm. We accomplished this by introducing the "Z" base which acted as a gap place holder. Figure 2-6 shows the different options for this optimization: we can either keep Z as a placeholder or remove it entirely. This figure shows this example on the GAL4 motif. Original Word CGGnnnnnnnnnnnCCG no placeholder add placeholder Z CGGCCG CGGZCCG HashMap Word Figure 2-6: Introducing the Z Placeholder 2.5.1 The Z modification After expanding a word with the given wildcards, the word contains Ns in all the areas where there are gaps not including those which are wildcards within a given syllable (these are not defined as gaps). We replace all the Ns that are part of gaps with a placeholder Z. This condenses each gap to a single letter, which serves as a placeholder for efficient re-expansion at the termination of the algorithm. We can reexpand it using the motif structure we defined in the beginning of the algorithm using the placeholder Z to indicate to us the original positions of the gaps. In addition, we implemented the option of removing the placeholder Z. 33 2.6 Alternate Implementation An alternate implementation was proposed to quicken the rate of the tests done. This method is a, combination of the method used by MSEE and an update to the method described in Takusagawa[2]. While MSEE performs motif discovery on the sets of sequences themselves, this alternate implementation enumerates motifs from sets of dictionaries that have been compiled previously from the sequences. A dictionary is defined as list of words with a given structure but that have no wildcards introduced. This hybrid approach uses the method described above to generate dictionaries of the positive intergenic sequences. These dictionaries and dictionaries compiled from the all intergenic sequences are used as input into the updated method from Takusagawa which runs motif discovery using enumeration. The flow chart in Figure 2-7 describes this alternate method. Creating dictionaries prior to expansion removes a substantial amount of computational work thus enabling the overall method to run quicker. In addition to the dictionary modification, the method implemented in Takusagawa[2] uses a shortcut that avoids reversing and complementing the entire word. Instead, it chooses one instance (either the original or the reverse complement) and only stores that instance. Specifically, it chooses the instance that is canonically higher. When it comes across a word, it determines, from the first base pair of the original and its reverse complement, which version is canonically higher and expands only that version. This eliminates the need to reverse and complement the entire subsequence and perform expansion on both instances. This optimization reduces the space required by the algorithm by halving the number of candidates we must keep track of. However, some problems may arise when using this method. Palindromic motifs, such as GALA, are counted only once though they happen twice for the forward and reverse strands. While MSEE takes up nearly twice as much space, the redundancy allows for accuracy in the enumeration. We discuss the difference between these two methods in the Results section (Section 4.3). 34 2.7 Developing a Test Suite We performed a series of validation tests of MSEE to ensure that it worked correctly. We isolated from the literature a list of motifs in S. Cerevisiae that had one or more gaps and used a range of motif structures as a primer to run the method. Each motif structure consisted of 2 syllables with 1 gap between them. Each syllable had a width of 4 bases and each gap ranged from 0 to 20 bases long. A similar set of validation tests was performed for the second method. Again, we used a range of motif structures as described above as a primer to run the method. This was run on all the transcription factors that were found to bind to yeast intergenic regions. For both methods, a set of tests was derived to determine which word were significant based on their occurrences in the positive intergenic regions and in all intergenic regions. These tests are described in the next Section 3.1. We then ran a series of Monte Carlo simulations to evaluate the effectiveness of our scoring method. These simulations were run to ensure that the motifs found using the above niethods were much more significant than those motifs found at random. This is also described in detail in the next Section 3.2. 35 All Intergenic Sequences Positive Intergenic Sequences Motif Structure Create Dictionary Create Dictionary ord Expansion and Number of Wildcards Enumeration from Dictionaries Word Counts (Positive Regions) Word Counts (All Regions) Test for Significance Motifs Figure 2-7: Flow Chart of Alternate Implementation 36 Chapter 3 Significance of a Motif 3.1 Probabilistic Model After obtaining the number of occurrences for each possible motif in both the positive intergenic sequences and all the intergenic sequences, the next step is to determine whether or not a given motif is significant. We use the Random Selection Null Hypothesis method as defined in Barash, et al[11]. We define significant to mean that a given motif is more frequent in the positive intergenic regions than in all intergenic regions. We iterate through all the candidate motifs found in the positive regions and determine their significance. Using statistical methods, we can estimate the probability that a candidate motif occurs randomly in a positive intergenic sequence. The smaller this probability, the more significant the motif. The most accurate way this can be done is by using the hypergeometric distribution to calculate the significance. 3.1.1 Hypergeometric Model The hypergeometric model estimates the probability that a candidate motif occurs randomly at least a number of times in a small window. We set this window to be the combined length of the positive intergenic sequences. This is equivalent to calculating the tail of the hypergeometric distribution with the following parameters: 37 n -- the total number of places that the motif can occur in the set of positive intergenic regions. This is equivalent to the cumulative sum of all the bases in these regions minus an edge correction factor which accounts for the structure of the motif. N -- the total number of places that the motif can occur in all intergenic regions. Again, this is equivalent to the number of bases in all intergenic regions minus the same edge correction applied to n. m - -- the total number of occurrences of the candidate motif in the positive regions. 1 -- the total number of occurrences of the candidate motif in all intergenic regions. The probability that the given motif is found m number of times randomly in the n possible positions is calculated in Equation 3.1 below[23, 24]: (") (N-M) Phyper(m In, M, N) (3.1) = n We would like to calculate the probability that the motif occurs at LEAST m times. This value, which we will refer to as the p - value, is calculated by summing the tail of the hypergeometric distribution described above as shown in Equation 3.2: min(M,n) p - value =- Pyper(i I n, M, N) (3.2) For each element in the sum, the expected value of the number of successes[23, 25] is the value = M N while the standard deviation is a M nM This sum is computationally expensive since there are a large number of elements in the summation, and therefore it appeared to be infeasible for our application. We therefore looked to different models which approximate the hypergeometric model. 38 3.1.2 Binomial Model In order to reduce the computational complexity, we used the binomial model to approximate the hypergeometric model[25. This approximation is done using the following transformation under the following assumptions. The hypergeometric model can also be modelled as a sum of binary random variables X, where each variable takes a value of 1 to indicate a success. These random variables are dependent on each other since the probability that the next event is a success depends on how many previous events were successes. However, if the sample size that we draw is small relative to the total number of objects, then the probability that the Vih object is a success varies slightly from the i - 1 previous objects. Here, the probability that the Zth object drawn is a success is close to being independent from all other draws. We can therefore approximate the hypergeometric distribution as a sum of independent Bernoulli variables which is a binomial distribution. Specifically, if the number of potential successes n is much smaller than the number of potential motif placements N, then sampling without replacement (hypergeometric distribution) is approximately equivalent to sampling with replacement (binomial distribution). This model defines the parameters of the binomial distribution by assigning a probability to the event that a given candidate motif occurs randomly in all the intergenic sequences (ie. probability of a "success") [26]. We then calculate the probability that such a candidate motif occurs randomly in a given positive sequence. The parameters for this model are similar to those for the hypergeometric model. n, m - defined above. p -- the probability that the candidate motif occurs randomly in any intergenic sequence = occurrences in all intergenic regions/possible occurrences in all intergenic regions = M/N The binomial distribution in Equation 3.3 below with the above parameters de- 39 scribes the probability that we will find m instances of the candidate motif randomly in a positive intergenic sequence[26]. Pbinomial(m In,p) ) pm(I ( ) (n-7n) (3.3) The probability of a motif occurring randomly m or more times in the positive sequences, or its p - value is the sum of the tail of the binomial distribution with the above parameters as shown in Equation 3.4 below. min(M,n) p - value = Pbinomial (i I n, p) ( (3.4) i=m For each element in the sum, the expected value of the number of successes[26] is the value p = n*p while the standard deviation is -= V/n *p *(l - p) These values are equivalent to the expected value and standard deviation of the hypergeometric distribution. Using these probabilities, we rank how significant each candidate motif is in the same way we did using the probabilities calculated from the hypergeometric distribution. For this model also, we observe that there are quite a few elements in the summation. We would like to calculate this probability even more efficiently. We observe that as n becomes larger (ie. the number of potential motif placements in the positive regions), the binomial distribution approaches a normal distribution. We therefore use the normal approximation to the binomial distribution with the same mean and standard deviation. 40 3.1.3 Normal Approximation to the Binomial Model By the central limit theorem, we can approximate the binomial distribution by using a normal Gaussian distribution[26, 27}. The transformation is described here. Let Xi be a Bernoulli random variable indicating whether or not the ith motif was randomly found in a positive intergenic sequence Let S,, be the sum of Bernoulli random variables X 1 + ... + X,, thus forming a binomial distribution with n and p defined previously. In other words, Sn is the number of successes out of n motifs and is modeled as a normal distribution. The p - value is equivalent to the probability that S, is at least as large as the number of times the motif was found in the positive regions only as shown in Equation 3.5 p - value - (3.5) P(m < Sn) Figure 3-1 displays the distribution of Sn. We are calculating the area of the shaded box in the bottom right corner. 1 0.90.8CO) Ca (D 0 0.7- S0.6 - CO) C 0 =3 0.4- CaZ0.5 0.3- - P (m <S) 6 0.20.1 01 Number of Successes Figure 3-1: Normal Distribution Approximation of Number of Successes. We calculate the area of the shaded box which is equivalent to the probability of m or more successes. p is the mean number of successes. 41 The distribution above is transformed to a standard normal distribution(Z,) in Equation 3.6 by subtracting the mean and scaling by the standard deviation. Zn=(Sn - p - 0.5) 0- -(S. - np - 0.5)(36 Vnp(1- p) We implement the 1/2 correction to account for the conversion from discrete to continuous values. We can then write the expression for the p - value in terms of (D as shown in Equation 3.7 which is the CDF of a normal Gaussian[27]. (m - np - 0.5) 5 Zn) = 0 P(m< Sn) = P(- 3.1.4 Vnp(l - p) - (m - np - 0.5) ) rD( p rnp l - p) (3.7) Analysis For each of the candidate motifs found in the positive intergenic regions, we determine the probability that it occurs randomly that many times. We then set a threshold t for what we consider significant. For example, if t = .5, this means that half the time this candidate could be an actual motif but half the time it occurs randomly. We prefer to use a relatively strict threshold thus eliminating the motifs that are just as likely to occur randomly. A sample threshold might be .05 thus indicating that the motif should only occur randomly about 5 percent of the time. We run Monte Carlo simulations on shuffled data to choose an appropriate threshold of significance. 3.2 Monte Carlo Simulation Strategy We perform another set of tests to determine how well our algorithm performs similar to the Random Sequence Null Hypothesis as described in Barash, et al{11]. We have just devised a way to determine how significant our motif is. However, we would like to evaluate how well our algorithm finds significant motifs. Our algorithm should find motifs that are more significant than if we searched on a random set of intergenic sequences. In order to determine this, we perform a series of Monte Carlo simulations. 42 The basic procedure for each Monte Carlo simulation requires us to pick a random set of all the intergenic sequences and set this to be the set of positive regions for a given motif. We then run our algorithm using this set of positive sequences and obtain a set of significant motifs. Our hope is that the motifs found from this random selection of intergenic sequences should be less significant than those found using the positive sequences as determined by the ChIP data. We can use the results from the Monte Carlo simulations to further prune the set of motifs found using the ChIP sequences. If a motif found in the ChIP intergenic sequences is not that much more significant than if found by random during the Monte Carlo simulations, then it is not a likely biological motif. By using this metric to prune our original results, we can set a threshold for significance that we mentioned previously and thus better characterize our motifs. 3.2.1 Sequence Selection Strategy We begin by randomly picking sequences from all intergenic regions. We specify that the random selection of the sequences to be set as the positive sequences needs to be modified to include the fact that probes that bind many transcription factors are more important and that probes that bind few or no factors are less desirable. Therefore, we weight the selection of a probe sequence by the number of transcription factors that it binds to. The transcription factors are then sorted by the number of intergenic sequences each factor binds to. Figure 3-2 shows the distribution of set sizes for all the transcription factors. We choose ten representative sizes for the number of sequences in the positive set. This selection strategy is described in conjunction with the results in Section 4.2. 3.2.2 Testing Strategy For each size, we generate 30 random sets of sequences of that size. MSEE is run on these 30 sets and the significant motifs are extracted. The flowchart in Figure 3-3 43 Distribution of the Number of Positive Sequences per Transcription Factor 40 35 30 25 E 20 15 10 5 0 0 100 50 Number of Positive 1S0 200 Intergenic Sequences in a Set 250 Figure 3-2: Distribution of Set Sizes summarizes this testing strategy. After running this simulation a number of times on sets of sequences with both the numbers of sequences and the total number of bases in the sequences being varied, we obtain a distribution of the significance for the motifs found. We hope to find that the log of the significance is indeed a normal distribution. Calculating the mean and standard deviation gives us a way of identifying the noise in the system as produced by the ChIP data, and a way to eliminate it. 44 Sets of Positive Sequences for Transcription Factors Choose 10 set sizes All Intergenic Sequences 4 for each size Weighted Random Selection 1 2 Motif Discovery / 2 * ~3 * * 0 3 30 MotifDicvr Motif Discovery Motif Discovery Figure 3-3: Monte Carlo Simulation Strategy. For each sequence set size chosen, we generate 30 random sets of that size and run MSEE on these sets. 45 46 Chapter 4 Results 4.1 Consensus Motifs from Literature In order to determine whether or not our methods correctly extracted gapped motifs, we first compiled a set of consensus motifs from the literature that were found to have gaps. Specifically, we compiled a list of motifs in Table 4.1 as published in Harbison, et al[4], which includes both motifs found experimentally as well as motifs found in the literature. The motifs are given here with their enrichment scores. Six different motif discovery methods were used to compile their list of motifs including MEME, MDScan, and AlignACE. Additionally, motifs published in Kellis, et al[5] are listed in Table 4.2. We specifically looked only at motifs that had a gap of 2 or more base pairs as well as the motif for GCN4 which is biologically very significant. Again, we use n to indicate a gapped base and N to indicate a wildcard within a syllable. Section 4.3 shows the best motifs found using MSEE and the alternate method and their correpsonding scores for all of the above transcription factors except TEA1 and HAPI, which were not included in the the ChIP experimental tests. 4.2 Monte Carlo Simulations The Monte Carlo simulations were performed using MSEE as described in Section 3.2 on a range of sequence set sizes. Again, the purpose of these simulations is to pick 47 TF ABF1 GAL4 GCN4 HSF1 PUT3 RGT1 SOK2 STB4 Consensus Motif rTCAytnnnnAgc CGGnnnnnnnnnnncCg TGAsTCa TTCYAnnnnnTTC NNCGGnnnnnnnnnnCCG CGGAnnA TGCAGnnA TCGGnnCGA Enrichment Score 137.742 13.424 64.620 32.956 0.000 0.000 12.280 3.693 STP1 RCGGCnnnRCGGC 0.000 SUT1 UGA3 GCSGSGnnSG CCGnnnnCGG 21.013 0.000 Table 4.1: Consensus motifs containing gaps from Harbison et al [4] listed by the transcription factor (TF), the consensus motif, and the enrichment score. The motifs for which the enrichment score is 0.0 were obtained from the literature and were not found experimentally in Harbison, et al. TF ABF1 GAL4 TEA1 PUT3 HAPI Known Motif rTCRYnnnnnACG CGGnnnnnnnnnnnCCG CGGnCGG CGGnnnnnnnnnnCCG CGGnnnTAnCGG MCS 50.0 8.0 6.8 6.2 2.5 Table 4.2: Consensus motifs containing gaps from Kellis, et al[5]. The motif conservation score(MCS) is listed alongside the motif. random sets of intergenic sequences (with a comparable number of sequences and total number of bases) and find motifs from these sets. If our scoring method is effective, then these motifs will score lower than motifs found from ChIP positive intergenic sequences. 4.2.1 Simulation Setup We chose set sizes that were close to the positive set sizes for the transcription factors in Table 4.1 above. The number of sequences found to be positive intergenic regions are displayed in Table 4.3 here along with the total number of bases for each set. Based on the number of sequences in Table 4.3, we chose 10 representative sizes: 4, 5, 15, 20, 30, 40, 60, 70, 90, and 180. We generated 50 data sets for the sizes 4, 5, 48 TF Num Sequences Total Num Bases 4 14 21 28 33 42 59 68 74 90 178 1909 7868 11882 13561 15000 22977 33358 50984 50323 49540 83304 RGT1 PUT3 STP1 STB4 UGA3 HSF1 GCN4 SOK2 SUT1 GAL4 ABF1 Table 4.3: Number of Positive Intergenic Sequences and Total Number of bases in these sequences for Transcription Factors with gapped consensus motifs. Sorted by Number of Positive Sequences. and 15, generated 30 data sets for all other set sizes, and ran MSEE using these sets as positive intergenic sequences. We also calculated the mean and standard deviation of the total number of bases for these generated sets. 4.2.2 Monte Carlo Motif Scores Table 4.4 displays the average and standard deviation of the scores for the best motifs found for each set size as well as the mean and standard deviation of the total number of bases. Size Mean Score StdDev Mean Number Bases StdDev(Bases) Conf 4 5 15 20 30 40 60 70 90 180 268.218 215.498 141.777 110.181 88.086 97.774 78.539 72.529 68.988 54.252 183.323 114.318 123.931 53.0920 26.5553 57.520 36.9511 22.7744 24.8676 16.113 1989 2585 7694 10701 16531 21845 32299 38588 49004 97462 652 647 1470 1141 1698 2109 2711 2657 3063 4239 .92 .86 .90 .93 .87 .90 .90 .87 .87 .83 Table 4.4: Scores from Monte Carlo Simulations. The last column "Conf" refers to the confidence level achieved if we set a threshold of one standard deviation away from the mean score. 49 Figure 4-1 graphically displays the motif scores listed in Table 4.4 thus allowing us to view distinct trends in both the mean and standard deviation of the scores as the sequence set size increases. We observe two key points. One, as the number of sequences in the set increases, the mean score decreases thus indicating that the motifs found are less and less significant. Secondly, as the number of sequences in the set increases, the standard deviation from the mean score also decreases. The fact that there is less variance in the scores when the positive set is larger is interesting. Mean and Standard Deviation of Monte Carlo Scores 500 450--- 400 - 350 CD, 300 0 0 250 -- C,, 0) 0 2: 200-- (D 0) 150- 100 -T- 50 { T - n 0 20 40 60 80 100 120 140 160 180 200 Number of Sequences in Test Set Figure 4-1: Scores of the Best Motifs from Monte Carlo Simulations. Both mean and standard deviation from the mean decrease as the number of sequences in the set increases. 50 4.2.3 Important Features of Monte Carlo Results We first examine how well these randomly generated sets compare to the ChIP intergenic sets. We plot the total number of bases for each in Figure 4-2. Number of Bases in Positive Set in Monte Carlo versus ChIP 4 10 -6- a- GAL4 0 I GCN4 -0 E Z 2STP1 PUT3x!T -9HSF1 -f-oUGA3 STB4 0RGT1 -2- i i I I I I I I I 0 5 101520 30 40 50 60 70 80 90 100 I I 120 I 140 1 160 180 200 Number of Sequences in Positive Set Figure 4-2: Number of bases in Monte Carlo sets vs. ChIP sets. The significant outliers are circled in red. Figure 4-2 shows that the Monte Carlo simulations generated comparable numbers of positive sets for all but three transcription factors. Both SOK2 and SUT1 have a larger number of bases than was generated randomly while ABF1 has a fewer number of bases that was generated randomly. The effect of this difference was not investigated in this thesis but might produce interesting results. Secondly., we look at the distribution of scores for a given set size. Many methods assume that this distribution is Gaussian. Using the calculated mean and standard 51 deviation, they then determine a score threshold which gives a good confidence interval. In the Gaussian case, this threshold is set to be three standard deviations above the mean. We observe the score distribution for set sizes of 4 and 180 to get an idea of the general shape. These distributions are displayed in Figure 4-3. Score Distribution for 4 Sequences in Positive Set 10 80 0 6-- 0 0,4 E 2 - 0 0 -- 200 400 600 800 1000 1200 Negative Log Motif Score Score Distribution for 180 Sequences in Positive Set -- 4 - 0 0 .0a)2- E Z 0 30 40 50 60 70 80 90 100 Negative Log Motif Score Figure 4-3: Histogram of Scores for Positive Sequence Set Sizes of 4 and 180. Distributions are noticeably non-Gaussian The score distributions do not have Gaussian properties; they are heavy-tailed to the right of the mean score. We would expect this to decrease the confidence level at a given standard deviation from the mean as compared a Gaussian distribution. The confidence of a one standard deviation threshold is therefore expected to be less than the 84.13% which is the confidence for a one standard deviation threshold of the Gaussian distribution. However, if we observe the confidence levels as shown in Table 4.4, it seems that the our confidence levels are as good if not better than the confidence for a Gaussian distribution. This is a mere coincidence and can be attributed to the small sample sizes of the random runs. Harbison et al parameterized 52 the observed scores from similar random runs by a normal distribution[4] (Supplementary Methods). However, we have shown that the assumption of normality is, in fact, incorrect. This should be kept in mind when we compare the Monte Carlo motifs to the motifs extracted using MSEE. 4.3 Testing the Two Methods We test the two methods described in Sections 2.2 and 2.6 on a subset of transcription factors and evaluated the results. 4.3.1 Validation of MSEE Parameters and Setup We ran the first method with the following parameters " Number of Syllables - 2 " Number of Wildcards - 2 " Motif Structure [Type (Width)] - Syllable (4): Gap (variable): Syllable (4) * Range of Gap Widths tested - [0,20] * Transcription Factors tested - all 367 from ChIP data[21, 4] " Hash Function Optimized? - yes First, we ran MSEE on all intergenic sequences and then on individual sets of positive intergenic sequences for each transcription factor. Both sets of counts were then used to calculate the significance of each candidate motif and output a list of significant motifs found for the given motif structure. 53 Results of MSEE Table 4.5 shows the highest scoring motif for each transcription factor(TF) and the corresponding negative log normal approximation score. The highest scoring motifs for each condition of a particular transcription factor were examined, and the most significant was chosen for Table 4.5. The corresponding conditions are given in the caption. The columns following the motif column are "Posit Count" which is the number of occurrences of this motif in the positive intergenic regions and "All Count" which is the number of occurrences of this motif in all intergenic regions. TF Score Top Motif (MSEE) Posit Count All Count GCN4 ABF1 UGA3 RGTI HSF1 PUT3 STB4 GAL4 STP1 SUTi SOK2 736.365 534.897 513.878 359.364 320.6 294.289 231.693 150.733 130.452 93.1803 69.7339 TGASTCAY RTCAnnnnnnACGN GKGTnnnnGTGK CCGGnnnnnnnnnnnnnnnnnnnCCRG TTCTAGAA CGGGnnnnnnnnnCCGA TCGRnYCGA CGGRnnnnnnnnnnCCGR CGGCnnnnnnnnnCGGC CCNGCGGS CCCCTRGC 59 182 54 3 36 6 16 18 5 21 11 172 759 448 12 206 17 96 50 17 101 37 Table 4.5: Motifs found using MSEE for GCN4, ABFl, UGA3 in Rapamycin, RGT1, HSF1 in Heat Shock, GAL4 in Galactose, PUT3, STB4, STP1, SUT1, and SOK2 after 14 hour Butanol treatment. The default condition is YPD. These results were obtained by extracting the best motif for all gap widths tested and then choosing the highest scoring motif and corresponding gap width from these 20 motifs. Non palindromic motifs were reported twice since the reversed and complemented version had the same score. One of these two representations was arbitrarily chosen for the Table 4.5 above. Comparison with Consensus Motifs We list the motifs found by MSEE again in Table 4.6 alongside the motifs found in Harbison, et al[4] and Kellis, et al[5] sorted by their MSEE significance score. From Table 4.6, we observe that the MSEE motifs and motifs from literature 54 TF Top Motif (MSEE) Harbison Motif Kellis Motif GCN4 TGASTCAY TGAsTCa - ABF1 RTCAnnnnnnACGN rTCAytnnnnAgc rTCRYnnnnnACG UGA3 GKGTnnnnGTGK CCGnnnnCGG - RGT1 CCGG[n]19CCRG CGGAnnA - HSF1 TTCTAGAA TTCYAnnnnnTTC - PUT3 CGGG[n]1oCCGA nnCGG[n]1 oCCG CGG[n]1oCCG STB4 TCGRnYCGA TCGGnnCGA - GAL4 STP1 SUTI CGGR[n]1oCCGR CGGC[n] 9 CGGC CCNGCGGS CGG[n]cCg RCGGCnnnRCGGC GCSGSGnnSG CGG[n]ICCG - SOK2 CCCCTRGC TGCAGnnA - Match? y y n p p y y y p n n Table 4.6: Comparing MSEE motifs with Consensus Motifs. The "Match?" column displays a y - pattern and structure match; p - partial pattern match but structure mismatch; n - mismatch agree well for GCN4, ABF1, PUT3, STB4, and GAL4. In some cases, the motifs agree within a syllable though the entire structure is not the same. The three sets of motifs that agree partially are the motifs for RGT1, HSF1, and STP1. The motifs do not agree for UGA3, SUTI, and SOK2. MSEE correctly extracts all motifs that had an enrichment score (see Table 4.1) of 30 or higher. It also extracted PUT3 which was not found in Harbison et al. We also observe that of the mismatches and partial matches, MSEE seems to pick up "C" and "G" signals more frequently than "A" and "T"; this is particularly noticeable for RGT1, STP1, SUTI and SOK2. This may be due to the fact that the background frequencies of "C" and "G" are higher than those for "A" and "T" in particular sets of positive sequences. For the motifs that were partially extracted, one explanation is that certain syllables are more significant than others. In other words, independent syllables (in their context) are more significant than the entire multi-syllabic word itself. The motif found for RGT1 has two syllables: CCGG and CCRG which are both very similar to the beginning of the actual RGT1 motif. Instead of picking up the correct motif, MSEE reported separate instances of the first syllable CGG(A) as the most significant motif. It was suprising that the motif for UGA3 was a mismatch given that its significance 55 score was so high. One possible explanation is that UGA3 has a transcription partner which binds in the same probe sequences. The motif found may be the motif to which this partner binds. Incidentally, the correct UGA3 motif has a score of 15 as calculated by MSEE which is very insignificant compared to all other top scoring motifs. Comparison with Monte Carlo Motifs Based on the Monte Carlo simulations, we would like to prune our results from MSEE and compare the signficance of the motifs to the motifs found from the Monte Carlo simulations as decribed in Section 4.2. We compare the actual result to the Monte Carlo score for the sequence set size that is closest to the size of the actual positive intergenic sequence set. Figure 4-4 shows the plot of the scores compared to the Monte Carlo scores. 56 Monte Carlo Scores versus ChIP Scores 800 I I I I I GCN4 0 700 F 600 [ 0 ABF1 UGA3 CO) (D 0 0 500 - 0 %I.Q1 0 - 400- RGT1 300 - HSF1 0 PUT3 0 STB4 U z 0 200 IGAL4 STP1 SUT1 100 - I 0 0 0 SQK2 I 101 I I 5 10 1520 30 40 I 50 I 0 I 60 I I 70 I 80 I I 90 I 100 I 120 140 Number of Sequences in Positive Set Figure 4-4: Motif Scores from Monte Carlo Sets versus ChIP sets I I 160 180 200 To determine whether we should discard a motif, we compare its score to the Monte Carlo score corresponding to the closest positive set size. If the score of the motif is at least one standard deviation away from the corresponding Monte Carlo score, we retain it; otherwise, we discard it. Upon close examination of Figure 4-4, we discard the motifs for the following transcription factors: RGT1, STP1, SOK2, and SUTI. We retain the motifs for PUT3, GAL4, GCN4, UGA3, HSF1, ABF1, and STB4. We have chosen to include PUT3 although it is barely one standard deviation away from the mean score of the corresponding Monte Carlo motif. We recall that the threshold of one standard deviation above the mean is a more lenient threshold that gives a lower confidence than that of a Gaussian. The 5 correct motifs (GCN4, ABF1, PUT3, STB4, and GAL4) were retained after we pruned our results using the Monte Carlo simulations. Therefore, we can conclude that these motifs are correct with the specific confidence levels reported in Table 4.4. This demonstrates the correctness of our scoring method. However, we have noticed that this scoring method is not effective in some cases. As we have noted with UGA3 and other motifs, the correct motif may have a very low score in comparison to the best motif found. This suggests that our assumption that a motif must be significantly overrepresented in positive intergenic regions may not hold. We must explore other biological factors that limit transcription factor binding to certain regions and not others. 4.3.2 Validation of Alternate Method Parameters and Setup We ran the second method with the following parameters: " Number of Syllables - 2 " Number of Wildcards - 2 " Motif Structure [Type (Width)] - Syllable (4): Gap (variable): Syllable (4) " Range of Gap Widths tested - [0,20] 58 * Transcription Factors tested - all 367 from ChIP data * Hash Function Optimized? - yes Here, we used the dictionaries created previously on all intergenic sequences and generated dictionaries of the positive sets of sequences using the separate algorithm. We then run the algorithm using the updated method described in Section 2.6 which has been optimized to run much faster. Results of Alternate Method Table 4.7 lists the motifs found using the alternate method. Again, preceding the motif is its negative log normal approximation score and following the motif is the number of times it occurred in the positive intergenic sequences and the number of times it occurred in all intergenic sequences. TF Score Top Motif Posit Count All Count GCN4 RGT1 PUT3 ABF1 STB4 GAL4 HSF1 UGA3 STP1 SUTI SOK2 368.002 350.465 257.885 236.666 188.337 171.357 156.231 152.577 113.162 70.8559 69.8423 TGASTCAY CCGGnnnnnnnnnnnnnnnnnnnCCGG CGGGnnnnnnnnCCCG RTCAnnnnnnACGR TCGGnCCGA YCGGnnnnnnnnnnnCCGR GCATnnATGC ACMCnnnnnnnCMCA GCCGnnnnnnnnnnnnnnnnnnnCGGC CGGGnCCCG CGGGnCCCG 59 2 6 142 8 14 4 42 4 4 4 214 3 13 588 19 40 32 539 8 3 3 Table 4.7: Motifs found using Alternate Method for GCN4, RGT1, PUT3, ABF1, STB4, GAL4 in Raffinose, HSF1, UGA3 in Rapamycin, STP1, SUTI, and SOK2 after 14 hour Butanol treatment. These results were similarly obtained by extracting the best motif for all gap widths tested and then choosing the highest scoring motif and corresponding gap width from these 20 motifs. However, in these results, non-palindromic motifs were counted only once given the addition of the canonical enumeration. As a result, the motif counts for the positive intergenic and all intergenic sequences differ from those found using MSEE and thus lead to a different score. 59 Immediately, we notice something wrong with this method if we observe the word counts for SUTI and SOK2. This method reports that it finds the motif 4 times in the positive regions yet only 3 times in all regions. However, since the set of all regions includes the positive regions, it must occur at least 4 times in all regions. Upon closer inspection, we see that both these motifs are palindromic. This inconsistency is the result of the use of the canonical form of the word when counting as described in Section 2.6. Palindromic words are counted only once for each instance on a sequence though they occur on both forward and reverse strands. For this reason, we must ignore all palindromic results from this method. Comparison with Consensus Motifs We list the motifs found by this alternate method again in Table 4.8 alongside the motifs found in Harbison et al and Kellis et al and sorted by their significance score. Kellis Motif Match? y p p y y y TF Top Motif (Alternate) Harbison Motif GCN4 TGASTCAY TGAsTCa - RGT1 PUT3 ABF1 CCGG[n]19CCGG - RTCAnnnnnnACGR CGGAnnA nnCGG[n]1OCCG rTCAytnnnnAgc STB4 TCGGnCCGA TCGGnnCGA - GAL4 YCGG[n]uCCGR CGG[n]1 icCg CGG[n]1 1 CCG HSF1 GCATnnATGC TTCYAnnnnnTTC - n UGA3 STP1 ACMCnnnnnnnCMCA GCCG[n]19 CGGC CCGnnnnCGG RCGGCnnnRCGGC - n p SUTI CGGGnCCCG GCSGSGnnSG CGGGnCCCG TGCAGnnA - n SOK2 CGGG[n]sCCGA CGG[n]1oCCG rTCRYnnnnnACG n Table 4.8: Comparing motifs from alternate method with Consensus Motifs. The "Match?" column displays a y - pattern and structure match; p - partial pattern match but structure mismatch; n - mismatch From Table 4.8, we observe that the motifs agree for reasonably well for GCN4, ABF1, STB4, and GAL4. The agree partially, or within individual syllables, for RGTl, PUT3, and STP1. They do not agree for HSF1, UGA3, SUTi, and SOK2. The motifs that produced palindromic results were RGTl, PUT3, STB4, GAL4, HSF1, STP1, SUTI, and SOK2. We observe that GAL4 and STB4 found the correct 60 significant motifs even though they are palindromic. Its score, however, is incorrect due to the inconsistency of the word counts. For the factors besides GAL4 and STB4, the palindromic motifs found were thus assigned a higher significance than the correct value and appeared to be more significant than the correct answer. 4.3.3 Comparison of MSEE and Alternate Method From our results for 11 transcription factors, we observe that MSEE correctly extracted 5 motifs, partially extracted 3 motifs, and completely missed 3 motifs. The alternate method correctly extracted 4 motifs, partially extracted 3 motifs, and completely missed 4 motifs. Both MSEE and the alternate method correctly found the motifs for GCN4, ABF1, STB4, and GAL4 while both completely missed the motifs for UGA3, SUT1, and SOK2. Both methods essentially perform motif extraction in the same way with few differences in run time. It is expected that their results be similar if not exactly the same. The differences in the results arise from the inconsistent word counting in the alternate method which favors palindromes over non-palindromic motifs. While MSEE runs much slower, it is the more accurate method of the two. 4.4 Further Work The most obvious next step is the refine the motifs produced by MSEE. Using these motifs, we can perform more specific motif extraction to develop better and more accurate models. 4.4.1 Motif Alignment of Different Gapped Results When MSEE is run, it outputs the best motif for each gap width used. Of these, the best is reported as the most significant motif in Table 4.5 above. However, we can use the information from the results of different gap widths to obtain more information about the context and relative strength of the base pairs in the motif. We use ABF1 61 as an example of this method. We list the top motifs for each gap width for ABF1 in Table 4.9 where the reported motif is given first. Aligning these various results gives us information about the motif context as shown directly below Table 4.9. Gap Width Score Motif Gap Gap Gap Gap Gap Gap Gap Gap Gap Gap 534.897 533.059 528.477 382.929 355.84 137.96 89.4057 38.4222 34.1574 33.496 RTCAnnnnnnACGN RTCAnnnnnNACG TCAYnnnnNACG CAYTnnnNACG NRTCnnnnnnnACGA YATCnnnnnnnnCGAN ACTWnnNACG ATCACTAW TATCnnnnnnnnnGANN CYATnNACG 6 5 4 3 7 8 2 0 9 1 Table 4.9: Top motifs found for ABF1 R T CA n n n A CG N R T CA n n n ACG T C A CA N R T Cn A Y AT Cn AT CA T AT Cn n n NACG n n N AC G n n n AC G A n n N A C G n n n n C G A N W n n n n n G A N N T n N AC G Upon inspection, the consensus motif resulting from this alignment is approximately r T C A y t w n n n A C G a This consensus motif is very similar to those found in Harbison, et al(4] and Kellis, et al[5]. This alignment can also be used on the results from HSF1 and GCN4. 62 4.4.2 Motif as Seed to Probabalistic Methods While alignment is quick and easy, a more accurate way of refining the motif models is to use the motifs reported by MSEE as a seed to either EM or Gibbs sampling. One way to do this is to create a PSSM from the most significant motif where each base has a weight of 1.0. Another way would be to create a PSSM from the alignment shown above with relative weights for each position. 4.4.3 Extracting Motifs with Complex Structures MSEE has only been tested for motif structures that have 2 syllables with a gap separating them. However, it has been implemented to search for any number of syllables with any number of gaps separating them. The only upper bound is that the total composite length of the syllables must be less than 8 or 9 base pairs. In future versions, we hope to relax this upper bound thereby allowing for a much larger range of motif structures. 4.4.4 Motif Significance This thesis did not analyze how well the normal distribution approximates the hypergeometric distribution. This requires further work in order to ensure that the approximation holds. We also mentioned in Section 4.3.1 that statistically enriching motifs by their overrepresentation in positive intergenic regions may not be sufficient to completely identify all motifs. For example, chromatin-folding is known to have an effect on limiting when and where transcription factors bind to the DNA. Further work can be done to factor in such biological phenomena. 4.5 Conclusions MSEE is a useful tool for finding motifs and in particular, for finding multi-syllabic motifs. In comparison to the alternate method described, MSEE takes up more space and is computationally more expensive. However, its algorithm is more accurate and 63 reports more correct motifs. The best use of this tool would be to generate motifs that can be used as a seed to other methods which produce more refined models of the motifs. 64 Appendix A All Motifs found with MSEE This appendix lists a subset of the top motifs found using MSEE for all the transcription factors tested. In the following table, the motifs for 179 out of 367 motifs are listed. Each row contains the transcription factor(TF), the score of the motif, and the motif itself. They are sorted by their significance score. TF AFT2H202 GAL4 YNR063W YAP8 RDS1 GCN4 GCN4_DTT RGM1 YPR196W RDR1 CUP9-Cu2 RCS1_H202 NCB2 CBF1 ABF1 CAD1 UGA3-Rapamycin YAP5 YAP3 PDR1 Score -1888.79 -1786.54 -922.758 -882.386 -865.11 -736.365 -725.591 -655.035 -643.426 -637.361 -634.949 -626.613 -621.08 -544.745 -534.897 -527.036 -513.878 -504.645 -447.91 -437.083 Top Motif (MSEE) ACACACAC MCACnCMCA GCCCnnnnnnnnnnnnnnnnnnnnGGGC ACACACAC TCGGCCGA RTGASTCA CMCAnnnMCAC CCACnnnnnnnnnnnnnnnTCAC GKAGGGTA CCGCnnnnnCCGC CMCAnnnMCAC ACACACAC TGAGnnnnnnnnnnnnnCTCA CACGTGAS NCGTnnnnnnTGAY GKGTnnnnGTGK GKGTnnnnGTGK GTGKnnGTGK ACACACAC MCACnCMCA 65 Posit 114 84 2 58 6 59 59 9 18 3 72 65 4 16 182 54 54 86 41 97 All 343 571 16 343 22 172 462 46 139 15 462 343 36 126 759 448 448 567 343 571 TF SFP1_H22 UBP12 RPN4-H202 MCM1 YFL052w RPH1-H202 WCEvWCE RGTl CHA4 RGT1-Galactose HSF1_Heat GAL4-Raffinose RIM101-low_1202 YKR064W SNT2 PUT3 CAD1-H202 ZAPI RTG1-SM YBR239c SUT2 YAP81H202 TBS1 OPIl YAP7 SRD1 YJL206CJow-H202 RTG3AowH202 UBP1O STP4 RDS1IH202 SFPI PUT3-SM STB4 TYE7 YDR266c REXI AZF1 ZAP1-Zn PH02-low-H202 RCS 1..ow-H202 YML081W SFPiSM ZMS 1 Score -430.584 -430.308 -402.896 -394.729 -393.293 -389.743 -360.64 -359.364 -330.096 -322.299 -320.6 -311.364 -310.88 -299.985 -296.585 -295.017 -294.289 -292.366 -289.585 -282.133 -276.983 -273.456 -268.878 -268.684 -265.848 -263.24 -262.421 -258.803 -251 -247.948 -246.623 -245.143 -238.514 -233.921 -231.693 -227.253 -225.459 -219.318 -216.074 -214.057 -208.223 -208.093 -208.064 -204.295 -202.281 Top Motif (MSEE) ACACACAS CGGWnnnnnnCCGR MCACnnnnnnnnnnnnnnCMCA TTCCnnnnnnGGAA ACAGCTGT GKGCnnnnnGCMC GTCTnnnnAGAC CCGGnnnnnnnnnnnnnnnnnnnCCRG GTGTGTGY KGGGnnnnnnnnnnnnnnCCCM TTCTAGAA CGGRnnnnnnnnnnCCGR GCGAnnnnnnTCGC GGCTnnnAGCC CCCSnnnnnnnnnnnnnnnnnnnSGGG GCGCTAYC CGGGnnnnnnnnnCCGA GGTGnnnnnnnnnnnCACC ACCYnnnRGGT ACACACAC CTCGnnnnnnnnnnnnnnnnnnnnCGAG GCGTACGC CACACACR ACCCnnnnnnnACCC CGGWnnnnnnCCGR AGKAnnnnnnnnnnnnnnnnTCGG YCCCnnnnnnnnGGGR TCCCnnnnnnnnGGGA CCGAnnTCGS CGGWnnnnnnCCGR GKAGGGTA KCGGCCGW CGGTnnnnnnnnnnnnnACCG CGGGnnnnnnnnnCCGR TCGRnYCGA YCACGTGR CGGTnnnnnnnnnnnnACCG GCCATGGC GGGAnTCCC CCACnnnnnnnnnnnnnnnTCAC CCTGnnnnnCAGG NGGGTGCA CCGGnnnnnnnCCGG AYCCnTACA CYACnnnnnnnnnnnnnnnCCCC 66 Posit 26 14 29 42 2 10 2 3 16 4 36 17 2 2 4 17 6 2 16 19 2 2 19 6 14 2 6 4 6 14 20 20 2 10 16 42 2 6 6 13 2 31 6 31 3 All 400 57 235 118 40 34 10 12 408 26 206 50 18 8 30 55 17 18 108 343 14 18 420 42 57 47 42 8 19 57 139 62 6 34 96 222 10 16 12 46 14 125 10 195 31 TF AFT2 UBP11 Score -198.727 RGTI-Low-Glucose -193.285 -193.249 -191.067 -188.426 TECI RAPI-SM UBP9 GAL4-Cu2 HSF1_low-.H202 CIN5-H202 YDR026c STB1 SIP4_SM HSF1 BASI HAP4 ASHl1_4hrButanol YAP1-low-H202 RTG3-H202 UGA3 GAL4-Galactose YJL206C-1202 RPH1 TEC1_Alpha ARG81-SM MIGi IN02 YJL206C PIP2 STP1 CST6 SKN7 SIP4 ADR1-SM STB5 ARG80-SM PHO4 UBP8 Z1256 AR080 ROXi-H202 GCR1 SMP1 RTG1 MATAl BUR6 -196.196 -186.813 -184.532 -182.568 -181.387 -179.017 -174.573 -169.329 -167.399 -167.184 -163.583 -161.605 -154.482 -150.796 -150.733 -149.211 -147.329 -145.058 -143.891 -137.571 -136.384 -131.815 -131.715 -130.452 -128.781 -128.423 -127.886 -126.772 -125.954 -124.576 -122.37 -122.24 -120.823 -118.324 -116.558 -115.952 -114.667 -111.166 -108.164 -107.457 Top Motif (MSEE) CACAnACAC I Posit I All 24 330 SGGTnnnnACCS 10 80 TCCGnnnnnnnnnnnnnnnCSGA 10 47 CACMnnnnnnnnnnnnnnnnnnnnMCAC 27 204 CCCGnnnnnnnnnnnnnnnnCGTA 2 7 ACYCnnnnnnnACYC 12 136 CCACnnnnnnnnnnnnnnnTCAC 14 46 TTCTAGAA 18 206 GGGGnnnnnnnnCCCC 4 6 CCGGnTAAA 10 86 CGCRnnnnnnnnnnGCGW 9 47 CCATnnnnCCGR 10 39 4 GCATnnATGC 44 17 MSGAGTCA 107 ACMCnnnnnnnnnnnnnnnnnMCAC 32 225 12 CCCSnnnnnnnnnnnnnnnnnnnSGGG 30 121 GCTKnCTAA CCGAnnnnCGSM 67 42 CCATnnnnnnnnnnnnnnnATGG CGGRnnnnnnnnnnCCGR 50 10 CGTGnnnnnnnnnnnnnCCGC CCTGnnnnnnnnnGSCC 21 210 GKGTnnnnnnnnnnnnnnnnnGTGK 50 CAACnnnnnnnnnnnnnnnnnnnGTTG 12 GGGTnnnnACCC GCATGTGA 54 8 TCCCnnnnnnnnGGGA 10 CCCCnnnnnnnnnnGGGG 17 CGGCnnnnnnnnnCGGC GKAGGGTA 139 56 GGSCnnnGSCC 28 CTGAnnTCAG 10 GTCTnnnnAGAC 14 GCGGnnnnnnnCCGC ATKTnnnnnnnnnnnnnCGGG 58 682 TSATnnnnATSA CSGGnnnnnnnCCSG 48 CCACnnnnnnnnnnnnnnnnnnCCCC 15 CCGMnnnnnRCCG 68 CACCCACA 111 CCcccCsC 79 AYACnCACA 548 AATSnnnTTAS 457 CCAAnnnnnnnnnnnTTGG 52 TGGGnnnnnnnnCCCR 14 67 TF PDR3 HAP2 YFL044C PHO2-PiSTP2 PH02-H202 SKOl RTG3-Rapamycin HAP5-SM MIGIGalactose HAP5 CUP9 NDT80 CAD1-SM CHA4-SM YER051w BAS1_SM WARI RTG3-SM RPH1-low-H202 ARO80-SM YAP5_H202 RLR1 ARG81 THI2 RME1 THI2-ThiSIR4 SUM1 PPR1 UPC2 TECi Butanol UGA3-SM SIG1-H202 RPN4Jow-H202 RIM10 1 HAP3 OAF1 PHO2-SM UMEl-H202 ADRI HAP4_SM SPT2 RTG3 STE12 Score ITop Motif (MSEE) I Posit -106.327 GATCnnnnnnnnnnnnnGATC -106.077 TCCCnnnnnnnnnnnnnnnnnnnGGGA -102.944 TRTAnnTAYA -100.173 SCACGTGS -99.537 GCGAnnnnnnnnnnnnnnnnnnCCGA -98.0243 ACGGCCGT -97.8135 GGGCnnnnnnGCCC -97.6281 GCGTnACGC -97.2144 GACCnnnnnGGTC -97.2052 GSTCnnnnnnnnnnnnnnnGASC -95.6302 CGCGnnnnnnnnnnnnnnnnnnCGCG -95.2845 CGCCnnnnnnnnnnnnGYCG -94.8907 AGTTnnnnnnnnnnnnnnnTCCC -94.8462 MTTAnnAATS -94.6534 CTCTnnnnnnnnnAGAG -91.891 GTCTnnnnAGAC -91.552 MSGAGTCA -91.3099 GGAGnnnnnnnnnnnnnnnnnnAYCC -90.7169 CGCCnnnnGGCG -90.2731 GTMGnnnnnnnnnnnnnnnnnCKAC -87.3551 GGGGnnnnnnnnnnnnnCCCC -85.1899 GGTAnnnnTACC -84.3003 CCCCnnnnnGGGG -81.8173 GGRTnnnnnnnnnnnnAYCC -80.2348 ARGCnnnnnnGCYT -80.1286 ASCTnnnnnnnnnnnnnnnnGCGG -80.0774 TRGGnnnnCCYA -79.0962 TCGAnTCGA -76.8761 GWCRCAAA -76.6462 CGACnnnnnnnnnnnnGCCA -75.7968 ACCGGTTA -75.7886 TCGTnACGA -74.9091 CTAGnnnnnnnnnnnnnCTAG -73.8237 GCTGnnnnnnnnnnnCAGC -72.8842 AGACnnnnnnnnnnnnGACY -71.8964 CTGCnnnnnnCTGA -71.4041 GCGAnnnnnnnnnnnCGCA -71.2012 CCCCnnnnnnnnnnSRGG -70.9359 ACCYnnnRGGT -70.7771 CGGCCGAR -69.5593 CGATnnnnnnnnnnnnnGTCC -69.3184 GATGnnnnnnnnnnnnnACCM -68.7773 GCGMGCRC -63.7732 CTAAnCTCA NTGAAACA -63.438 68 All 20 10 5642 90 7 16 12 6 6 54 4 31 42 756 24 10 107 31 8 70 8 34 10 100 168 34 64 18 342 11 13 22 24 18 41 22 24 43 108 45 13 65 95 67 690 TF STPISM YHP1 RLMl-14hr-Butanol RCS1-SM YRR1 ACE2-Cu2 YAPI UMEl ARG80 HAP2-Rapamycin PHO2 RCS1 ASK10 ROX SIR3 STB2 RPH1-SM RIM101.H202 RPN4 RLM1 SFPi1ow-H202 YOX1 SPT23 RTG 1Japamycin Score -63.0324 -62.4818 -62.2335 -61.855 -59.8441 -59.1808 -59.1462 -59.1042 -58.2859 -57.7562 -54.6925 -53.0868 -52.9335 -52.5781 -51.3149 -51.0293 -48.1694 -47.5489 -46.2583 -45.7255 -43.1684 -41.8195 -39.444 -37.7277 Top Motif (MSEE) TCGGCCGW Posit 6 18 30 4 3 6 GCTKnCTAA 17 GKAGGGTA 12 8 4 4 4 11 8 9 8 3 2 5 6 GCCGnnnnnnCGGC NGCTnnnnnnnnnnnnnnnnnnnAGCN CYAAnnnTAGM CCKGnnnnnnnnGGAC CGACnnnnnnnnnnnnGCCA GGATnnnGTAY CGTGnnnnnnnnnnnnnnnnnnnGGCC GGCCnnnnnnGGCC GACCnnnnGGTC AGATnnnnnnnSGAT CGTSnnnnnnnnnnnnnnnnnGGMC CGCTnnnnnnnnnnnnnTMSG SCAGnnnnnnnnnnnnnnCTGS GGCCnnnGGRC CCGCnnnnnnnnnnnnnnnGCGG GCTCnnnCTGA CCTAGCAC RATCnnnnnnnnnnnnnnnGATY GGGCnnnnnnnnnnnnnnnnnnnGCCC ACCSnnnnnnTACM GCRGnnnnnnnnnnnnnnCTAC 69 12 2 12 6 All 14 612 291 15 11 40 121 139 100 9 14 16 120 32 83 90 17 4 37 17 196 2 148 44 70 Appendix B Optimization Comparison Tests This describes a series of tests of hash function performance. The first test performed was to test the performance of the hash function on words with and without the "Z" modification. The prediction was that the hash function would perform better when we eliminated the string of n characters that corresponded to gaps for a given word. The test was set up with the following parameters: " Number of Syllables - 2 " Number of Wildcards - 2 * Motif Structure [Type (Width)] - Syllable (3): Gap (variable): Syllable (3) " Number of sequences in Positive Intergenic Set - 10 " Range of Gap Widths tested - (0,100) We then recorded the time taken to expand and enumerate the words for the positive set and then calculated the average time per sequence. The average time to process a sequence was then compared between the optimized and non-optimized algorithm. Figure B-i shows the results of these tests as a function of gap size used. 71 Hash Function Performance in Optimized and Non-optimized Versions of Algorithm 9 Ea Optimized 0 8 - 0 Optimized Data Points -Non-Optimized or Non-Optimized Data Points - c - . - - - - - - - - - - - - - - C) w2 -- - - - - - - - - - - -- -- - - -- - - C, 0. a/) 0 1- 20.4....70.... 10 2-0 101 2-- 0 4 0 6 7 0 9 0 1 Gap Width Figure B- 1: Comparing hash function performance between optimized and nonoptimized versions of algorithm 72 Linear regression methods were used to interpolate the trend in hash function performance from the data points gathered. Here, we observe that for small/medium size gaps, the overhead for implementing the "Z" modification causes it run slower than the unmodified version. However, we also observe that as the gap width increases, the hash time for the unmodified version also increases, and at some point, this becomes larger than the hash time for the modified version. The hash function performance for the modified version remains constant with the changing gap width since it is independent of the gap width. Based on these results, we decided to offer two options in running the algorithm: with and without the modification of removing the gaps when hashing. This allows us to perform searches for motifs with small gaps relatively quickly while still allowing us to find motifs with large gaps at a constant time. 73 74 Bibliography [1 Ken Takusagawa and David Gifford. Negative information for motif discovery. Pacific Symposium on Biological Computing, 2003. [2] Ken Takusagawa. Negative information for motif discovery. Master's project, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2003. [3] Akimori Sarai. Bioinfo at bank. Available http://gibk26.bse.kyutech.ac.jp/jouhou/image/dna- protein/all/small-N1d66.gif. [4] C. Harbison, D. Gordon, T. Lee, N. Rinaldi, K. Macisaac, T. Danford, N. Hannett, J. Tagne, D. Reynolds, J. Yoo, E. Jennings, J. Zeitlinger, M. Kellis, P. A. Rolfe, K. Takusagawa, E. Landeri, D. Gifford, E. Fraenkel, , and R.Young. Transcriptional regulatory code of the eukaryotic genome. Nature, 431:99-104, September 2004. [5] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. Lander. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423:241-254, May 2003. [6] Marie-France Sagot. On motifs in biological sequences. Available at http://citeseer-ist.psu.edu/473028.html. [7] Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Menlo Park, CA, 1994. AAAI Press. 75 [8] XS Liu, DL Brutlag, and JS Liu. An algorithm for finding protein-dna binding sites with applications to chromatin immunoprecipitation microarray experiments. Nature Biotechnology, 20(8):835-839, 2002. [9] X. Liu, D. Brutlag, and J. Liu. Bioprospector: discovering conserved dna motifs in upstream regulatory regions of coexpressed genes. 2001. [10] W. Thompson, E. C. Rouchka, and C. E. Lawrence. Gibbs recursive sampler: finding transcription factor binding sites. Nucleic Acids Research, 31(13):3580- 3585, 2003. [11] Y. Barash, G. Bejerano, and N. Friedman. A simple hyper-geometric approach for discovering putative transcription factor binding sites. 2001. Proceedings First International Workshop. 12] Yoseph Barash and Nir Friedman. Context-specific bayesian clustering for gene expression data. In RECOMB, pages 12-21, 2001. [13] E.P. Xing, M.I. Jordan, R.M. Karp, and S. Russell. A hierarchical bayesian markovian model for motifs in biopolymer sequences. Advances in Neural Information Processing Systems, 2002. [14] Jiang Liu. A combinatorial approach for motif discovery in unaligned dna sequences. Master's thesis, University of Waterloo, 2004. [15] Eric Rouchka. A brief overview of gibbs sampling. Available at http://iteseer.ist.psu.edu/85660.html. [16] Y. Barash, G. Elidan, N. Friedman, and T. Kaplan. in protein-dna binding sites. 2003. Modeling dependencies In Proceedings of the 7th International Conference on Research in Computational Molecular Biology. [17] E. Xing, W. Wu, M. Jordan, and R. Karp. Logos: A modular bayesian model for de novo motif detection. 2003. In Proceedings IEEE Computer Society Bioinformatics Conference. 76 [18] Jeremy Buhler and Martin Tompa. Finding motifs using random projections. In RECOMB, pages 69-76, 2001. [19] S. Sinha for and finding M. Tompa. transcription Performance factor binding comparison sites, of 2003. algorithms Available at http://citeseer.ist.psu.edu/sinha03performance.html. [20] Albin Sandelin and Wyeth W. Wasserman. Constrained binding site diversity within families oftranscription factors enhances pattern discovery bioinformatics. Journal of Molecular Biology, 338(2):207-215, 2004. [21] Bing Ren, Frangois Robert, John J. Wyrick, Oscar Aparicio, Ezra G. Jennings, Itamar Simon, Julia Zeitlinger, Jorg Schreiber, Nancy Hannett. Elenita Kanin, Thomas L. Volkert, Christopher J. Wilson, Stephen P. Bell, and Richard A. Young. Genome-wide location and function of dna binding proteins. Science, 290:2306-2309, 200. [22] Microsoft MSDN ence: Map-Class. at Library. Standard Microsoft C++ Corporation, Library 2005. Refer- Available http://msdn.microsoft.com/library/default.asp?url=/library/en- us/vcstdlib/html/vclrfmap.class.asp. [23] Eric W. Weisstein. Math World-A Wolfram Hypergeometric Web Resource. distribution. Available at http://mathworld.wolfram.com/HypergeometricDistribution.html. [24] Kyle Siegrist. oratories in The hypergeometric Probability and distribution. Statistics, Virtual 1997-2001. Lab- Available at http://www.ds.unifi.it/VL/VL-EN/urn/urn2.html. [25] Suranthe ability ric De Silva distributions: distribution. and Peter D'Andreti. Binomial Approximations approximation ThinkQuest: Seeing is http://ibrary.thinkquest.org/10030/6atpdvah.htm. 77 to the Believing. to prob- hypergeometAvailable at [26] Eric W. Weisstein. Binomial distribution. Math World-A Wolfram Web Resource. Available at http://mathworld.wolfram.com/BinomialDistribution.html. [27] Dimitri P. Bertsekas and John N. Tsitsiklis. Introduction to Probability. Athena Scientific, 2002. [28] L. Marsan and M.-F. Sagot. Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification. Journal of ComputationalBiology, 7:354-360, 2000. [29] B. Brejova, C. DiMarco, T. Vinar, S. Hidalgo, G. Holguin, and C. Patten. Finding patterns in biological sequences. Unpublished project report for CS798G, University of Waterloo, September 2000. 78