Expectation Maximization for DNA motif discovery Alastair M. Kilpatrick Centre for Intelligent Systems and their Applications, University of Edinburgh http://www.cisa.inf.ed.ac.uk Sequence motifs are short, reasonably well conserved patterns of DNA known to have biological functions, serving as transcription factor binding sites, for example. Discovery of these motifs is an important task in the wider challenge of understanding the mechanisms of gene expression. Considerable effort has been invested into computational methods for motif discovery; these methods have been applied successfully in this area. The Expectation-Maximization (EM) algorithm is a classic general optimisation technique used in a number of motif discovery algorithms, including the popular MEME algorithm. Given a number of input DNA sequences, two models are created, one representing a recurring pattern within the sequences (the 'motif') and the other representing the remainder of the sequences (the 'background'). An initial estimation for the motif model parameters is made, then two iterative steps are carried out repeatedly to simultaneously optimise both the motif and the background model. We have implemented the EM algorithm for DNA motif discovery in alphaproteobacteria (Magnetospirillum magneticum sp.strain AMB-1), with promising results. The EM algorithm and MEME Experimental Work The basic EM algorithm for motif discovery (Lawrence and Reilly, 1990) assumes that each subsequence within the input dataset arises from either the motif or the background model; at the outset, it is not known which one (this knowledge is known as the ‘missing data’). Given an initial estimate of the motif model parameters, EM allows us to discover which subsequences arise from which model and hence any motifs within the (observed) dataset. In the expectation step (E-step), the current parameter values are used to evaluate the probability of the missing data given the observed data. This probability is then used in the maximization step (M-step) to re-estimate the model parameters. These steps are carried out iteratively until convergence is reached. The MEME algorithm was implemented and tested on both synthetic data and selected promoter sequences of genes from the M. magneticum sp. strain AMB-1 genome (a member of the alphaproteobacteria, a class of bacteria currently researched in the Ward lab, School of Biological Sciences at The University of Edinburgh). The results confirmed previously proposed regulatory motifs and proposed several new motifs potentially important in gene expression (Kilpatrick, 2009). (a) (b) Results of tests on AMB-1 confirm previously proposed iron regulatory motifs. Results are visualised to show percentage weights (a) and rescaled to show motif information content (b). Using the EM algorithm to iteratively estimate the two components in the ‘Old Faithful’ dataset. Figures show EM iterations 1, 2 & 30 (convergence). Note that although the initial estimates are poor, EM still converges to a good estimation. The original EM algorithm for motif discovery forms the basis of the popular MEME algorithm (Bailey and Elkan, 1995), which incorporates a number of novel features, including a method for discovering multiple motifs within a dataset and a method for automatically discovering the length of motifs. MEME can also successfully model alternative motif forms, including (quasi-)palindromic motifs and spaced dyad (‘gapped’) motifs. Wet-lab work has confirmed that MEME works well in unsupervised motif discovery (i.e. no prior knowledge is required). However, we have shown that if prior information about motifs is available, it can be exploited and incorporated within MEME in order to improve results (Kilpatrick, 2009). Two novel extensions to the MEME algorithm have been implemented (Kilpatrick, 2009). A method to score motifs based on information content has been successful in increasing the significance of returned motifs, particularly when searching for multiple motifs in a single dataset. A method to incorporate prior knowledge of motif positions based on the concept of the energy of a motif has also been successfully implemented, further improving results. Future work will consider stochastic methods and other forms of prior biological knowledge in order to improve the results of motif discovery algorithms. References Bailey, T.L. & Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning. Lawrence, C.E. & Reilly, A.A. (1990) An EM algorithm for the identification and characterisation of common sites in unaligned biopolymer sequences. Proteins. Kilpatrick, A.M. (2009) Regulatory motif discovery in magnetic bacteria. MSc Thesis The University of Edinburgh AMK is supported by an EPSRC Doctoral Studentship.