Gibbs sampling for motif finding in biological sequences Christopher Sheldahl Biological Sequences • Proteins : “strings” of amino acids. 20 common types |alphabet| = 20 • Nucleic acids : “strings” of nucleotides 4 common types for RNA; 4 for DNA |alphabet| = 4 Motif Finding • Multiple sequences : How to detect related segments (or “motifs”)? 1: …ATACGTTGAGA… 2: …CGACGTTGCAA… 3: …CCACGCTGGAC… Sequence Databases are Large • Genbank : NA : 108 million sequences • SwissProt : proteins : ~500,000 sequences Growth of SwissProt http://www.expasy.org/sprot/relnotes/relstat.html The basic problem • We have N biological sequences. • Assume that there is one evolutionarily related segment (substring) of known width W in each sequence. • How do we find this segment in each sequence? Remember that there might be mutations. Protein case : We’d like to know the following: • Motif description (matrix): Probability of each of the 20 aa at each position in the motif of width w: qi,R : Probability of R type aa at position i of motif. Matrix is W x 20. • Background probability (vector) of each of the 20 aa in sites not part of motif: p1….p20 • Ak : alignment vector of starting indices for the motif in each of the N sequences. Scoring a given string • If we knew the motif description matrix and the background vector we could score any string of width W : W W P pRi i1 i1 Q qi,Ri Q L P L : score of segment wrt the motif. Ri : the aa type at position i of the segment. Scoring an alignment of N sequences We can score an entire alignment : N AlignScore Lx x1 Qx Lx Px What if we just had one new sequence? • What if we knew the motif description matrix and the background vector for a large number of sequences, and we want to quickly align a new (presumably related) sequence? What if we just had one new sequence? • Score possible locations of the motif in the new sequence and pick one with a high (or max) value of L: Q L P • Note that the better our motif description and background model are, the better the new alignment will be. Motif Finding Dilemma • If we knew the alignment, we could calculate the motif description matrix and the background vector. • If we knew the motif description and the background, we could calculate the alignment (by choosing high L positions). • We don’t know either of these - we’ve seen one way to approach this problem before... Expectation Maximization or The Road Not Taken in this Lecture Start with Random motif description and background. Repeat until motif description converges: 1) E - step : Calculate alignment points (Ak) from Motif Description and Background. 2) M - step : Calculate Motif Description and background from alignments. Lawrence et. al. 1990. “An Expectation Maximization Algorithm for the Identification and Characterization of Common Sites in Unaligned Biolpolymer Sequences”. Proteins. Local minima • After initialization, simple EM algorithm is a deterministic process. • If the initialization is bad, can get trapped in local minima. • Heuristic : Try many different starting points. Random Sampling • Note the heuristic of trying many random initial positions. Maybe we can overcome local minima if we employ randomness more thoroughly. • Can we take random sample alignments from the distribution : p(Align | S, W, Motif Matrix, Background) where Align means A1, A2…AN. Markov Chain Monte Carlo • Markov Chain Monte Carlo methods generate a Markov chain of points that converges to a distribution of interest. • “Monte Carlo” : The methods employ randomness. Metropolis-Hastings • Metropolis-Hastings is an MCMC model that can sample from any distribution P, using a proposal distribution Q(x’; x). • Initialize with random x. • Generate new x’ = Proposal position according to Q(x’; x) • Compute α = min( (P(x’) / P(x) ), 1) and accept change with probability α. Figure : Wikipedia Gibbs Sampling • Gibbs sampling is a variety of MetropolisHastings sampling where the sampling step is always accepted. • For multivariate distributions, in Gibbs sampling only one parameter is changed at a time. • This makes Gibbs sampling particularly useful for multivariate distributions. Motif Finding with Gibbs : Site Sampler • Site sampler : Sample starting points for motif in each sequence. • Start with random alignments, then use random sampling for one sequence at a time to gradually improve the alignments. Lawrence et al., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment”, Science. 1993. 262(5131), 208-214 Gibbs sampling: Changing One Thing at a Time • N=3 • We want to sample alignment points A1 for sequence 1, given the pattern description M, background model B, the sequences S and width W, and alignments A2 (for sequence 2) and A3 (for sequence 3). p( A1 | S, M, B,W, A2, A3) Site Sampler Algorithm Initialize with random alignment points. While not converged: Do steps 1 and 2 for each of N sequences : 1) Predictive update step : Calculate motif matrix and background using all sequences but the currently selected one. 2) Sampling step. Calculate Lx = Qx / Px for all starting points x in this sequence. Choose one with a probability proportional to Lx. • The inner loop is an iteration of sequences. • In step 2 you sample a new value for Ak i.e. the alignment position for sequence k • Convergence : In theory you sample until alignment no longer changes Example: Random Alignments Rouchka, 1997, “A Brief Overview of Gibbs Sampling” DNA: Alphabet = {A,C,G,T} Example: Initial Counts New Counts Remove 1st sequence (ATTTAT) TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT Rouchka, 1997. Sequence 1: TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT Rouchka, 1997. Final Alignment Assumptions for Site Sampler • Assumption of one motif. • Assumption of one copy of motif in each sequence. • Assumption that we know the motif width. Motif Sampler • Sample different motifs for each part of the sequence. Store motif descriptions and background vector for each motif. • Allows multiple motifs, and varying number of each motif to be present. • Still subject to assumption of known motif width. Neuwald et al., “Gibbs Motif Sampling: Detection of bacterial membrane protein repeats”. 1995. Protein Science. 4:1618-1632. Null model • Null model has pattern description probabilities identical to the background probabilities for its constituent amino acids. Widths and Copy Numbers • The motif sampler requires widths and expected numbers of each motif per sequence as input parameters. The widths are fixed during the algorithm, but the number of copies can change. • Construction of subfamilies (more specific models) favored by larger widths and smaller expected number of copies. • Construction of superfamilies (less specific models) favored by smaller widths and larger expected number of copies. Motif Sampler Algorithm Initialize with Ei non-overlapping random segments for each motif i in each of the sequences. While not converged: Do steps 1 and 2 for successive segments in the biological sequences : 1) Predictive update step : If this segment is in an alignment remove it and recalculate motif matrixand background data excluding this segment. 2) Sampling step : Sample one of the motifs with a probability proportional to the score that the motif description assigns this segment. • In the inner loop the motif sampler iterates through segments (substrings) • Motif sampler samples among different motif models for each segment. Modifications to the Samplers • There are methods to allow for gapped motifs. • High scoring alignments can be stored for later investigation. Neuwald et al., 1995 Porin structures Porin alignment Neuwald et al., 1995 Motif descriptions Motivation for Gibbs sampling • We have a joint probability density f(x, y1, y2,…yp) • We want a marginal density, f(x). • The integrations are difficult. Casella and George, “Explaining the Gibbs Sampler” Near-optimum Sampler • A particular motif may not be present in the best alignment found, even though its found in many other near-optimal alignments. • Sample among near-optimal alignments to capture all sites present. Column sampler • The column sampler allows for a motif to shift: GCACCTG --> GCACCTG • It also allows for the development of motifs with gaps: GCACCTG --> GCACCTG