CS5263 Bioinformatics Probabilistic modeling approaches for motif finding Motif representation • Collection of exact words – {ACGTTAC, ACGCTAC, AGGTGAC, …} • Consensus sequence (with wild cards) – {AcGTgTtAC} – {ASGTKTKAC} S=C/G, K=G/T (IUPAC code) • Position specific weight matrices (PWMs) Sequence Logo 1 2 3 4 5 6 7 8 9 A C .97 .10 .02 .03 .10 .01 .05 .85 .03 .01 .40 .01 .04 .05 .01 .05 .05 .03 G T I .01 .40 .95 .03 .40 .01 .3 .05 .03 .01 .10 .02 .90 .45 .97 .6 .05 .91 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42 Finding Motifs Classification of approaches • Combinatorial search – Based on enumeration of words and computing word similarities – Analogy to DP for sequence alignment • Probabilistic modeling – Construct models to distinguish motifs vs nonmotifs – Analogy to HMM for sequence alignment Combinatorial motif finding Given a set of sequences S = {x1, …, xn} • A motif W is a consensus string w1…wK • Find motif W* with “best” match to x1, …, xn Definition of “best”: d(W, xi) = min hamming dist. between W and a word in xi d(W, S) = i d(W, xi) W* = argmin( d(W, S) ) Exhaustive searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4K ) (where N = i |xi|) Guaranteed to find the optimal solution. Exhaustive searches 2. Sample-driven algorithm: For W = a K-long word in some xi Find d( W, S ) Report W* = argmin( d( W, S ) ) OR Report a local improvement of W* Running time: O( K N2 ) WEEDER: algorithm sketch Current pattern P, |P| < K # mismatches (e, B) Seq occ A C G T T • A list containing all eligible nodes: with at most α mismatches to P • For each node, remember #mismatches accumulated (e), and bit vector (B) for seq occ, e.g. [011100010] • Bit OR all B’s to get seq occurrence for P • Suppose #occ >= m – Pattern still valid • Now add a letter WEEDER: algorithm sketch Current pattern P (e, B) A C G T T A • Simple extension: no branches. – No change to B – e may increase by 1 or no change – Drop node if e > α • Branches: replace a node with its child nodes – Drop if e > α – B may change • Re-do Bit OR using all B’s • Try a different char if #occ < m • Report P when |P| = K Probabilistic modeling approaches for motif finding Probabilistic modeling approaches • A motif model – Usually a PWM – M = (Pij), i = 1..4, j = 1..k, k: motif length • A background model – Usually the distribution of base frequencies in the genome (or other selected subsets of sequences) – B = (bi), i = 1..4 • A word can be generated by M or B Expectation-Maximization • For any word W, P(W | M) = PW[1] 1 PW[2] 2…PW[K] K P(W | B) = bW[1] bW[2] …bW[K] • Let = P(M), i.e., the probability for any word to be generated by M. • Then P(B) = 1 - • Can compute the posterior probability P(M|W) and P(B|W) P(M|W) ~ P(W|M) * P(B|W) ~ P(W|B) * (1-) Expectation-Maximization Initialize: Randomly assign each word to M or B • Let Zxy = 1 if position y in sequence x is a motif, and 0 otherwise • Estimate parameters M, , B Iterate until converge: • E-step: Zxy = P(M | X[y..y+k-1]) for all x and y • M-step: re-estimate M, given Z (B usually fixed) Expectation-Maximization position 5 1 Initialize E-step probability 1 5 9 9 M-step • E-step: Zxy = P(M | X[y..y+k-1]) for all x and y • M-step: re-estimate M, given Z MEME • • • • • Multiple EM for Motif Elicitation Bailey and Elkan, UCSD http://meme.sdsc.edu/ Multiple starting points Multiple modes: ZOOPS, OOPS, TCM Gibbs Sampling • Another very useful technique for estimating missing parameters • EM is deterministic – Often trapped by local optima • Gibbs sampling: stochastic behavior to avoid local optima Gibbs sampling Initialize: Randomly assign each word to M or B • Let Zxy = 1 if position y in sequence x is a motif, and 0 otherwise • Estimate parameters M, B, Iterate: • • • • • Randomly remove a sequence X* from S Recalculate model parameters using S \ X* Compute Zx*y for X* Sample a y* from Zx*y. Let Zx*y = 1 for y = y* and 0 otherwise Gibbs Sampling probability position 0.2 probability 0.15 0.1 0.05 0 0 2 4 6 8 10 position 12 14 16 18 Sampling • Gibbs sampling: sample one position according to probability • • – Update prediction of one training sequence at a time Viterbi: always take the highest Simultaneously update EM: take weighted average predictions of all sequences 20 Better background model • Repeat DNA can be confused as motif – Especially low-complexity CACACA… AAAAA, etc. • Solution: more elaborate background model – Higher-order Markov model 0th order: B = { pA, pC, pG, pT } 1st order: B = { P(A|A), P(A|C), …, P(T|T) } … Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} } Has been applied to EM and Gibbs (up to 3rd order) Gibbs sampling motif finders • Gibbs Sampler – First appeared as: Larence et.al. Science 262(5131):208-214. – Continually developed and updated. webpage – The newest version: Thompson et. al. Nucleic Acids Res. 35 (s2):W232W237 • AlignACE – Hughes et al., J. of Mol Bio, 2000 10;296(5):1205-14. – Allow don’t care positions – Additional tools to scan motifs on new seqs, and to compare and group motifs • BioProspector, X. Liu et. al. PSB 2001 , an improvement of AlignACE – Liu, Brutlag and Liu. Pac Symp Biocomput. 2001;:127-38. – Allow two-block motifs – Consider higher-order markov models Limits of Motif Finders 0 ??? gene • Given upstream regions of coregulated genes: – Increasing length makes motif finding harder – random motifs clutter the true ones – Decreasing length makes motif finding harder – true motif missing in some sequences Challenging problem d mutations n = 20 k L = 600 • (k, d)-motif challenge problem • Many algorithms fail at (15, 4)-motif for n = 20 and L = 600 • Combinatorial algorithms usually work better on challenge problem – However, they are usually designed to find (k, d)-motifs – Performance in real data varies (15, 4)-motif • Information content: 11.7 bits • ~ 6mers. Expected occurrence 1 per 3k bp Actual Results by MEME llr = 163 E-value = 3.2e+005 llr = 177 E-value = 1.5e+006 llr = 88 E-value = 2.5e+005 Motif finding in practice • Now we’ve found some good looking motifs – This is probably the easiest step • What to do next? – Are they real? – How do we find more instances in the rest of the genome? – What are their functional meaning? • Motifs => regulatory networks How to make sense of the motifs? • Each program usually reports a number of motifs (tens to hundreds) – Many motifs are variations of each other – Each program also report some different ones • Each program has its own way of scoring motifs – – – – Best scored motifs often not interesting AAAAAAAA ACACACAC TATATATAT How to make sense of the motifs? • Combine results from different algorithms usually helpful – Ones that appeared multiple times are probably more interesting • Except simple repeats like AAAAA or ATATATATA – Cluster motifs into groups. • Compare with known motifs in database – TRANSFAC – JASPAR – YPD (yeast promoter database) Strategies to improve results • How to tell real motifs (functional) from noises? Statistical test of significance. – Enrichment in target sequences vs background sequences Target set T Assumed to contain a common motif, P Background set B Assumed to not contain P, or with very low frequency Ideal case: every sequence in T has P, no sequence in B has P Statistical test for significance P Target set T Background set + target set B+T N P appeared in n sequences P appeared in m sequences • If n / N >> m / M – P is enriched (over-represented) in T – Statistical significance? • If we randomly draw N sequences from (B+T), how likely we will see at least n sequences having P? M Hypergeometric distribution • A box with M balls (seqs), of which m are red (with motifs), and the rest are blue (without motifs). – Red ball: sequences with motifs – Blue ball: sequences without motifs • We randomly draw N balls (seqs) from the box • What’s the probability we’ll see n red balls? m M m n N n hypegeom(n; M , N , m) M N # of choices to have n red balls Total # of choices to draw N balls Cumulative hypergeometric test for motif significance • We are interested in: if we randomly pick m balls, how likely that we’ll see at least n red balls? cHypegeom (n; M , N , m) min( m , N ) hypogeom(i; M , N , m) i n n 1 1 hypogeom(i; M , N , m) Null hypothesis: our selection is random. Alternative hypothesis: our selection favored red balls. When prob is small, we reject the null hypothesis. Equivalent: we accept the alternative hypothesis (The number of red balls is larger than expected). i 0 m M m n 1 i N i 1 M i 0 N Example • • • • • • • Yeast genome has 6000 genes Select 50 genes believed to be co-regulated by a common TF Found a motif from the promoter seqs of these 50 genes The motif appears in 20 of these 50 genes In the rest of the genome, 100 genes have this motif M = 6000, N = 50, m = 100+20 = 120, n = 20 Intuitively: – m/M = 120/6000=1/50. (1 out 50 genes has the motif) – N = 50, would expect only 1 gene in the target set to have the motif – 20-fold enrichment • P-value = cHyperGeom(20; 6000, 50, 120) = 6 x 10-22 • This motif is significantly enriched in the set of genes ROC curve for motif significance • Motif is usually a PWM • Any word will have a score – – – – Typical scoring function: Log (P(W | M) / P(W | B)) W: a word. M: a PWM. B: background model • To determine whether motif M occurred in a sequence, a cutoff has to be decided – – – – Different cutoffs give different # of occurrences Stringent cutoff: low occurrence in both + and - sequences Loose cutoff: high occurrence in both + and - sequences It may be better to look at a range of cutoffs ROC curve for motif significance P Target set T N Given a score cutoff Appeared in n sequences • • • • • • Background set + target set B+T Appeared in m sequences With different score cutoff, will have different m and n Assume you want to use P to classify T and B Sensitivity: n / N Specificity: (M-N-m+n) / (M-N) False Positive Rate = 1 – specificity: (m – n) / (M-N) With decreasing cutoff, sensitivity , FPR M ROC curve for motif significance A good cutoff Lowest cutoff. Every sequence has the motif. Sensitivity = 1. specificity = 0. sensitivity 1 ROC-AUC: area under curve. 1: the best. 0.5: random. Motif 1 Motif 2 Random 0 0 1-specificity Motif 1 is more enriched in motif 2. 1 Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1. Other strategies • Cross-validation – Randomly divide sequences into 10 sets, hold 1 set for test. – Do motif finding on 9 sets. Does the motif also appear in the testing set? • Phylogenetic conservation information – Does a motif also appears in the homologous genes of another species? – Strongest evidence – However, will not be able to find species-specific ones Other strategies • Finding motif modules – Will two motifs always appear in the same gene? • Location preference – Some motifs appear to be in certain location • E.g., within 50-150bp upstream to transcription start – If a detect motif has strong positional bias, may be a sign of its function • Evidence from other types of data sources – Do the genes having the motif always have similar activities (gene expression levels) across different conditions? – Interact with the same set of proteins? – Similar functions? – etc. To search for new instances • Usually many false positives • Score cutoff is critical • Can estimate a score cutoff from the “true” binding sites Motif finding Scoring function Log (P(W | M) / P(W | B)) A set of scores for the “true” sites. Take mean - std as a cutoff. (or a cutoff such that the majority of “true” sites can be predicted). To search for new instances • Use other information, such as positional biases of motifs to restrict the regions that a motif may appear • Use gene expression data to help: the genes having the true motif should have similar activities – Risk of circular reasoning: most likely this is how you get the initial sequences to do motif finding • Phylogenetic conservation is the key References • D’haeseleer P (2006) What are DNA sequence motifs? NATURE BIOTECHNOLOGY, 24 (4):423-425 • D’haeseleer P (2006) How does DNA sequence motif discovery work? NATURE BIOTECHNOLOGY, 24 (8):959-961 • MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2(4): e36 • Lawrence CE et. al. (1993) Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment, Science, 262(5131):208-214