Techniques & Applications of Sequence Comparison Limsoon Wong Institute for Infocomm Research 27 August 2003 Copyright 2003 limsoon wong Lecture plan • Basic sequence comparison methods – Pairwise alignment – Multiple alignment • Applications – Active sites – Homologs – Key mutation sites • P-value • More advanced sequence comparison methods Copyright 2003 limsoon wong Basic Sequence Comparison Methods A brief refresher Copyright 2003 limsoon wong Sequence Comparison: Motivations • DNA is blue print for living organisms Evolution is related to changes in DNA By comparing DNA sequences we can infer evolutionary relationships between the sequences w/o knowledge of the evolutionary events themselves • Foundation for inferring function, active site, and key mutations Copyright 2003 limsoon wong Alignment Copyright 2003 limsoon wong Alignment: An Example indel Sequence U mismatch Sequence V match Copyright 2003 limsoon wong Alignment: Simple-minded Probability & Score • Define score S(A) by simple log likelihood as S(A) = log(prob(A)) - [n log(s) + r log(s)], with log(p/s) = 1 • Then S(A) = #matches - #mismatches - #indels Copyright 2003 limsoon wong Global Pairwise Alignment: Problem Definition • Given sequences U and V of lengths n and m, then number of possible alignments is given by – f(n, m) = f(n-1,m) + f(n-1,m-1) + f(n,m-1) – f(n,n) ~ (1 + 2)2n+1 n-1/2 • The problem of finding a global pairwise alignment is to find an alignment A so that S(A) is max among exponential number of possible alternatives Copyright 2003 limsoon wong Global Pairwise Alignment: Dynamic Programming Solution • Define an indel-similarity matrix s(.,.); e.g., – s(x,x) = 1 – s(x,y) = -, if x y • Then Copyright 2003 limsoon wong Global Pairwise Alignment: More realistic handling of indels • In Nature, indels of several adjacent letters are not the sum of single indels, but the result of one event • So reformulate as follows: Copyright 2003 limsoon wong Variations of Pairwise Alignment • Fitting a “short’’ seq to a “long’’ seq. • Find “local” alignment U U V • Indels at beginning and end are not penalized V • find i, j, k, l, so that – S(A) is maximized, – A is alignment of ui…uj and vk…vl Copyright 2003 limsoon wong Multiple Alignment Copyright 2003 limsoon wong Multiple Alignment: Naïve Approach • Let S(A) be the score of a multiple alignment A. The optimal multiple alignment A of sequences U1, …, Ur can be extracted from the following dynamic programming computation of Sm1,…,mr: • This requires O(2r) steps • Exercise: Propose a practical approximation Copyright 2003 limsoon wong Applications of Sequence Comparison Copyright 2003 limsoon wong Emerging Patterns • An emerging pattern is a pattern that occurs significantly more frequently in one class of data compared to other classes of data • A lot of biological sequence analysis problems can be thought of as extracting emerging patterns from sequence comparison results Copyright 2003 limsoon wong A protein is a ... • A protein is a large complex molecule made up of one or more chains of amino acids • Protein performs a wide variety of activities in the cell Copyright 2003 limsoon wong Function Assignment to Protein Sequence SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNR YVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWE QNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGD VTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTG TFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELE VT • How do we attempt to assign a function to a new protein sequence? Copyright 2003 limsoon wong Function Assignment: Guilty-by-Association • Compare the target sequence T with sequences S1, …, Sn of known function in a database • Determine which ones amongst S1, …, Sn are the mostly likely homologs of T • Then assign to T the same function as these homologs • Finally, confirm with suitable wet experiments Copyright 2003 limsoon wong Guilty-by-Association: Homologs obtained by BLAST • Thus our example sequence could be a protein tyrosine phosphatase (PTP) Copyright 2003 limsoon wong Guilty-by-Association: Caveats • Ensure that the effect of database size has been accounted for • Ensure that the function of the homology is not derived via invalid “transitive assignment’’ • Ensure that the target sequence has all the key features associated with the function, e.g., active site and/or domain Copyright 2003 limsoon wong Effect of database size: Interpretation of P-value • Seq. comparison • Suppose the P-value of progs, e.g. BLAST, an alignment is 10-6 often associate a P• If database has 107 value to each hit seqs, then you expect • P-value is interpreted 107 * 10-6 = 10 seqs in as prob. that a random it that give an equally seq. has an equally good alignment good alignment Need to correct for database size if your seq. comparison prog does not do that! Copyright 2003 limsoon wong Examples of Invalid Function Assignment: The IMP dehydrogenases (IMPDH) A partial list of IMPdehydrogenase misnomers in complete genomes remaining in some public databases Copyright 2003 limsoon wong IMPDH: Domain Structure IMPDH Misnomer in Methanococcus jannaschii IMPDH Misnomers in Archaeoglobus fulgidus • Typical IMPDHs have 2 IMPDH domains that form the catalytic core and 2 CBS domains. • A less common but functional IMPDH (E70218) lacks the CBS domains. • Misnomers show similarity to the CBS domains Copyright 2003 limsoon wong IMPDH: Invalid Transitive Assignment Root of invalid transitive assignment B A C Mis-assignment of function No IMPDH domain Copyright 2003 limsoon wong IMPDH: Emerging Pattern Typical IMPDH Functional IMPDH w/o CBS IMPDH Misnomer in Methanococcus jannaschii IMPDH Misnomers in Archaeoglobus fulgidus • Most IMPDHs have 2 IMPDH and 2 CBS domains. • Some IMPDH (E70218) lacks CBS domains. IMPDH domain is the emerging pattern Copyright 2003 limsoon wong Discover Active Site and/or Domain • How to discover the active site and/or domain of a function in the first place? – Multiple alignment of homologous seqs – Determine conserved positions Emerging patterns relative to background Candidate active sites and/or domains • Easier if sequences of distance homologs are used Copyright 2003 limsoon wong Discover Active Site: Multiple Alignment of PTPs • Notice the PTPs agree with each other on some positions more than other positions • These positions are more impt wrt PTPs • Else they wouldn’t be conserved by evolution They are candidate active sites Copyright 2003 limsoon wong Identifying Key Mutation Sites Sequence from a typical PTP domain D2 • Some PTPs have 2 PTP domains • PTP domain D1 is has much more activity than PTP domain D2 • Why? And how do you figure that out? Copyright 2003 limsoon wong Key Mutation Site: Emerging Patterns of PTP D1 vs D2 • • • • • Collect example PTP D1 sequences Collect example PTP D2 sequences Make multiple alignment A1 of PTP D1 Make multiple alignment A2 of PTP D2 Are there positions conserved in A1 that are violated in A2? • These are candidate mutations that cause PTP activity to weaken • Confirm by wet experiments Copyright 2003 limsoon wong Key Mutation Site: PTP D1 vs D2 D2 D1 • Positions marked by “!” and “?” are likely places responsible for reduced PTP activity – All PTP D1 agree on them – All PTP D2 disagree on them Copyright 2003 limsoon wong Key Mutation Site: PTP D1 vs D2 D2 D1 • Positions marked by “!” are even more likely as 3D modeling predicts they induce large distortion to structure Copyright 2003 limsoon wong Key Mutation Sites: Confirmation by Mutagenesis Expt • What wet experiments are needed to confirm the prediction? – Mutate E D in D2 and see if there is gain in PTP activity – Mutate D E in D1 and see if there is loss in PTP activity Copyright 2003 limsoon wong Understanding P-value Copyright 2003 limsoon wong What is P-value? • What does E-value mean? – Statistical notion of P-value – Prob that a random seq gives an equally good alignment • How do we calculate it? Copyright 2003 limsoon wong Hypothesis Testing • Null hypothesis H0 – A claim (about a probability distribution) that we are interested in rejecting or refuting • Alternative hypothesis H1 – The contrasting hypothesis that must be true if H0 is rejected • Type I error – H0 is wrongly rejected • Type II error – H1 is wrongly rejected • Level of significance – Probability of getting a type I error rejection region rejection region Acceptance region Copyright 2003 limsoon wong Description Level, aka P-value • Instead of fixing the • The description level significance level at a of a test H0 is the value a, we may be smallest level of interested in significance a at computing the which the observed probability of getting test result would be a result as extreme declared significant-as, or more extreme -that is, would be than, the observed declared indicative of result under H0 rejection of H0 Copyright 2003 limsoon wong P-value: Key Questions • Recall S(A) scores an alignment A • Let H(U,V) = max{ S(A) | A is an alignment of U and V} • Suppose the letters in U and V are iid. Can we calculate h = E(H(U,V))? • Furthermore, can we calculate P(H(U,V) > h + c)? Copyright 2003 limsoon wong Alignment: Statistical Understanding • Ignoring indels for now, we can think of a good alignment as one that has a long contiguous stretch of matches • The matches are essentially a long run of “heads’’ in a series of coin tosses • So we think in terms of “a headrun of length t begins at position i ” Copyright 2003 limsoon wong E(H(U,V)): Erdos-Renyi Thm for Exact Match Copyright 2003 limsoon wong E(H(U,V)): Arratia-Waterman Thm for Local Alignment Copyright 2003 limsoon wong Alignment: Statistical Understanding • Recall we think in terms of “a headrun of length t begins at position i ” • Caution: – headruns occur in “clumps” – if there is a headrun of length t at position i, then with high prob there is also a headrun of length t at position i+1 • So, we count only 1st headrun in a clump; i.e., a headrun preceded by a tail Copyright 2003 limsoon wong Arratia-Gordon Thm on Large Deviations for binomials Copyright 2003 limsoon wong E(H(U,V)): Exact Match, Accounting for Clumps Copyright 2003 limsoon wong E(H(UV)): Local Alignment, Accounting for Clumps Copyright 2003 limsoon wong P(H(U,V) > E(H(U,V))): Approximate P-Value Our E(Yn) from previous slides are in the right form :-) Copyright 2003 limsoon wong More Advanced Sequence Comparison Methods • PHI-BLAST • Iterated BLAST Copyright 2003 limsoon wong PHI-BLAST: Pattern-Hit Initiated BLAST • Input – protein sequence and – pattern of interest that it contains • Output – protein sequences containing the pattern and have good alignment surrounding the pattern • Impact – able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional onepass methods Copyright 2003 limsoon wong PHI-BLAST: How it works find sequences with good flanking alignment find from database all seq containing given pattern Copyright 2003 limsoon wong PHI-BLAST: IMPACT Copyright 2003 limsoon wong ISS: Intermediate Sequence Search • Two homologous seqs, which have diverged beyond the point where their homology can be recognized by a simple direct comparison, can be related through a third sequence that is suitably intermediate between the two • High score betw A & C, and betw B & C, imply A & B are related even though their own match score is low Copyright 2003 limsoon wong ISS: Search Procedure Input seq A BLAST against db (p-value @ 0.081) Results H1, H2, ... Matched seqs M1, M2, ... BLAST against db (p-value @ 0.0006) Keep regions in M1, M2, … that A. Discard rest of M1, M2, ... Matched regions R1, R2, ... Copyright 2003 limsoon wong ISS: IMPACT No obvious match between Amicyanin and Ascorbate Oxidase Copyright 2003 limsoon wong ISS: IMPACT Convincing homology via Plastocyanin Previously only this part was matched Copyright 2003 limsoon wong PSI-BLAST: Position-Specific Iterated BLAST • given a query seq, • matrix is used to search initial set of homologs db for new homologs is collected from db • new homologs with using GAP-BLAST good score are used to • weighted multiple construct new positionalignment is made from specific score matrix query seq and • iterate the search until homologs scoring no new homologs better than threshold found, or until specified • position-specific score limit is reached matrix is constructed from this alignment Copyright 2003 limsoon wong SAM-T98 HMM Method • similar to PSI-BLAST • but use HMM instead of position-specific score matrix Copyright 2003 limsoon wong Comparisons Iterated seq. comparisons vs pairwise seq. comparison Copyright 2003 limsoon wong Suggested Readings Copyright 2003 limsoon wong Function Assignment • S.E.Brenner. “Errors in genome annotation”, TIG, 15:132--133, 1999 • T.F.Smith & X.Zhang. “The challenges of genome sequence annotation or `The devil is in the details’”, Nature Biotech, 15:1222--1223, 1997 • D. Devos & A.Valencia. “Intrinsic errors in genome annotation”, TIG, 17:429--431, 2001. • K.L.Lim et al. “Interconversion of kinetic identities of the tandem catalytic domains of receptor-like protein tyrosine phosphatase PTP-alpha by two point mutations is synergist and substrate dependent”, JBC, 273:28986--28993, 1998. Copyright 2003 limsoon wong Alignment Applications • J. Park et al. “Sequence comparisons using multiple sequences detect three times as many remote homologs as pairwise methods”, JMB, 284(4):1201-1210, 1998 • J. Park et al. “Intermediate sequences increase the detection of homology between sequences”, JMB, 273:349-354, 1997 • Z. Zhang et al. “Protein sequence similarity searches using patterns as seeds”, NAR, 26(17):3986--3990, 1996 • M.S.Gelfand et al. “Gene recognition via spliced sequence alignment”, PNAS, 93:9061--9066, 1996 • S.F.Altschul et al. “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs”, NAR, 25(17):3389--3402, 1997. Copyright 2003 limsoon wong Alignment Statistics • P.Erdos & A. Renyi. “On a new law of large numbers”, J. Anal. Math., 22:103--111, 1970 • R. Arratia & M. S. Waterman. “Critical phenomena in sequence matching”, Ann. Prob., 13:1236--1249, 1985 • R. Arratia, P. Morris, & M. S. Waterman. “Stochastic scrabble: Large deviations for sequences with scores”, J. Appl. Prob., 25:106--119, 1988 • R. Arratia, L. Gordon. “Tutorial on large deviations for the binomial distribution”, Bull. Math. Biol., 51:125-131, 1989 Copyright 2003 limsoon wong lecture is (27/8/2003) wednesday at 6.30pm at LT4. If you have time we could meet a bit earlier to have a chat . What about 5.30pm in my office at N4-2c-79? Copyright 2003 limsoon wong