Sequence analysis of nucleic acids and proteins: part 1 Similarity search Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000 Search and learning problems in sequence analysis Similarity search Proble ms in Biological Science Pairwise sequence alignment Database search for similar sequences Mult iple sequence alignment Phylogenetic tree reconstruction Prot ein 3D structure alignment Structure/func tion ab initio prediction prediction Knowledge based prediction Mole cular classifi cation RNA seconda ry struc ture prediction RNA 3D structure prediction Protein 3D structure prediction Motif extraction Func tiona l sit e prediction Cellular locali zation p rediction Coding region p rediction Transmembrane domain prediction Protein seconda ry structure prediction Protein 3D structure prediction Supe rfamil y classification Ortholog/p aralog grouping of gene s 3D fold classification Math/Stat/CompSci method Optimi zation algorithms Dynamic progra mmi ng (DP) Simulated annealing (SA) Genetic algorithms (GA) Markov Chain Monte Carlo (MCMC: Metropolis and Gibbs sampl ers) Hopfield neural networ k Pattern recogn iti on and learning algo rit hms Discrimi nan t ana lysis Neural networks Suppor t vec tor machin es Hidden Markov models (HMM) Forma l grammar CART Clustering algorithms Hierarchical, k-means , etc PCA, MDS, etc Self -organ izing maps, etc A comparison of the homology search and the motif search for functional interpretation of sequence information. Homology Search New sequence Retrieval Sequence database (Primary data) Motif Search Knowledge acquisition Similar sequence Expert knowledge New sequence Motif library (Empirical rules) Inference Expert knowledge Sequence interpretation Sequence interpretation Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b). (a) Path Matrix A (b) Search Tree I M S A M O S X . Alignment AIM-S A-MOS . X . . . . . . . . . . . . Pruning by an optimization function Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant. (b) the gap penalty is a linear function of the gap length. (a) Di, j-l Di-1, j-1 (b) Di-1, j-1 Di, j-l d ws(i), t(j) Di-1, j d Di,j b Di-1, j Di, j(2) ws(i), t(j) b Di,j(3) Di,j(1) Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the initial values are assigned to the path matrix. (a) Global vs. Global (b) Local vs. Global 0 0 . . . . . . 0 0 (c) Local vs. Local X 0 0 . . . . . . 0 . . . . 0 0. 0 . . . . . . 0 . . . 0 The order of computing matrix elements in the path matrix, which is suitable for (a) sequential processing and (b) parallel processing. (a) (i -1, j -1) (I, j -1) (i +1, j-1) (i -1, j ) (i, j) (i +1, j ) (i, j -2) (i+1, j -2) (i, j -1) (i +1, j -1) (b) (i -1, j -1) (i -1, j ) (i, j) The dynamic programming algorithm can be applied to limited areas, rather than to the entire matrix, after rapidly searching the diagonals that contain candidate markers. 1 1 i n 1 j l m n +m -1 l m The hashing technique for rapid sequence comparison. In this case the horizontal sequence is converted to a hash table, which contains the locations of the four nucleotides. Query Sequence Hash Table A T C A C A C G G C T A T C G C A G T C A A T T C . . * * * * * * * * * * * * * * * * Key A C G T Address 1 4 6 3 5 7 10 8 9 2 * * * * * * * * * * * * * * * * * * * * * Used in FASTA An example of the finite state automaton for pattern matching C B A C Q1 Q2 A B A C Q0 B C Q4 A B B A Q3 C Bold arrows lead to ouputs indicating patterns have been found Used in BLAST The tree-based progressive method for multiple sequence alignment, which utilizes: (a) a dendrogram obtained by cluster analysis and (b) group alignment for pairwise comparison of groups of sequences. (a) DEPGG3 DEBYG3 DEZYG3 DEBSG F L R R - A R T A S A L - R G A R A A A E DEHUG3 L W R D G R G A L Q L W R G G R G A A Q D W R - G R T A S G (b) Possible tree topologies in the phylogenetic analysis of: (a) three sequences or (b) four sequences. Filled circles represent extant sequences, while open circles represent common ancestors. (a) A B C D D A B A B A B D C C C Simulated annealing and Metropolis Monte Carlo methods are based on the concept of thermal fluctuations in the energy functions. DE = E (x’n) - E (x n) E 1 When DE p = exp(-DEl Tn ) When DE > x Dynamic programming to find edit distances - Edit operation: M, R, I, D - Edit transcript: A string over the alphabet M, R, I, D that describes a transformation of one string into another. Example: R D I M D M M A - T H S A - R T - S - Edit (Levens(h)tein) distance: The minimum number of edit operations necessary to transform one string into another. (Note: matches are not counted.) Example: R D I M D M 1+ 1+ 1+ 0+ 1+ 0 = 4 The recurrence - Stage: position in the edit transcript; - State: I, D, M, or R; - Optimal value function: D(i, j) where D(i, j) = edit distance of Seq1[1...i] and Seq2[1...j] - Recurrence relation: 1 +D(i-1, j) D(i, j) = min 1 +D(i, j-1) t(i, j) +D(i-1, j-1) , where t(i, j) = { 1, Seq1 (i) Seq2(j) 0, Seq1(i) Seq2(j) The tabulation , D(i, j) Seq2(j) Seq1(i) 0 0 M 1 A 2 T 3 H 4 S 5 A R T S 1 2 3 4 The tabulation , D(i, j) Seq2(j) Seq1(i) 0 0 M 1 A 2 T 3 H 4 S 5 0 A R T S 1 2 3 4 The tabulation , D(i, j) Seq2(j) Seq1(i) 0 M 1 A 2 T 3 H 4 S 5 A R T S 0 1 2 3 4 0 1 The tabulation , D(i, j) Seq2(j) Seq1(i) 0 M 1 A 2 T 3 H 4 S 5 A R T S 0 1 2 3 4 0 1 2 The tabulation , D(i, j) Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 A 2 2 T 3 3 H 4 4 S 5 5 Seq1(i) The tabulation , D(i, j) Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 A 2 2 T 3 3 H 4 4 S 5 5 Seq1(i) The tabulation , D(i, j) Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 A 2 2 T 3 3 H 4 4 S 5 5 Seq1(i) The tabulation , D(i, j) Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 H 4 4 S 5 5 Seq1(i) The tabulation , D(i, j) Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4 4 S 5 5 Seq1(i) The tabulation , D(i, j) Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4 4 3 3 3 3 S 5 5 4 4 4 3 Seq1(i) The traceback Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4 4 3 3 3 3 S 5 5 4 4 4 3 Seq1(i) The solutions - #1 1 0 1 1 0 D M R R M M A T H S - A R T S = 3 The traceback Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4 4 3 3 3 3 S 5 5 4 4 4 3 Seq1(i) The solutions - #2 1 0 1 0 1 0 D M I M D M M A - T H S - A R T - S = 3 The traceback Seq2(j) A R T S 0 1 2 3 4 0 0 1 2 3 4 M 1 1 1 2 3 4 A 2 2 1 2 3 4 T 3 3 2 2 2 3 H 4 4 3 3 3 3 S 5 5 4 4 4 3 Seq1(i) The solutions - #3 1 1 0 1 0 R R M D M M A T H S A R T - S = “Life must be lived forwards and understood backwards.” - Søren Kierkegaard 3 BLOSUM62 SCORING MA TRIX 134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI | ||| | | |||||| | || || 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM C S T P A G N D E Q H R K M I L V F Y W 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2 D:D = +6 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3 5 -1 0 -2 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 0 -2 -2 -2 7 -1 -2 -2 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4 4 0 -2 -2 -1 -1 -2 -1 -1 -1 -1 -1 0 -2 -2 -3 D:R = -2 6 0 -1 -2 -2 -2 -2 -2 -3 -4 -4 -3 -3 -3 -2 6 1 0 0 1 0 0 -2 -3 -3 -3 -3 -2 -4 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4 5 2 0 0 1 -2 -3 -3 -2 -3 -2 -3 5 0 1 1 0 -3 -2 -2 -3 -1 -2 8 0 -1 -2 -3 -3 -3 -1 2 -2 5 2 -1 -3 -2 -3 -3 -2 -3 5 -1 -3 -2 -2 -3 -2 -3 5 1 2 1 0 -1 -1 4 2 3 0 -1 -3 4 1 0 -1 -2 4 -1 6 -1 3 -3 1 7 2 11 C S T P A G N D E Q H R K M I L V F Y W From Henikoff 1996 Scoring Matrices • Physical/Chemical similarities - comparing two sequences according to the properties of their residues may highlight regions of structural similarity • Identity matrices - by stressing only identities in the alignment, stretches of sequence that may have diverged will not penalise any remaining common features Scoring Matrices (ctd) • As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated • The most commonly used will be one of the mutation matrices PAM, BLOSUM • The matrix that performs best will be the matrix that reflects the evolutionary separation of the sequences being aligned Probability and Likelihood Some probabilities of observations depend on unknown parameters. E.g. if O = SFFSFFF then under independence pr(O) = p2(1-p)5. We can calculate this for any observation O, so in a sense we have a 2-variable function pr(O,p) or pr(O|p) depending on O and p (0< p <1). Likelihood: holds O fixed, varies p. Maximum Likelihood estimate: the p which maximizes pr(O,p), O fixed, denoted . E.g. above, = 2/7. Statistical motivation for alignment scores AGCTGATCA... Alignment: AACCGGTTA... pr(data|H) = pr( |H) = pr( = (1-p)apd pr(data|R) = pr( log Hypotheses: H = homologous (indep. sites, Jukes-Cantor) R = random (indep. sites, equal freq.) |H) x ... d = # disagreements, a = # agreements, p = 3(1-e-8at) 4 |R) = pr( |R) x ... = ( 1 )a( 3 )d 4 4 pr(data|H) p 1-p = a x log + d x log 3/4 . pr(data|R) 1/4 { } score = a x s + d x (-m) Since p < 3 p 1-p , log <0, log >0 4 3/4 1/4 s>0 match score, -m<0 mismatch penalty Note that if at 0, p 6at, 1-p 1 and so s log4, while -m log8at is large and negative: a big difference in the two scores. 3 p Conversely, if at is large, p = 4 (1-e), 3/4 = 1-e, and m log(1-e) -e, 1 1-p while 1-p = 4(1+3e), 1/4 = 1+3e, and so s log(1+3e) 3e. Thus the scores are about 3:1. We can do the same with any other Markov substitution matrix for molecular evolution. E.g. with a PAM or BLOSUM matrix of probabilities, a1 ..... am data = b ..... b 1 m a gap free alignment of two a.a. sequence fragments m pa pa b (2t) P 1 log{pr(data|H) } = Slog{ pa b (2t)/ pb } pr(data|R) pr(data|H) = i i i i i i pr(data|R) = P pa pb i i i i The elements of a log-odds score matrix are typically > 0 on the diagonal and < 0 off the diagonal, but not always. Also the relative sizes of match and mismatch penalties increase as #PAMs (at) decreases. Thus PAM(120) is more stringent than PAM(250), while PAM(360) is less stringent than it. PAM(0) = the identity matrix is the toughest. There are plenty of score matrices based on other principles. Local alignment aligns only the most similar regions of two sequences Why? Often distantly related proteins have only isolated regions (e.g. active sites) of similarity. The modular nature of proteins How? The dynamic programming algorithm we have seen needs only a minor modification to yield the best local alignment between two sequences. It is called the Smith-Waterman algorithm, and is named bestfit in GCG. Similar Amino Acid Sequences: Chance or Common Ancestry? Title of paper by Russell F. Doolittle, Science 214 (1981)1 The question arises every time an alignment is done without prior knowledge of homology. The usual caveats: • the scientific goal is not necessarily the same as the mathematical/statistical goal •significance may not mean homology •non-significance may not mean non-homology Early use of statistics •Generate random permutations of the sequence(s) •Obtain the average (av) and standard deviation (SD) of the random similarity scores •Compute z=(observed score - av)/SD •Think normal (e.g. 4 is a very large z) This approach is still used for global alignments, but is no longer seen as appropriate for local alignments, since the score is optimized, and random optimal scores do not follow the normal law. More recent statistical developments: Theory developed by Karlin and collaborators in 1990-4 and, independently, by Waterman and collaborators in 1988-94. Incorporates the fact that the score has been optimized. Immediately implemented in BLAST. Later appears in a similar form in FASTA and elsewhere. The theory applies to the ensemble of random •pairs of sequences, with fixed •possibly different lengths, •possibly different residue distributions •and ungapped alignments (extensions to ungapped alignments coming now) The theoretical distribution of random similarity scores •is universal in form (see diagram) •with scale parameter depending on the two residue distributions, and the substitution scores used •and location parameter depending on the above, plus the lengths of the two sequences For m, n large, the optimal random score S has the extreme-value distribution with cdf exp{-exp{-l(s-u)}} where l is the unique positive solution (in t) of Sijpiqjexp(sijt)=1, and u = 1 l log(Kmn) and K is given by a series depending on the compositions (pi) and (qj) and the scoring matrix (sij). Databases searches: why do them? To find exact matches to sequences To find homologous sequences To infer structure and/or function of new protein sequences To locate genes in ESTs or genomic sequences To discover gene structure in DNA sequence And much more... Database searching Compares a query sequence to each sequence in a database (also called a library). Because of the large size of sequence databases, comparisons are generally carried out using faster heuristic approximations to, rather than the exact Smith-Waterman local alignment algorithm. The two most common of these are FASTA and BLAST, where each of these names corresponds to a family of algorithms used in different contexts. BLAST variants for different searchesa (after S. Brenner, Trends Guide to Bioinformatics, 1998) Program Query aSimilar Database Comparison Common use blastn DNA DNA DNA level Seek identical DNA sequences and splicing patterns blastp Protein Protein Protein level Find homologous proteins blastx DNA Protein Protein level Analyze new DNA to find genes and seek homologous proteins tblastn Protein DNA Protein level Search for genes in unannotated DNA tblastx DNA DNA Protein level Discover gene structure variant programs are available for FASTA. Proteinlevel searches of DNA sequences are performed by comparing translations of all six reading frames. cDNA, ORFs and ESTs • Complementary DNA (cDNA) – Single stranded DNA complementary to an RNA, from which synthesized by reverse transcription. • Open reading frames (ORFs) – Contains a series of triplets coding for amino acids without any termination codons (potentially translatable into proteins) – Many derived from sequencing of cDNAs • Expressed sequence tags (ESTs) – Short (300-500 bp) single reads from mRNA (cDNA) sequencing survey projects. – A snapshot of what is expressed in a given tissue at a given developmental stage.