Dayhoff Model: Accepted Point Mutation (PAM) Arthur W. Chou Fall 2005 Tunghai University Dr. Margaret Oakley Dayhoff (1925-1983) The Nobel Prize in Physiology or Medicine 1962: "for their discoveries concerning the molecular structure of nucleic acids and its significance for information transfer in living material" Francis Harry James Dewey Compton Crick Watson Hugh Frederick Wilkins Rosaline Elsie Frankline (1920 – 1958) Dayhoff’s 34 protein superfamilies Protein Ig kappa chain Kappa casein Lactalbumin Hemoglobin a Myoglobin Insulin Histone H4 Ubiquitin PAMs per 100 million years 37 33 27 12 8.9 4.4 0.10 0.00 Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? A Ala A R N D C Q E G H R Arg N Asn D Asp C Cys Q Gln E Glu G Gly 30 109 17 154 0 532 33 10 0 0 93 120 50 76 0 266 0 94 831 0 422 579 10 156 162 10 30 112 21 103 226 43 10 243 23 10 Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases fly human plant bacterium yeast archaeon GAKKVIISAP GAKRVIISAP GAKKVIISAP GAKKVVMTGP GAKKVVITAP GADKVLISAP SAD.APM..F SAD.APM..F SAD.APM..F SKDNTPM..F SS.TAPM..F PKGDEPVKQL VCGVNLDAYK VMGVNHEKYD VVGVNEHTYQ VKGANFDKY. VMGVNEEKYT VYGVNHDEYD PDMKVVSNAS NSLKIISNAS PNMDIVSNAS AGQDIVSNAS SDLKIVSNAS GE.DVVSNAS CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNSITPVA fly human plant bacterium yeast archaeon KVINDNFEIV KVIHDNFGIV KVVHEEFGIL KVINDNFGII KVINDAFGIE KVLDEEFGIN EGLMTTVHAT EGLMTTVHAI EGLMTTVHAT EGLMTTVHAT EGLMTTVHSL AGQLTTVHAY TATQKTVDGP TATQKTVDGP TATQKTVDGP TATQKTVDGP TATQKTVDGP TGSQNLMDGP SGKLWRDGRG SGKLWRDGRG SMKDWRGGRG SHKDWRGGRG SHKDWRGGRT NGKP.RRRRA AAQNIIPAST ALQNIIPAST ASQNIIPSST ASQNIIPSST ASGNIIPSST AAENIIPTST fly human plant bacterium yeast archaeon GAAKAVGKVI GAAKAVGKVI GAAKAVGKVL GAAKAVGKVL GAAKAVGKVL GAAQAATEVL PALNGKLTGM PELNGKLTGM PELNGKLTGM PELNGKLTGM PELQGKLTGM PELEGKLDGM AFRVPTPNVS AFRVPTANVS AFRVPTSNVS AFRVPTPNVS AFRVPTVDVS AIRVPVPNGS VVDLTVRLGK VVDLTCRLEK VVDLTCRLEK VVDLTVRLEK VVDLTVKLNK ITEFVVDLDD GASYDEIKAK PAKYDDIKKV GASYEDVKAA AATYEQIKAA ETTYDEIKKV DVTESDVNAA The relative mutability of amino acids Asn Ser Asp Glu Ala Thr Ile Met Gln Val 134 120 106 102 100 97 96 94 93 74 His Arg Lys Pro Gly Tyr Phe Leu Cys Trp 66 65 56 56 49 41 41 40 20 18 Normalized frequencies of amino acids Gly Ala Leu Lys Ser Val Thr Pro Glu Asp 8.9% 8.7% 8.5% 8.1% 7.0% 6.5% 5.8% 5.1% 5.0% 4.7% Arg Asn Phe Gln Ile His Cys Tyr Met Trp 4.1% 4.0% 4.0% 3.8% 3.7% 3.4% 3.3% 3.0% 1.5% 1.0% blue=6 codons; red=1 codon Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? A Ala A R N D C Q E G H R Arg N Asn D Asp C Cys Q Gln E Glu G Gly 30 109 17 154 0 532 33 10 0 0 93 120 50 76 0 266 0 94 831 0 422 579 10 156 162 10 30 112 21 103 226 43 10 243 23 10 Dayhoff’s PAM1 mutation probability matrix A R N D C Q E G H I A Ala R N D C Q Arg Asn Asp Cys Gln E Glu G Gly H His I Ile 9867 2 9 10 3 8 17 21 2 6 1 9913 1 0 1 10 0 0 10 3 4 1 9822 36 0 4 6 6 21 3 6 0 42 9859 0 6 53 6 4 1 1 1 0 0 9973 0 0 0 1 1 3 9 4 5 0 9876 27 1 23 1 10 0 7 56 0 35 9865 4 2 3 21 1 12 11 1 3 7 9935 1 0 1 8 18 3 1 20 1 0 9912 0 2 2 3 1 2 1 2 0 0 9872 Estimating p(·,·) for proteins Generate a large diverse collection of accepted mutations. An accepted mutation is a mutation due to an alignment of closely related protein sequences. For example, Hemoglobin alpha chain in humans and other organisms (homologous proteins). Let pa = na/n where na is the number of occurrences of letter a and n is the total number of letters in the collection, so n = ana. Mutation counts f ab f ba be the number of mutations a b, f a b|b a f ab be the total number of mutations that involve a, f a f a be the total number of amino acids involved in a mutation. Note that f is twice the number of mutations. PAM-1 matrices Define Mab to be the symmetric probability matrix for switching between a and b. We set, Maa = 1 – ma, so that ma is the probability that a is involved in a change. M ab f ab Pr( a b) Pr( a b | a changed) Pr( a changed) ma fa We define Mab, such that only 1% of amino acids change according to this matrix or 99% don’t. Hence the name, 1-Percent Accepted Mutation (PAM). In other words, a pa M aa a pa 1 ma 1 a pa ma 0.99 PAM-1 matrices We wish that ma will be proportional to the relative mutability of letter a compared to other letters. fa ma K pa f where K is a proportional constant. We select K to satisfy the PAM-1 definition: fa a pa ma a pa Kp f a fa 1 0.01 Kf K a So K=100 for PAM-1 matrices. Note that K=50 yields 2% change, etc. Evolutionary distance The choice that 1% of amino acids change (and that K =100) is quite arbitrary. It could fit specific set of proteins whose evolutionary distance is such that indeed 1% of the letters have mutated. This is a unit of evolutionary change, not time because evolution acts differently on distinct sequence types. What is the substitution matrix for k units of evolutionary time ? Model of Evolution We make some assumptions: 1. Each position changes independently of the rest 2. The probability of mutations is the same in each position 3. Evolution does not “remember” T A T A T C C C t t+ t+2 t+3 G G t+4 Time Model of Evolution How do we model such a process? This process is called a Markov Chain A chain is defined by the transition probability P(Xt+ =b|Xt=a) - the probability that the next state is b given that the current state is a We often describe these probabilities by a matrix: M[]ab = P(Xt+ =b|Xt=a) Multi-Step Changes on Mab, we can compute the probabilities of changes over two time periods Based P( X t 2 b | X t a) c P( X t 2 b | X t c, X t a) P( X t c | X t a) Using Conditional independence (No memory) c P( X t 2 b | X t c) P( X t c | X t a) c M ac M cb Thus By M[2] = M[]M[] induction: M[n] = M[] n A Markov Model (chain) X1 X2 Xn-1 Xn •Every variable xi has a domain. For example, suppose the domain are the letters {a, c, t, g}. •Every variable is associated with a local probability table P(Xi = xi | Xi-1= xi-1 ) and P(X1 = x1 ). •The joint distribution is given by p( X 1 x1 ,, X n xn ) P( X 1 x1 ) P( X 2 x2 | X 1 x1 ) P( X n xn | X n 1 xn1 ) n p( X i xi | Pai pa i ) i 1 where Pai are the parents of variable/node Xi ,namely, none or Xi-1. n In short, we write: p( x1 ,, xn ) p( xi | pa i ) i 1 Markov Model of Evolution Revisited X1 M X2 M Xn-1 Xn In the evolution model we studied earlier we had P(x1) = (pa, pc, pg, pt) which sum to 1 and called the prior probabilities, and P(xi|xi-1) = M[] which is a stationary transition probability table, not depending on the index i. The quantity we computed earlier from this model was the joint probability table n p( x1 , xn ) p( x1 ) M [] x1 xn Longer Term Changes M[] = M (PAM-1 matrices) Use M[n] = Mn (PAM-n matrices) Define Estimate p ( a , b) pa M Use n ab this quantity to define the score for your application of interest. PAM250 mutation probability matrix A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2 7 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 7 2 4 17 Top: original amino acid Side: replacement amino acid A R N D C Q E G H I L K M F P S T W Y V 2 -2 6 0 0 2 0 -1 2 4 -2 -4 -4 -5 12 0 1 1 2 -5 4 0 -1 1 3 -5 2 4 1 -3 0 1 -3 -1 0 5 -1 2 2 1 -3 3 1 -2 6 -1 -2 -2 -2 -2 -2 -2 -3 -2 5 -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6 -1 3 1 0 -5 1 0 -2 0 -2 -3 5 -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V PAM250 log odds scoring matrix Why do we go from a mutation probability matrix to a log odds matrix? • We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. • Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them). How do we go from a mutation probability matrix to a log odds matrix? • The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log10 ( Mab / pb ) M ab f ab f fa ma ab fa f a K pa f As an example, for tryptophan, S( W, W ) = 10 log10 ( 0.55 / 0.01 ) = 17.4 f ab 100 pa f What do the numbers mean in a log odds matrix? S( W, W ) = 10 log10 ( 0.55 / 0.010 ) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two tryptophan residues. S(W, W) = 17 Probability of replacement ( Mab / pb ) = x Then 17 = 10 log10 x 1.7 = log10 x 101.7 = x = 50 What do the numbers mean in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. A R N D C Q E G H I L K M F P S T W Y V 7 -10 9 -7 -9 9 -6 -17 -1 8 -10 -11 -17 -21 10 -7 -4 -7 -6 -20 9 -5 -15 -5 0 -20 -1 8 -4 -13 -6 -6 -13 -10 -7 7 -11 -4 -2 -7 -10 -2 -9 -13 10 -8 -8 -8 -11 -9 -11 -8 -17 -13 9 -9 -12 -10 -19 -21 -8 -13 -14 -9 -4 7 -10 -2 -4 -8 -20 -6 -7 -10 -10 -9 -11 7 -8 -7 -15 -17 -20 -7 -10 -12 -17 -3 -2 -4 12 -12 -12 -12 -21 -19 -19 -20 -12 -9 -5 -5 -20 -7 9 -4 -7 -9 -12 -11 -6 -9 -10 -7 -12 -10 -10 -11 -13 8 -3 -6 -2 -7 -6 -8 -7 -4 -9 -10 -12 -7 -8 -9 -4 7 -3 -10 -5 -8 -11 -9 -9 -10 -11 -5 -10 -6 -7 -12 -7 -2 8 -20 -5 -11 -21 -22 -19 -23 -21 -10 -20 -9 -18 -19 -7 -20 -8 -19 13 -11 -14 -7 -17 -7 -18 -11 -20 -6 -9 -10 -12 -17 -1 -20 -10 -9 -8 10 -5 -11 -12 -11 -9 -10 -10 -9 -9 -1 -5 -13 -4 -12 -9 -10 -6 -22 -10 R N D Q E A C G H PAM10 log odds scoring matrix I L K M F P S T W Y 8 V Comparing two proteins with a PAM1 matrix gives completely different results than PAM250! Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match. hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** A PAM250 matrix is very tolerant of mismatches. 24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** ** Comments regarding PAM Historically researchers use PAM-250. (The only one published in the original paper.) Original PAM matrices were based on small number of proteins (circa 1978). Later versions use many more examples. Used to be the most popular scoring rule, but there are some problems with PAM matrices. Degrees of freedom in PAM definition With K=100 the 1-PAM matrix is given by M ab f ab f ab f a ma fa f a K pa f f ab 100 pa f With K=50 the basic matrix is different, namely: f ab M 'ab 50 pa f Thus we have two different ways to estimate the matrix M[4] : Use the 1-PAM matrix to the fourth power: M[4] = M[] 4 Or Use the K=50 matrix to the second power: M[4] = M[2] 2 Problems in building distance matrices How do we find pairs of aligned sequences? How far is the ancestor ? earlier divergence low sequence similarity later divergence high sequence similarity E.g., M[250] is known not reflect well long period changes. Does one letter mutate to the other or are they both mutations of a third letter ? BLOSUM Outline • Idea: use aligned ungapped regions of protein families.These are assumed to have a common ancestor. Similar ideas but better statistics and modeling. It uses 2000 conserved blocks from 500 families. • Procedure: – Cluster together sequences in a family whenever more than L% identical residues are shared, for BLOSUM-L. – Count number of substitutions across different clusters (in the same family). – Estimate frequencies using the counts. • Practice: BlOSUM-50 and BLOSOM62 are widely used. Considered the state of the art nowadays. BLOSUM Matrices BLOSUM matrices are based on local alignments. BLOSUM stands for blocks substitution matrix. BLOSUM62 is a matrix calculated from comparisons of sequences with less than 62% identical sites. BLOSUM Matrices All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. A R N D C Q E G H I L K M F P S T W Y V 4 -1 5 -2 0 6 -2 -2 1 6 0 -3 -3 -3 9 -1 1 0 0 -3 5 -1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6 -2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5 -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V Blosum62 scoring matrix BLOSUM Matrices Percent amino acid identity 100 62 30 BLOSUM62 Percent amino acid identity BLOSUM Matrices 100 100 100 62 62 62 30 30 30 BLOSUM80 BLOSUM62 BLOSUM30 Rat versus mouse RBP Rat versus bacterial lipocalin