Lecture_4 PAM and BLOSUM

advertisement
Dayhoff Model:
Accepted Point Mutation (PAM)
Arthur W. Chou
Fall 2005
Tunghai University
Dr. Margaret Oakley Dayhoff (1925-1983)
The Nobel Prize in Physiology or
Medicine 1962:
"for their discoveries concerning the
molecular structure of nucleic acids and
its significance for information transfer in
living material"
Francis Harry James Dewey
Compton Crick
Watson
Hugh Frederick
Wilkins
Rosaline Elsie Frankline
(1920 – 1958)
Dayhoff’s 34 protein superfamilies
Protein
Ig kappa chain
Kappa casein
Lactalbumin
Hemoglobin a
Myoglobin
Insulin
Histone H4
Ubiquitin
PAMs per 100 million years
37
33
27
12
8.9
4.4
0.10
0.00
Dayhoff’s numbers of “accepted point mutations”:
what amino acid substitutions occur in proteins?
A
Ala
A
R
N
D
C
Q
E
G
H
R
Arg
N
Asn
D
Asp
C
Cys
Q
Gln
E
Glu
G
Gly
30
109
17
154
0
532
33
10
0
0
93
120
50
76
0
266
0
94
831
0
422
579
10
156
162
10
30
112
21
103
226
43
10
243
23
10
Multiple sequence alignment of
glyceraldehyde 3-phosphate dehydrogenases
fly
human
plant
bacterium
yeast
archaeon
GAKKVIISAP
GAKRVIISAP
GAKKVIISAP
GAKKVVMTGP
GAKKVVITAP
GADKVLISAP
SAD.APM..F
SAD.APM..F
SAD.APM..F
SKDNTPM..F
SS.TAPM..F
PKGDEPVKQL
VCGVNLDAYK
VMGVNHEKYD
VVGVNEHTYQ
VKGANFDKY.
VMGVNEEKYT
VYGVNHDEYD
PDMKVVSNAS
NSLKIISNAS
PNMDIVSNAS
AGQDIVSNAS
SDLKIVSNAS
GE.DVVSNAS
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNSITPVA
fly
human
plant
bacterium
yeast
archaeon
KVINDNFEIV
KVIHDNFGIV
KVVHEEFGIL
KVINDNFGII
KVINDAFGIE
KVLDEEFGIN
EGLMTTVHAT
EGLMTTVHAI
EGLMTTVHAT
EGLMTTVHAT
EGLMTTVHSL
AGQLTTVHAY
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TGSQNLMDGP
SGKLWRDGRG
SGKLWRDGRG
SMKDWRGGRG
SHKDWRGGRG
SHKDWRGGRT
NGKP.RRRRA
AAQNIIPAST
ALQNIIPAST
ASQNIIPSST
ASQNIIPSST
ASGNIIPSST
AAENIIPTST
fly
human
plant
bacterium
yeast
archaeon
GAAKAVGKVI
GAAKAVGKVI
GAAKAVGKVL
GAAKAVGKVL
GAAKAVGKVL
GAAQAATEVL
PALNGKLTGM
PELNGKLTGM
PELNGKLTGM
PELNGKLTGM
PELQGKLTGM
PELEGKLDGM
AFRVPTPNVS
AFRVPTANVS
AFRVPTSNVS
AFRVPTPNVS
AFRVPTVDVS
AIRVPVPNGS
VVDLTVRLGK
VVDLTCRLEK
VVDLTCRLEK
VVDLTVRLEK
VVDLTVKLNK
ITEFVVDLDD
GASYDEIKAK
PAKYDDIKKV
GASYEDVKAA
AATYEQIKAA
ETTYDEIKKV
DVTESDVNAA
The relative mutability of amino acids
Asn
Ser
Asp
Glu
Ala
Thr
Ile
Met
Gln
Val
134
120
106
102
100
97
96
94
93
74
His
Arg
Lys
Pro
Gly
Tyr
Phe
Leu
Cys
Trp
66
65
56
56
49
41
41
40
20
18
Normalized frequencies of amino acids
Gly
Ala
Leu
Lys
Ser
Val
Thr
Pro
Glu
Asp
8.9%
8.7%
8.5%
8.1%
7.0%
6.5%
5.8%
5.1%
5.0%
4.7%
Arg
Asn
Phe
Gln
Ile
His
Cys
Tyr
Met
Trp
4.1%
4.0%
4.0%
3.8%
3.7%
3.4%
3.3%
3.0%
1.5%
1.0%
blue=6 codons; red=1 codon
Dayhoff’s numbers of “accepted point mutations”:
what amino acid substitutions occur in proteins?
A
Ala
A
R
N
D
C
Q
E
G
H
R
Arg
N
Asn
D
Asp
C
Cys
Q
Gln
E
Glu
G
Gly
30
109
17
154
0
532
33
10
0
0
93
120
50
76
0
266
0
94
831
0
422
579
10
156
162
10
30
112
21
103
226
43
10
243
23
10
Dayhoff’s PAM1 mutation probability matrix
A
R
N
D
C
Q
E
G
H
I
A
Ala
R
N
D
C
Q
Arg Asn Asp Cys Gln
E
Glu
G
Gly
H
His
I
Ile
9867
2
9
10
3
8
17
21
2
6
1
9913
1
0
1
10
0
0
10
3
4
1
9822
36
0
4
6
6
21
3
6
0
42
9859
0
6
53
6
4
1
1
1
0
0
9973
0
0
0
1
1
3
9
4
5
0
9876
27
1
23
1
10
0
7
56
0
35
9865
4
2
3
21
1
12
11
1
3
7
9935
1
0
1
8
18
3
1
20
1
0
9912
0
2
2
3
1
2
1
2
0
0
9872
Estimating p(·,·) for proteins
Generate a large diverse collection of accepted mutations. An
accepted mutation is a mutation due to an alignment of closely
related protein sequences. For example, Hemoglobin alpha chain
in humans and other organisms (homologous proteins).
Let pa = na/n where na is the number of occurrences of letter a
and n is the total number of letters in the collection, so n = ana.
Mutation counts
f ab  f ba
be the number of mutations a  b,
f a  b|b  a f ab be the total number of mutations that involve a,
f  a f a be the total number of amino acids involved in a mutation.
Note that f is twice the number of mutations.
PAM-1 matrices
Define Mab to be the symmetric probability matrix for switching
between a and b. We set, Maa = 1 – ma, so that ma is the probability
that a is involved in a change.
M ab
f ab
 Pr( a  b)  Pr( a  b | a changed)  Pr( a changed) 
ma
fa
We define Mab, such that only 1% of amino acids change according
to this matrix or 99% don’t. Hence the name, 1-Percent Accepted
Mutation (PAM). In other words,

a
pa M aa  a pa 1  ma   1  a pa ma  0.99
PAM-1 matrices
We wish that ma will be proportional to the relative mutability of
letter a compared to other letters.
fa
ma 
K  pa f
where K is a proportional constant.
We select K to satisfy the PAM-1 definition:
 fa
a pa ma  a pa  Kp f
 a




fa
1
 0.01

Kf
K
a
So K=100 for PAM-1 matrices. Note that K=50 yields 2% change, etc.
Evolutionary distance
The choice that 1% of amino acids change (and that K =100) is quite
arbitrary. It could fit specific set of proteins whose evolutionary
distance is such that indeed 1% of the letters have mutated.
This is a unit of evolutionary change, not time because evolution acts
differently on distinct sequence types.
What is the substitution matrix for k units of evolutionary time ?
Model of Evolution
We make some assumptions:
1. Each position changes independently of the rest
2. The probability of mutations is the same in each
position
3. Evolution does not “remember”
T
A
T
A
T
C
C
C
t
t+
t+2
t+3
G
G
t+4
Time
Model of Evolution
 How
do we model such a process?
 This process is called a Markov Chain
A chain is defined by the transition probability
 P(Xt+ =b|Xt=a) - the probability that the next state
is b given that the current state is a
 We often describe these probabilities by a matrix:
M[]ab = P(Xt+ =b|Xt=a)
Multi-Step Changes
on Mab, we can compute the probabilities of
changes over two time periods
 Based
P( X t  2   b | X t  a) 
 c P( X t  2   b | X t    c, X t  a) P( X t    c | X t  a)
Using Conditional independence (No memory)
 c P( X t  2   b | X t    c) P( X t    c | X t  a)
 c M ac M cb
 Thus
 By
M[2] = M[]M[]
induction:
M[n] = M[]
n
A Markov Model (chain)
X1
X2
Xn-1
Xn
•Every variable xi has a domain. For example, suppose the
domain are the letters {a, c, t, g}.
•Every variable is associated with a local probability table
P(Xi = xi | Xi-1= xi-1 ) and P(X1 = x1 ).
•The joint distribution is given by
p( X 1  x1 ,, X n  xn )  P( X 1  x1 ) P( X 2  x2 | X 1  x1 )  P( X n  xn | X n 1  xn1 )
n
  p( X i  xi | Pai  pa i )
i 1
where Pai are the parents of variable/node Xi ,namely, none or Xi-1.
n
In short, we write: p( x1 ,, xn )   p( xi | pa i )
i 1
Markov Model of Evolution Revisited
X1
M
X2
M
Xn-1
Xn
In the evolution model we studied earlier we had
P(x1) = (pa, pc, pg, pt)
which sum to 1 and called the prior probabilities, and
P(xi|xi-1) = M[]
which is a stationary transition probability table, not
depending on the index i.
The quantity we computed earlier from this model was the joint
probability table
n

p( x1 , xn )  p( x1 )  M []

x1 xn
Longer Term Changes
M[] = M (PAM-1 matrices)
 Use M[n] = Mn (PAM-n matrices)
 Define
 Estimate
 
p ( a , b)  pa M
 Use
n
ab
this quantity to define the score for your
application of interest.
PAM250 mutation probability matrix
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
R
N
D
C
Q
E
G
H
I
L
K
M F
P
S
T
W Y
V
13
6
9
9
5
8
9
12
6
8
6
7
7
4
11
11
11
2
4
9
3
17
4
3
2
5
3
2
6
3
2
9
4
1
4
4
3
7
2
2
4
4
6
7
2
5
6
4
6
3
2
5
3
2
4
5
4
2
3
3
5
4
8
11
1
7
10
5
6
3
2
5
3
1
4
5
5
1
2
3
2
1
1
1
52
1
1
2
2
2
1
1
1
1
2
3
2
1
4
2
3
5
5
6
1
10
7
3
7
2
3
5
3
1
4
3
3
1
2
3
5
4
7
11
1
9
12
5
6
3
2
5
3
1
4
5
5
1
2
3
12
5
10
10
4
7
9
27
5
5
4
6
5
3
8
11
9
2
3
7
2
5
5
4
2
7
4
2
15
2
2
3
2
2
3
3
2
2
3
2
3
2
2
2
2
2
2
2
2
10
6
2
6
5
2
3
4
1
3
9
6
4
4
3
2
6
4
3
5
15
34
4
20
13
5
4
6
6
7
13
6
18
10
8
2
10
8
5
8
5
4
24
9
2
6
8
8
4
3
5
1
1
1
1
0
1
1
1
1
2
3
2
6
2
1
1
1
1
1
2
2
1
2
1
1
1
1
1
3
5
6
1
4
32
1
2
2
4
20
3
7
5
5
4
3
5
4
5
5
3
3
4
3
2
20
6
5
1
2
4
9
6
8
7
7
6
7
9
6
5
4
7
5
3
9
10
9
4
4
6
8
5
6
6
4
5
5
6
4
6
4
6
5
3
6
8
11
2
3
6
0
2
0
0
0
0
0
0
1
0
1
0
0
1
0
1
0
55
1
0
1
1
2
1
3
1
1
1
3
2
2
1
2
15
1
2
2
3
31
2
7
4
4
4
4
4
4
5
4
15
10
4
10
5
5
5
7
2
4
17
Top: original amino acid
Side: replacement amino acid
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
2
-2 6
0 0 2
0 -1 2 4
-2 -4 -4 -5 12
0 1 1 2 -5 4
0 -1 1 3 -5 2 4
1 -3 0 1 -3 -1 0 5
-1 2 2 1 -3 3 1 -2 6
-1 -2 -2 -2 -2 -2 -2 -3 -2 5
-2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6
-1 3 1 0 -5 1 0 -2 0 -2 -3 5
-1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
-3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6
1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2
1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3
-6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17
-3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A R N D C Q E G H I
L K M F P S T W Y V
PAM250 log odds
scoring matrix
Why do we go from a mutation probability
matrix to a log odds matrix?
• We want a scoring matrix so that when we do a pairwise
alignment (or a BLAST search) we know what score to
assign to two aligned amino acid residues.
• Logarithms are easier to use for a scoring system. They
allow us to sum the scores of aligned residues (rather
than having to multiply them).
How do we go from a mutation probability
matrix to a log odds matrix?
• The cells in a log odds matrix consist of an “odds ratio”:
the probability that an alignment is authentic
the probability that the alignment was random
The score S for an alignment of residues a,b is given by:
S(a,b) = 10 log10 ( Mab / pb )
M ab 
f ab
f  fa
ma  ab 
fa
f a  K  pa f
As an example, for tryptophan,
S( W, W ) = 10 log10 ( 0.55 / 0.01 ) = 17.4

f ab


 100  pa f
What do the numbers mean
in a log odds matrix?
S( W, W ) = 10 log10 ( 0.55 / 0.010 ) = 17.4
A score of +17 for tryptophan means that this alignment
is 50 times more likely than a chance alignment of two
tryptophan residues.
S(W, W) = 17
Probability of replacement ( Mab / pb ) = x
Then
17 = 10 log10 x
1.7 = log10 x
101.7 = x = 50
What do the numbers mean
in a log odds matrix?
A score of +2 indicates that the amino acid replacement
occurs 1.6 times as frequently as expected by chance.
A score of 0 is neutral.
A score of –10 indicates that the correspondence of two
amino acids in an alignment that accurately represents
homology (evolutionary descent) is one tenth as frequent
as the chance alignment of these amino acids.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
7
-10
9
-7
-9
9
-6
-17
-1
8
-10
-11
-17
-21
10
-7
-4
-7
-6
-20
9
-5
-15
-5
0
-20
-1
8
-4
-13
-6
-6
-13
-10
-7
7
-11
-4
-2
-7
-10
-2
-9
-13
10
-8
-8
-8
-11
-9
-11
-8
-17
-13
9
-9
-12
-10
-19
-21
-8
-13
-14
-9
-4
7
-10
-2
-4
-8
-20
-6
-7
-10
-10
-9
-11
7
-8
-7
-15
-17
-20
-7
-10
-12
-17
-3
-2
-4
12
-12
-12
-12
-21
-19
-19
-20
-12
-9
-5
-5
-20
-7
9
-4
-7
-9
-12
-11
-6
-9
-10
-7
-12
-10
-10
-11
-13
8
-3
-6
-2
-7
-6
-8
-7
-4
-9
-10
-12
-7
-8
-9
-4
7
-3
-10
-5
-8
-11
-9
-9
-10
-11
-5
-10
-6
-7
-12
-7
-2
8
-20
-5
-11
-21
-22
-19
-23
-21
-10
-20
-9
-18
-19
-7
-20
-8
-19
13
-11
-14
-7
-17
-7
-18
-11
-20
-6
-9
-10
-12
-17
-1
-20
-10
-9
-8
10
-5
-11
-12
-11
-9
-10
-10
-9
-9
-1
-5
-13
-4
-12
-9
-10
-6
-22
-10
R
N
D
Q
E
A
C
G
H
PAM10 log odds
scoring matrix
I
L
K
M
F
P
S
T
W Y
8
V
Comparing two proteins with a PAM1 matrix
gives completely different results than PAM250!
Consider two distantly related proteins. A PAM40 matrix
is not forgiving of mismatches, and penalizes them
severely. Using this matrix you can find almost no match.
hsrbp, 136 CRLLNLDGTC
btlact,
3 CLLLALALTC
* ** * **
A PAM250 matrix is very tolerant of mismatches.
24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7%
hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV
btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN
*
**** *
* *
*
** *
hsrbp, 86 --CADMVGTFTDTEDPAKFKM
btlact, 80 GECAQKKIIAEKTKIPAVFKI
**
* ** **
Comments regarding PAM
Historically researchers use PAM-250. (The only one
published in the original paper.)
 Original PAM matrices were based on small number of
proteins (circa 1978). Later versions use many more
examples.
 Used to be the most popular scoring rule, but there are
some problems with PAM matrices.

Degrees of freedom in PAM definition
With K=100 the 1-PAM matrix is given by
M ab
f ab
f ab  f a

ma 

fa
f a  K  pa f

f ab

 100  pa f
With K=50 the basic matrix is different, namely:
f ab
M 'ab 
50  pa f
Thus we have two different ways to estimate the matrix M[4] :
Use the 1-PAM matrix to the fourth power: M[4] = M[] 4
Or
Use the K=50 matrix to the second power: M[4] = M[2] 2
Problems in building distance
matrices
 How
do we find pairs of aligned sequences?
 How far is the ancestor ?
 earlier divergence  low sequence similarity
 later divergence  high sequence similarity
E.g., M[250] is known not reflect well long period changes.
 Does one letter mutate to the other or are they both
mutations of a third letter ?
BLOSUM Outline
• Idea: use aligned ungapped regions of protein
families.These are assumed to have a common
ancestor. Similar ideas but better statistics and
modeling. It uses 2000 conserved blocks from 500
families.
• Procedure:
– Cluster together sequences in a family whenever more than
L% identical residues are shared, for BLOSUM-L.
– Count number of substitutions across different clusters (in
the same family).
– Estimate frequencies using the counts.
• Practice: BlOSUM-50 and BLOSOM62 are widely
used.
Considered the state of the art nowadays.
BLOSUM Matrices
BLOSUM matrices are based on local alignments.
BLOSUM stands for blocks substitution matrix.
BLOSUM62 is a matrix calculated from comparisons of
sequences with less than 62% identical sites.
BLOSUM Matrices
All BLOSUM matrices are based on observed alignments;
they are not extrapolated from comparisons of
closely related proteins.
The BLOCKS database contains thousands of groups of
multiple sequence alignments.
BLOSUM62 is the default matrix in BLAST 2.0.
Though it is tailored for comparisons of moderately distant
proteins, it performs well in detecting closer relationships.
A search for distant relatives may be more sensitive
with a different matrix.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1 5
-2 0 6
-2 -2 1 6
0 -3 -3 -3 9
-1 1 0 0 -3 5
-1 0 0 2 -4 2 5
0 -2 0 -1 -3 -2 -2 6
-2 0 1 -1 -3 0 0 -2 8
-1 -3 -3 -3 -1 -3 -3 -4 -3 4
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
-1 2 0 -1 -1 1 1 -2 -1 -3 -2 5
-1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2
2
7
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I
L K M F P S T W Y
V
Blosum62 scoring matrix
BLOSUM Matrices
Percent amino acid identity
100
62
30
BLOSUM62
Percent amino acid identity
BLOSUM Matrices
100
100
100
62
62
62
30
30
30
BLOSUM80
BLOSUM62
BLOSUM30
Rat versus
mouse RBP
Rat versus
bacterial
lipocalin
Download