Lectures4&5-SequenceAlignment-Oheads

advertisement
Two biologically related proteins with similar sequences:
FlgA1 EAGNVKLKRGRLDTLPPRTVLDINQLVDAISLRDLSPDQPIQLTQFRQAWRVKAGQRVNVIASGD
++K+K+GRLDTLPP +L+ N
A+SLR ++ QP+
R+ W +KAGQ V V+AG+
FlgA2 TLQDIKMKQGRLDTLPPGALLEPNFAQGAVSLRQINAGQPLTRNMLRRLWIIKAGQDVQVLALGE
Also biologically related (& fold up into the same 3D protein structure):
FlgA1 EAGNVKLKRGRLDTLPPRTVLDINQLVDAISLRDLSPDQPIQLTQFRQAWRVKAGQRVNVIASGD
A
+
P
+L
I+ R L P + I
R+AW V+ G V V
FlgA3 LAALKQVTLIAGKHKPDAMATHAEELQGKIAKRTLLPGRYIPTAAIREAWLVEQGAAVQVFFIAG
But these are biologically unrelated (& fold up into unrelated structures):
FlgA1 AGNVKLKRGRLDTLPPRTVLDINQLVDAISLRDLSPDQPIQLTQFRQA-WRVKAGQRVNVIASGD
AG+V K G + + PRT ++
I+
P PI
+++A WRV A + V V+ GD
HvcPP AGHV--KNGTMRIVGPRTCSNVWNGTFPINATTTGPSIPIPAPNYKKALWRVSATEYVEVVRVGD
1
We can:
(1) establish some objective criteria for scoring the quality of sequence alignments,
(2) establish a random model for which scores are expected just by chance, then
(3) decide if our observed alignments are significantly better than we would expect from
our random model.
To align two sequences, we need:
1. Some way to decide which alignments are better than others. For this, we’ll invent a
way to give the alignments a “score” indicating their quality.
2. Some way to align the proteins so that they get the best possible score.
3. Then finally, some way to decide when a score is “good enough” for us to believe the
alignment is biologically significant.
2
Making an objective score for alignments
(1) Assume that mutations in these sequences are independent from position to position.
This is an assumption! It works for DNA & proteins, but is lousy for RNA.
(2) Assume that sequence changes come in 3 flavors: substitutions, insertions and
deletions.
Insertions and deletions = gaps.
Consider 2 possible models:
Random model (R): amino acids occur independently at some given frequencies
Probability of observing an alignment between x and y
= product of frequencies (q) for finding each amino acid:
Px, y | R    q xi  q y j
i
j
Match model (M): amino acids in the alignment arise from common ancestor with a
probability given by the joint probability pab.
Probability of the whole alignment =
= product of probabilities of seeing the individual amino acids aligned:
Px, y | M    p xi yi
i
3
For your alignment, which model is better, the random or the match?
Take the ratio of probabilities:
P  x, y | M 

P  x, y | R 
p
q q
xi yi

i
xi
i
yj
i
p xi yi
q xi q yi
j
= odds ratio
If > 1, model M is more probable
If < 1, random model R is more probable
Make an additive score
Take the logarithm of the odds ratio (= the log odds ratio):
S   s ( xi , y i )
i
Here, s(xi,yi) is the score for aligning one amino acid with another amino acid:
 p 
sa, b   log  ab 
 p a pb 
the inherent preference for 2 amino acids
(a and b) to be aligned.
4
Used 2 tricks to our additive score:
(1) the probability of a set of independent events occurring is equal to the product of their
individual probabilities
(2) it’s often easier to calculate the product of a set of numbers by instead calculating the
sum of the log of those numbers
Trick #3:
Take set of correct protein sequence alignments and measure the s(a,b)’s:
= an amino acid substitution matrix
For example, BLOSUM50:
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
5
-2
-1
-2
-1
-1
-1
0
-2
-1
-2
-1
-1
-3
-1
1
0
-3
-2
0
R
-2
7
-1
-2
-4
1
0
-3
0
-4
-3
3
-2
-3
-3
-1
-1
-3
-1
-3
N
-1
-1
7
2
-2
0
0
0
1
-3
-4
0
-2
-4
-2
1
0
-4
-2
-3
D
-2
-2
2
8
-4
0
2
-1
-1
-4
-4
-1
-4
-5
-1
0
-1
-5
-3
-4
C
-1
-4
-2
-4
13
-3
-3
-3
-3
-2
-2
-3
-2
-2
-4
-1
-1
-5
-3
-1
Q
-1
1
0
0
-3
7
2
-2
1
-3
-2
2
0
-4
-1
0
-1
-1
-1
-3
E
-1
0
0
2
-3
2
6
-3
0
-4
-3
1
-2
-3
-1
-1
-1
-3
-2
-3
G
0
-3
0
-1
-3
-2
-3
8
-2
-4
-4
-2
-3
-4
-2
0
-2
-3
-3
-4
H
-2
0
1
-1
-3
1
0
-2
10
-4
-3
0
-1
-1
-2
-1
-2
-3
2
-4
I
-1
-4
-3
-4
-2
-3
-4
-4
-4
5
2
-3
2
0
-3
-3
-1
-3
-1
4
L
-2
-3
-4
-4
-2
-2
-3
-4
-3
2
5
-3
3
1
-4
-3
-1
-2
-1
1
K
-1
3
0
-1
-3
2
1
-2
0
-3
-3
6
-2
-4
-1
0
-1
-3
-2
-3
5
M
-1
-2
-2
-4
-2
0
-2
-3
-1
2
3
-2
7
0
-3
-2
-1
-1
0
1
F
-3
-3
-4
-5
-2
-4
-3
-4
-1
0
1
-4
0
8
-4
-3
-2
1
4
-1
P
-1
-3
-2
-1
-4
-1
-1
-2
-2
-3
-4
-1
-3
-4
10
-1
-1
-4
-3
-3
S
1
-1
1
0
-1
0
-1
0
-1
-3
-3
0
-2
-3
-1
5
2
-4
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
2
5
-3
-2
0
W
-3
-3
-4
-5
-5
-1
-3
-3
-3
-3
-2
-3
-1
1
-4
-4
-3
15
2
-3
Y
-2
-1
-2
-3
-3
-1
-2
-3
2
-1
-1
-2
0
4
-3
-2
-2
2
8
-1
V
0
–3
-3
-4
-1
-3
-3
-4
–4
4
1
-3
1
-1
-3
-2
0
-3
-1
5
Can now score any alignment = sum of scores of individual pairs of amino acids.
From our earlier example:
Two biologically related proteins with similar sequences:
FlgA1 EAGNVKLKRGRLDTLPPRTVLDINQLVDAISLRDLSPDQPIQLTQFRQAWRVKAGQRVNVIASGD
++K+K+GRLDTLPP +L+ N
A+SLR ++ QP+
R+ W +KAGQ V V+AG+
FlgA2 TLQDIKMKQGRLDTLPPGALLEPNFAQGAVSLRQINAGQPLTRNMLRRLWIIKAGQDVQVLALGE
S(FlgA1,FlgA2) = – 1 – 2 – 2 + 2 + 4 + 6 + ... = 186
What about gaps?
ALKSADLKASDLKAS----------------DLKDRGWPPIWEMRVCP
ALKSADLKASDLKASDAGDSDFKELWDIRIIDLKDRGWPPIWEMRVCP
The simple solution:
gap “cost” = proportional to gap length (g):
(g) = -gd,
d = cost of making gap of 1 amino acid residue.
The slightly more sophisticated solution (the affine gap penalty):
gap cost = cost of initial gap d + lower cost e proportional to length:
(g) = -d – (g-1)e
On to an alignment algorithm...
6
How to align 2 protein sequences?
The brute force way:
Make all possible alignments, score them all, take the best
For two 100 amino acid proteins => 1059 possible alignments => very long Ph.D.
The elegant way:
Use algorithm called dynamic programming
Refers to algorithms that work recursively = define problems in terms of subproblems,
subproblems in terms of smaller subproblems,
& so on until you can solve the smallest subproblems.
For sequence alignments,
best alignment of sequence of length n
= best alignment of sequence of length n-1 plus alignment of nth symbol
best alignment of sequence of length n-1
= best alignment of sequence of length n-2 plus alignment of n-1th symbol
etc.
2 types :
global and local
Global alignments: force every amino acid of one protein to align with every amino acid
(or gap) of other protein, e.g.
ACGTTATGCATGACGTA
-C---ATGCAT----TLocal alignments = best matching subsequences (including gaps), e.g.
ATGCAT
ATGCAT
Let’s look at local alignments (usually make more sense for proteins)
7
General idea:
Construct a “path” matrix = indicates scores of different alignments between 2 sequences
Build this one aligned residue at a time
Each entry in matrix = score F(i,j) of best alignment so far between
sequence x, up to residue i, and y, up to residue j
When the matrix is done, find the highest score (= best alignment possible), then read it
out from the path matrix.
In this way, test all possible alignments between the 2 protein sequences!
Let’s see it work:
(1) Write out sequence of one protein across top, other across side
(2) Initialize path matrix:
set F(0,0) = 0
i=0
H
x
E
0
P <-- j=0
A
W
y H
E
A
E <-- j=m
A
G
A
W
G
H
E
i=n
E
The path matrix will be
filled from top left to bottom right
Here’s where the recursion comes in:
(3) Find each score (matrix entry) F(i,j) from the ones calculated so far, plus the best
move chosen from aligning the 2 amino acids, or inserting a gap in one or the other, e.g.:
Given F(i-1,j-1), F(i-1,j) and F(i,j-1), calculate F(i,j) as:
xi is aligned to yj
xi is aligned to a gap
yj is aligned to a gap
F(i-1,j-1) + s(xi,yj)
F(i-1,j) – d (e.g., linear gap cost d = 8)
F(i,j-1) – d
(4) Choose the possibility that gives the best score.
(5) If negative, set score = 0.
(6) Enter score in matrix.
(7) Record w/ an arrow which event (substitution, gap on x, gap on y) generated best score.
(8) Repeat from step (3).
8
One minor detail:
Need special rule for the edges, since lack scores outside of those.
So,
positions on top edge F(i,0) = 0
positions on leftmost boundary F(0,j) = 0
First iteration:
No decent alignments yet! Keep plugging....
Second iteration:
Terrible! Keep plugging (ok, let’s skip forward a few iterations)...
9
Some iterations later:
Finally! A positive score! A with A
We write 5 in the A,A cell, and draw arrow from the cell it came from,
which is the P,E cell (b/c we got this score from a substitution, not a gap)
Next few iterations:
At bottom left, we finally see an insertion
Keep going till the whole matrix is filled up...
10
The whole enchilada:
Now, find the best score = 28, and find the corresponding alignment
called “traceback”
Choose top scoring cell, then follow arrows until get to zero score. Arrows tell you what
“events” to include in alignment (substitution, gap on x, gap on y):
11
How to decode the arrows:
Vertical arrow = add a gap in x (sequence on top) and symbol in y (sequence on left side)
Horizontal arrow = gap in y and symbol in x
Diagonal arrow = symbol from x aligned to one of y.
Think of this as: The arrow points to the sequence that gets the gap.
The alignment “grows” from right to left:
From this example, final alignment =
AWGHE
AW-HE
Score = 28
Also many high-scoring “suboptimal” alignments besides this, e.g.
HEA
HEA
which scores 21.
Suboptimal alignments often = repeating sequences
12
When a score is “good enough”?
= statistically significant, not just random
Remember: algorithm always gives an alignment, even if it’s not biologically significant!
One way to tell =
Compare score to lots of random alignments
“Scramble” one sequence, find best alignment to other sequence
Repeat 1,000 times.
If your original score is better than all of these, it’s at least better than 1 in a 1000 random
trials.
Repeat 1,000,000 times.
If still better, might start to believe it.
In practice, this is slow, but can do a few times and mathematically model the distribution
of random scores, then calculate statistical significance of the real score.
Histogram of scores has predictable shape = extreme value distribution
(also called Gumbel distribution):
fit by following equation:
p(max score  X )  e kNe
  X  
,
N = # randomized trials
 = mean best score of randomized alignments
k and  = parameters that characterize shape of the particular distribution from
aligning x to y
In practice, fit k and  from scores of a few random alignments, then calculate probability
of real alignment score from equation. If very low, alignment is real, not random.
13
Download