BNFO 136 Sequence alignment

advertisement
BNFO 136
Sequence alignment
Usman Roshan
Pairwise alignment
• X: ACA, Y: GACAT
• Match=8, mismatch=2, gap-5
ACA-GACAT
-ACAGACAT
--ACA
GACAT
ACA---G--ACAT
8+2+2-5-5
Score = 2
-5+8+8+8-5
14
-5-5+2+2+2
-4
2-5-5-5-5-5-5
-28
Traceback
• We can compute an alignment of DNA
(or protein or RNA) sequences X and Y
with a traceback matrix T.
• Sequence X is aligned along the rows
and Y along the columns.
• Each entry of the matrix T contains D, L,
or U specifying diagonal, left or upper
Traceback
• X: ACA, Y=TACAG
T
A
C
A
G
L
L
L
L
L
A
U
D
U
U
L
C
U
U
D
U
D
A
U
L
L
D
L
Traceback
• X: ACA, Y=TACAG
T
A
C
A
G
L
L
L
L
L
A
U
D
U
U
L
C
U
U
D
U
D
A
U
L
L
D
L
Traceback code
aligned_seq1 = ""
aligned_seq2 = ""
i = len(seq2)
j = len(seq1)
while(i !=0 or j != 0):
if(T[i][j] == “L”):
aligned_seq1 = “-” + aligned_seq1
aligned_seq1 = seq1[j-1] + aligned_seq1
j = j - 1
elif(T[i][j] == "U"):
aligned_seq1 = "-" + aligned_seq1
aligned_seq2 = seq2[i-1] + aligned_seq2
i = i - 1
else:
aligned_seq1 = seq1[j-1] + aligned_seq1
aligned_seq2 = seq2[i-1] + aligned_seq2
i = i - 1
j = j - 1
Optimal alignment
• An alignment can be specified by the
traceback matrix.
• How do we determine the traceback for the
highest scoring alignment?
• Needleman-Wunsch algorithm for global
alignment
– First proposed in 1970
– Widely used in genomics/bioinformatics
– Dynamic programming algorithm
Needleman-Wunsch (NW)
• Input:
– X = x1x2…xn, Y=y1y2…ym
– (X is seq2 and Y is seq1)
• Notation:
– X1..i = x1x2…xi
– Score(X1..i,Y1..j) = Optimal alignment score of
sequences X1..i and Y1..j.
• Suppose we know the optimal alignment
scores of
– X1…i-1 and Y1…j-1
– X1…i and Y1...j-1
– X1...i-1 and Y1…j
Needleman-Wunsch (NW)
• Then the optimal alignment score of
X1…i and Y1…j is the maximum of
– Score(X1…i-1,Y1…j-1) + match/mismatch
– Score(X1…i,Y1…j-1) + gap
– Score(X1…i-1,Y1…j) + gap
• We build on this observation to compute
Score(Xn,Ym)
Needleman-Wunsch
• Define V to be a two dimensional matrix
with len(X)+1 rows and len(Y)+1
columns
• Let V[i][j] be the score of the optimal
alignment of X1…i and Y1…j.
• Let m be the match cost, mm be
mismatch, and g be the gap cost.
NW pseudocode
Initialization:
for i = 1 to length of seq2 { V[i][0] = i*g; }
For i = 1 to length of seq1 { V[0][i] = i*g; }
Recurrence:
for i = 1 to length of seq2{
for j = 1 to length of seq1{
V[i][j] = max {
V[i-1][j-1] + m(or mm)
V[i-1][j] + g
V[i][j-1] + g
if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’
else if (maximum is V[i-1][j] + g)
then T[i][j] = ‘U’
else
then T[i][j] = ‘L’
}
}
Example
V
Input:
seq2: ACA
seq1: GACAT
m=5
mm = -4
gap = -20
A
C
A
G
A
C
A
T
0
-20
-40
-60
-80
-100
-20
-4
-15
-35
-55
-75
-40
-24
-8
-10
-30
-50
-60
-44
-19
-12
-5
-25
T
seq2 is lined along the rows
and seq2 is along the columns
L
L
L
L
L
U
D
D
L
L
L
U
U
D
D
L
L
U
U
D
D
D
L
Download