CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment

advertisement
CS 5263 Bioinformatics
CS 4593 AT:Bioinformatics
Lectures 3-6: Pair-wise Sequence
Alignment
Outline
• Part I: Algorithms
–
–
–
–
–
Biological problem
Intro to dynamic programming
Global sequence alignment
Local sequence alignment
More efficient algorithms
• Part II: Biological issues
– Model gaps more accurately
– Alignment statistics
• Part III: BLAST
Evolution at the DNA level
C
…ACGGTGCAGTCACCA…
…ACGTTGC-GTCCACCA…
DNA evolutionary events (sequence edits):
Mutation, deletion, insertion
Sequence conservation implies function
next generation
OK
OK
OK
X
X
Still OK?
Why sequence alignment?
• Conserved regions are more likely to be
functional
– Can be used for finding genes, regulatory elements,
etc.
• Similar sequences often have similar origin and
function
– Can be used to predict functions for new genes /
proteins
• Sequence alignment is one of the most widely
used computational tools in biology
Global Sequence Alignment
S
T
S’
T’
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
An alignment of two strings S, T is a pair of strings S’,
T’ (with spaces) s.t.
(1) |S’| = |T’|, and (|S| = “length of S”)
(2) removing all spaces in S’, T’ leaves S, T
What is a good alignment?
Alignment:
The “best” way to match the letters of one
sequence with those of the other
How do we define “best”?
S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• The score of aligning (characters or
spaces) x & y is σ (x,y).
• Score of an alignment:
• An optimal alignment: one with max score
Scoring Function
• Sequence edits:
– Mutations
– Insertions
– Deletions
Scoring Function:
Match:
+m
Mismatch:
-s
Gap (indel):
-d
AGGCCTC
AGGACTC
AGGGCCTC
AGG-CTC
~~~AAC~~~
~~~A-A~~~
• Match = 2, mismatch = -1, gap = -1
• Score = 3 x 2 – 2 x 1 – 1 x 1 = 3
More complex scoring function
• Substitution matrix
– Similarity score of matching two letters a, b
should reflect the probability of a, b derived
from the same ancestor
– It is usually defined by log likelihood ratio
– Active research area. Especially for proteins.
– Commonly used: PAM, BLOSUM
An example substitution matrix
A
C
G
T
A
C
G
T
3
-2
-1
-2
3
-2
-1
3
-2
3
How to find an optimal alignment?
• A naïve algorithm:
for all subseqs A of S, B of T s.t. |A| = |B| do
align A[i] with B[i], 1 ≤i ≤|A|
align all other chars to spaces
compute its value
S = abcd A = cd
T = wxyz B = xz
retain the max
-abc-d a-bc-d
end
w--xyz -w-xyz
output the retained alignment
Analysis
• Assume |S| = |T| = n
• Cost of evaluating one alignment: ≥n
• How many alignments are there:
– pick n chars of S,T together
– say k of them are in S
– match these k to the k unpicked chars of T
• Total time:
• E.g., for n = 20, time is > 240 >1012 operations
Intro to Dynamic Programming
Dynamic programming
• What is dynamic programming?
– A method for solving problems exhibiting the
properties of overlapping subproblems and optimal
substructure
– Key idea: tabulating sub-problem solutions rather
than re-computing them repeatedly
• Two simple examples:
– Computing Fibonacci numbers
– Find the special shortest path in a grid
Example 1: Fibonacci numbers
• 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …
F(0) = 1;
F(1) = 1;
F(n) = F(n-1) + F(n-2)
• How to compute F(n)?
A recursive algorithm
function fib(n)
if (n == 0 or n == 1) return 1;
else return fib(n-1) + fib(n-2);
F(9)
F(8)
F(7)
F(6)
F(7)
F(6)
F(6)
F(5)
F(5) F(5) F(4) F(5) F(4) F(4) F(3)
F(9)
F(8)
F(7)
F(7)
F(6)
F(6)
F(5)
n
F(6)
F(5) F(5) F(4) F(5) F(4) F(4) F(3)
• Time complexity:
– Between 2n/2 and 2n
– O(1.62n), i.e. exponential
• Why recursive Fib algorithm is inefficient?
– Overlapping subproblems
n/2
An iterative algorithm
function fib(n)
F[0] = 1;
F[1] = 1;
for i = 2 to n
F[i] = F[i-1] + F[i-2];
Return F[n];
1
1
Time complexity:
Time: O(n), space: O(n)
2
3
5
8
13 21 34 55
Example 2: shortest path in a grid
S
m
n
G
Each edge has a length (cost). We need to get to G from S. Can only
move right or down. Aim: find a path with the minimum total length
Optimal substructures
• Naïve algorithm: enumerate all possible
paths and compare costs
– Exponential number of paths
• Key observation:
– If a path P(S, G) is the shortest from S to G,
any of its sub-path P(S,x), where x is on
P(S,G), is the shortest from S to x
Proof
• Proof by contradiction
– If the path between P(S,x) is
not the shortest, i.e., P’(S,x) <
P(S,x)
– Construct a new path P’(S,G)
= P’(S,x) + P(x, G)
– P’(S,G) < P(S,G) => P(S,G) is
not the shortest
– Contradiction
– Therefore, P(S, x) is the
shortest
S
x
G
Recursive solution
(0,0)
• Index each intersection by
two indices, (i, j)
• Let F(i, j) be the total
length of the shortest path
from (0, 0) to (i, j).
Therefore, F(m, n) is the
shortest path we wanted.
• To compute F(m, n), we
need to compute both
F(m-1, n) and F(m, n-1)
m
n
(m, n)
F(m-1, n) + length((m-1, n), (m, n))
F(m, n) = min
F(m, n-1) + length((m, n-1), (m, n))
Recursive solution
F(i-1, j) + length((i-1, j), (i, j))
F(i, j) = min
F(i, j-1) + length((i, j-1), (i, j))
(0,0)
(i-1, j)
(i, j-1)
(i, j)
m
n
• But: if we use recursive
call, many subpaths will
be recomputed for many
times
• Strategy: pre-compute F
values starting from the
upper-left corner. Fill in
row by row (what other
order will also do?)
(m, n)
Dynamic programming illustration
S
0
3
5
3
9
3
5
3
2
6
2
6
13
2
4
17
9
4
5
11
17
11
6
14
2
17
13
13
3
17
2
18
15
3
3
16
4
3
1
3
15
3
7
3
2
2
9
3
6
1
8
13
3
3
2
3
1
3
3
7
12
20
3
2
20
G
F(i-1, j) + length(i-1, j, i, j)
F(i, j) = min
F(i, j-1) + length(i, j-1, i, j)
Trackback
0
3
5
3
9
3
5
3
2
6
2
6
13
2
4
17
9
4
5
11
17
11
6
14
2
17
13
13
3
17
2
18
15
3
3
16
4
3
1
3
15
3
7
3
2
2
9
3
6
1
8
13
3
3
2
3
1
3
3
7
12
20
3
2
20
Elements of dynamic programming
• Optimal sub-structures
– Optimal solutions to the original problem
contains optimal solutions to sub-problems
• Overlapping sub-problems
– Some sub-problems appear in many solutions
• Memorization and reuse
– Carefully choose the order that sub-problems
are solved
Dynamic Programming for
sequence alignment
Suppose we wish to align
x1……xM
y1……yN
Let F(i,j) = optimal score of aligning
x1……xi
y1……yj
Scoring Function:
Match:
+m
Mismatch:
-s
Gap (indel):
-d
Elements of dynamic programming
• Optimal sub-structures
– Optimal solutions to the original problem
contains optimal solutions to sub-problems
• Overlapping sub-problems
– Some sub-problems appear in many solutions
• Memorization and reuse
– Carefully choose the order that sub-problems
are solved
Optimal substructure
1
2
i
M
...
x:
1
y:
2
j
N
...
• If x[i] is aligned to y[j] in the optimal alignment
between x[1..M] and y[1..N], then
• The alignment between x[1..i] and y[1..j] is also
optimal
• Easy to prove by contradiction
Recursive solution
Notice three possible cases:
1.
~~~~~~~ xM
~~~~~~~ yN
2.
F(M,N) = F(M-1, N-1) +
-s, if not
xM aligns to a gap
~~~~~~~ xM
~~~~~~~ 
3.
m, if xM = yN
xM aligns to yN
max
F(M,N) = F(M-1, N) - d
yN aligns to a gap
~~~~~~~ 
~~~~~~~ yN
F(M,N) = F(M, N-1) - d
Recursive solution
• Generalize:
F(i,j) = max
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
(Xi,Yj) = m if Xi = Yj, and –s otherwise
• Boundary conditions:
– F(0, 0) = 0.
-jd: y[1..j] aligned to gaps.
– F(0, j) = ?
-id: x[1..i] aligned to gaps.
– F(i, 0) = ?
What order to fill?
F(0,0)
F(i-1, j-1)1
F(i-1, j)
1
i
F(i, j-1)
3
2
F(i, j)
F(M,N)
j
What order to fill?
F(0,0)
F(M,N)
Example
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
1
A
2
T
3
A
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
A
G
3
T
4
A
m= 1
s =1
d =1
Example
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
0
0
j=0
1
A
-1
2
T
-2
3
A
-3
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
4
A
G
T
A
-1
-2
-3
-4
m= 1
s =1
d =1
Example
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
A
G
T
A
0
-1
-2
-3
-4
1
0
-1
-2
1
A
-1
2
T
-2
3
A
-3
3
4
m= 1
s =1
d =1
Example
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
m= 1
s =1
d =1
Example
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal Alignment:
F(4,3) = 2
Example
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
1
A
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
-1
1
0
-1
-2
m= 1
s =1
d =1
4
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal Alignment:
F(4,3) = 2
This only tells us the
best score
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
A
2
T
-2
0
0
1
0
A
3
A
-3
-1
-1
0
2
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
T A
2
T
-2
0
0
1
0
T A
3
A
-3
-1
-1
0
2
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
G T A
2
T
-2
0
0
1
0
-
3
A
-3
-1
-1
0
2
T A
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
m= 1
s =1
d =1
4
1
A
-1
1
0
-1
-2
A G T A
2
T
-2
0
0
1
0
A -
3
A
-3
-1
-1
0
2
T A
Trace-back
x = AGTA
y = ATA
F(i,j)
F(i,j) = max
i =
j=0
1
A
0
F(i-1, j-1) + (Xi,Yj)
F(i-1, j) – d
F(i, j-1) – d
1
2
3
A
G
T
A
0
-1
-2
-3
-4
-1
1
0
-1
-2
m= 1
s =1
d =1
4
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal Alignment:
F(4,3) = 2
AGTA
ATA
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
0
0
j=0
1
A
-1
2
T
-2
3
A
-3
1
2
3
4
A
G
T
A
-1
-2
-3
-4
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
A
G
T
A
0
-1
-2
-3
-4
1
0
-1
-2
1
A
-1
2
T
-2
3
A
-3
3
4
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
1
A
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Using trace-back pointers
x = AGTA
y = ATA
F(i,j)
m= 1
s =1
d =1
i =
j=0
1
A
0
1
2
3
4
A
G
T
A
0
-1
-2
-3
-4
-1
1
0
-1
-2
2
T
-2
0
0
1
0
3
A
-3
-1
-1
0
2
Optimal Alignment:
F(4,3) = 2
AGTA
ATA
The Needleman-Wunsch
Algorithm
1.
Initialization.
a.
b.
c.
2.
F(0, 0)
F(0, j)
F(i, 0)
= 0
=-jd
=-id
Main Iteration. Filling in scores
a.
For each i = 1……M
For each j = 1……N
F(i, j)
Ptr(i,j)
3.
= max
=
F(i-1,j) – d
F(i, j-1) – d
F(i-1, j-1) + σ(xi, yj)
UP,
LEFT
DIAG
[case 1]
[case 2]
[case 3]
if [case 1]
if [case 2]
if [case 3]
Termination. F(M, N) is the optimal score, and from Ptr(M, N) can
trace back optimal alignment
Complexity
• Time:
O(NM)
• Space:
O(NM)
• Linear-space algorithms do exist (with the
same time complexity)
Equivalent graph problem
S1 =
A
G
T
A
(0,0)
S2 =
A
1
1
: a gap in the 2nd sequence
: a gap in the 1st sequence
1
T
1
A
: match / mismatch
1
Value on vertical/horizontal line: -d
Value on diagonal: m or -s
(3,4)
• Number of steps: length of the alignment
• Path length: alignment score
• Optimal alignment: find the longest path from (0, 0) to (3, 4)
• General longest path problem cannot be found with DP. Longest path on this
graph can be found by DP since no cycle is possible.
Question
• If we change the scoring scheme, will the
optimal alignment be changed?
– Old: Match = 1, mismatch = gap = -1
– New: match = 2, mismatch = gap = 0
– New: Match = 2, mismatch = gap = -2?
Question
• What kind of alignment is represented by
these paths?
A
A
A
A
A
B
B
B
B
B
C
C
C
C
C
A-
A--
--A
-A-
-A
BC
-BC
BC-
B-C
BC
Alternating gaps are impossible if –s > -2d
A variant of the basic algorithm
Scoring scheme: m = s = d: 1
Seq1: CAGCA-CTTGGATTCTCGG
|| |:|||
Seq2: ---CAGCGTGG--------
Seq1: CAGCACTTGGATTCTCGG
||||
| |
||
Seq2: CAGC-----G-T----GG
Score = -7
Score = -2
The first alignment may be biologically more realistic in
some cases (e.g. if we know s2 is a subsequence of s1)
A variant of the basic algorithm
• Maybe it is OK to have an unlimited # of
gaps in the beginning and end:
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
• Then, we don’t want to penalize gaps in
the ends
The Overlap Detection variant
yN ……………………………… y1
x1 ……………………………… xM
Changes:
1.
Initialization
For all i, j,
F(i, 0) = 0
F(0, j) = 0
2.
Termination
maxi F(i, N)
FOPT = max
maxj F(M, j)
Different types of overlaps
x
x
y
y
The local alignment problem
Given two strings
X = x1……xM,
Y = y1……yN
Find substrings x’, y’ whose similarity (optimal
global alignment value) is maximum
e.g. X = abcxdex
Y = xxxcde
x
y
X’ = cxde
Y’ = c-de
Why local alignment
• Conserved regions may be a small part of the
whole
– Global alignment might miss them if flanking “junk”
outweighs similar regions
• Genes are shuffled between genomes
A
B
C
B
D
D
A
C
Naïve algorithm
for all substrings X’ of X and Y’ of Y
Align X’ & Y’ via dynamic programming
Retain pair with max value
end ;
Output the retained pair
• Time: O(n2) choices for A, O(m2) for B,
O(nm) for DP, so O(n3m3 ) total.
Reminder
• The overlap detection algorithm
– We do not give penalty to gaps at either end
Free gap
Free gap
The local alignment idea
• Do not penalize the unaligned regions (gaps or
mismatches)
• The alignment can start anywhere and ends anywhere
• Strategy: whenever we get to some low similarity region
(negative score), we restart a new alignment
– By resetting alignment score to zero
The Smith-Waterman algorithm
Initialization: F(0, j) = F(i, 0) = 0
0
F(i – 1, j) – d
Iteration: F(i, j) = max
F(i, j – 1) – d
F(i – 1, j – 1) + (xi, yj)
The Smith-Waterman algorithm
Termination:
1. If we want the best local alignment…
FOPT = maxi,j F(i, j)
2. If we want all local alignments scoring > t
For all i, j find F(i, j) > t, and trace back
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
x
0
x
0
c
0
d
0
e
0
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
x
0
0
0
x
0
0
0
c
0
0
0
d
0
0
0
e
0
0
0
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
x
0
0
0
0
x
0
0
0
0
c
0
0
0
2
d
0
0
0
1
e
0
0
0
0
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
x
0
0
0
0
2
x
0
0
0
0
2
c
0
0
0
2
1
d
0
0
0
1
0
e
0
0
0
0
0
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
x
0
0
0
0
2
1
x
0
0
0
0
2
1
c
0
0
0
2
1
1
d
0
0
0
1
0
3
e
0
0
0
0
0
2
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
0
x
0
0
0
0
2
1
0
x
0
0
0
0
2
1
0
c
0
0
0
2
1
1
0
d
0
0
0
1
0
3
2
e
0
0
0
0
0
2
5
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
c
0
0
0
2
1
1
0
1
d
0
0
0
1
1
3
2
1
e
0
0
0
0
0
2
5
4
Trace back
Match: 2
Mismatch: -1
Gap: -1
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
c
0
0
0
2
1
1
0
1
d
0
0
0
1
1
3
2
1
e
0
0
0
0
0
2
5
4
Trace back
Match: 2
Mismatch: -1
Gap: -1
cxde
| ||
c-de
x-de
| ||
xcde
a
b
c
x
d
e
x
0
0
0
0
0
0
0
0
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
x
0
0
0
0
2
1
0
2
c
0
0
0
2
1
1
0
1
d
0
0
0
1
1
3
2
1
e
0
0
0
0
0
2
5
4
• No negative values in local alignment DP
array
• Optimal local alignment will never have a
gap on either end
• Local alignment: “Smith-Waterman”
• Global alignment: “Needleman-Wunsch”
Analysis
• Time:
– O(MN) for finding the best alignment
– Time to report all alignments depends on the
number of sub-opt alignments
• Memory:
– O(MN)
– O(M+N) possible
More efficient alignment algorithms
• Given two sequences of length M, N
• Time: O(MN)
– Ok, but still slow for long sequences
• Space: O(MN)
– bad
– 1Mb seq x 1Mb seq = 1TB memory
• Can we do better?
Bounded alignment
Good alignment should
appear near the
diagonal
Bounded Dynamic
Programming
If we know that x and y are very similar
Assumption:
Then,
xi
|
yj
# gaps(x, y) < k
implies
|i–j|<k
yN ………………………… y1
Bounded Dynamic
Programming
x1 ………………………… xM Initialization:
F(i,0), F(0,j) undefined for i, j > k
Iteration:
For i = 1…M
For j = max(1, i – k)…min(N, i+k)
F(i – 1, j – 1)+ (xi, yj)
F(i, j) = max F(i, j – 1) – d, if j > i – k
F(i – 1, j) – d, if j < i + k
k
Termination:
same
Analysis
• Time: O(kM) << O(MN)
• Space: O(kM) with some tricks
=>
M
2k
M
2k
• Given two sequences of length M, N
• Time: O(MN)
– ok
• Space: O(MN)
– bad
– 1mb seq x 1mb seq = 1TB memory
• Can we do better?
Linear space algorithm
• If all we need is the alignment score but
not the alignment, easy!
We only need to
keep two rows
(You only need one row,
with a little trick)
But how do we get
the alignment?
Linear space algorithm
• When we finish, we know how we have
aligned the ends of the sequences
XM
YN
Naïve idea: Repeat on
the smaller subproblem
F(M-1, N-1)
Time complexity:
O((M+N)(MN))
(0, 0)
M/2
(M, N)
Key observation: optimal alignment (longest path) must use
an intermediate point on the M/2-th row. Call it (M/2, k),
where k is unknown.
(0,0)
(3,0)
(3,2)
(3,4)
(3,6)
(6,6)
• Longest path from (0, 0) to (6, 6) is
max_k (LP(0,0,3,k) + LP(6,6,3,k))
Hirschberg’s idea
• Divide and conquer!
Y
X
Forward algorithm
Align x1x2…xM/2 with Y
M/2
F(M/2, k) represents the
best alignment between
x1x2…xM/2 and y1y2…yk
Backward Algorithm
Y
X
M/2
Backward algorithm
Align reverse(xM/2+1…xM)
with reverse(Y)
B(M/2, k) represents the best
alignment between
reverse(xM/2+1…xM) and
reverse(ykyk+1…yN )
Linear-space alignment
Using 2 (4) rows of space, we can compute
for k = 1…N, F(M/2, k), B(M/2, k)
M/2
Linear-space alignment
Now, we can find k* maximizing F(M/2, k) + B(M/2, k)
Also, we can trace the path exiting column M/2 from k*
Conclusion: In O(NM) time, O(N) space, we found
optimal alignment path at row M/2
Linear-space alignment
k*
M/2
M/2
• Iterate this procedure to the two sub-problems!
N-k*
Analysis
• Memory: O(N) for computation, O(N+M) to
store the optimal alignment
• Time:
– MN for first iteration
– k M/2 + (N-k) M/2 = MN/2 for second
–…
k
M/2
M/2
N-k
MN
MN/2
MN/4
MN + MN/2 + MN/4 + MN/8 + …
= MN (1 + ½ + ¼ + 1/8 + 1/16 + …)
= 2MN = O(MN)
MN/8
Outline
• Part I: Algorithms
–
–
–
–
–
Biological problem
Intro to dynamic programming
Global sequence alignment
Local sequence alignment
More efficient algorithms
• Part II: Biological issues
– Model gaps more accurately
– Alignment statistics
• Part III: BLAST
How to model gaps more accurately
What’s a better alignment?
GACGCCGAACG
|||||
|||
GACGC---ACG
GACGCCGAACG
|||| | | ||
GACG-C-A-CG
Score = 8 x m – 3 x d
Score = 8 x m – 3 x d
However, gaps usually occur in bunches.
- During evolution, chunks of DNA may be lost entirely
- Aligning genomic sequences vs. cDNAs (reverse
complimentary to mRNAs)
Model gaps more accurately
• Current model:
– Gap of length n incurs penalty nd

n
• General:
– Convex function
– E.g. (n) = c * sqrt (n)

n
General gap dynamic
programming
Initialization:
same
Iteration:
F(i-1, j-1) + s(xi, yj)
F(i, j) = max maxk=0…i-1F(k,j) – (i-k)
maxk=0…j-1F(i,k) – (j-k)
Termination:
same
Running Time: O((M+N)MN)
(cubic)
Space: O(NM) (linear-space algorithm not applicable)
Compromise: affine gaps
(n) = d + (n – 1)e
|
|
gap
gap
open
extension
Match: 2
Gap open: -5
Gap extension: -1
(n)
d
GACGCCGAACG
|||||
|||
GACGC---ACG
GACGCCGAACG
|||| | | ||
GACG-C-A-CG
8x2-5-2 = 9
8x2-3x5 = 1
We want to find the optimal alignment with affine gap penalty in
• O(MN) time
• O(MN) or better O(M+N) memory
e
Allowing affine gap penalties
• Still three cases
– xi aligned with yj
– xi aligned to a gap
• Are we continuing a gap in x? (if no, start is more expensive)
– yj aligned to a gap
• Are we continuing a gap in y? (if no, start is more expensive)
• We can use a finite state machine to represent the three
cases as three states
– The machine has two heads, reading the chars on the two
strings separately
– At every step, each head reads 0 or 1 char from each sequence
– Depending on what it reads, goes to a different state, and
produces different scores
Finite State Machine
Input
Output
?/?
?/?
Ix
?/?
?/?
F
State
?/?
Iy
?/?
?/?
F: have just read 1 char from each seq (xi aligned to yj )
Ix: have read 0 char from x. (yj aligned to a gap)
Iy: have read 0 char from y (xi aligned to a gap)
Input
Output
(xi,yj) / 
(xi,yj) / 
Ix
(-, yj) / d
F
Start state
(-, yj) / e
(xi,-) / d
Iy
(xi,-) / e
(xi,yj) / 
Current state
Input
Output
Next state
F
(xi,yj)

F
F
(-,yj)
(xi,-)
(-,yj)
…
d
d
e
…
Ix
F
Ix
…
Iy
Ix
…
(xi,yj) / 
(xi,yj) /

start
state
(-, yj) / e
Ix
(-, yj) / d
F
(xi,-) / d
(xi,yj) / 
F-F-F-F
Iy
(xi,-) / e
F-Iy-F-F-Ix
AAC
AAC
AAC-
ACT
|||
||
ACT
-ACT
F-F-Iy-F-Ix
AAC-
| |
A-CT
Given a pair of sequences, an alignment (not necessarily optimal)
corresponds to a state path in the FSM.
Optimal alignment: find a state path to read the two sequences such that
the total output score is the highest
Dynamic programming
• We encode this information in three different
matrices
• For each element (i,j) we use three variables
– F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns
to yj
– Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap
– Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap
xi
xi
yj
F(i, j)
xi
yj
Ix(i, j)
yj
Iy(i, j)
(-, yj)/e
(xi,yj) /
(xi,yj) /
Ix
(-, yj) /d
F
(xi,-) /d
Iy
(xi,yj) /
xi
yj
(xi,-)/e
F(i-1, j-1) + (xi, yj)
F(i, j) = max Ix(i-1, j-1) + (xi, yj)
Iy(i-1, j-1) + (xi, yj)
(-, yj)/e
(xi,yj) /
(xi,yj) /
Ix
(-, yj) /d
F
(xi,-) /d
Iy
(xi,yj) /
(xi,-)/e
F(i, j-1) + d
xi
Ix(i, j) = max
yj
Ix(i, j)
Ix(i, j-1) + e
(-, yj)/e
(xi,yj) /
(xi,yj) /
Ix
(-, yj) /d
F
(xi,-) /d
Iy
(xi,yj) /
(xi,-)/e
F(i-1, j) + d
xi
Iy(i, j) = max
yj
Iy(i, j)
Iy(i-1, j) + e
F(i – 1, j – 1)
F(i, j) = (xi, yj) + max Ix(i – 1, j – 1)
Iy(i – 1, j – 1)
Ix(i, j) = max
Iy(i, j) = max
Continuing alignment
Closing gaps in x
Closing gaps in y
F(i, j – 1) + d
Opening a gap in x
Ix(i, j – 1) + e
Gap extension in x
F(i – 1, j) + d
Opening a gap in y
Iy(i – 1, j) + e
Gap extension in y
Data dependency
F
i
Ix
Iy
j
i-1
i-1
j-1
j-1
Data dependency
F
i
Ix
Iy
j
i
i
j
j
Data dependency
• If we stack all three matrices
– No cyclic dependency
– Therefore, we can fill in all three matrices in order
Algorithm
• for i = 1:m
– for j = 1:n
• Fill in F(i, j), Ix(i, j), Iy(i, j)
– end
end
• F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))
• Time: O(MN)
• Space: O(MN) or O(N) when combined with the
linear-space algorithm
Exercise
•
•
•
•
•
•
x = GCAC
y = GCC
m=2
s = -2
d = -5
e = -1
y= G
x= 0
-
C
-
C
-
x=
y= G
C
C
-
-
-
G -
G -5
C -
C -6
A -
A -7
C -
C -8
F: aligned on both
y= G
x = -
-5
C
-6
A
C
m=2
s = -2
d = -5
e = -1
Iy: Insertion on y
C
-7
F(i-1, j-1)
-
Iy(i-1, j-1)
(xi, yj)
G -
C
-
Iy(i-1,j)
F(i-1,j)
e
Ix(i-1, j-1)
d
F(i, j)
-
F(i,j-1)
-
Ix(i,j)
Ix(i,j-1)
Ix: Insertion on x
Iy(i,j)
d
e
y= G
x= 0
-
G -
C
-
C
-
2
x=
y= G
C
C
-
-
-
-
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x = -
-5
C
A
C
Iy
C
-6
C
-7
F(i-1, j-1)
G -
m=2
s = -2
d = -5
e = -1
Iy(i-1, j-1)
(xi, yj) = 2
-
Ix(i-1, j-1)
-
F(i, j)
-
Ix
y= G
x= 0
G -
C
-
-
2
-7
C
-
x=
y= G
C
C
-
-
-
-
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x = -
-5
C
A
C
Iy
C
-6
C
-7
F(i-1, j-1)
G -
m=2
s = -2
d = -5
e = -1
Iy(i-1, j-1)
(xi, yj) = -2
-
Ix(i-1, j-1)
-
F(i, j)
-
Ix
y= G
x= 0
G -
C
C
-
-
-
2
-7
-8
x=
y= G
C
C
-
-
-
-
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x = -
-5
C
A
C
Iy
C
-6
C
-7
F(i-1, j-1)
G -
m=2
s = -2
d = -5
e = -1
Iy(i-1, j-1)
(xi, yj) = -2
-
Ix(i-1, j-1)
-
F(i, j)
-
Ix
y= G
x= 0
G -
C
C
-
-
-
2
-7
-8
y= G
C
C
-
-
-
x=
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x=
G -
Iy
C
-5
-6
-
-3
C
-7
F(i,j-1)
C -
d = -5
Ix(i,j)
A -
Ix(i,j-1)
C -
Ix
e = -1
m=2
s = -2
d = -5
e = -1
y= G
x= 0
G -
C
C
-
-
-
2
-7
-8
y= G
C
C
-
-
-
x=
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x=
G -
Iy
C
C
-5
-6
-7
-
-3
-4
F(i,j-1)
C -
d = -5
Ix(i,j)
A -
Ix(i,j-1)
C -
Ix
e = -1
m=2
s = -2
d = -5
e = -1
y= G
x= 0
G -
C
C
-
-
-
2
-7
-8
y= G
C
C
-
-
-
-
-
-
x=
G -5
C -
C -6
A -
A -7
C -
C -8
F
y= G
x=
G -
-5
-
Iy
C
-6
-3
C
-7
-4
Iy(i-1,j)
F(i-1,j)
e=-1
C -
d=-5
A -
Iy(i,j)
C -
Ix
m=2
s = -2
d = -5
e = -1
y= G
x= 0
C
C
-
-
-
G -
2
-7
-8
C -
-7
y= G
C
C
-
-
-
-
-
-
x=
G -5
C -6
A -
A -7
C -
C -8
F
y= G
x=
G -
-5
-
m=2
s = -2
d = -5
e = -1
Iy
C
-6
-3
C
-7
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = -2
C -
Ix(i-1, j-1)
A -
F(i, j)
C -
Ix
y= G
x= 0
C
C
-
-
-
G -
2
-7
-8
C -
-7
4
y= G
C
C
-
-
-
-
-
-
x=
G -5
C -6
A -
A -7
C -
C -8
F
y= G
x=
G -
-5
-
m=2
s = -2
d = -5
e = -1
Iy
C
-6
-3
C
-7
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = 2
C -
Ix(i-1, j-1)
A -
F(i, j)
C -
Ix
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
C -
-7
4
-1
C -6
x=
A -
A -7
C -
C -8
F
y= G
x=
G -
-5
Iy
C
-6
-
m=2
s = -2
d = -5
e = -1
-3
C
-7
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = 2
C -
Ix(i-1, j-1)
A -
F(i, j)
C -
Ix
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
C -
-7
4
-1
C -6
x=
A -
A -7
C -
C -8
F
y= G
x=
Iy
C
C
-5
-6
-7
G -
-
-3
-4
C -
-
-12 -1
F(i,j-1)
d = -5
Ix(i,j)
A -
Ix(i,j-1)
C -
Ix
e = -1
m=2
s = -2
d = -5
e = -1
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
C -
-7
4
-1
C -6
-3
x=
A -
A -7
C -
C -8
F
y= G
x=
-5
Iy
C
-6
C
-7
G -
-
-3
C -
-
-12 -1
-4
Iy(i-1,j)
F(i-1,j)
e=-1
d=-5
A -
Iy(i,j)
C -
Ix
m=2
s = -2
d = -5
e = -1
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-1
C -6
-3
-12 -13
x=
A -
A -7
C -
C -8
F
y= G
x=
G -
Iy
C
C
-5
-6
-7
-
-3
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj)
-
Iy(i-1,j)
F(i-1,j)
e
Ix(i-1, j-1)
C -
m=2
s = -2
d = -5
e = -1
d
-12 -1
F(i, j)
F(i,j-1)
A -
Iy(i,j)
d
Ix(i,j)
C -
Ix(i,j-1)
Ix
e
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-5
C -6
-3
-12 -13
A -
-8
-5
2
A -7
C -
x=
C -8
F
y= G
x=
G -
m=2
s = -2
d = -5
e = -1
Iy
C
C
-5
-6
-7
-
-3
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj)
F(i-1,j)
e
Ix(i-1, j-1)
C -
-
A -
-
Iy(i-1,j)
d
-12 -1
F(i, j)
-13 -10
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
C -
Ix(i,j-1)
Ix
e
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-1
C -6
-3
-12 -13
A -
-8
-5
2
A -7
-8
-1
C -
x=
C -8
F
y= G
x=
-5
Iy
C
-6
C
-7
G -
-
-3
C -
-
-12 -1
A -
-
-4
Iy(i-1,j)
F(i-1,j)
e=-1
d=-5
-13 -10
Iy(i,j)
C -
Ix
m=2
s = -2
d = -5
e = -1
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-1
C -6
-3
-12 -13
A -
-8
-5
2
A -7
-8
-1
C -
x=
C -8
F
y= G
x=
-6
-5
Iy
C
-6
C
-7
G -
-
-3
C -
-
-12 -1
A -
-
-4
Iy(i-1,j)
F(i-1,j)
e=-1
d=-5
-13 -10
Iy(i,j)
C -
Ix
m=2
s = -2
d = -5
e = -1
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-1
C -6
-3
-12 -13
A -
-8
-5
2
A -7
-8
-1
-6
C -
-9
-6
1
C -8
-13 -2
-3
x=
F
y= G
x=
G -
m=2
s = -2
d = -5
e = -1
Iy
C
C
-5
-6
-7
-
-3
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj)
F(i-1,j)
e
Ix(i-1, j-1)
C -
-
A -
-
-13 -10
C -
-
-14 -11
Iy(i-1,j)
d
-12 -1
F(i, j)
F(i,j-1)
Ix(i,j)
Ix(i,j-1)
Ix
Iy(i,j)
d
e
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-1
C -6
-3
-12 -13
A -
-8
-5
2
A -7
-8
-1
-6
C -
-9
-6
1
C -8
-13 -2
-3
x=
F
y= G
x=
G -
m=2
s = -2
d = -5
e = -1
Iy
C
C
-5
-6
-7
-
-3
-4
F(i-1, j-1)
Iy(i-1, j-1)
(xi, yj)
F(i-1,j)
e
Ix(i-1, j-1)
C -
-
A -
-
-13 -10
C -
-
-14 -11
Iy(i-1,j)
d
-12 -1
F(i, j)
F(i,j-1)
Ix(i,j)
Ix(i,j-1)
Ix
Iy(i,j)
d
e
y= G
x= 0
C
C
y= G
C
C
-
-
-
-
-
-
-
G -
2
-7
-8
G -5
-
-
C -
-7
4
-1
C -6
-3
-12 -13
A -
-8
-5
2
x GCAC
A -7
-8
-1
-6
C -
-9
-6
1
|| |
C -8
-13 -2
-3
y GC-C
F
y= G
x=
x=
C
Iy
C
y= G
-5
-6
-7
G -
-
-3
-4
G
C -
-
-12 -1
C
A -
-
-13 -10
C -
-
-14 -11
Ix
m=2
s = -2
d = -5
e = -1
x=
A
C
C
C
Statistics of alignment
Where does (xi, yj) come from?
Are two aligned sequences actually related?
Probabilistic model of alignments
• We’ll first focus on protein alignments without
gaps
• Given an alignment, we can consider two
possible models
– R: the sequences are related by evolution
– U: the sequences are unrelated
• How can we distinguish these two models?
• How is this view related to amino-acid
substitution matrix?
Model for unrelated sequences
• Assume each position of the alignment is independently
sampled from some distribution of amino acids
• ps: probability of amino acid s in the sequences
• Probability of seeing an amino acid s aligned to an
amino acid t by chance is
– Pr(s, t | U) = ps * pt
• Probability of seeing an ungapped alignment between
x = x1…xn and y = y1…yn randomly is
i
Model for related sequences
• Assume each pair of aligned amino acids
evolved from a common ancestor
• Let qst be the probability that amino acid s in one
sequence is related to t in another sequence
• The probability of an alignment of x and y is give
by
Probabilistic model of Alignments
• How can we decide which model (U or R) is
more likely?
• One principled way is to consider the relative
likelihood of the two models (the odds ratio)
– A higher ratio means that R is more likely than U
Log odds ratio
• Taking logarithm, we get
• Recall that the score of an alignment is
given by
• Therefore, if we define
• We are actually defining the alignment
score as the log odds ratio between the
two models R and U
How to get the probabilities?
• ps can be counted from the available
protein sequences
• But how do we get qst? (the probability that
s and t have a common ancestor)
• Counted from trusted alignments of related
sequences
Protein Substitution Matrices
• Two popular sets of matrices for protein
sequences
– PAM matrices [Dayhoff et al, 1978]
• Better for aligning closely related sequences
– BLOSUM matrices [Henikoff & Henikoff, 1992]
• For both closely or remotely related sequences
BLOSUM-N matrices
• Constructed from a database called BLOCKS
• Contain many closely related sequences
– Conserved amino acids may be over-counted
• N = 62: the probabilities qst were computed
using trusted alignments with no more than 62%
identity
– identity: % of matched columns
• Using this matrix, the Smith-Waterman algorithm
is most effective in detecting real alignments
with a similar identity level (i.e. ~62%)
: Scaling factor to convert score to integer.
Important: when you are told that a
scoring matrix is in half-bits =>  = ½ ln2
Positive for chemically
similar substitution
Common amino acids
get low weights
Rare amino acids
get high weights
BLOSUM-N matrices
• If you want to detect homologous genes with
high identity, you may want a BLOSUM matrix
with higher N. say BLOSUM75
• On the other hand, if you want to detect remote
homology, you may want to use lower N, say
BLOSUM50
• BLOSUM-62: good for most purposes
45
Weak homology
62
90
Strong homology
For DNAs
• No database of trusted alignments to start
with
• Specify the percentage identity you would
like to detect
• You can then get the substitution matrix by
some calculation
For example
• Suppose pA = pC = pT = pG = 0.25
• We want 88% identity
• qAA = qCC = qTT = qGG = 0.22, the rest =
0.12/12 = 0.01
• (A, A) = (C, C) = (G, G) = (T, T)
= log (0.22 / (0.25*0.25)) = 1.26
• (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for
s ≠ t.
Substitution matrix
A
C
G
T
A
1.26 -1.83 -1.83 -1.83
C
-1.83 1.26 -1.83 -1.83
G
-1.83 -1.83 1.26 -1.83
T
-1.83 -1.83 -1.83 1.26
A
C
G
T
A
5
-7
-7
-7
C
-7
5
-7
-7
G
-7
-7
5
-7
T
-7
-7
-7
5
• Scale won’t change the alignment
• Multiply by 4 and then round off to get integers
Arbitrary substitution matrix
• Say you have a substitution matrix
provided by someone
• It’s important to know what you are
actually looking for when you use the
matrix
NCBI-BLAST
G
WU-BLAST
A
C
T
A
C
A
1
-2 -2 -2
C
-2
1
G
T
G
T
A
5
-4 -4 -4
-2 -2
C
-4
5
-2 -2
1
-2
G
-4 -4
5
-4
-2 -2
-2
1
T
-4 -4
-4
5
-4 -4
• What’s the difference?
• Which one should I use for my sequences?
• We had
• Scale it, so that
• Reorganize:
• Since all probabilities must sum to 1,
• We have
• Suppose again ps = 0.25 for any s
• We know (s, t) from the substitution
matrix
• We can solve the equation for λ
• Plug λ into
to get qst
NCBI-BLAST
WU-BLAST
A
C
G
T
A
C
G
A
1
-2
-2 -2
C
-2
1
G
T
T
A
5
-4
-4 -4
-2 -2
C
-4
5
-4 -4
-2 -2
1
-2
G
-4 -4
5
-4
-2 -2
-2
1
T
-4 -4
-4
5
 = 1.33
 = 0.19
qst = 0.24 for s = t, and 0.004 for s ≠ t
qst = 0.16 for s = t, and 0.03 for s ≠ t
Translate: 95% identity
Translate: 65% identity
Details for solving 
Known: (s,t) = 1 for s=t, and (s,t) = -2 for s t.
Since
A
C
G
T
A
1
-2
-2
-2
C
-2
1
-2
-2
G
-2
-2
1
-2
T
-2
-2
-2
1
and s,t qst = 1, we have
12 * ¼ * ¼ * e-2 + 4 * ¼ * ¼ * e = 1
Let e = x, we have
¾ x-2 + ¼ x = 1. Hence,
x3 – 4x2 + 3 = 0;
• X has three solutions: 3.8, 1, -0.8
•
Only the first solution leads to a positive 
•  = ln (3.8) = 1.33
Statistics of alignment
Where does (xi, yj) come from?
Are two aligned sequences actually related?
Statistics of Alignment Scores
• Q: How do we assess whether an
alignment provides good evidence for
homology (i.e., the two sequences are
evolutionarily related)?
– Is a score 82 good? What about 180?
• A: determine how likely it is that such an
alignment score would result from chance
P-value of alignment
• p-value
– The probability that the alignment score can
be obtained from aligning random sequences
– Small p-value means the score is unlikely to
happen by chance
• The most common thresholds are 0.01
and 0.05
– Also depend on purpose of comparison and
cost of misclaim
Statistics of global seq alignment
• Theory only applies to local alignment
• For global alignment, your best bet is to do Monte-Carlo
simulation
– What’s the chance you can get a score as high as the real
alignment by aligning two random sequences?
• Procedure
– Given sequence X, Y
– Compute a global alignment (score = S)
– Randomly shuffle sequence X (or Y) N times, obtain
X1, X2, …, XN
– Align each Xi with Y, (score = Ri)
– P-value: the fraction of Ri >= S
Human HEXA
Fly HEXO1
Score = -74
45
40
Number of Sequences
35
30
25
20
15
-74
10
5
0
-95
-90
-85
-80
-75
-70
Alignment Score
-65
-60
-55
-50
Distribution of the alignment scores between fly HEXO1 and 200
randomly shuffled human HEXA sequences
There are 88 random sequences with alignment score >= -74.
So: p-value = 88 / 200 = 0.44 => alignment is not significant
Mouse HEXA
Human HEXA
Score = 732
……………………………………………………
45
45
40
40
35
30
Number of Sequences
35
Number of Sequences
30
Distribution of the
alignment scores
between mouse HEXA
and 200 randomly
shuffled human HEXA
sequences
25
20
15
10
25
5
0
-230
20
-220
-210
-200
-190
-180
Alignment Score
-170
-160
-150
15
732
10
5
0
-200
-100
0
100
200
300
400
Alignment Score
500
600
700
800
• No random sequences with alignment score >= 732
– So: the P-value is less than 1 / 200 = 0.05
• To get smaller p-value, have to align more random sequences
– Very slow
• Unless we can fit a distribution (e.g. normal distribution)
– Such distribution may not be generalizable
– No theory exists for global alignment score distribution
Statistics for local alignment
• Elegant theory exists
• Score for ungapped local alignment follows extreme value
distribution (Gumbel distribution)
Normal
distribution
Extreme value
distribution
An example extreme value distribution:
• Randomly sample 100 numbers from a normal distribution, and compute max
• Repeat 100 times.
• The max values will follow extreme value distribution
Statistics for local alignment
• Given two unrelated sequences of lengths M, N
• Expected number of ungapped local alignments
with score at least S can be calculated by
–
–
–
–
E(S) = KMN exp[-S]
Known as E-value
: scaling factor as computed in last lecture
K: empirical parameter ~ 0.1
• Depend on sequence composition and substitution matrix
P-value for local alignment score
• P-value for a local alignment with score S

P  x  S   1  exp  E ( S )   1  exp  KMNeS
 E ( S ) when P is small.

Example
• You are aligning two sequences, each has
1000 bases
• m = 1, s = -1, d = -inf (ungapped alignment)
• You obtain a score 20
• Is this score significant?
 = ln3 = 1.1 (computed as discussed on slide #41)
E(S) = K MN exp{- S}
E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5
P-value = 3 x 10-5 << 0.05
The alignment is significant
400
350
300
Number of Sequences
•
•
•
•
•
250
Distribution of 1000
random sequence pairs
200
150
100
20
50
0
9
10
11
12
13
14
15
Alignment Score
16
17
18
Multiple-testing problem
• Searching a 1000-base sequence against a database of
106 sequences (each of length 1000)
• How significant is a score 20 now?
• You are essentially comparing 1000 bases with 1000x106
= 109 bases (ignore edge effect)
• E(20) = 0.1 * 1000 * 109 * 3-20 = 30
• By chance we would expect to see 30 matches
– The P-value (probability of seeing at least one match with score
>= 30) is 1 – e-30 = 0.9999999999
– The alignment is not significant
– Caution: it does NOT mean that the two sequences are unrelated.
Rather, it simply means that you have NO confidence to say
whether the two sequences are related.
Score threshold to determine
significance
• You want a p-value that is very small (even after
taking into consideration multiple-testing)
• What S will guarantee you a significant p-value?
E(S)  P(S) << 1
=> KMN exp[-S] << 1
=> log(KMN) -S < 0
=> S > T + log(MN) / 
(T = log(K) / , usually small)
Score threshold to determine
significance
• In the previous example
– m = 1, s = -1, d = -inf =>  = 1.1
• Aligning 1000bp vs 1000bp
S > log(106) / 1.1 = 13.
So 20 is significant.
• Searching 1000bp against 106 x 1000bp
S > log(1012) / 1.1 = 25.
so 20 is not significant.
Statistics for gapped local alignment
• Theory not well developed
• Extreme value distribution works well
empirically
• Need to estimate K and  empirically
– Given the database and substitution matrix,
generate some random sequence pairs
– Do local alignment
– Fit an extreme value distribution to obtain K
and 
Alignment statistics summary
• How to obtain a substitution matrix?
– Obtain qst and ps from established alignments (for DNA: from
your knowledge)
– Computing score:
• How to understand arbitrary substitution matrix?
– Solve function to obtain  and target qst
– Which tells you what percent identity you are expecting
• How to understand alignment score?
– probability that a score can be expected from chance.
– Global alignment: Monte-Carlo simulation
– Local alignment: Extreme Value Distribution
• Estimate p-value from a score
• Determine a score threshold without computing a p-value
Part III:
Heuristic Local Sequence
Alignment: BLAST
State of biological databases
Sequenced Genomes:
Human
Mouse
Neurospora
Tetraodon
Drosophila
Rice
sea squirts
3109
2.7109
4107
3108
1.2108
1.0109
1.6108
Yeast
Rat
Fugu fish 3.3108
Mosquito 2.8108
Worm
Arabidopsis
Current rate of sequencing (before new-generation sequencing):
4 big labs  3 109 bp /year/lab
10s small labs
Private sectors
With new-generation sequencing:
Easily generating billions of reads daily
1.2107
2.6109
1.0108
1.2108
Some useful applications of
alignments
Given a newly discovered gene,
- Does it occur in other species?
Assume we try Smith-Waterman:
Our
new
gene
104
The entire genomic database
1010 - 1011
May take several weeks!
Some useful applications of
alignments
Given a newly sequenced organism,
- Which subregions align with other organisms?
- Potential genes
- Other functional units
Assume we try Smith-Waterman:
Our newly
sequenced
mammal
3109
The entire genomic database
1010 - 1011
> 1000 years ???
BLAST
• Basic Local Alignment Search Tool
– Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990
– The most widely used bioinformatics tool
• Which is better: long mediocre match or a few nearby,
short, strong matches with the same total score?
– Score-wise, exactly equivalent
– Biologically, later may be more interesting, & is common
– At least, if must miss some, rather miss the former
• BLAST is a heuristic algorithm emphasizing the later
– speed/sensitivity tradeoff: BLAST may miss former, but gains
greatly in speed
BLAST
• Available at NCBI (National Center for
Biotechnology Information) for download and
online use. http://blast.ncbi.nlm.nih.gov/
• Along with many sequence databases
query
Main idea:
1.Construct a dictionary of all the words in
the query
2.Initiate a local alignment for each word
match between query and DB
Running Time: O(MN)
However, orders of magnitude faster
than Smith-Waterman
DB
BLAST  Original Version
……
Dictionary:
All words of length k (~11 for DNA, 3
for proteins)
Alignment initiated between words of
alignment score  T (typically T = k)
query
……
Alignment:
Ungapped extensions until score
below statistical threshold
scan
DB
Output:
All local alignments with score
> statistical threshold
query
BLAST  Original Version
A C G A A G T A A G G T C C A G T
k = 4, T = 4
The matching word GGTC
initiates an alignment
Extension to the left and
right with no gaps until
alignment falls < 50%
Output:
GTAAGGTCC
GTTAGGTCC
C C C T T C C T G G A T T G C G A
Example:
Gapped BLAST
Added features:
•
•
Pairs of words can
initiate alignment
Extensions with
gaps in a band
around anchor
Output:
GTAAGGTCCAGT
GTTAGGTC-AGT
C T G A T C C T G G A T T G C G A
A C G A A G T A A G G T C C A G T
Example
Query: gattacaccccgattacaccccgattaca (29 letters)
[2 mins]
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
sequences) 1,726,556 sequences; 8,074,398,388 total letters
>gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence,
complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21
(95%) Strand = Plus / Plus
Query:
Sbjct:
4
tacaccccgattacaccccga 24
||||||| |||||||||||||
125138 tacacccagattacaccccga 125158
Score = 34.2 bits (17),
Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus
Query:
Sbjct:
4 tacaccccgattacaccccga 24
||||||| |||||||||||||
125104 tacacccagattacaccccga 125124
>gi|28173089|gb|AC104321.7|
Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence,
complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21
(95%) Strand = Plus / Plus
Query:
Sbjct:
4 tacaccccgattacaccccga 24
||||||| |||||||||||||
3891 tacacccagattacaccccga 3911
Example
Query: Human atoh enhancer, 179 letters
[1.5 min]
Result: 57 blast hits
1.
2.
3.
4.
5.
6.
7.
8.
gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc...
gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch...
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc...
gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene...
gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo...
gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O...
gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ...
gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres...
355 1e-95
264 4e-68
256 9e-66
78 5e-12
54 7e-05
44 0.068
42 0.27
42 0.27
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517
Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%),
Gaps = 2/177 (1%) Strand = Plus / Plus
Query:
3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62
||||||||||||| ||||||||||||||||||| ||||||||||||||||||||||||||
Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203
Query:
63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122
|||||||||||||||||||||||||| ||||||||| |||||||||||||||| |||||
Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262
Query:
123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179
||||||||||||| || ||| |||||||||||||||||||| |||||||||||||||
Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318
BLAST Score: bit score vs raw score
• Bit score is converted from raw score by
taking into account K and :
S’ = ( S – log K) / log 2
• To compute E-value from bit score:
E = KM’N’ e-S = M’N’ 2-S’
• Critical score is now:
S* = log2(M’N’)
If S’ >> S*: significant
If S’ << S*: not significant
(M’ ~ M, N’ ~ N)
Different types of BLAST
• blastn: search nucleic acid databases
• blastp: search protein databases
• blastx: you give a nucleic acid sequence,
search protein databases
• tblastn: you give a protein sequence,
search nucleic acid databases
• tblastx: you give a nucleic sequence,
search nucleic acid database, implicitly
translate both into protein sequences
BLAST cons and pros
• Advantages
– Fast!!!!
– A few minutes to search a database of 1011 bases
• Disadvantages
– Sensitivity may be low
– Often misses weak homologies
• New improvement
– Make it even faster
• Mainly for aligning very similar sequences or really long sequences
– E.g. whole genome vs whole genome
– Make it more sensitive
• PSI-BLAST: iteratively add more homologous sequences
• PatternHunter: discontinuous seeds
Variants of BLAST
NCBI-BLAST: most widely used version
WU-BLAST: (Washington University BLAST): another popular version
Optimized, added features
MEGABLAST:
Optimized to align very similar sequences. Linear gap penalty
BLAT: Blast-Like Alignment Tool
BlastZ:
Optimized for aligning two genomes
PSI-BLAST:
BLAST produces many hits
Those are aligned, and a pattern is extracted
Pattern is used for next search; above steps iterated
Sensitive for weak homologies
Slower
Summary
• Part I: Algorithms
– Global sequence alignment: Needleman-Wunsch
– Local sequence alignment: Smith-Waterman
– Improvement on space and time
• Part II: Biological issues
– Model gaps more accurately: affine gap penalty
– Alignment statistics
• Part III: Heuristic algorithms – BLAST family
Download