Needleman Wunsch Algorithm for Sequence Alignment in

advertisement
Needleman Wunsch Sequence Alignment
•The Needleman–Wunsch algorithm performs a global alignment on two
sequences (called A and B here).
•It is commonly used in bioinformatics to align protein or nucleotide
sequences.
•The algorithm was proposed in 1970 by Saul Needleman and Christian
Wunsch in their paper A general method applicable to the search for
similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):44353.
•The Needleman–Wunsch algorithm is an example of dynamic programming,
and was the first application of dynamic programming to biological sequence
comparison.
Needleman Wunsch Sequence Alignment
Scores for aligned characters are specified by a similarity matrix. Here, S(i,j)
is the similarity of characters i and j. It uses a linear gap penalty, called ‘d’.
For example, if the similarity matrix was
A
G
C
T
A
10
-1
-3
-4
G
-1
7
-5
-3
C
-3
-5
9
0
T
-4
-3
0
8
Then the alignment:
AGACTAGTTAC
CGA - - -GACGT
with a gap penalty of -5, would have the following score...
S(A,C) + S (G,G) + S(A,A) + 3*d + S(G,G) + S(T,A) + S(T,C) + S(A,G) + S(C,T)
= -3 +7 + 10 -3*5 +7 -4 +0 -1 +0 = 1
Needleman Wunsch Sequence Alignment
•To find the alignment with the highest score, a two-dimensional array (or
matrix) is allocated. This matrix is often called the F matrix, and its (i,j)th
entry is often denoted Fij (j along horizontal axis and i along vertical axis)
•There is one column for each character in sequence A, and one row for
each character in sequence B.
•Thus, if we are aligning sequences of sizes n and m, the running time of the
algorithm is O(nm) and the amount of memory used is in O(nm).
•As the algorithm progresses, the Fij will be assigned to be the optimal score
for the alignment of the first j characters in A and the first i characters in B.
The principle of optimality is then applied as follows.
Basis: F0j = d * j Fi0 = d * i
Recursion, based on the principle of optimality:
Fij = max(Fi − 1,j − 1 + S(Bi,Aj),Fi,j − 1 + d,Fi − 1,j + d)
Needleman Wunsch Sequence Alignment
The pseudo-code for the algorithm to compute the F matrix therefore looks like this
(array and sequence indexes start at 0):
for i=0 to length(B)-1
F(i,0) <- d*i
for j=0 to length(A)-1
F(0,j) <- d*j
for i=1 to length(B)
for j = 1 to length(A) {
Choice1 <- F(i-1,j-1) + S(B(i), A(j))
Choice2 <- F(i-1, j) + d
Choice3 <- F(i, j-1) + d
F(i,j) <- max(Choice1, Choice2, Choice3)
}
•Once the F matrix is computed, the bottom right hand corner of the matrix is the maximum
score for any alignment.
•To compute which alignment actually gives this score, you can start from the bottom right cell,
and compare the value with the three possible sources(Choice1, Choice2, and Choice3 above)
to see which it came from.
If Choice1, then A(j) and B(i) are aligned,
If Choice2, then A(j) is aligned with a gap, and
If Choice3, then B(i) is aligned with a gap.
Needleman Wunsch Sequence Alignment
AlignmentA <- "" ; AlignmentB <- "“;
i <- length(B); j <- length(A);
while (i > 0 AND j > 0) {
Score <- F(i,j); ScoreDiag <- F(i - 1, j - 1);
ScoreLeft <- F(i, j - 1); ScoreUp <- F(i - 1, j);
if (Score == ScoreDiag + S(A(j), B(i))) {
AlignmentA <- A(j) + AlignmentA; AlignmentB <- B(i) + AlignmentB;
i <- i – 1; j <- j – 1; }
else if (Score == ScoreLeft + d) {
AlignmentA <- A(j) + AlignmentA; AlignmentB <- "-" + AlignmentB;
j <- j - 1 }
else if (Score == ScoreUp + d) {
AlignmentA <- "-" + AlignmentA; AlignmentB <- B(i) + AlignmentB;
i <- i - 1 }
}
while (j > 0) { AlignmentA <- A(j) + AlignmentA; AlignmentB <- "-" + AlignmentB; j <- j - 1 }
while (i > 0) { AlignmentA <- "-" + AlignmentA; AlignmentB <- B(i) + AlignmentB; i <- i - 1 }
Needleman Wunsch Sequence Alignment
Project Deliverables:
Given the computation flow of the NWSA algorithm, architect a pipelined VHDL
implementation such that a single pipeline stage contains a single processing
element (PE).
1. Find the number and width of data elements that move between PEs.
2. Also assume that the testbench code includes the read/write memory.
a. Assume a fixed length of the A string – A does not change.
b. B strings are sent from the memory to the PE’s as inputs. Once a B string is
consumed, the next B string is fed into the system from the memory.
c. The final score values are sent back to memory as outputs. Each score corresponds to
a single B string.
d. Explicit instantiations of memory elements are not required – supply input values
from testbench, and read output values into the testbench.
e. Each PE also stores the compass value (cv) to remember where it got its score from
(0 = diagonal, 1 = up, 2 = left).
3. Describe your pipelined design implementation in your report.
4. Give printouts of the VHDL codes, including testbench in the report.
5. Attach the waveform printouts in the report.
Download