BNFO 136 Sequence alignment Usman Roshan Pairwise alignment • X: ACA, Y: GACAT • Match=8, mismatch=2, gap-5 ACA-GACAT -ACAGACAT --ACA GACAT ACA---G--ACAT 8+2+2-5-5 Score = 2 -5+8+8+8-5 14 -5-5+2+2+2 -4 2-5-5-5-5-5-5 -28 Traceback • We can compute an alignment of DNA (or protein or RNA) sequences X and Y with a traceback matrix T. • Sequence X is aligned along the rows and Y along the columns. • Each entry of the matrix T contains D, L, or U specifying diagonal, left or upper Traceback • X: ACA, Y=TACAG T A C A G L L L L L A U D U U L C U U D U D A U L L D L Traceback • X: ACA, Y=TACAG T A C A G L L L L L A U D U U L C U U D U D A U L L D L Traceback code aligned_seq1 = "" aligned_seq2 = "" i = len(seq2) j = len(seq1) while(i !=0 or j != 0): if(T[i][j] == “L”): aligned_seq1 = “-” + aligned_seq1 aligned_seq1 = seq1[j-1] + aligned_seq1 j = j - 1 elif(T[i][j] == "U"): aligned_seq1 = "-" + aligned_seq1 aligned_seq2 = seq2[i-1] + aligned_seq2 i = i - 1 else: aligned_seq1 = seq1[j-1] + aligned_seq1 aligned_seq2 = seq2[i-1] + aligned_seq2 i = i - 1 j = j - 1 Optimal alignment • An alignment can be specified by the traceback matrix. • How do we determine the traceback for the highest scoring alignment? • Needleman-Wunsch algorithm for global alignment – First proposed in 1970 – Widely used in genomics/bioinformatics – Dynamic programming algorithm Needleman-Wunsch (NW) • Input: – X = x1x2…xn, Y=y1y2…ym – (X is seq2 and Y is seq1) • Notation: – X1..i = x1x2…xi – Score(X1..i,Y1..j) = Optimal alignment score of sequences X1..i and Y1..j. • Suppose we know the optimal alignment scores of – X1…i-1 and Y1…j-1 – X1…i and Y1...j-1 – X1...i-1 and Y1…j Needleman-Wunsch (NW) • Then the optimal alignment score of X1…i and Y1…j is the maximum of – Score(X1…i-1,Y1…j-1) + match/mismatch – Score(X1…i,Y1…j-1) + gap – Score(X1…i-1,Y1…j) + gap • We build on this observation to compute Score(Xn,Ym) Needleman-Wunsch • Define V to be a two dimensional matrix with len(X)+1 rows and len(Y)+1 columns • Let V[i][j] be the score of the optimal alignment of X1…i and Y1…j. • Let m be the match cost, mm be mismatch, and g be the gap cost. NW pseudocode Initialization: for i = 1 to len(seq2) { V[i][0] = i*g; } For i = 1 to len(seq1) { V[0][i] = i*g; } Recurrence: for i = 1 to len(seq2){ for j = 1 to len(seq1){ V[i][j] = max { V[i-1][j-1] + m(or mm) V[i-1][j] + g V[i][j-1] + g if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’ else if (maximum is V[i-1][j] + g) then T[i][j] = ‘U’ else then T[i][j] = ‘L’ } } Example V Input: seq2: ACA seq1: GACAT m=5 mm = -4 gap = -20 A C A G A C A T 0 -20 -40 -60 -80 -100 -20 -4 -15 -35 -55 -75 -40 -24 -8 -10 -30 -50 -60 -44 -19 -12 -5 -25 T seq2 is lined along the rows and seq2 is along the columns L L L L L U D D L L L U U D D L L U U D D D L