BCB 444/544 Lecture 7 Still more: Dynamic Programming Global vs Local Alignment Scoring Matrices & Alignment Statistics BLAST nope #7_Sept5 BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 1 Required Reading (before lecture) √Last week: - for Lectures 4-7 Pairwise Sequence Alignment, Dynamic Programming, Global vs Local Alignment, Scoring Matrices, Statistics • Xiong: Chp 3 • Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html Wed Sept 5 - for Lecture 7 & Lab 3 Database Similarity Searching: BLAST • Chp 4 - pp 51-62 Fri Sept - for Lecture 8 BLAST variations; BLAST vs FASTA • Chp 4 - pp 51-62 BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 2 Assignments & Announcements √Tues Sept 4 - Lab #2 Exercise Writeup due by 5 PM Send via email to Pete Zaback petez@iastate.edu (For now, no late penalty - just send ASAP) √Wed Sept 5 - Notes for Lecture 5 posted online - HW#2 posted online & sent via email & handed out in class Fri Sept 14 - HW#2 Due by 5 PM Fri Sept 21 - Exam #1 BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 3 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • • • • • • √Evolutionary Basis √Sequence Homology versus Sequence Similarity √Sequence Similarity versus Sequence Identity Methods - cont Scoring Matrices Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 4 Methods • √Global and Local Alignment • √Alignment Algorithms • √Dot Matrix Method • Dynamic Programming Method - cont • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 5 Global vs Local Alignment Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 6 Global vs Local Alignment - example 1 = CTGTCGCTGCACG 2 = TGCCGTG Global alignment CTGTCGCTGCACG -TG-C-C-G--TG Local alignment CTGTCGCTGCACG -TGCCG-TG---- CTGTCGCTGCACG -TGCCG-T----G Which is better? BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 7 Global vs Local Alignment Which should be used when? It is critical to choose correct method! Global Alignment vs Local Alignment? Shout out the answers!! Which should we use for? 1. 2. 3. 4. 5. Searching for conserved motifs in DNA or protein sequences? Aligning two closely related sequences with similar lengths? Aligning highly divergent sequences? Generating an extended alignment of closely related sequences? Generating an extended alignment of closely related sequences with very different lengths? Hmmm - we'll work on that Excellent! BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 8 Global vs Local Alignment Which should be used when? It is critical to choose correct method! Global Alignment vs Local Alignment? Shout out the answers!! Which should we use for? 1. Searching for conserved motifs in DNA or protein sequences? Local 2. Aligning two closely related sequences with similar lengths? Global 3. Aligning highly divergent sequences? Local (at least initially) 4. Generating an extended alignment of closely related sequences? Global 5. Generating an extended alignment of closely related sequences with very different lengths? Hmmm - we'll work on that BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 9 Alignment Algorithms 3 major methods for pairwise sequence alignment: 1. Dot matrix analysis √ - practice in HW2 2. Dynamic programming - more today & in HW2 3. Word or k-tuple methods (later, in Chp 4) BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 10 Dynamic Programming For Pairwise sequence alignment Idea: Display one sequence above another with spaces inserted in both to reveal similarity CAT-TCA-C | | || | C-TCGCAGC BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 11 Global Alignment: Scoring CTGTCGCTGCACG -TGC-CG-TG---Reward for matches: Mismatch penalty: Space/gap penalty: Score = w – x - y w = #matches x = #mismatches y = #spaces Note: I changed symbols & colors on this slide! BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 12 Global Alignment: Scoring Reward for matches: Mismatch penalty: Space/gap penalty: 10 -2 -5 C T G T C G – C T G C - T G C – C G – T G -5 10 10 -2 -5 -2 -5 -5 10 10 -5 Note: I changed symbols & colors on this slide! Total = 11 We could have done better!! BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 13 Alignment Algorithms • Global: Needleman-Wunsch • Local: Smith-Waterman • Both NW and SW use dynamic programming • Variations: • Gap penalty functions • Scoring matrices BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 14 Dynamic Programming - Key Idea: The score of the best possible alignment that ends at a given pair of positions (i, j) is equal to: the score of best alignment ending just previous to those two positions (i.e., ending at i-1, j-1) PLUS the score for aligning xi and yj BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 15 Global Alignment: DP Problem Formulation & Notations Given two sequences (strings) • X = x1x2…xN of length N x = AGC N=3 • Y = y1y2…yM of length M y = AAAC M=4 Construct a matrix with (N+1) x (M+1) elements, where S(i,j) = Score of best alignment of x[1..i]=x1x2…xi with y[1..j]=y1y2…yj x1 x2 x3 Which means: S(i,j) = Score of best alignment of a prefix of X and a prefix of Y y1 y2 y3 S(2,3) = score of best alignment of AG (x1x2) to AAA (y1y2y3) y4 BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 16 Dynamic Programming - 4 Steps: 1. Define score of optimal alignment, using recursion 2. Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest subproblems first (bottom-up approach) 3. Calculate score of optimal alignment(s) 4. Trace back through matrix to recover optimal alignment(s) that generated optimal score BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 17 1- Define Score of Optimal Alignment using Recursion Define: x1..i Prefix of length i of x y1.. j Prefix of length j of y S(i, j) Score of optimal alignment of x1..i and y1..j Initial conditions: S(i,0) i S(0, j) j = Match Reward = Mismatch Penalty = Gap penalty Recursive definition: For 1 i N, 1 j M: S(i 1, j 1) (xi , y j ) S(i, j) max S(i 1, j) S(i, j 1) BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices (xi,yj) = or = Gap penalty 9/5/07 18 2- Initialize & Fill in DP Matrix for Storing Optimal Scores ofSubproblems • Construct sequence vs sequence matrix • Fill in from [0,0] to [N,M] (row by row), calculating best possible score for each alignment ending at residues at [i,j] 0 0 1 1 N S(0,0)=0 S(i,j) M S(N,M) BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 19 How do we calculate S(i,j)? i.e., Score for alignment of x[1..i] to y[1..j]? 1 of 3 cases optimal score for this subproblem: xi aligns to yj xi aligns to a gap yj aligns to a gap x1 x2 . . . xi-1 xi x1 x2 . . . xi-1 xi x1 x2 . . . x i y1 y2 . . . yj-1 yj y1 y2 . . . yj y1 y2 . . . yj-1 yj S(i-1,j-1) + (xi,yj) S(i-1,j) — - BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices S(i,j-1) 9/5/07 — - 20 Note: I changed sequences on this slide (to match the rest of DP example) Specific Example: Case 1: Line up xi with yj x: C y: C A T T T i-1 C G C A j-1 Case 2: Line up xi with space x: C y: C A - T T T T C C Case 3: Line up yj with space x: C y: C A - T T T T C C i C C j i-1 A A G j Scoring Consequence? A Mismatch Penalty i C - Space Penalty i A C A - G j -1 j Space Penalty BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 21 Ready? Fill in DP Matrix Keep track of dependencies of scores (in a pointer matrix) 0 1 0 S(0,0)=0 1 = Match Reward = Mismatch Penalty = Gap penalty M Initialization S(i,0) i S(0, j) j N + (xi,yj) = or S(i-1,j-1) S(i-1,j) S(i,j-1) S(i,j) - - S(N,M) Recursion S(i 1, j 1) (xi , y j ) S(i, j) max S(i 1, j) S(i, j 1) BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 22 Fill in the DP matrix !! λ λ 0 -5 C A -10 T -15 T -20 C A -25 -30 C -35 C T C G C A G C -5 -10 -15 -20 -25 -30 -35 -40 10 5 +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 23 3- Calculate Score S(N,M) of Optimal Alignment - for Global Alignment λ C T C G C A G C λ 0 -5 -10 -15 -20 -25 -30 -35 -40 C A -5 10 5 0 -5 -10 -15 -20 -25 -10 5 8 3 -2 -7 0 -5 -10 T -15 0 15 10 5 0 -5 -2 -7 T -20 -5 10 13 8 3 -2 -7 -4 C A -25 -10 5 20 15 18 13 8 3 -30 -15 0 15 18 13 28 23 18 C -35 -20 -5 10 13 28 23 26 33 +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 24 4- Trace back through matrix to recover optimal alignment(s) that generated the optimal score How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix Result? Optimal alignment(s) of sequences BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 25 Traceback - for Global Alignment Start in lower right corner & trace back to upper left Each arrow introduces one character at end of alignment: • A horizontal move puts a gap in left sequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from each sequence BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 26 Traceback to Recover Alignment λ C λ 0 d -5 C A -5 10v -10 5 d 8 T -15 0 15 T -20 -5 C A -25 C T C G C A G C -10 -15 -20 -25 -30 -35 -40 5 0 -5 -10 -15 -20 -25 -2 -7 0 -5 -10 10 d 5 0 -5 -2 -7 10d 13 8 3 -2 -7 -4 -10 5 20 13 8 3 -30 -15 0 23 d 18 -35 -20 -5 26 33 h 1 3 h d d 15 18 15 18 13 28 10 13 28 23 2 h Can have >1 optimal alignment; this example has 2 BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 27 What are the 2 Global Alignments with Optimal Score = 33? 1: 2: C T C G C A G C A T T C A C C T C G C A G C C T C G C A G C BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices C 9/5/07 28 Local Alignment: Motivation • To "ignore" stretches of non-coding DNA: • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions • Local alignment between two protein-encoding sequences is likely to be between two exons • To locate protein domains or motifs: • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities • Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 29 Local Alignment: Example G G T C T G A G A A A C G A Match: +2 Mismatch or space: -1 Best local alignment: G G T C T G A G A A A C – G A - Score = 5 BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 30 Local Alignment: Algorithm •S [i, j] = Score for optimally aligning a suffix of X with a suffix of Y • Initialize top row & leftmost column of matrix with "0" Recall: for Global Alignment, • S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y • Initialize top row & leftmost column of with gap penalty BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 31 Traceback - for Local Alignment λ C T C G C A G C 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 2 0 0 T 0 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 0 0 0 1 0 2 0 1 0 0 1 0 0 0 0 1 0 2 0 0 0 1 0 1 0 2 0 1 1 λ C A C A C +1 for a match, -1 for a mismatch, -5 for a space BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 32 What are the 4 Local Alignments with Optimal Score = 2? C C T A C T G T C C A A G C C 1: C T C G C A G C 2: C T C G C A G C 3: C T C G C A G C 4: C T C G C A G C BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 33 Some Results re: Alignment Algorithms (for ComS, CprE & Math types!) • Most pairwise sequence alignment problems can be solved in O(mn) time • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 34 Affine Gap Penalty Functions Affine Gap Penalties = Differential Gap Penalties used to reflect cost differences between opening a gap and extending an existing gap Total Gap Penalty is linear function of gap length: W = where + X (k - 1) = gap opening penalty = gap extension penalty Can also be solved in O(nm) time using DP k = length of gap Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 35 Methods • • • • √Global and Local Alignment √Alignment Algorithms √Dot Matrix Method √Dynamic Programming Method - cont • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 36 "Scoring" or "Substitution" Matrices 2 Major types for Amino Acids: PAM & BLOSUM PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 37 PAM Matrix PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in closely related proteins • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed • PAM1 - for less divergent sequences (shorter time) • PAM250 - for more divergent sequences (longer time) BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 38 BLOSUM Matrix BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated • BLOSUM45 - for more divergent sequences • BLOSUM62 - for less divergent sequences BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 39 BLOSUM62 Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) BCB 444/544 F07 ISU Dobbs #7 - Still more DP, Scoring Matrices 9/5/07 40