#6 -Scoring Matrices & Alignment Statistics 8/31/07 Required Reading BCB 444/544 (before lecture) Mon Aug 27 - for Lecture #4 Lecture 6 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Wed Aug 29 - for Lecture #5 Finish Dynamic Programming Dynamic Programming • Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 Scoring Matrices Alignment Statistics http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html Thurs Aug 30 - Lab #2: Databases, ISU Resources & Pairwise Sequence Alignment #6_Aug31 Fri Aug 31 - for Lecture #6 Scoring Matrices & Alignment Statistics • Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 1 Announcements BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 2 Chp 3- Sequence Alignment SECTION II Fri Aug 31 - Revised notes for Lecture 5 posted online Changes? mainly re-ordering, symbols, color "coding" SEQUENCE ALIGNMENT Xiong: Chp 3 Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! - Enjoy!! Pairwise Sequence Alignment Tues Sept 4 - Lab #2 Exercise Writeup Due by 5 PM (or sooner!) Send via email to Pete Zaback petez@iastate.edu (HW#2 assignment will be posted online) • √Evolutionary Basis Fri Sept 14 - HW#2 Due by 5 PM (or sooner!) • √Sequence Similarity versus Sequence Identity • √Sequence Homology versus Sequence Similarity • Methods - Fri Sept 21 - Exam #1 cont • Scoring Matrices • Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 3 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 Sequence Homology vs Similarity Methods • √Global and Local Alignment • Homologous sequences - sequences that share a common • √Alignment Algorithms • √Dot Matrix Method • Similar sequences - sequences that have a high percentage of evolutionary ancestry aligned residues with similar physicochemical properties • Dynamic Programming Method - cont (e.g., size, hydrophobicity, charge) • Gap penalities • DP for Global Alignment IMPORTANT: • DP for Local Alignment • Sequence homology: • Scoring Matrices • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Amino acid scoring matrices • PAM • Homology is qualitative • BLOSUM • Sequence similarity: • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 4 8/31/07 • The direct result of observation from a sequence alignment 5 • SimilarityBCBis444/544 quantitative ; can#6be described using percentages F07 ISU Dobbs - Scoring Matrices & Alignment Stats 8/31/07 6 1 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Goal of Sequence Alignment Statement of Problem Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters 4 letter alphabet (+ gap) • Penalty function for gaps in sequences TTGACAC TTTACAC Find: Optimal pairing of sequences that: • Proteins 20 letter alphabet (+ gap) • Retains the order of characters • Introduces gaps where needed RKVA-GMA • Maximizes total score RKIAVAMA BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 7 Avoiding Random Alignments with a Scoring Function 8/31/07 8 Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others (physicochemical properties are similar) • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice e.g., Ser & Thr are more similar than Trp & Ala • Need to distinguish between alignments that occur due to homology and those that occur by chance • Substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Define a scoring function that rewards matches (+) and penalizes mismatches (-) and gaps (-) Scoring Function (S): Note: I changed symbols & colors on this slide! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats e.g. Match: α 1 Mismatch: β 1 Gap: γ 0 • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) S = α(#matches) - β(#mismatches) - γ(#gaps) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 9 Substitution Matrix BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 10 Global vs Local Alignment Global alignment s(a,b) corresponds to score of aligning character a with character b • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length Match scores are often calculated based on frequency of mutations in Local alignment very similar sequences • Finds local regions with highest similarity between 2 sequences (more details later) • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 11 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 12 2 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Global vs Local Alignment Which should be used when? Global vs Local Alignment - example 1 = CTGTCGCTGCACG 2 = TGCCGTG Global alignment CTGTCGCTGCACG -TG-C-C-G--TG It is critical to choose correct method! Global Alignment Local alignment Local Alignment? Shout out the answers!! Which should we use for? CTGTCGCTGCACG -TGCCG-TG---- 1. Searching for conserved motifs in DNA or protein sequences? 2. Aligning two closely related sequences with similar lengths? 3. Aligning highly divergent sequences? CTGTCGCTGCACG -TGCCG-T----G vs 4. Generating an extended alignment of closely related sequences? 5. Generating an extended alignment of closely related sequences with very different lengths? Hmmm - we'll work on that Which is better? BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 13 Alignment Algorithms Excellent! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 14 Dot Matrix Method (Dot Plots) 3 major methods for pairwise sequence alignment: • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix 1. Dot matrix analysis • Plot a dot each time there is a match between an element of row sequence and an element of column sequence 2. Dynamic programming • For proteins, usually use more sophisticated scoring schemes than "identical match" 3. Word or k-tuple methods (later, in Chp 4) • Diagonal lines indicate areas of match A C G C G A C A C G • Contiguous diagonal lines reveal alignment; "breaks" = gaps (indels) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 15 Interpretation of Dot Plots BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 16 8/31/07 18 Dynamic Programming For Pairwise sequence alignment When comparing 2 sequences: • Diagonal lines of dots indicate regions of similarity between 2 sequences Idea: Display one sequence above another with spaces inserted in both to reveal similarity • Reverse diagonals (perpendicular to diagonal) indicate inversions C A T - T C A - C | | | | | C - T C G C A G C • What do such patterns mean when comparing a sequence with itself (or its reverse complement)? • e.g.: Reverse diagonals crossing diagonals (X's) indicate Exploring Dot Plots palindromes BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 17 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 3 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Global Alignment: Scoring Global Alignment: Scoring CTGTCG-CTGCACG Reward for matches: Mismatch penalty: Space/gap penalty: -TGC-CG-TG---Reward for matches: α Mismatch penalty: β Space/gap penalty: γ C - Score = αw – βx - γy w = #matches x = #mismatches Note: I changed symbols & colors on this slide! BCB 444/544 F07 ISU T T G G T C C – G C – G 10 -2 -5 C – T T G G C - -5 10 10 -2 -5 -2 -5 -5 10 10 -5 y = #spaces Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 19 Note: I changed symbols & colors on this slide! Total = 11 We could have done better!! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats Alignment Algorithms 8/31/07 20 Dynamic Programming - Key Idea: The score of the best possible alignment that ends at a • Global: Needleman-Wunsch given pair of positions (i, j) is equal to: • Local: Smith-Waterman the score of best alignment ending just previous to those two positions (i.e., ending at i-1, j-1) • Both NW and SW use dynamic programming PLUS • Variations: • Gap penalty functions the score for aligning xi and yj • Scoring matrices BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 21 Global Alignment: DP Problem Formulation & Notations BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 22 Dynamic Programming - 4 Steps: Given two sequences (strings) • X = x 1x 2 …xN of length N x = AGC N=3 • Y = y1y2 …yM of length M y = AAAC M=4 1. Define score of optimum alignment, using recursion 2. Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest Construct a matrix with (N+1) x (M+1) elements, where subproblems first (bottom-up approach) S ( i,j) = Scorexof best of x[1..i]=x1x2…x i with y[1..j]=y1 y2…yj x2 xalignment 1 3 Which means: Score of best alignment of a prefix of X and a prefix of Y 3. Calculate score of optimum alignment(s) 4. Trace back through matrix to recover optimum y1 y2 y3 S(2,3) = score of best alignment alignment(s) that generated optimal score of AG (x1x2) to AAA (y1y2y3) y4 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 23 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 24 4 #6 -Scoring Matrices & Alignment Statistics 8/31/07 2- Initialize & Fill in DP Matrix for Storing Optimal Scores of Subproblems 1- Define Score of Optimum Alignment using Recursion • Construct sequence vs sequence matrix: x1..i = Prefix of length i of x y1.. j = Prefix of length j of y Define: 0 1 S(i, j) = Score of optimum alignment of x1..i and y1..j Initial ! conditions: S(i,0) = "i # $ ! S(i,j-1) S(0, j) = " j # $ 8/31/07 i 25 8/31/07 26 ! ! 3- Calculate Score S(N,M) of Optimum Alignment - for Global Alignment • Fill in from [0,0] to [N,M] (row by row), calculating best possible score for each alignment including residues at [i,j] • Keep track of dependencies of scores (in a pointer matrix). 1 j BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats Fill in DP Matrix 0 Initialization S(i,0) = "i # $ S(0, j) = " j # $ %S(i "1, j "1) + # (x , y ) ' S(i, j) = max&S(i "1, j) " $ 'S(i, j "1) " $ ( ! 0 S(N,M) M %S(i "1, j "1) + # (xi , y j ) ' S(i, j) = max&S(i "1, j) " $ 'S(i, j "1) " $ ( BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats N S(i,j) Recursion ! 1 ≤ i ≤ N, 1 ≤ j ≤ M: For cont 1 S(0,0)=0 S(i-1,j-1) S(i-1,j) Recursive definition: 2- 0 1 What happens in last step in alignment of x[1..i] to y[1..j]? 1 of 3 cases applies: N xi aligns to yj S(0,0)=0 S(i-1,j-1) S(i-1,j) S(i,j-1) S(i,j) xi aligns to a gap yj aligns to a gap x1 x2 . . . xi-1 xi x1 x2 . . . xi-1 xi x1 x2 . . . xi y1 y2 . . . yj-1 yj y1 y2 . . . yj y1 y2 . . . yj-1 yj S(i-1,j-1) + σ(xi,yj) S(i-1,j) — -γ S(i,j-1) — -γ S(N,M) M BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 27 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats λ Case 1: Line up x i with y j A - T T T T C C i-1 A A j -1 Case 2: Line up x i with space x: C y: C A - T T T T C C A A λ i C G j i-1 G Case 3: Line up y j with space A - T T T T C C A A i C - i C - G j -1 j BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs C 0 T C G C A G C -5 -10 -15 -20 -25 -30 -35 -40 -5 C A -10 j x: C y: C 28 Fill in the matrix Example x: C y: C 8/31/07 T -15 T -20 C A -25 -30 C -35 10 5 +10 for match, -2 for mismatch, -5 for space 8/31/07 29 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 30 5 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Calculate score of optimum alignment λ C T C G C A G 4- Trace back through matrix to recover optimum alignment(s) that generated the optimal score C λ 0 -5 C A -5 10 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 5 0 -5 -1 0 5 8 3 -2 -7 0 -5 -1 0 T -1 5 0 15 10 5 0 -5 -2 -7 T -2 0 -5 10 13 8 3 -2 -7 -4 C -2 5 -1 0 5 20 15 18 13 8 3 A -3 0 -1 5 0 15 18 13 28 23 18 C -3 5 -2 0 -5 10 13 28 23 26 33 -1 0 -1 5 -2 0 -2 5 How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix Result? Optimal alignment(s) of sequences +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 31 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 32 Traceback to Recover Alignment Traceback - for Global Alignment λ Start in lower right corner & trace back to upper left Each arrow introduces one character at end of sequence alignment: • A horizontal move puts a gap in left sequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from each sequence C T C G C A G C λ 0 -5 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 C A -5 10 5 0 -5 -1 0 -1 5 -2 0 -2 5 -1 0 5 8 3 -2 -7 0 -5 -1 0 T -1 5 0 15 10 5 0 -5 -2 -7 T -2 0 -5 10 * 13 8 3 -2 -7 -4 C A -2 5 -1 0 5 20 15 18 13 8 3 -3 0 -1 5 0 15 18 13 28 23 18 C -3 5 -2 0 -5 10 13 28 23 26 33 * Can have >1 optimum alignment; this example has 2 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 33 Local Alignment: Motivation BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats Local Alignment: 8/31/07 34 8/31/07 36 Example • To "ignore" stretches of non-coding DNA: • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions g g t c t g a g a a a c g a • Local alignment between two protein-encoding sequences is likely to be between two exons Match: +2 • To locate protein domains or motifs: • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities Best local alignment: • Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 Mismatch or space: -1 g g t c t g a g a a a c – g a - 35 Score = 5 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 6 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Traceback - for Local Alignment Local Alignment: Algorithm •S [i, j] = Score for optimally aligning a suffix of X with a suffix of Y • Initialize top row & leftmost column of matrix with "0" Recall: for Global Alignment, • S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y • Initialize top row & leftmost column of with gap penalty λ C T C G C A G C λ 0 0 0 0 0 0 0 0 0 C A 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 2 0 0 T 0 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 0 0 C 0 1 0 2 0 1 0 0 1 A 0 0 0 0 1 0 2 0 0 C 0 1 0 1 0 2 0 1 1 +1 for a match, -1 for a mismatch, -5 for a space BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 37 Some Results re: Alignment Algorithms PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins 39 PAM Matrix BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 40 BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated • PAM1 - for less divergent sequences (shorter time) • BLOSUM45 - for more divergent sequences • PAM250 - for more divergent sequences (longer time) • BLOSUM62 - for less divergent sequences BCB 444/544 Fall 07 Dobbs 8/31/07 BLOSUM Matrix PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differnces in closely related proteins • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 38 2 Major types for Amino Acids: PAM & BLOSUM • Most pairwise sequence alignment problems can be solved in O(mn) time 8/31/07 8/31/07 "Scoring" or "Substitution" Matrices (for ComS, CprE & Math types!) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 41 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 42 7 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Statistical Significance of Sequence Alignment Affine Gap Penalty Functions Gap penalty = h + gk where k = length of gap Can also be solved in O(nm) time using dynamic programming h = gap opening penalty g = gap extension penalty BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 43 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 44 8