#6 -Scoring Matrices & Alignment Statistics 8/31/07 Required Reading BCB 444/544 (before lecture) Mon Aug 27 - for Lecture #4 Lecture 6 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Wed Aug 29 - for Lecture #5 Finish Dynamic Programming Dynamic Programming • Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 Scoring Matrices Alignment Statistics http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html Thurs Aug 30 - Lab #2: Databases, ISU Resources & Pairwise Sequence Alignment #6_Aug31 Fri Aug 31 - for Lecture #6 Scoring Matrices & Alignment Statistics • Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 1 Announcements BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 2 Chp 3- Sequence Alignment SECTION II Fri Aug 31 - Revised notes for Lecture 5 posted online Changes? mainly re-ordering, symbols, color "coding" SEQUENCE ALIGNMENT Xiong: Chp 3 Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! - Enjoy!! Pairwise Sequence Alignment Tues Sept 4 - Lab #2 Exercise Writeup Due by 5 PM (or sooner!) Send via email to Pete Zaback petez@iastate.edu (HW#2 assignment will be posted online) • √Evolutionary Basis Fri Sept 14 - HW#2 Due by 5 PM (or sooner!) • √Sequence Similarity versus Sequence Identity • √Sequence Homology versus Sequence Similarity • Methods - Fri Sept 21 - Exam #1 cont • Scoring Matrices • Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 3 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 Sequence Homology vs Similarity Methods • √Global and Local Alignment • Homologous sequences - sequences that share a common • √Alignment Algorithms • √Dot Matrix Method • Similar sequences - sequences that have a high percentage of evolutionary ancestry aligned residues with similar physicochemical properties • Dynamic Programming Method - cont (e.g., size, hydrophobicity, charge) • Gap penalities • DP for Global Alignment IMPORTANT: • DP for Local Alignment • Sequence homology: • Scoring Matrices • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Amino acid scoring matrices • PAM • Homology is qualitative • BLOSUM • Sequence similarity: • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 4 8/31/07 • The direct result of observation from a sequence alignment 5 • SimilarityBCBis444/544 quantitative ; can#6be described using percentages F07 ISU Dobbs - Scoring Matrices & Alignment Stats 8/31/07 6 1 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Goal of Sequence Alignment Statement of Problem Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters 4 letter alphabet (+ gap) • Penalty function for gaps in sequences TTGACAC TTTACAC Find: Optimal pairing of sequences that: • Proteins 20 letter alphabet (+ gap) • Retains the order of characters • Introduces gaps where needed RKVA-GMA • Maximizes total score RKIAVAMA BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 7 Avoiding Random Alignments with a Scoring Function 8/31/07 8 Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others (physicochemical properties are similar) • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice e.g., Ser & Thr are more similar than Trp & Ala • Need to distinguish between alignments that occur due to homology and those that occur by chance • Substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Define a scoring function that rewards matches (+) and penalizes mismatches (-) and gaps (-) Scoring Function (S): Note: I changed symbols & colors on this slide! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats e.g. Match: α 1 Mismatch: β 1 Gap: γ 0 • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) S = α(#matches) - β(#mismatches) - γ(#gaps) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 9 Substitution Matrix BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 10 Global vs Local Alignment Global alignment s(a,b) corresponds to score of aligning character a with character b • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length Match scores are often calculated based on frequency of mutations in Local alignment very similar sequences • Finds local regions with highest similarity between 2 sequences (more details later) • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 11 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 12 2 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Global vs Local Alignment Which should be used when? Global vs Local Alignment - example 1 = CTGTCGCTGCACG 2 = TGCCGTG Global alignment CTGTCGCTGCACG -TG-C-C-G--TG It is critical to choose correct method! Global Alignment Local alignment Local Alignment? Shout out the answers!! Which should we use for? CTGTCGCTGCACG -TGCCG-TG---- 1. Searching for conserved motifs in DNA or protein sequences? 2. Aligning two closely related sequences with similar lengths? 3. Aligning highly divergent sequences? CTGTCGCTGCACG -TGCCG-T----G vs 4. Generating an extended alignment of closely related sequences? 5. Generating an extended alignment of closely related sequences with very different lengths? Hmmm - we'll work on that Which is better? BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 13 Alignment Algorithms Excellent! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 14 Dot Matrix Method (Dot Plots) 3 major methods for pairwise sequence alignment: • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix 1. Dot matrix analysis • Plot a dot each time there is a match between an element of row sequence and an element of column sequence 2. Dynamic programming • For proteins, usually use more sophisticated scoring schemes than "identical match" 3. Word or k-tuple methods (later, in Chp 4) • Diagonal lines indicate areas of match A C G C G A C A C G • Contiguous diagonal lines reveal alignment; "breaks" = gaps (indels) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 15 Interpretation of Dot Plots BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 16 8/31/07 18 Dynamic Programming For Pairwise sequence alignment When comparing 2 sequences: • Diagonal lines of dots indicate regions of similarity between 2 sequences Idea: Display one sequence above another with spaces inserted in both to reveal similarity • Reverse diagonals (perpendicular to diagonal) indicate inversions C A T - T C A - C | | | | | C - T C G C A G C • What do such patterns mean when comparing a sequence with itself (or its reverse complement)? • e.g.: Reverse diagonals crossing diagonals (X's) indicate Exploring Dot Plots palindromes BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 17 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 3 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Global Alignment: Scoring Global Alignment: Scoring CTGTCG-CTGCACG Reward for matches: Mismatch penalty: Space/gap penalty: -TGC-CG-TG---Reward for matches: α Mismatch penalty: β Space/gap penalty: γ C - Score = αw – βx - γy w = #matches x = #mismatches Note: I changed symbols & colors on this slide! BCB 444/544 F07 ISU T T G G T C C – G C – G 10 -2 -5 C – T T G G C - -5 10 10 -2 -5 -2 -5 -5 10 10 -5 y = #spaces Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 19 Note: I changed symbols & colors on this slide! Total = 11 We could have done better!! BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats Alignment Algorithms 8/31/07 20 Dynamic Programming - Key Idea: The score of the best possible alignment that ends at a • Global: Needleman-Wunsch given pair of positions (i, j) is equal to: • Local: Smith-Waterman the score of best alignment ending just previous to those two positions (i.e., ending at i-1, j-1) • Both NW and SW use dynamic programming PLUS • Variations: • Gap penalty functions the score for aligning xi and yj • Scoring matrices BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 21 Global Alignment: DP Problem Formulation & Notations BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 22 Dynamic Programming - 4 Steps: Given two sequences (strings) • X = x 1x 2 …xN of length N x = AGC N=3 • Y = y1y2 …yM of length M y = AAAC M=4 1. Define score of optimum alignment, using recursion 2. Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest Construct a matrix with (N+1) x (M+1) elements, where subproblems first (bottom-up approach) S ( i,j) = Scorexof best of x[1..i]=x1x2…x i with y[1..j]=y1 y2…yj x2 xalignment 1 3 Which means: Score of best alignment of a prefix of X and a prefix of Y 3. Calculate score of optimum alignment(s) 4. Trace back through matrix to recover optimum y1 y2 y3 S(2,3) = score of best alignment alignment(s) that generated optimal score of AG (x1x2) to AAA (y1y2y3) y4 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 23 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 24 4 #6 -Scoring Matrices & Alignment Statistics 8/31/07 2- Initialize & Fill in DP Matrix for Storing Optimal Scores of Subproblems 1- Define Score of Optimum Alignment using Recursion • Construct sequence vs sequence matrix: x1..i = Prefix of length i of x y1.. j = Prefix of length j of y Define: 0 1 S(i, j) = Score of optimum alignment of x1..i and y1..j Initial ! conditions: S(i,0) = "i # $ ! S(i,j-1) S(0, j) = " j # $ 8/31/07 i 25 8/31/07 26 ! ! 3- Calculate Score S(N,M) of Optimum Alignment - for Global Alignment • Fill in from [0,0] to [N,M] (row by row), calculating best possible score for each alignment including residues at [i,j] • Keep track of dependencies of scores (in a pointer matrix). 1 j BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats Fill in DP Matrix 0 Initialization S(i,0) = "i # $ S(0, j) = " j # $ %S(i "1, j "1) + # (x , y ) ' S(i, j) = max&S(i "1, j) " $ 'S(i, j "1) " $ ( ! 0 S(N,M) M %S(i "1, j "1) + # (xi , y j ) ' S(i, j) = max&S(i "1, j) " $ 'S(i, j "1) " $ ( BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats N S(i,j) Recursion ! 1 ≤ i ≤ N, 1 ≤ j ≤ M: For cont 1 S(0,0)=0 S(i-1,j-1) S(i-1,j) Recursive definition: 2- 0 1 What happens in last step in alignment of x[1..i] to y[1..j]? 1 of 3 cases applies: N xi aligns to yj S(0,0)=0 S(i-1,j-1) S(i-1,j) S(i,j-1) S(i,j) xi aligns to a gap yj aligns to a gap x1 x2 . . . xi-1 xi x1 x2 . . . xi-1 xi x1 x2 . . . xi y1 y2 . . . yj-1 yj y1 y2 . . . yj y1 y2 . . . yj-1 yj S(i-1,j-1) + σ(xi,yj) S(i-1,j) — -γ S(i,j-1) — -γ S(N,M) M BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 27 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats λ Case 1: Line up x i with y j A - T T T T C C i-1 A A j -1 Case 2: Line up x i with space x: C y: C A - T T T T C C A A λ i C G j i-1 G Case 3: Line up y j with space A - T T T T C C A A i C - i C - G j -1 j BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs C 0 T C G C A G C -5 -10 -15 -20 -25 -30 -35 -40 -5 C A -10 j x: C y: C 28 Fill in the matrix Example x: C y: C 8/31/07 T -15 T -20 C A -25 -30 C -35 10 5 +10 for match, -2 for mismatch, -5 for space 8/31/07 29 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 30 5 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Calculate score of optimum alignment λ C T C G C A G 4- Trace back through matrix to recover optimum alignment(s) that generated the optimal score C λ 0 -5 C A -5 10 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 5 0 -5 -1 0 5 8 3 -2 -7 0 -5 -1 0 T -1 5 0 15 10 5 0 -5 -2 -7 T -2 0 -5 10 13 8 3 -2 -7 -4 C -2 5 -1 0 5 20 15 18 13 8 3 A -3 0 -1 5 0 15 18 13 28 23 18 C -3 5 -2 0 -5 10 13 28 23 26 33 -1 0 -1 5 -2 0 -2 5 How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix Result? Optimal alignment(s) of sequences +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 31 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 32 Traceback to Recover Alignment Traceback - for Global Alignment λ Start in lower right corner & trace back to upper left Each arrow introduces one character at end of sequence alignment: • A horizontal move puts a gap in left sequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from each sequence C T C G C A G C λ 0 -5 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 C A -5 10 5 0 -5 -1 0 -1 5 -2 0 -2 5 -1 0 5 8 3 -2 -7 0 -5 -1 0 T -1 5 0 15 10 5 0 -5 -2 -7 T -2 0 -5 10 * 13 8 3 -2 -7 -4 C A -2 5 -1 0 5 20 15 18 13 8 3 -3 0 -1 5 0 15 18 13 28 23 18 C -3 5 -2 0 -5 10 13 28 23 26 33 * Can have >1 optimum alignment; this example has 2 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 33 Local Alignment: Motivation BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats Local Alignment: 8/31/07 34 8/31/07 36 Example • To "ignore" stretches of non-coding DNA: • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions g g t c t g a g a a a c g a • Local alignment between two protein-encoding sequences is likely to be between two exons Match: +2 • To locate protein domains or motifs: • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities Best local alignment: • Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 Mismatch or space: -1 g g t c t g a g a a a c – g a - 35 Score = 5 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 6 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Traceback - for Local Alignment Local Alignment: Algorithm •S [i, j] = Score for optimally aligning a suffix of X with a suffix of Y • Initialize top row & leftmost column of matrix with "0" Recall: for Global Alignment, • S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y • Initialize top row & leftmost column of with gap penalty λ C T C G C A G C λ 0 0 0 0 0 0 0 0 0 C A 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 2 0 0 T 0 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 0 0 C 0 1 0 2 0 1 0 0 1 A 0 0 0 0 1 0 2 0 0 C 0 1 0 1 0 2 0 1 1 +1 for a match, -1 for a mismatch, -5 for a space BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 37 Some Results re: Alignment Algorithms PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins 39 PAM Matrix BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 40 BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated • PAM1 - for less divergent sequences (shorter time) • BLOSUM45 - for more divergent sequences • PAM250 - for more divergent sequences (longer time) • BLOSUM62 - for less divergent sequences BCB 444/544 Fall 07 Dobbs 8/31/07 BLOSUM Matrix PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differnces in closely related proteins • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 38 2 Major types for Amino Acids: PAM & BLOSUM • Most pairwise sequence alignment problems can be solved in O(mn) time 8/31/07 8/31/07 "Scoring" or "Substitution" Matrices (for ComS, CprE & Math types!) BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 41 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 42 7 #6 -Scoring Matrices & Alignment Statistics 8/31/07 Statistical Significance of Sequence Alignment Affine Gap Penalty Functions Gap penalty = h + gk where k = length of gap Can also be solved in O(nm) time using dynamic programming h = gap opening penalty g = gap extension penalty BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats BCB 444/544 Fall 07 Dobbs 8/31/07 43 BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8/31/07 44 8
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )