BCB 444/544 Lecture 5 Dynamic Programming #5_Aug29 BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 1 Required Reading (before lecture) Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Wed Aug 29 - for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 Thurs Aug 30 - Lab #2: Databases, ISU Resources & Pairwise Sequence Alignment Fri Aug 31 - for Lecture #6 Scoring Matrices and Alignment Statistics • Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 2 Review: Chp 2- Biological Databases • Xiong: Chp 2 Introduction to Biological Databases • • • • • What is a Database? Types of Databases Biological Databases Pitfalls of Biological Databases Information Retrieval from Biological Databases BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 3 Types of Databases 3 Major types of electronic databases: 1. Flat files - simple text files • no organization to facilitate retrieval 2. Relational - data organized as tables ("relations") • shared features among tables allows rapid search 3. Object-oriented - data organized as "objects" • objects associated hierarchically BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 4 Examples of Biological Databases 1- Primary • DNA sequences • GenBank - USA • European Molecular Biology Lab - EMBL • DNA Data Bank of Japan - DDBJ • Structures (Protein, DNA, RNA) • PDB - Protein Data Bank • NDB - Nucleic Acid Data Bank BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 5 Examples of Biological Databases 2- Secondary • Protein sequences • Swiss-Prot, TreEMBL, PIR • these recently combined into UniProt 3- Specialized • Species-specific (or "taxonomic" specific) • Flybase, WormBase, AceDB, PlantDB • Molecule-specific, disease-specific See: http://www.oxfordjournals.org/nar/database/c/ BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 6 SUMMARY: #2- Biological Databases BEWARE! Who was that Icelandic fellow? BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 7 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • • • • • • Evolutionary Basis Sequence Homology versus Sequence Similarity Sequence Similarity versus Sequence Identity Methods Scoring Matrices Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 8 Motivation for Sequence Alignment "Sequence comparison lies at the heart of bioinformatics analysis." Jin Xiong Sequence comparison is important for drawing functional & evolutionary inferences re: new genes/proteins Pairwise sequence alignment is fundamental; it used to: • Search for common patterns of characters • Establish pair-wise correspondence between related sequences Pairwise sequence alignment is basis for: • Database searching (e.g., BLAST) • Multiple sequence alignment (MSA) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 9 Homology Homology has a very specific meaning in evolutionary & computational biology - & term is often used incorrectly For us: Homology = similarity due to descent from a common evolutionary ancestor But, HOMOLOGY ≠ SIMILARITY When 2 sequences share a sufficiently high degree of sequence similarity (or identity), we may infer that they are homologous We can infer homology from similarity (can't prove it!) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 10 Orthologs vs Paralogs 2 types of homologous sequences: • Orthologs - "same genes" in different species; • result of common ancestry • corresponding proteins have "same" functions (e.g., human -globin & mouse -globin) • Paralogs - "similar genes" within a species; • result of gene duplication events • proteins may (or may not) have similar functions (e.g., human -globin & human -globin) A A is the parent gene Speciation leads to B & C Duplication leads to C’ Speciation Duplication B C C' BCB 444/544 F07 ISU B and C are Orthologous C and C’ are Paralogous Dobbs #5 - Dynamic Programming 8/29/07 11 Sequence Homology vs Similarity • Homologous sequences - sequences that share a common evolutionary ancestry • Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Sequence similarity: • The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 12 Sequence Similarity vs Identity For nucleotide sequences (DNA & RNA), sequence similarity and identity have the "same" meaning: • Two DNA sequences can share a high degree of sequence identity (or similarity) -- means the same thing • Drena's opinion: Always use "identity" when making quantitative comparisons re: DNA or RNA sequences (to avoid confusion!) For protein sequences, sequence similarity and identity have different meanings: • Identity = % of exact matches between two aligned sequences • Similarity = % of aligned residues that share similar characteristics (e.g, physicochemical characteristics, structural propsensities, evolutionary profiles) • Drena's opinion: Always use "identity" when making quantitative comparisons re: protein sequences (to avoid confusion!) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 13 What is Sequence Alignment? Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence. Align: 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE. 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ######SHORT###SENTENCE##############. OR 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ##SHORT###SENT#EN###CE##############. Is one of these alignments "optimal"? Which is better? BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 14 Goal of Sequence Alignment Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA 4 letter alphabet (+ gap) TTGACAC TTTACAC • Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 15 Statement of Problem Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences Find: Optimal pairing of sequences that: • Retains the order of characters • Introduces gaps where needed • Maximizes total score BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 16 Types of Sequence Variation • Sequences can diverge from a common ancestor through various types of mutations: • Substitutions • Insertions • Deletions ACGA AGGA ACGA ACCGA ACGA AGA • Insertions or deletions ("indels") result in gaps in alignments • Substitutions result in mismatches • No change? match BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 17 Gaps Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 18 Avoiding Random Alignments with a Scoring Function • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define a scoring function that accounts for mismatches and gaps Scoring Function (F): Match: Mismatch: Gap: + w - x - y e.g. +1 0 -1 F = w(#matches) + x(#mismatches) + y(#gaps) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 19 Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others (physicochemical properties are similar) e.g., Ser & Thr are more similar than Trp & Ala • Substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 20 Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 21 Methods • • • • Global and Local Alignment Alignment Algorithms Dot Matrix Method Dynamic Programming Method • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 22 Global vs Local Alignment Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 23 Global vs Local Alignment - example S = CTGTCGCTGCACG T = TGCCGTG Global alignment Local alignment CTGTCGCTGCACG CTGTCGCTGCACG -TGCCG-TG---- -TG-C-C-G--TG CTGTCGCTGCACG -TGCCG-T----G Which is better? BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 24 Global vs Local Alignment Which should be used when? Both are important but it is critical to use right method for a given task! Global alignment: • Good for: aligning closely related sequences of similar length • Not good for: divergent sequences or sequences with different lengths Local Alignment: • Good for: searching for conserved patterns (domains or motifs) in DNA or protein sequences • Not good for: generating an alignment of closely related sequences Global and local alignments are fundamentally similar; they differ only in optimization strategy used to align similar residues BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 25 Alignment Algorithms 3 major methods for pairwise sequence alignment: 1. Dot matrix analysis 2. Dynamic programming 3. Word or k-tuple methods (later, in Chp 4) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 26 Dot Matrix Method (Dot Plots) • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix • Plot a dot each time there is a match between an element of row sequence and an element of column sequence • For proteins, usually use more sophisticated scoring schemes than "identical match" • Diagonal lines indicate areas of match A C G C G A C A C G • Contiguous diagonal lines reveal alignment; "breaks" = gaps (indels) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 27 Interpretation of Dot Plots When comparing 2 sequences: • Diagonal lines of dots indicate regions of similarity between 2 sequences • Reverse diagonals (perpendicular to diagonal) indicate inversions • What do similar patterns mean when comparing a sequence with itself (reverse complement)? • e.g.: Reverse diagonals crossing diagonals (X's) indicate palindromes Exploring Dot Plots BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 28 Dot Matrix Variations Compare 2 sequences • Identify matching regions • Identities for DNA seqs • Similarities for protein seqs Compare sequence with itself • Identify repeated regions • Identify inverted repeats • Identify palindromes For long sequences? • Too many dots! Noisy! • Instead of per "residue," plot one dot per "window" of n matching residues to reduce noise BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 29 Strengths & Weakneses of Dot Plots Strengths: • Fast and easy • Allows direct visual identification of regions of similarity • Repeats, inversions, etc. are readily apparent • Displays all possible matches Weaknesses: • Doesn't generate full alignment - user must "connect the diagonals" • No statistical assessment of quality of alignment (score) • Impractical and noisy for long sequences • Difficult to scale up to muliple alignment BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 30 Dynamic Programming For Pairwise sequence alignment Idea: Display one sequence above another with spaces inserted in both to reveal similarity A: C A T - T C A - C | | || | B: C - T C G C A G C BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 31 Global alignment: Scoring CTGTCGCTGCACG -TGC-CG-TG---Reward for matches: Mismatch penalty: Space/gap penalty: Score = w – x - y w = #matches x = #mismatches BCB 444/544 F07 ISU y = #spaces Dobbs #5 - Dynamic Programming 8/29/07 32 Global alignment: Scoring Reward for matches: Mismatch penalty: Space/gap penalty: 10 2 5 C T G T C G – C T G C - T G C – C G – T G -5 10 10 -2 -5 -2 -5 -5 10 10 -5 Total = 11 BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 33 Optimum Alignment • Score of an alignment is a measure of its quality • Optimum alignment problem: Given a pair of sequences X and Y, find an alignment (global or local) with maximum score BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 34 Alignment algorithms • Global: Needleman-Wunsch • Local: Smith-Waterman • Both NW and SW use dynamic programming • Variations: • Gap penalty functions • Scoring matrices BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 35 Dynamic Programming (DP) • As computer science concept - formalized in early 1950's by Bellman at RAND Corporation “Frequently, however, there are only a polynomial number of subproblems… If we keep track of the solution to each subproblem solved, and simply look up the answer when needed, we obtain a polynomial-time algorithm. “ ----Aho, Hopcroft, Ullman • Reported to biologists for sequence alignment problems by Needleman & Wunsch, 1969 BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 36 Key Idea Score of the best possible alignment that ends at a given pair of positions (i,j) in two sequences is the score of the best alignment previous to those two positions PLUS the score for aligning those two positions Next best alignment = previous best + local best BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 37 Problem Formulation and Notations Given two sequences (strings) • X = x1x2…xN of length N x = AGC N=3 • Y = y1y2…yM of length M y = AAAC M=4 Construct a matrix with (N+1) x (M+1) elements, where S(i,j) = score of best alignment of x[1..i]=x1x2…xi with y[1..j]=y1y2…yj x1 x2 x3 y1 S(2,3) = score of best alignment y2 of AG (x1x2) to AAA (y1y2y3) y3 y4 BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 38 Dynamic Programming 4 Components: 1. Recursive definition for optimal score 2. Matrix for storing optimal scores of subproblems 3. Bottom-up approach for filling the matrix, by solving smallest subproblems first 4. Traceback of path through matrix to recover the optimal alignment(s) that gave the optimal score BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 39 Global Alignment: Algorithm x Prefix of length i of x 1.. i y Prefix of length j of y 1.. j S(i, j) Score of optimal alignment of x and y 1..i BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 1..j 8/29/07 40 Calculating Score of Optimum Alignment S(i,j) satisfies the following relationships: Initial conditions: S(i,0) i S(0, j) j Recursive definition: For 1 i n, 1 j m: S(i 1, j 1) (Si ,T j ) S(i, j) max S(i 1, j) S(i, j 1) BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 41 Computing the best current score 0 0 1 1 N S(0,0)=0 S(i-1,j-1) S(i-1,j) S(i,j-1) S(i,j) S(N,M) M Recursion S(i 1, j 1) (x i , y j ) S(i, j) max S(i 1, j) S(i, j 1) BCB 444/544 F07 ISU Initialization S(i,0) i S(0, j) j Dobbs #5 - Dynamic Programming 8/29/07 42 What happens at the last step in the alignment of x[1..i] to y[1..j]? 1 of 3 cases: xi aligns to yj yj aligns to a gap xi aligns to a gap x1 x2 . . . xi-1 xi x1 x2 . . . xi-1 xi y1 y2 . . . yj-1 yj y1 y2 . . . yj S(i-1,j-1) + (xi,yj) S(i-1,j) BCB 444/544 F07 ISU — + x1 x 2 . . . x i — y1 y2 . . . yj-1 yj S(i,j-1) Dobbs #5 - Dynamic Programming + 8/29/07 43 DP Implementationn - 3 steps: 1. Construct sequence vs sequence matrix and fill in from [0,0] to [N,M], the best possible scores for alignments including the residues at [i,j]. Also, keep track of dependencies of scores (in a pointer matrix). 2. For a global alignment of the sequences, find the score S(N,M) 3. Trace back through pointer matrix to get the optimal alignment. Do this position by position to retrieve alignment of all residues of sequences, including gaps (i.e., repeat alignment calculations in reverse order, following path back through matrix, starting at from position with highest score. BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 44 Example Case 1: Line up xi with yj x: C y: C A - T T i-1 C A C A j -1 T T Case 2: Line up xi with space x: C y: C A - T T T T C C i C G j i-1 A A G i C - j Case 3: Line up yj with space x: C y: C A - T T T T BCB 444/544 F07 ISU C C i A C A j -1 G j Dobbs #5 - Dynamic Programming 8/29/07 45 λ λ 0 -5 C A -10 T -15 T -20 C A -25 -30 C -35 C T C G C A G C -5 -10 -15 -20 -25 -30 -35 -40 10 5 +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 46 λ λ C A T T C A C C T C G C A G C 0 -5 -10 -15 -20 -25 -30 -35 -40 -5 10 5 0 -5 -10 -15 -20 -25 -10 5 8 3 -2 -7 0 -5 -10 -15 0 15 10 5 0 -5 -2 -7 8 3 -2 -7 -4 * -20 -5 10 * 13 -25 -10 5 20 15 18 13 8 3 -30 -15 0 15 18 13 28 23 18 -35 -20 -5 10 13 28 23 26 33 Traceback can yield both optimal alignments BCB 444/544 F07 ISU Dobbs #5 - Dynamic Programming 8/29/07 47 Affine Gap Penalty Functions Gap penalty = h + gk where k = length of gap h = gap opening penalty g = gap continuation penalty BCB 444/544 F07 ISU Can also be solved in O(nm) time using dynamic programming Dobbs #5 - Dynamic Programming 8/29/07 48