Introduction to Computational Biology A. Shehu – CS444 (001) Pairwise Sequence Alignment bioalgorithms.info, cnx.org, and instructors across the country February 3, 2011 Introduction to Computational Biology A. Shehu – CS444 (001) Challenges in Computational Biology Some more challenges involve characterizing structure, dynamics, and function in RNA and proteins Goal for today: 2. Sequence Alignment Global vs. Local © Manolis Kellis A. Shehu – CS444 (001) Introduction to Computational Biology Genomic Scales Demand Efficiency Organism Phi-X 174 Epstein-Barr virus Base pairs Nr. Genes Description 5,386 172,282 10 80 Treponema pallidium 1,138,011 1,039 Syphilis-causing bacterium Heliobacter pylori 1,667,970 1,589 Main cause of ulcers Haemophilus influenza 1,830,138 1,738 Causes middle ear infections E. Coli 4,639,221 4,377 Best-studied bacterium 12,495,682 5,770 Budding yeast Caenorhabditis elegans 100,258,171 19,000 Well-studied worm Arabidopsis thaliana 115,409,949 25,498 Well-studied flowering plant Drosophila melanogaster 122,653,977 13,379 Fruit fly Homo sapiens 3,000,000,000 22,000 Us, completed in 2004 Rice 4,300,000,000 60,000 Repetitive, GC gradient Saccharomyces cerevisiae Virus of E. Coli Causes mononucleosis Importance of algorithm efficiency: comparing mouse to human in blocks of 1,000 bps 3,000,000 x 3,000,000 comparisons → 104 years assuming 1 trillion ops/sec Introduction to Computational Biology A. Shehu – CS444 (001) Sequence Alignment ▆ Global Alignment • LCS • Needleman-Wunsch ▆ Local Alignment • Smith-Waterman ▆ Scoring Matrices • Alignment with Affine Gap Penalties ▆ Alignment in BLAST ▆ Alignment of Multiple Sequences Introduction to Computational Biology A. Shehu – CS444 (001) The Basis for Sequence Alignment ▆ All living organisms are related through evolution ▆ This means that there is a common ancestor ▆ Moreover, there are similarities, of varying degrees, between organisms that descend from a common ancestor ▆ Evolution involves: • Inheritance: passing characteristics from parent to offspring • Variation: differentiation between parent and offspring • Selection: favoring of some organisms over others Introduction to Computational Biology A. Shehu – CS444 (001) Evolutionary Basis for Sequence Variations ▆ Mutations and selection over billions of years can result in considerable divergence between related organisms ▆ The base pair composition of related sequences can change due to: • Mutations (substitutions) • Insertions/deletions (which affect sequence lengths) Introduction to Computational Biology A. Shehu – CS444 (001) What is Alignment? ▆ ▆ ▆ We align two sequences to find whether there are common substrings in them The degree to which the substrings “match” determines the similarity of the original sequences Optimal alignment: alignment that maximizes similarity Introduction to Computational Biology A. Shehu – CS444 (001) Sequence Similarity vs. Homology: Why do we Align? ▆ 1. Alignment can reveal whether two sequences are homologous or not ▆ Homologous sequences are those that descend from a common ancestor • The sequences share a common ancestral sequence ▆ Note: • Similarity is a descriptive term for degree of matching between two sequences • Similarity does not automatically imply homology • Sequences can be highly similar but not homologous (convergent evolution) ▆ 2. Alignment can be used to denote a common protein function ▆ Note: • Sequence similarity does not always imply functional similarity • Conserved function does not always imply sequence similarity Introduction to Computational Biology A. Shehu – CS444 (001) Why Align Protein over DNA Sequences? ▆ Easier to detect homology when comparing amino-acid sequences rather than nucleic acid sequences • Why? • Hint 1: There are only four bases in DNA/RNA (A, T/U, C, G) • Hint 2: There are 20 (classic) amino acids in proteins • The probability of a match by chance in DNA is ... than in a protein chain • Another reason: genetic code is redundant • What does this mean? ▆ A strong reason: • Structure and function of a protein is determined by its amino-acid sequence • Function conservation restricts changes that can happen to amino-acid sequence ▆ Our next goal: • Global Alignment (Needleman-Wunsch) vs. Local Alignment (Smith-Waterman) Introduction to Computational Biology A. Shehu – CS444 (001) Global Alignment: Needleman-Wunsch ▆ The problem of global alignment when (1) matching characters exactly and (2) not penalizing for gaps (insertions or deletions) is a classic problem in computer science known as • Longest Common Subsequence (LCS) • Often presented to illustrate Dynamic Programming • Short Detour: LCS lecture (Shehu_LCS.pdf) ▆ Reason to allow “soft” matching rather than exact matching: • Amino acids with similar chemical properties and size can often replace one another with little effect to both structure and function • Substitution matrices compiled over functionally-related proteins dictate penalty for soft matches or substitutions ▆ We also want to penalize gaps (indels): • Needleman-Wunsch algorithm is the general version of LCS that takes into account both substitutions and gap penalties Introduction to Computational Biology A. Shehu – CS444 (001) Global Alignment: Needleman-Wunsch x = THISLINE y = ISALIGNED Si,j stores the score of the optimal alignment of all characters/residues up to xi of x will all residues up to yj of y Let g be the gap penalty (penalty of one insertion or deletion) Optimal alignment: THIS_ LI_ NE _ __ ISALIGN ED Then: Si,0 = ig S0,j = jg (what does this mean?) (what if we do not want to penalize gaps at beginning or end?) A. Shehu – CS444 (001) Introduction to Computational Biology Only Difference from LCS in S matrix ▆ How is s(xi, yj) determined? ▆ What about the gap penalty? ▆ Answer: through a scoring matrix • PAM, BLOSUM are examples of scoring matrices • It is very important to have the “right” scoring matrix A. Shehu – CS444 (001) Introduction to Computational Biology Global Alignment: Needleman-Wunsch Scoring matrix used matches the gap penalty (-4) to the most severe mismatch (-4). ▆ Can there be multiple global alignments? • How does the algorithm break a three-way tie? • The decision is left to the programmer • There may be reasons to prefer a less arbitrary decision • For instance: discouraging gaps in one of the sequences Introduction to Computational Biology A. Shehu – CS444 (001) Global Alignment: Needleman-Wunsch Is (B) optimal? Scoring matrix used is BLOSUM-62 It seems that the gap penalty is so high (-8) that there is no incentive to add gaps rather than allow mismatches (the most severe of which has a penalty of -4) The “fault” is with the scoring matrix used the alignment is optimal within the scoring matrix used Introduction to Computational Biology A. Shehu – CS444 (001) From Global to Local Alignment ▆ Sometimes local alignment is more appropriate than global alignment • E.g., we want to determine whether two proteins share a common domain • The Smith-Waterman algorithm addresses local alignment ▆ Main differences over Needleman-Wunsch: • Whenever the score of the optimal sub-alignment is less than zero, it is rejected (the matrix element is set to 0) • Traceback starts from the highest-scoring element 0 What does the rejection of a negative optimal sub-alignment mean? Hint: many mini global alignments not worth to continue at some point Introduction to Computational Biology A. Shehu – CS444 (001) Smith-Waterman for Local Alignment Gap penalty = -8 Gap penalty = -4 A. Shehu – CS444 (001) Introduction to Computational Biology Affine Gap Penalties ▆ It is observed that gaps come in bursts: • k indels are a single event More likely: ATA _ _ G C ATA TTG C ▆ ▆ The idea is to give reduced penalties to horizontal or vertical runs ▆ Score now is: Less likely: ATA G _ G C AT_ GTG C Filling each element now may require a horizontal or vertical look-up • If |x| = n and |y| = m and n > m, the running time becomes O(n2 x m ) • An O(n x m) algorithm exists Introduction to Computational Biology A. Shehu – CS444 (001) Global Alignment (Needleman-Wunsch) x = THISLINE y = ISALIGNED Si,j stores the score of the optimal alignment of all prefixes of x through xi, y through yj Let g be the gap penalty (penalty of one indel) Initialization: Si,0 = ig S0,j = jg Optimal alignment: THIS_ LI_ N E _ __ I SALI GN E D Introduction to Computational Biology A. Shehu – CS444 (001) Scoring Scheme ▆ Consists of two elements: 1. Substitution matrix Assigning a value to a pair of aligned residues 2. Gap model Dealing with the presence of indels in one sequence relative to the other Introduction to Computational Biology A. Shehu – CS444 (001) Substitution Matrices PAM 120 ▆ ▆ ▆ BLOSUM 62 Assigns a value to a pair of aligned residues Constructed by analyzing substitution frequencies in correct alignments of known families of nucleotides/residues Degree to which A and B can occur aligned against one another in correct alignments reflects relationships occurring in nature Introduction to Computational Biology A. Shehu – CS444 (001) Substitution Matrices – Example Overall alignment score = ? BLOSUM 62 Introduction to Computational Biology A. Shehu – CS444 (001) Substitution Matrices – Example Overall alignment score: 52 BLOSUM 62 Introduction to Computational Biology A. Shehu – CS444 (001) Scoring Matrices: How are they Constructed? ▆ Scoring matrices are created based on biological evidence ▆ Statistics over protein sequences are gathered ▆ ▆ Some mutations have little effect on the protein’s function ▆ Reason why some penalties δ(vi , wj), are less harsh than others The statistics is a bit more involved ▆ The alignment score attempts to measure the likelihood of a common evolutionary ancestor ▆ To achieve this mathematically, the alignment of two residues from two sequences is considered under two competing models/hypotheses: ▆ The random model (R) ▆ The match (non-random, evolutionary) model, M Introduction to Computational Biology A. Shehu – CS444 (001) The Random vs. the Match Model The Random Model: ▆ Sequences are assumed to be random selections from a given pool of residues ▆ Every position in the sequence is independent of every other ▆ If the proportion of amino acid type a in the pool is pa, this fraction will be reproduced in a sequence ▆ As a result, the probability of amino acid a being aligned with amino acid b is pa x pb The Match Model: ▆ Sequences are assumed to be related due to an evolutionary process, with a high correlation between aligned residues ▆ The probability of occurrence of particular residues depends not on the pool of available residues, but on the residue at the equivalent position in the sequence of the common ancestor ▆ As a result, the probability of a being aligned with b is qa,b, where the actual values of qa,b depend on the properties of the evolutionary process Introduction to Computational Biology A. Shehu – CS444 (001) The Random vs. the Match Model: the Odds Ratio ▆ P(a,b|R) = pa x pb and P(a,b|M) = qa,b ▆ Comparing the odds: qa,b/papb ▆ If the ratio is greater than 1, we say that the match model is more likely to explain the alignment of a to b ▆ The odds ratio over two aligned sequences is the product of the above ratio for the different positions ▆ The product turns into a summation under the logarithm ▆ That is why scores are summed up when computing the alignment either in Needleman-Wunsch or Smith-Waterman ▆ A positive score means that the match model explains the alignment at that position ▆ What does a negative score mean? ▆ Two scoring matrices: PAM and BLOSUM Introduction to Computational Biology A. Shehu – CS444 (001) PAM (Point Accepted Mutation) Scoring Matrix ▆ Scoring matrices are best developed based on experimental data because in this way they reflect the actual relationships that occur in nature ▆ PAM was one of the first scoring matrices – developed by Dayhoff et al. ▆ Substitution probabilities were estimated by using known mutational histories ▆ 34 protein superfamilies were used, divided into 71 groups of near--homologous sequences (>85% identity to remove superimposed mutations) ▆ A phylogenetic tree was constructed for each group (including inference of most likely ancestral sequences at each internal node) ▆ We have not yet covered phylogeny, but it is used to define an evolutionary model that explains the evolution ▆ Caviat: even under the evolutionary model, the substitution in a position is independent of what happens in neighboring positions ▆ This is not always a valid assumption Introduction to Computational Biology A. Shehu – CS444 (001) PAM (Point Accepted Mutation) Scoring Matrix PAM 120 Henikoff and Henikoff (1992) resulted in BLOSUM matrices BLOSUM-62 is the standard for ungapped alignments BLOSUM-50 is the standard for gapped alignments Question to explore: how are BLOSUM matrices different from PAM matrices? Are there any advantages to using BLOSUM over PAM? Introduction to Computational Biology A. Shehu – CS444 (001) PAM vs. BLOSUM BLOSUM-62 ▆ BLOSUM matrices were built only from the most conserved domains of blocks of a database of conserved proteins ▆ BLOSUM matrices are also more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches ▆ PAM matrices are more tolerant of substitutions to or from hydrophilic amino acids ▆ What do the numbers (PAM40, PAM120) mean? For details: http://www.ebi.ac.uk/help/matrix.html Introduction to Computational Biology A. Shehu – CS444 (001) Which Scoring Matrix to Use? ▆ PAM* vs. BLOSUM*: This decision is also more difficult when the evolutionary distance between the sequences is not known ▆ Different studies have concluded that for the PAM matrices it is generally best to try PAM40, PAM120, and PAM250 ▆ BLOSUM matrices are best for local alignments ▆ If using PAM for local alignments: ▆ ▆ Lower PAM matrices find short local alignments ▆ Higher PAM matrices find longer but weaker local alignments Several different matrices should be used, and the alignment that is judged to be evolutionarily the most accurate is the one chosen ▆ Question: how can one judge which one is the most accurate? ▆ Judgment on a control set where the evolutionary relationship is known Introduction to Computational Biology A. Shehu – CS444 (001) Gaps ▆ Gap: ▆ A maximal consecutive run of spaces in a single string of a given alignment ▆ Homologous sequences are often of different lengths as a result of indels that have occurred as they diverged from an ancestral sequence ▆ Their alignment is dealt with by inserting gaps in the sequences to achieve as correct a match as possible Introduction to Computational Biology A. Shehu – CS444 (001) Gap Models ▆ To place limits on the introduction of gaps, alignment algorithms use a gap model • Explicitly models the number and length of gaps • Commonly used gap models: • Constant How are gap penalties determined? • Linear Not too large, not too small • Convex Relative to range of entries in substitution • Arbitrary (scoring) matrix Details: http://www.ebi.ac.uk/help/gaps.html Introduction to Computational Biology A. Shehu – CS444 (001) BLAST: How Do They Do It? ▆ What is the alignment algorithm used by pBLAST? ▆ Main idea is to precompile the database of deposited sequences ▆ ▆ For every W-mer (string of length W), record the database location of the sequence to which it belongs Overview of the BLAST algorithm: ▆ Receive query ▆ Split query into overlapping words of length W ▆ Find neighborhood words for each word until threshold T ▆ Look into the table where these neighbor words occur: seeds ▆ Extend seeds until score drops off under X ▆ Evaluate statistical significance of score ▆ Report scores and alignments Introduction to Computational Biology A. Shehu – CS444 (001) BLAST: Illustration Seeds are extended until the cumulative score drops Karlin-Altschul statistics P-value: Probability that the HSP was generated by chance Score: -log(probability) E: expected number of such alignments given the database Question: What is a two-hit blast?