GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES 1 Assumptions: Life is monophyletic Biological entities (sequences, taxa) share common ancestry 2 ancestor descendant 1 Any two organisms share a common ancestor in their past descendant 2 3 ancestor (~5 MYA) 4 ancestor (~120 MYA) 5 ancestor (~1,500 MYA) 6 (1) Speciation events (2) Gene duplication (3) Duplicative transposition Homologous sequences 7 Homology: A term coined by Richard Owen in 1843. Definition: Similarity resulting from common ancestry. 8 Homology There are three main types of molecular homology: orthology, paralogy (including ohnology) and xenology. 9 Homology: General Definition • Homology designates a qualitative relationship of common descent between entities • Two genes are either homologous or they are not! – it doesn’t make sense to say “two genes are 43% homologous.” – it doesn’t make sense to say “Linda is 43% pregnant.” 10 Orthology & Paralogy • Two genes are orthologs if they originated from a single ancestral gene in the most recent common ancestor of their respective genomes • Two genes are paralogs if they are related by gene duplication. Two genes are ohnologs if they are related by gene duplication due to genome duplication 11 12 = Gene death 13 Xenology is due to horizontal (lateral) gene transfer (HGT or LGT) XA and XB are xenologs Distinguishing orthologs from xenologs is impossible in pairwise genomic comparisons, but possible when multiple genomes are compared 14 Orthology, Paralogy, Xenology (Fitch, Trends in Genetics, 2000. 16(5):227-231) 15 Homology By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor. 16 Homology When comparing sequences, we are interested in POSITIONAL HOMOLOGY. We identify POSITIONAL HOMOLOGY through SEQUENCE ALIGNMENT. 17 Alignment: A hypothesis concerning positional homology among residues from two or more sequence. Positional homology = In pairwise alignment, a pair of nucleotides from two homologous sequences that have descended from one nucleotide in the ancestor of the two sequences. Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor. 19 20 Unknown sequence Unknown events & unknown sequence of events Unknown events & unknown sequence of events The true alignment is unknown. 21 There are two modes of alignment. Global alignment: each residue of sequence A is compared with each residue in sequence B. Global alignment algorithms are used in comparative and evolutionary studies. Local alignment: Determining if sub-segments of one sequence are present in another. Local alignment methods have their greatest utility in database searching and retrieval (e.g., BLAST). For reasons of computational complexity, sequence alignment is divided into two categories: Pairwise alignment (i.e., the alignment of two sequences). Multiple-sequence alignment (i.e., the alignment of three or more sequences). Pairwise alignment problems have exact solutions. Multiple-sequence alignment problems only have approximate (heuristic) solutions. A pairwise alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs: (1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null base in the other. GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG 24 -Two DNA sequences: A and B. -Lengths are m and n, respectively. -The number of matched pairs is x. -The number of mismatched pairs is y. - Total number of bases in gaps is z. 25 There are internal and terminal gaps. GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG 26 A terminal gap may indicate missing data. GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG 27 An internal gap indicates that a deletion or an insertion has occurred in one of the two lineages. GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG 28 When sequences are compared through alignment, it is impossible to tell whether a deletion has occurred in one sequence or an insertion has occurred in the other. Thus, deletions and insertions are collectively referred to as indels (short for insertion or deletion). GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG 29 The alignment is the first step in many functional and evolutionary studies. Errors in alignment tend to amplify in later stages of the study. 30 Motivation for sequence alignment Function – Similarity may be indicative of similar function. Evolution – Similarity may be indicative of common ancestry. 31 Some definitions 32 Methods of alignment: 1. Manual 2. Dot matrix 3. Distance Matrix 4. Combined (Distance + Manual) 34 Manual alignment. When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection. GCG-TCCATCAGGTAGTTGGTGTG GCGATCCATCAGGTGGTTGGTGTG 35 Advantages of manual alignment: (1) use of a powerful and trainable tool (the brain, well… some brains). (2) ability to integrate additional data, e.g., domain structure, biological function. 36 37 Protein Alignment may be guided by Secondary and Tertiary Structures Escherichia coli DjlA protein Homo sapiens DjlA protein 38 Disadvantages of manual alignment: subjectivity (the algorithm is unspecified) irreproducibility (the results cannot be independently reproduced) unscalability (inapplicable to long sequences) incommensurability (the results cannot be compared to those obtained by other methods) 39 The dot-matrix method (Gibbs and McIntyre, 1970): The two sequences are written out as column and row headings of a twodimensional matrix. A dot is put in the dot-matrix plot at a position where the nucleotides in the two sequences are identical. 40 The alignment is defined by a path from the upper-left element to the lower-right element. 41 There are 4 possible steps in the path: (1) a diagonal step through a dot = match. (2) a diagonal step through an empty element of the matrix = mismatch. (3) a horizontal step = a gap in the sequence on the left of the matrix. (4) a vertical step = a gap in the sequence on the top of the matrix. 42 A dot matrix may become cluttered. With DNA sequences, ~25% of the elements will be occupied by dots by chance alone. 43 window size =1 stringency = 1 alphabet size = 4 The number of spurious matches is determined by: window size (how many residues are compared), stringency (the minimum number of matches for a hit), & alphabet size (number of characters 44 states). Window size must be an odd number. window size =1 stringency = 1 alphabet size = 4 window size = 3 stringency = 2 alphabet size = 4 45 window size = 1 stringency = 1 alphabet size = 20 46 Dot-matrix methods: Advantages: By being a visual representation, and humans being visual animals, the method may unravel information on the evolution of sequences that cannot easily be gleaned from a line alignment. Disadvantages: May not identify the best possible alignment. 47 Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. 48 Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The two pairs of diagonally oriented parallel lines most probably indicate that two small internal duplications occurred in the bacterial gene. 49 Disadvantages: Not possible to identify the best alignment. 50 Scoring Matrices & Gap Penalties 51 The true alignment between two sequences is the one that reflects accurately the evolutionary relationships between the sequences. Since the true alignment is unknown, in practice we look for the optimal alignment, which is the one in which the numbers of mismatches and gaps are minimized according to certain criteria. Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and vice versa. 53 a = matches b = mismatches g = nucleotides in gaps d = gaps 54 The scoring scheme comprises a gap penalty and a scoring matrix, M(a,b), that specifies the score for each type of match (a = b) or mismatch (a b). The units in a scoring matrix may be the nucleotides in the DNA or RNA sequences, the codons in protein-coding regions, or the amino acids in protein sequences. 55 DNA scoring matrices are usually simple. In the simplest scheme all mismatches are given the same penalty. M(a,b) is positive if a = b and negative otherwise. 0 if a b M(a,b) 0 if a b In more complicated matrices a distinction may be made between transition and transversion mismatches or each type of mismatch may be penalized differently. 56 Further complications: Distinguishing among different matches and mismatches. For example, a mismatched pair consisting of Leu & Ile, which are very similar biochemically to each other, may be given a lesser penalty than a mismatched pair consisting of Arg & Glu, which are very dissimilar from each other. 57 Lesser penalty than 58 BLOSUM62 (BLOcks of amino acid SUbstitution Matrix 59 BLOSUM62 (BLOcks of amino acid SUbstitution Matrix B = asx (asp or asn) Z = glx (glu or gln) X = unknown * = termination codon 60 BLOSUM62 (BLOcks of amino acid SUbstitution Matrix The matrix is symmetrical 61 BLOSUM62 (BLOcks of amino acid SUbstitution Matrix Positive numbers on the diagonal 62 BLOSUM62 (BLOcks of amino acid SUbstitution Matrix Mismatches are usually penalized 63 BLOSUM62 (BLOcks of amino acid SUbstitution Matrix Some mismatches are not penalized 64 BLOSUM62 (BLOcks of amino acid SUbstitution Matrix A few mismatches are even rewarded 65 Gap penalty (or cost) is a factor (or a set of factors) by which the gap values (numbers and lengths of gaps) are mathematically manipulated to make the gaps equivalent in value to the mismatches. The gap penalties are based on our assessment of how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions. 66 Mismatches Gaps The gap penalty has two components: a gap-opening penalty and a gap-extension penalty. 68 Three main gap-penalty systems: (1) Fixed gap-penalty system = 0 gap-extension costs. 69 Three main gap-penalty systems: (2) Linear gap-penalty system = the gap-extension cost is calculated by multiplying the gap length minus 1 by a constant representing the gap-extension penalty for increasing the gap by 1. 70 Three main gap-penalty systems: (3) Logarithmic gap-penalty system = the gap-extension penalty increases with the logarithm of the gap length, i.e., slower. 71 Alignment algorithms 72 Aim: Given a predetermined set of criteria, find the alignment associated with the best score from among all possible alignments. The OPTIMAL ALIGNMENT 73 The number of possible alignments may be astronomical. n m (n m)! n!m! min(n,m) n m (n m)n m n m 2nm n m where n and m are the lengths of the two sequences to be aligned. 74 The number of possible alignments may be astronomical. For example, when two DNA sequences 200 residues long each are compared, there are more than 10153 possible alignments. In comparison, the number of protons in the universe is only ~1080. 75 FORTUNATELY: There are computer algorithms for finding the optimal alignment between two sequences that do not require an exhaustive search of all the possibilities. 76 The Needleman-Wunsch (1970) algorithm uses Dynamic Programming 77 Dynamic programming = a computational technique. It is applicable when large searches can be divided into a succession of small stages, such that (1) the solution of the initial search stage is trivial, (2) each partial solution in a later stage can be calculated by reference to only a small number of solutions in an earlier stage, and (3) the last stage contains the overall solution. 78 Dynamic programming can be applied to problems of alignment because ALIGNMENT SCORES obey the following rules: S S S 1 x, 1 y x1, y1 1 x1, 1 y1 79 Path Graph for aligning two sequences 80 allowed 81 not allowed 82 Scoring scheme match = +5 mismatch = –3 gap-opening penalty = –4 gap-extension penalty = 0 84 match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Matrix initialization match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Matrix initialization 0 + match = 5 match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Matrix initialization 0 + gap = –4 match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Matrix initialization 0 + gap = –4 match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Matrix fill 0 + match = 5 match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Matrix fill 5 + gap = 1 match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Matrix fill 0 + gap = –4 match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 … and so on and so forth match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Complete matrix fill match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Trace back The alignment is produced by either starting at the highest score in either the rightmost column or the bottom row, and proceeding from right to left by following the best pointers, or at the bottom rightmost cell. This stage is called the traceback. The graph of pointers in the traceback is also referred to as the path graph because it defines the paths through the matrix that correspond to the optimal alignment or alignments. 95 match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Trace back (if we DO allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 Trace back (if we DO NOT allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 10 + gap ≠ 11 10 + gap ≠ 11 14 + mismatch = 11 Trace back (if we DO NOT allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 10 + gap ≠ 14 5 + gap ≠ 14 9 + match = 14 Trace back (if we DO NOT allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 4 + mismatch ≠ 9 0 + gap ≠ 9 13 + gap= 9 Trace back (if we DO NOT allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 8 + match = 13 9 + gap ≠ 13 4 + gap ≠ 13 Trace back (if we DO NOT allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 12 + gap = 8 3 + match = 8 –1 + gap ≠ 8 Trace back (if we DO NOT allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 7 + gap ≠ 12 7 + gap = 3 3 + gap ≠ 12 –2 + mismatch ≠ 3 7 + match = 12 –6 + gap ≠ 3 Trace back (if we DO NOT allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 … Trace back (if we DO NOT allow terminal gaps) match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0 high road/low road/middle road Trace back (complete) Two possible alignments: GAATTCAGT GGA-TC-GA * * ** * GAATTCAGT GGAT-C-GA * ** * * Scoring Matrices Mismatch and gap penalties should be inversely proportional to the frequencies with which changes occur. 107 Transitions (68%) occur more frequently than transversions (32%). Mismatch penalties for transitions should be smaller than those for transversions. To A From A To T To C To G Row totals 3.4 0.7 (3.6 0.7) 4.5 0.8 (4.8 0.9) 12.5 1.1 (13.3 1.1) 20.3 (21.6) 13.8 1.9 (14.7 2.0) 3.3 0.6 (3.5 0.6) 20.4 (21.7) 4.6 0.6 (4.4 0.6) 29.5 (25.1) From T 3.3 0.6 (3.5 0.6) From C 4.2 0.5 (4.2 0.5) 20.7 1.3 (16.4 1.3) From G 20.4 1.4 (21.9 1.5) 4.4 0.6 (4.6 0.6) 4.9 0.7 (5.2 0.8) Column totals 27.9 (29.5) 28.5 (24.6) 23.2 (23.2) 29.7 (31.6) 20.5 (21.3) 108 Empirical substitution matrices PAM (Percent/Point Accepted Mutation) BLOSUM (BLOcks SUbstitution Matrix) 109 PAM • • Developed by Margaret Dayhoff in .1978 Based on comparisons of very similar protein sequences. 110 Log-odds ratios • A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair occurring in an alignment. • The values in a scoring matrix are log ratios of two probabilities. One is the random probability. The other is the probability of a empirical pair occurrence. • Because the scores are logarithms of probability ratios, they can be added to give a meaningful score for the entire alignment. The more positive the score, the better the alignment! 111 The PAM matrices (Percent accepted mutations) • Align sequences that are at least 85% identical. – Minimizes ambiguity in alignments and the number of coincident mutations. • Reconstruct phylogenetic trees and infer ancestral sequences. • Tally replacements "accepted" by natural selection, in all pairwise comparisons. – • Meaning, the number of times j was replaced by i in all comparisons. Compute amino acid mutability (i.e., the propensity of a given amino acid, j, to be replaced). 112 The PAM matrices • Combine data to produce a Mutation Probability Matrix for one PAM of evolutionary distance, which is used to calculate the Log Odds Matrix for similarity scoring. • Thus, depending on the protein family used, various PAM matrices result - some of which are “good” at locating evolutionary distant conserved mutations and some that are good at locating evolutionary close conserved mutations. 113 More on log-odds ratios In PAM log-odds scores are multiplied by 10 to avoid decimals. Therefore, a PAM score of 2 actually corresponds to a log-odds ratio of 0.2. 0.2 = substitioni to j = log10 { (observed ij mutation rate) / (expected rate) } The value 0.2 is log10 of the relative expectation value of the mutation. Therefore, the expectation value is 100.2 = 1.6. So, a PAM score of 2 indicates that (in related sequences) the mutation would be expected to occur 1.6 times more frequently than random. 114 PAM250 – Calculated for families of related proteins (>85% identity) – 1 PAM is the amount of evolutionary change that yields, on average, one substitution in 100 amino acid residues – A positive score signifies a common replacement whereas a negative score signifies an unlikely replacement – PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time) 115 PAM250 Sequence alignment matrix that allows 250 accepted point mutations per 100 amino acids. PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences. 116 Selecting a PAM Matrix • Low PAM numbers: short sequences, strong local similarities. • High PAM numbers: long sequences, weak similarities. – PAM60 for close relations (60% identity) – PAM120 recommended for general use (40% identity) – PAM250 for distant relations (20% identity) • If uncertain, try several different matrices – PAM40, PAM120, PAM250 recommended. 117 BLOSUM • Blocks Substitution Matrix – Steven and Jorga G. Henikoff (1992). • Based on BLOCKS database (www.blocks.fhcrc.org) – Families of proteins with identical function. – Highly conserved protein domains. • Ungapped local alignment to identify motifs – Each motif is a block of local alignment. – Counts amino acids observed in same column. – Symmetrical model of substitution. 118 BLOSUM62 • BLOSUM matrices are based on local alignments (“blocks” or conserved amino acid patterns). • BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. • All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. • BLOSUM 62 is the default matrix in BLAST 2.0. 119 BLOSUM Matrices • Different BLOSUMn matrices are calculated independently from BLOCKS • BLOSUMn is based on sequences that are at most n percent identical. 120 BLOSUM62 The procedure for calculating a BLOSUM matrix is based on a likelihood method estimating the occurrence of each possible pairwise substitution. Only aligned blocks are used to calculate the BLOSUMs. The higher the score The more closely related sequences. 121 Why is BLOSUM62 called BLOSUM62? Because all blocks whose members shared at least 62% identity with ANY other member of that block were averaged and represented as 1 sequence. 122 Selecting a BLOSUM Matrix • For BLOSUMn, higher n suitable for sequences which are more similar – BLOSUM62 recommended for general use – BLOSUM80 for close relations – BLOSUM45 for distant relations 123 Equivalent PAM and Blosum matrices The following matrices are roughly equivalent... •PAM100 ==> Blosum90 •PAM120 ==> Blosum80 •PAM160 ==> Blosum60 •PAM200 ==> Blosum52 •PAM250 ==> Blosum45 Less divergent More divergent Generally speaking... •The Blosum matrices are best for detecting local alignments. •The Blosum62 matrix is the best for detecting the majority of weak protein similarities. •The Blosum45 matrix is the best for detecting long and weak 124 alignments. Comparison of PAM250 and BLOSUM62 The relationship between BLOSUM and PAM substitution matrices: BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences. BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins. If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search. 125 Scoring matrices commonly used • PAM250 – Shown to be appropriate for searching for sequences of 17-27% identity. • BLOSUM62 – Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. • BLOSUM50 – Shown to be better for FASTA searches. 126 Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone (a) Penalty for gaps is 0 (b) Penalty for a gap of size k nucleotides is wk = 1 + 0.1k (c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by showing pairs of 127 biochemically similar amino acids Alignments: things to keep in mind “Optimal alignment” means “having the highest possible score, given a substitution matrix and a set of gap penalties” This is NOT necessarily the most meaningful alignment The assumptions of the algorithm are often wrong: - substitutions are not equally frequent at all positions, - it is very difficult to realistically model insertions and deletions. Pairwise alignment programs ALWAYS produce an alignment (even when it does not make sense to align sequences)