Homology, Similarity & Identity Homologous = evolved from a common ancestor, qualitative (yes/no), not expressed as a % or proportion homologous proteins almost always have similar 3D structure, may or may not have high protein or nt sequence identity Identity & Similarity are quantitative Identity = % of aa/nt that are identical Similarity =% of aa that are the same or similar (in terms of biochem properties) Globin family example: all members are homologous, similar function, but low sequence identity because diverged so long ago human beta globin and neuroglobin: 22% aa identity human alpha globin and myoglobin have 26% identity but same shape FIGURE 3.1 Three-dimensional structures of: (a) myoglobin (accession 3RGK); (b) the tetrameric hemoglobin protein (2H35); (c) the beta globin subunit of hemoglobin; and (d) myoglobin and beta globin superimposed. 1 Homology, Similarity & Identity 2 types homologous sequences: 1. Orthologs = derived from a common ancestor by speciation 2. Paralogs = derived from a common ancestor by gene duplication FIGURE 3.2 A group of myoglobin orthologs FIGURE 3.3 Paralogous human globins 2 Relatedness is studied by determining sequence similarity 1st must align the sequences Protein vs DNA Alignment Often more information from protein alignment why? *many changes in DNA seq do not change the aa (ex. 3rd position) ex. CAA, CAG, CAT, CAC all Valine *many aas have similar physicobiochemical properties and this can be accounted for with a scoring system *more changes at DNA level less observable homology ex. CAA mutates to CAG = Val; CAA mutates to CAT = Val 3 So, we often translate a nt seq and use that for alignment 4 Alignment is performed with computer algorithm = procedure Alignment of beta globin and myoglobin: Identity value Similarity value identical and similar aas + means similar FIGURE 3.5 Pairwise alignment of human beta globin (the “query”) and myoglobin (the “subject”). (a) The alignment; (b) Illustration of how raw scores are calculated. 5 Scoring of Alignments algorithm chooses best alignment based on its score = numerical value different scoring algorithms use different rules Example: Dayhoff model considers: 1. types of mutations that are accepted by natural selection 2. aa frequency 3. aa mutability 4. probability of each aa mutation scoring matrix = PAM matrix = accepted point mutation matrix 6 PAM1 matrix based on aligning closely related proteins with 1% chance of change at a given aa ex. If there is Ala in original seq, what is probability it is still Ala in 2nd seq? = 98.7% If Ala changes what is it most likely to become? = S = Serine (change in 1st position) GCU, GCA, GCC, GCG UCU, UCA, UCC, UCG FIGURE 3.9 The PAM1 mutation probability matrix. The original amino acid j is arranged in columns (across the top), while the replacement amino acid i is arranged in rows. 7 More distantly related proteins need different matrices PAM100 & PAM250 PAM250: used when aa identity is ~20% FIGURE 3.13 The PAM250 mutation probability matrix. At this evolutionary distance, only one in five amino acid residues remains unchanged from an original amino acid sequence (columns) to a replacement amino acid (rows). Note that the scale has changed relative to Figure 3.11, and the columns sum to 100. 8 Other scoring matrices BLOSUM = blocks substitution matrix, based on >500 conserved protein regions BLOSUM62 based on proteins with at least 62% identity = default for BLAST FIGURE 3.17 The BLOSUM62 scoring matrix of Henikoff and Henikoff (1992). This matrix merges all proteins in an alignment that have 62% amino acid identity or greater into one sequence. 9 Guide for which matrix to use: FIGURE 3.18 Summary of PAM and BLOSUM matrices. 10 Danger! if seqs are too diverged, correct alignment/homology can’t be found twilight zone = <20% identity FIGURE 3.19 Two randomly diverging protein sequences change in a negatively exponential fashion. This plot shows the observed number of amino acid identities per 100 residues of two sequences (y axis) versus the number of changes that must have occurred (the evolutionary distance in PAM units). The twilight zone (Doolittle, 1987) refers to the evolutionary distance corresponding to about 20% identity between two proteins. Proteins with this degree of amino acid sequence identity may be homologous, but such homology is difficult to detect. 11 Global and Local Alignment 1. Global: entire seq of each protein/DNA is used ex. Needleman & Wunsch 2. Local: only aligns regions with most similarity ex. Smith & Waterman FIGURE 3.23 (a) Global pairwise alignment of bacterial proteins containing globin domains from Streptomyces avermitilis MA-4680 (NP_824492) and Mycobacterium tuberculosis CDC1551 (NP_337032). (b) Local alignment. 12 Global: Needleman & Wunsch gives optimal alignment without checking every one checking all consumes too much time & computing power example of dynamic programming: does a residue-by-residue search for optimal alignment Step 1: Set up matrix– 1st seq across top, 2nd seq down; draw path to show alignment diagonal line = match or mismatch vertical = deletion is seq1 horizontal = deletion in seq2 FIGURE 3.20 Pairwise alignment of two amino acid sequences using a dynamic programming algorithm of Needleman and Wunsch (1970) for global alignment. (a) Two sequences can be assigned a diagonal path through the matrix and, when necessary, the path can deviate horizontally or vertically, reflecting gaps that are introduced into the alignment. (b) Two identical sequences form a path on the matrix that fits a diagonal line. (c) If there is a mismatch (or multiple mismatches), the path still follows a diagonal, although a scoring system may penalize the presence of mismatches. If the alignment includes a gap in (d) the first sequence or (e) the second sequence, the path includes a vertical or horizontal line. 13 Global: Needleman & Wunsch Step 2: Make scoring matrix– gap penalties added below/to right of each seq (-2) matching aas filled in gray enter scores for matches & mismatches according to rules for moving thru matrix FIGURE 3.21 Pairwise alignment of two amino acid sequences using the dynamic programming algorithm of Needleman and Wunsch (1970) for global alignment. 14 Global: Needleman & Wunsch Step 3: Identify optimal alignment– start in lower right corner, find path with lowest scores Optimal alignment with best score FIGURE 3.22 Global pairwise alignment of two amino acid sequences using a dynamic programming algorithm: scoring the matrix and using the trace-back procedure to obtain the alignments. 15 Local Alignment useful for database searches most rigorous = Smith & Waterman– has matrix like global but no gap penalties at beginning or end, slightly different scoring system –> optimal alignment relatively slow faster alternatives = FASTA & BLAST: first look for likely matches in db then align both are heuristic algorithms = don’t consider all possibilities, not exhaustive 16 Dotplots = graphical way to compare 2 seqs matrix similar to alignment, dots placed wherever aa/nt is the same FIGURE 3.25 Dot matrix plots in the output of the NCBI BLASTP program permit visualization of matching domains in pairwise protein alignments. 17 TABLE 3.4 Global pairwise alignment algorithms 18 TABLE 3.5 Local pairwise alignment algorithms 19 Alignment problems: Results are greatly affected by optional parameters (scoring matrix, etc.) no alignment of homologous seqs or alignment of non-homologous seqs if incorrectly chosen Always need biological evidence– structure & function! Note! 2 aligned proteins of 100 aa have 50% identity but will actually be calculated to have ~80 aa differences why? Multiple substitutions : topic of MBG325 Molecular Evolution 20 21