Uploaded by Ayberk Caner

MBG305 pairwise alignment

advertisement
Homology, Similarity & Identity
Homologous = evolved from a common ancestor, qualitative (yes/no), not expressed as a % or proportion
homologous proteins almost always have similar 3D structure, may or may not have high protein
or nt sequence identity
Identity & Similarity are quantitative
Identity = % of aa/nt that are identical
Similarity =% of aa that are the same or similar (in terms of biochem
properties)
Globin family example: all members are homologous, similar function, but
low sequence identity because diverged so long ago
human beta globin and neuroglobin: 22% aa identity
human alpha globin and myoglobin have 26% identity
but same shape
FIGURE 3.1 Three-dimensional structures of: (a) myoglobin (accession 3RGK); (b)
the tetrameric hemoglobin protein (2H35); (c) the beta globin subunit of
hemoglobin; and (d) myoglobin and beta globin superimposed.
1
Homology, Similarity & Identity
2 types homologous sequences:
1. Orthologs = derived from a common ancestor by speciation
2. Paralogs = derived from a common ancestor by gene duplication
FIGURE 3.2 A group of myoglobin orthologs
FIGURE 3.3 Paralogous human globins
2
Relatedness is studied by determining sequence similarity
1st must align the sequences
Protein vs DNA Alignment
Often more information from protein alignment
why?
*many changes in DNA seq do not change the aa (ex. 3rd position)
ex. CAA, CAG, CAT, CAC all  Valine
*many aas have similar physicobiochemical properties and this can be accounted for
with a scoring system
*more changes at DNA level  less observable homology
ex. CAA mutates to CAG = Val; CAA mutates to CAT = Val
3
So, we often translate a nt seq and use that for alignment
4
Alignment is performed with computer algorithm = procedure
Alignment of beta globin and myoglobin:
Identity value
Similarity value
identical and similar aas
+ means similar
FIGURE 3.5 Pairwise alignment of human beta globin (the “query”) and myoglobin (the “subject”). (a) The alignment; (b) Illustration of how
raw scores are calculated.
5
Scoring of Alignments
algorithm chooses best alignment based on its score = numerical value
different scoring algorithms use different rules
Example: Dayhoff model considers:
1. types of mutations that are accepted by natural selection
2. aa frequency
3. aa mutability
4. probability of each aa mutation
 scoring matrix = PAM matrix = accepted point mutation matrix
6
PAM1 matrix
based on aligning closely related proteins with 1% chance of change at a given aa
ex. If there is Ala in original seq, what is probability it is still Ala in 2nd seq? = 98.7%
If Ala changes what is it most likely to become? = S = Serine (change in 1st position)
GCU, GCA, GCC, GCG  UCU, UCA, UCC, UCG
FIGURE 3.9 The PAM1 mutation probability matrix. The original amino acid j is arranged in columns (across the top),
while the replacement amino acid i is arranged in rows.
7
More distantly related proteins need different matrices  PAM100 & PAM250
PAM250: used when aa identity is ~20%
FIGURE 3.13 The PAM250 mutation probability matrix. At this evolutionary distance, only one in five amino acid residues
remains unchanged from an original amino acid sequence (columns) to a replacement amino acid (rows). Note that the scale
has changed relative to Figure 3.11, and the columns sum to 100.
8
Other scoring matrices
BLOSUM = blocks substitution matrix, based on >500 conserved protein regions
BLOSUM62 based on proteins with at least 62% identity = default for BLAST
FIGURE 3.17 The BLOSUM62 scoring matrix of Henikoff and Henikoff (1992). This matrix merges all proteins in an
alignment that have 62% amino acid identity or greater into one sequence.
9
Guide for which matrix to use:
FIGURE 3.18 Summary of PAM and BLOSUM matrices.
10
Danger!
if seqs are too diverged, correct alignment/homology can’t be found
twilight zone = <20% identity
FIGURE 3.19 Two randomly diverging protein sequences change in a negatively exponential fashion. This plot shows the observed number of amino
acid identities per 100 residues of two sequences (y axis) versus the number of changes that must have occurred (the evolutionary distance in PAM
units). The twilight zone (Doolittle, 1987) refers to the evolutionary distance corresponding to about 20% identity between two proteins. Proteins with
this degree of amino acid sequence identity may be homologous, but such homology is difficult to detect.
11
Global and Local Alignment
1. Global: entire seq of each protein/DNA is used
ex. Needleman & Wunsch
2. Local: only aligns regions with most similarity
ex. Smith & Waterman
FIGURE 3.23 (a) Global pairwise alignment of bacterial proteins
containing globin domains from Streptomyces avermitilis MA-4680
(NP_824492) and Mycobacterium tuberculosis CDC1551
(NP_337032). (b) Local alignment.
12
Global: Needleman & Wunsch
gives optimal alignment without checking every one
checking all consumes too much time & computing power
example of dynamic programming: does a residue-by-residue
search for optimal alignment
Step 1:
Set up matrix– 1st seq across top, 2nd seq down; draw path to show
alignment
diagonal line = match or mismatch
vertical = deletion is seq1
horizontal = deletion in seq2
FIGURE 3.20 Pairwise alignment of two amino acid sequences using a dynamic programming algorithm of Needleman and Wunsch (1970) for global
alignment. (a) Two sequences can be assigned a diagonal path through the matrix and, when necessary, the path can deviate horizontally or vertically,
reflecting gaps that are introduced into the alignment. (b) Two identical sequences form a path on the matrix that fits a diagonal line. (c) If there is a
mismatch (or multiple mismatches), the path still follows a diagonal, although a scoring system may penalize the presence of mismatches. If the
alignment includes a gap in (d) the first sequence or (e) the second sequence, the path includes a vertical or horizontal line.
13
Global: Needleman & Wunsch
Step 2:
Make scoring matrix– gap penalties added
below/to right of each seq (-2)
matching aas filled in gray
enter scores for matches & mismatches
according to rules for moving thru
matrix
FIGURE 3.21 Pairwise alignment of two
amino acid sequences using the dynamic
programming algorithm of Needleman and
Wunsch (1970) for global alignment.
14
Global: Needleman & Wunsch
Step 3:
Identify optimal alignment– start in lower right
corner, find path with lowest scores 
Optimal alignment with best score
FIGURE 3.22 Global pairwise alignment of two amino acid sequences using a
dynamic programming algorithm: scoring the matrix and using the trace-back
procedure to obtain the alignments.
15
Local Alignment
useful for database searches
most rigorous = Smith & Waterman– has matrix like global but no gap penalties at beginning or end,
slightly different scoring system –> optimal alignment
relatively slow
faster alternatives = FASTA & BLAST: first look for likely matches in db then align
both are heuristic algorithms = don’t consider all possibilities, not exhaustive
16
Dotplots = graphical way to compare 2 seqs
matrix similar to alignment, dots placed wherever aa/nt is the same
FIGURE 3.25 Dot matrix plots in the output of the NCBI BLASTP program permit visualization of
matching domains in pairwise protein alignments.
17
TABLE 3.4 Global pairwise alignment algorithms
18
TABLE 3.5 Local pairwise alignment algorithms
19
Alignment problems:
Results are greatly affected by optional parameters (scoring matrix, etc.)  no alignment of homologous seqs or
alignment of non-homologous seqs if incorrectly chosen
Always need biological evidence– structure & function!
Note!
2 aligned proteins of 100 aa have 50% identity but will actually be calculated to have ~80 aa differences
why? Multiple substitutions : topic of MBG325 Molecular Evolution
20
21
Download