pairwise alignment of two nucleotide or amino

advertisement
GLOBAL
PAIRWISE ALIGNMENT
GLOBAL ALIGNMENT OF:
2 NUCLEOTIDE SEQUENCES
OR
2 AMINO-ACID SEQUENCES
1
Assumptions:
Life is monophyletic
Biological entities (sequences,
taxa) share common ancestry
2
ancestor
descendant 1
Any two organisms
share a common
ancestor in their past
descendant 2
3
ancestor
(~5 MYA)
4
ancestor
(~120 MYA)
5
ancestor
(~1,500 MYA)
6
(1) Speciation events
(2) Gene duplication
(3) Duplicative transposition
Homologous
sequences
7
Homology:
A term coined by
Richard Owen in
1843.
Definition:
Similarity
resulting from
common ancestry.
8
Homology
There are three main types of molecular
homology: orthology, paralogy (including
ohnology) and xenology.
9
Homology: General Definition
• Homology designates a qualitative
relationship of common descent between
entities
• Two genes are either homologous or they
are not!
– it doesn’t make sense to say “two genes
are 43% homologous.”
– it doesn’t make sense to say “Linda is
43% pregnant.”
10
Orthology & Paralogy
• Two genes are orthologs if they
originated from a single ancestral gene
in the most recent common ancestor of
their respective genomes
• Two genes are paralogs if they are
related by gene duplication. Two genes
are ohnologs if they are related by
gene duplication due to genome
duplication
11
12
= Gene death
13
Xenology is due to horizontal (lateral)
gene transfer (HGT or LGT)
XA and XB are xenologs
Distinguishing orthologs from xenologs is
impossible in pairwise genomic
comparisons, but possible when multiple
genomes are compared
14
Orthology, Paralogy, Xenology
(Fitch, Trends in Genetics, 2000. 16(5):227-231)
15
Homology
By comparing homologous characters,
we can reconstruct the evolutionary
events that have led to the formation of
the extant sequences from the common
ancestor.
16
Homology
When comparing sequences, we are
interested in POSITIONAL HOMOLOGY.
We identify POSITIONAL HOMOLOGY
through SEQUENCE ALIGNMENT.
17
Alignment: A hypothesis concerning
positional homology among residues
from two or more sequence.
Positional homology = In
pairwise alignment, a pair of
nucleotides from two homologous
sequences that have descended
from one nucleotide in the
ancestor of the two sequences.
Sequence alignment involves the
identification of the correct location
of deletions and insertions that have
occurred in either of the two lineages
since their divergence from a
common ancestor.
19
20
Unknown sequence
Unknown events &
unknown sequence of
events
Unknown events &
unknown sequence of
events
The true alignment is
unknown.
21
There are two modes of alignment.
Global alignment: each residue of sequence A is
compared with each residue in sequence B. Global
alignment algorithms are used in comparative and
evolutionary studies.
Local alignment: Determining if sub-segments of
one sequence are present in another. Local
alignment methods have their greatest utility in
database searching and retrieval (e.g., BLAST).
For reasons of computational complexity, sequence
alignment is divided into two categories:
Pairwise alignment (i.e., the alignment of two
sequences).
Multiple-sequence alignment (i.e., the alignment of
three or more sequences).
Pairwise alignment problems have exact solutions.
Multiple-sequence alignment problems only have
approximate (heuristic) solutions.
A pairwise alignment consists of a series of
paired bases, one base from each sequence.
There are three types of pairs:
(1) matches = the same nucleotide appears in both
sequences.
(2) mismatches = different nucleotides are found in
the two sequences.
(3) gaps = a base in one sequence and a null base in
the other.
GCGGCCCATCAGGTAGTTGGTG-G
GCGTTCCATC--CTGGTTGGTGTG
24
-Two DNA sequences: A and B.
-Lengths are m and n, respectively.
-The number of matched pairs is x.
-The number of mismatched pairs is y.
- Total number of bases in gaps is z.
25
There are internal and terminal
gaps.
GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG
26
A terminal gap may indicate
missing data.
GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG
27
An internal gap indicates that a
deletion or an insertion has
occurred in one of the two
lineages.
GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG
28
When sequences are compared through
alignment, it is impossible to tell whether a
deletion has occurred in one sequence or an
insertion has occurred in the other. Thus,
deletions and insertions are collectively
referred to as indels (short for insertion
or deletion).
GCGG-CCATCAGGTAGTTGGTG-GCGTTCCATC--CTGGTTGGTGTG
29
The alignment is the first step in
many functional and evolutionary
studies.
Errors in alignment tend to
amplify in later stages of the
study.
30
Motivation for sequence alignment
Function
– Similarity may be indicative of similar
function.
Evolution
– Similarity may be indicative of common
ancestry.
31
Some definitions
32
Methods of alignment:
1. Manual
2. Dot matrix
3. Distance Matrix
4. Combined (Distance + Manual)
34
Manual alignment. When there are
few gaps and the two sequences
are not too different from each
other, a reasonable alignment
can be obtained by visual
inspection.
GCG-TCCATCAGGTAGTTGGTGTG
GCGATCCATCAGGTGGTTGGTGTG
35
Advantages of manual alignment:
(1) use of a powerful and trainable tool (the
brain, well… some brains).
(2) ability to integrate additional data, e.g.,
domain structure, biological function.
36
37
Protein Alignment may be guided by
Secondary and Tertiary Structures
Escherichia coli
DjlA protein
Homo sapiens
DjlA protein
38
Disadvantages of manual alignment:
subjectivity (the algorithm is unspecified)
irreproducibility (the results cannot be
independently reproduced)
unscalability (inapplicable to long sequences)
incommensurability (the results cannot be
compared to those obtained by other
methods)
39
The dot-matrix
method (Gibbs and
McIntyre, 1970): The
two sequences are written
out as column and row
headings of a twodimensional matrix. A dot
is put in the dot-matrix
plot at a position where
the nucleotides in the two
sequences are identical.
40
The alignment
is defined by a
path from the
upper-left
element to the
lower-right
element.
41
There are 4 possible steps in the path:
(1) a diagonal step through
a dot = match.
(2) a diagonal step through
an empty element of the
matrix = mismatch.
(3) a horizontal step = a
gap in the sequence on
the left of the matrix.
(4) a vertical step = a gap
in the sequence on the
top of the matrix.
42
A dot matrix may become cluttered.
With DNA sequences, ~25% of the
elements will be occupied by dots by
chance alone.
43
window size =1
stringency = 1
alphabet size = 4
The number of spurious matches is determined by:
window size (how many residues are compared),
stringency (the minimum number of matches for a
hit), & alphabet size (number of characters
44
states). Window size must be an odd number.
window size =1
stringency = 1
alphabet size = 4
window size = 3
stringency = 2
alphabet size = 4
45
window size = 1
stringency = 1
alphabet size = 20
46
Dot-matrix methods:
Advantages: By being a visual
representation, and humans being
visual animals, the method may
unravel information on the evolution
of sequences that cannot easily be
gleaned from a line alignment.
Disadvantages: May not identify
the best possible alignment.
47
Window size = 60 amino acids; Stringency = 24 matches
Advantages:
Highlighting Information
The vertical gap indicates
that a coding region
corresponding to ~75
amino acids has either
been deleted from the
human gene or inserted
into the bacterial gene.
48
Window size = 60 amino acids; Stringency = 24 matches
Advantages:
Highlighting Information
The two pairs of
diagonally oriented
parallel lines most
probably indicate that two
small internal duplications
occurred in the bacterial
gene.
49
Disadvantages:
Not possible to
identify the
best alignment.
50
Scoring Matrices & Gap Penalties
51
The true alignment between two sequences is
the one that reflects accurately the evolutionary
relationships between the sequences.
Since the true alignment is unknown, in practice
we look for the optimal alignment, which is the
one in which the numbers of mismatches and
gaps are minimized according to certain
criteria.
Unfortunately, reducing the
number of mismatches results
in an increase in the number of
gaps, and vice versa.
53
a = matches
b = mismatches
g = nucleotides in gaps
d = gaps
54
The scoring
scheme comprises a gap
penalty and a scoring matrix, M(a,b), that
specifies the score for each type of match (a = b)
or mismatch (a  b).
The units in a scoring matrix may be the
nucleotides in the DNA or RNA sequences, the
codons in protein-coding regions, or the amino
acids in protein sequences.
55
DNA scoring matrices are usually simple. In the
simplest scheme all mismatches are given the
same penalty.
M(a,b) is positive if a = b and negative otherwise.
 0 if a  b
M(a,b)
 0 if a  b
In more complicated matrices a distinction may be
made between transition and transversion
mismatches or each type of mismatch may be

penalized differently.
56
Further complications:
Distinguishing among different
matches and mismatches.
For example, a mismatched pair consisting
of Leu & Ile, which are very similar
biochemically to each other, may be given a
lesser penalty than a mismatched pair
consisting of Arg & Glu, which are very
dissimilar from each other.
57
Lesser penalty than
58
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
59
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
B = asx (asp or asn)
Z = glx (glu or gln)
X = unknown
* = termination codon
60
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
The matrix is symmetrical
61
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Positive numbers on the diagonal
62
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Mismatches are usually penalized
63
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Some mismatches are not penalized 64
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
A few mismatches are even rewarded
65
Gap penalty (or cost) is a factor (or a set
of factors) by which the gap values
(numbers and lengths of gaps) are
mathematically manipulated to make the
gaps equivalent in value to the mismatches.
The gap penalties are based on our
assessment of how frequent different
types of insertions and deletions occur in
evolution in comparison with the frequency
of occurrence of point substitutions.
66
Mismatches
Gaps
The gap penalty has two
components: a gap-opening
penalty and a gap-extension
penalty.
68
Three main gap-penalty systems:
(1) Fixed gap-penalty system = 0 gap-extension costs.
69
Three main gap-penalty systems:
(2) Linear gap-penalty system = the gap-extension cost is calculated
by multiplying the gap length minus 1 by a constant representing the
gap-extension penalty for increasing the gap by 1.
70
Three main gap-penalty systems:
(3) Logarithmic gap-penalty system = the gap-extension
penalty increases with the logarithm of the gap length,
i.e., slower.
71
Alignment algorithms
72
Aim: Given a predetermined
set of criteria, find the
alignment associated with the
best score from among all
possible alignments.
The OPTIMAL ALIGNMENT
73
The number of possible alignments may
be astronomical.
 n  m  (n  m)!



n!m!
min(n,m)
n  m (n  m)n m
 n m
2nm n  m
where n and m are the lengths of the
 two sequences to be aligned.
74
The number of possible alignments may
be astronomical.
For example, when two DNA sequences
200 residues long each are compared,
there are more than 10153 possible
alignments.
In comparison, the number of protons in
the universe is only ~1080.
75
FORTUNATELY:
There are computer algorithms for
finding the optimal alignment
between two sequences that do not
require an exhaustive search of all
the possibilities.
76
The
Needleman-Wunsch (1970) algorithm
uses
Dynamic Programming
77
Dynamic programming = a computational
technique. It is applicable when large
searches can be divided into a succession of
small stages, such that (1) the solution of
the initial search stage is trivial, (2) each
partial solution in a later stage can be
calculated by reference to only a small
number of solutions in an earlier stage, and
(3) the last stage contains the overall
solution.
78
Dynamic programming can be
applied to problems of alignment
because ALIGNMENT SCORES
obey the following rules:
S
S
S
1 x, 1 y x1, y1 1 x1, 1 y1
79
Path Graph for aligning two sequences
80
allowed
81
not allowed
82
Scoring scheme
match = +5
mismatch = –3
gap-opening penalty = –4
gap-extension penalty = 0
84
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization
0 + match = 5
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization
0 + gap = –4
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix initialization
0 + gap = –4
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix fill
0 + match = 5
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix fill
5 + gap = 1
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Matrix fill
0 + gap = –4
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
… and so on and so forth
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Complete matrix fill
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Trace back
The alignment is produced by either starting
at the highest score in either the rightmost
column or the bottom row, and proceeding
from right to left by following the best
pointers, or at the bottom rightmost cell.
This stage is called the traceback. The
graph of pointers in the traceback is also
referred to as the path graph because it
defines the paths through the matrix that
correspond to the optimal alignment or
alignments.
95
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Trace back (if we DO allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
10 + gap ≠ 11
10 + gap ≠ 11 14 + mismatch = 11
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
10 + gap ≠ 14
5 + gap ≠ 14
9 + match = 14
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
4 + mismatch ≠ 9
0 + gap ≠ 9
13 + gap= 9
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
8 + match = 13
9 + gap ≠ 13
4 + gap ≠ 13
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
12 + gap = 8
3 + match = 8
–1 + gap ≠ 8
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
7 + gap ≠ 12
7 + gap = 3
3 + gap ≠ 12
–2 + mismatch ≠ 3
7 + match = 12
–6 + gap ≠ 3
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
…
Trace back (if we DO NOT allow terminal gaps)
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0
high road/low road/middle road
Trace back (complete)
Two possible alignments:
GAATTCAGT
GGA-TC-GA
* * ** *
GAATTCAGT
GGAT-C-GA
* ** * *
Scoring Matrices
Mismatch and gap penalties
should be inversely proportional to
the frequencies with which
changes occur.
107
Transitions (68%) occur more frequently than transversions (32%).
Mismatch penalties for transitions should be smaller than those
for transversions.
To A
From A
To T
To C
To G
Row totals
3.4  0.7
(3.6  0.7)
4.5  0.8
(4.8  0.9)
12.5  1.1
(13.3  1.1)
20.3
(21.6)
13.8  1.9
(14.7  2.0)
3.3  0.6
(3.5  0.6)
20.4
(21.7)
4.6  0.6
(4.4  0.6)
29.5
(25.1)
From T
3.3  0.6
(3.5  0.6)
From C
4.2  0.5
(4.2  0.5)
20.7  1.3
(16.4  1.3)
From G
20.4  1.4
(21.9  1.5)
4.4  0.6
(4.6  0.6)
4.9  0.7
(5.2  0.8)
Column
totals
27.9
(29.5)
28.5
(24.6)
23.2
(23.2)
29.7
(31.6)
20.5
(21.3)
108
Empirical substitution matrices
PAM (Percent/Point Accepted Mutation)
BLOSUM (BLOcks SUbstitution Matrix)
109
PAM
•
•
Developed by Margaret Dayhoff in .1978
Based on comparisons of very similar
protein sequences.
110
Log-odds ratios
•
A scoring matrix is a table of values that describe the probability of
a residue (amino acid or base) pair occurring in an alignment.
•
The values in a scoring matrix are log ratios of two probabilities.
One is the random probability. The other is the probability of a
empirical pair occurrence.
•
Because the scores are logarithms of probability ratios, they can be
added to give a meaningful score for the entire alignment. The more
positive the score, the better the alignment!
111
The PAM matrices
(Percent accepted mutations)
•
Align sequences that are at least 85% identical.
–
Minimizes ambiguity in alignments and the number of coincident mutations.
•
Reconstruct phylogenetic trees and infer ancestral sequences.
•
Tally replacements "accepted" by natural selection, in all pairwise
comparisons.
–
•
Meaning, the number of times j was replaced by i in all comparisons.
Compute amino acid mutability (i.e., the propensity of a given amino
acid, j, to be replaced).
112
The PAM matrices
• Combine data to produce a Mutation Probability Matrix
for one PAM of evolutionary distance, which is used to
calculate the Log Odds Matrix for similarity scoring.
• Thus, depending on the protein family used, various PAM
matrices result - some of which are “good” at locating
evolutionary distant conserved mutations and some that
are good at locating evolutionary close conserved
mutations.
113
More on log-odds ratios
In PAM log-odds scores are multiplied by 10 to avoid decimals. Therefore, a PAM
score of 2 actually corresponds to a log-odds ratio of 0.2.
0.2 = substitioni to j = log10 { (observed ij mutation rate) / (expected rate) }
The value 0.2 is log10 of the relative expectation value of the mutation. Therefore,
the expectation value is 100.2 = 1.6.
So, a PAM score of 2 indicates that (in related sequences) the mutation would
be expected to occur 1.6 times more frequently than random.
114
PAM250
– Calculated for families of related proteins (>85%
identity)
– 1 PAM is the amount of evolutionary change that
yields, on average, one substitution in 100 amino
acid residues
– A positive score signifies a common replacement
whereas a negative score signifies an unlikely
replacement
– PAM250 matrix assumes/is optimized for
sequences separated by 250 PAM, i.e. 250
substitutions in 100 amino acids (longer
evolutionary time)
115
PAM250
Sequence alignment matrix that allows 250 accepted point
mutations per 100 amino acids. PAM250 is suitable for
comparing distantly related sequences, while a lower PAM is
suitable for comparing more closely related sequences.
116
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local
similarities.
• High PAM numbers: long sequences, weak similarities.
– PAM60 for close relations (60% identity)
– PAM120 recommended for general use (40% identity)
– PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices
– PAM40, PAM120, PAM250 recommended.
117
BLOSUM
• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992).
• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function.
– Highly conserved protein domains.
• Ungapped local alignment to identify motifs
– Each motif is a block of local alignment.
– Counts amino acids observed in same column.
– Symmetrical model of substitution.
118
BLOSUM62
• BLOSUM matrices are based on local alignments (“blocks” or
conserved amino acid patterns).
• BLOSUM 62 is a matrix calculated from comparisons of
sequences with no less than 62% divergence.
• All BLOSUM matrices are based on observed alignments; they
are not extrapolated from comparisons of closely related
proteins.
• BLOSUM 62 is the default matrix in BLAST 2.0.
119
BLOSUM Matrices
• Different BLOSUMn matrices are
calculated independently from BLOCKS
• BLOSUMn is based on sequences that
are at most n percent identical.
120
BLOSUM62
The procedure for calculating a BLOSUM matrix is based on a
likelihood method estimating the occurrence of each possible
pairwise substitution. Only aligned blocks are used to calculate the
BLOSUMs.
The higher the score
The more closely
related sequences.
121
Why is BLOSUM62 called
BLOSUM62?
Because all blocks whose members shared at least 62%
identity with ANY other member of that block were
averaged and represented as 1 sequence.
122
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for
sequences which are more similar
– BLOSUM62 recommended for general use
– BLOSUM80 for close relations
– BLOSUM45 for distant relations
123
Equivalent PAM and Blosum
matrices
The following matrices are roughly equivalent...
•PAM100 ==> Blosum90
•PAM120 ==> Blosum80
•PAM160 ==> Blosum60
•PAM200 ==> Blosum52
•PAM250 ==> Blosum45
Less
divergent
More
divergent
Generally speaking...
•The Blosum matrices are best for detecting local alignments.
•The Blosum62 matrix is the best for detecting the majority of
weak protein similarities.
•The Blosum45 matrix is the best for detecting long and weak
124
alignments.
Comparison of PAM250 and
BLOSUM62
The relationship between BLOSUM and PAM substitution
matrices:
BLOSUM matrices with higher numbers and PAM matrices with
low numbers are both designed for comparisons of closely related
sequences.
BLOSUM matrices with low numbers and PAM matrices with high
numbers are designed for comparisons of distantly related
proteins.
If distant relatives of the query sequence are specifically being
sought, the matrix can be tailored to that type of search.
125
Scoring matrices commonly used
• PAM250
– Shown to be appropriate for searching for sequences of
17-27% identity.
• BLOSUM62
– Though it is tailored for comparisons of moderately
distant proteins, it performs well in detecting closer
relationships.
• BLOSUM50
– Shown to be better for FASTA searches.
126
Effect of gap penalties on amino-acid alignment
Human pancreatic hormone precursor versus chicken
pancreatic hormone
(a) Penalty for gaps is 0
(b) Penalty for a gap of size k nucleotides is wk = 1 + 0.1k
(c) The same alignment as in (b), only the similarity between
the two sequences is further enhanced by showing pairs of
127
biochemically similar amino acids
Alignments: things to keep in
mind
“Optimal alignment” means “having the highest possible
score, given a substitution matrix and a set of gap
penalties”
This is NOT necessarily the most meaningful alignment
The assumptions of the algorithm are often wrong:
- substitutions are not equally frequent at all positions,
- it is very difficult to realistically model insertions and
deletions.
Pairwise alignment programs ALWAYS produce an alignment
(even when it does not make sense to align sequences)
Download