Sequence Alignment

advertisement
www.
.uni-rostock.de
Bioinformatics
Sequence Analysis I
Ulf Schmitz
ulf.schmitz@informatik.uni-rostock.de
Bioinformatics and Systems Biology Group
www.sbi.informatik.uni-rostock.de
Ulf Schmitz, Sequence Analysis I
1
www.
Outline
.uni-rostock.de
1. Introduction to sequence alignment
2. pair wise sequence alignment
a) The Dot Matrix
b) Scoring Matrices
c) Gap Penalties
d) Dynamic Programming
Ulf Schmitz, Sequence Analysis I
2
Introduction to sequence alignment
www.
.uni-rostock.de
Sequence Alignment is the identification of
residue-residue correspondences.
• It is the basic tool of bioinformatics.
Ulf Schmitz, Sequence Analysis I
3
Introduction to sequence alignment
www.
.uni-rostock.de
Evolution:
Ancestral sequence: ABCD
ACCD (B C)
mutation
ACCD
AB─D
or
ACCD
A─BD
ABD (C ø)
deletion
Pairwise Alignment
true alignment
Ulf Schmitz, Sequence Analysis I
4
Introduction to sequence alignment
www.
.uni-rostock.de
A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS
* *
* **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
att---tggcggatcg-cctctacgggcc---***
**** **** **
******
Ulf Schmitz, Sequence Analysis I
5
Introduction to sequence alignment
www.
.uni-rostock.de
Why do sequence alignment?
• for discovering functional, structural and
evolutionary information in biological sequences
• eases further tasks like:
– annotation of new sequences
– modelling of protein structures
– design and analysis of gene expression
experiments
Ulf Schmitz, Sequence Analysis I
6
www.
The diversity of species
.uni-rostock.de
• a newly created species biochemistry is not completely
reconfigured
• and new functionality isn’t created by sudden
appearance of whole new genes
• incremental modifications give rise to genetic diversity
and novel function
Sequence A
x steps
Sequence B
y steps
x + y = number of mismatches
Ancestor sequence
Ulf Schmitz, Sequence Analysis I
7
Sequence Alignment
www.
.uni-rostock.de
The concept
• An alignment is a mutual arrangement of two
sequences.
• It exhibits where the two sequences are similar,
and where they differ.
• An 'optimal' alignment is one that exhibits the most
correspondences, and the least differences.
• sequences that are similar probably have the same
function
Ulf Schmitz, Sequence Analysis I
8
Sequence Alignment
www.
.uni-rostock.de
Terms of sequence comparison
Sequence identity
• exactly the same Amino Acid or Nucleotide in the same position
Sequence similarity
• substitutions with similar chemical properties
Sequence homology
• general term that indicates evolutionary relatedness among sequences
• sequences are homologous if they are derived from a common ancestral
sequence
• one speaks of percentage of sequence homology
Ulf Schmitz, Sequence Analysis I
9
Sequence Alignment
www.
.uni-rostock.de
Things to consider:
• to find the best alignment one needs to examine all
possible alignments
• to reflect the quality of the possible alignments one
needs to score them
• there can be different alignments with the same highest
score
• variations in the scoring scheme may change the ranking
of alignments
Ulf Schmitz, Sequence Analysis I
10
Sequence Alignment
www.
.uni-rostock.de
A good pair wise alignment with end-gaps, indels, and substitutions
Ulf Schmitz, Sequence Analysis I
11
Methods of Sequence Alignment
www.
.uni-rostock.de
(1) Dot Matrix analysis
(2) Dynamic Programming (or DP)
(3) Local Sequence Alignment
(4) Multiple Sequence Alignment
Ulf Schmitz, Sequence Analysis I
12
www.
The Dot Matrix
.uni-rostock.de
• established in 1970 by A.J. Gibbs and G.A.McIntyre
• method for comparing two amino acid or nucleotide sequences
A
G
C
T
A
G
G
A
G
A
C
T
A
G
G
C
• each sequence builds one axis of
the grid
• one puts a dot, at the intersection
of same letters appearing in both
sequences
• scan the graph for a series of dots
• reveals similarity
• or a string of same characters
• longer sequences can also be
compared on a single page, by
using smaller dots
Ulf Schmitz, Sequence Analysis I
13
www.
The Dot Matrix
The very stringent, self-dotplot:
.uni-rostock.de
The non-stringent self-dotplot:
Ulf Schmitz, Sequence Analysis I
14
www.
The Dot Matrix
.uni-rostock.de
• to filter out random matches, one uses
sliding windows
• a dot is printed only if a minimal number of
matches occur
• rule of thumb:
– larger windows for DNAs (only 4 bases, more
random matches)
– typical window size is 15 and stringency of 10
Ulf Schmitz, Sequence Analysis I
15
www.
The Dot Matrix
Two similar, but not identical, sequences
.uni-rostock.de
An indel (insertion or deletion):
Ulf Schmitz, Sequence Analysis I
16
www.
The Dot Matrix
A tandem duplication:
.uni-rostock.de
Self-dotplot of a tandem duplication:
Ulf Schmitz, Sequence Analysis I
17
www.
The Dot Matrix
An inversion:
.uni-rostock.de
Joining sequences:
Ulf Schmitz, Sequence Analysis I
18
www.
The Dot Matrix
.uni-rostock.de
Self dotplot with repeats:
Ulf Schmitz, Sequence Analysis I
19
www.
The Dot Matrix
.uni-rostock.de
• the dot matrix method reveals the presence of
insertions or deletions
• comparing a single sequence to itself can reveal
the presence of a repeat of a subsequence
• self comparison can reveal several features:
–
–
–
–
similarity between chromosomes
tandem genes
repeated domains in a protein sequence
regions of low sequence complexity (same characters
are often repeated)
Ulf Schmitz, Sequence Analysis I
20
Tools generating Dot Matrices
www.
.uni-rostock.de
• Dotlet (Java based web-application)
http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
• compare & dotplot programmes in GCG Wisconsin
Package (Genetics Computer Group [comercial])
• GeneAssist package of ABI/Perkin Elmer
• DOTTER (available on dapsas, UNIX X-Windows)
• DNA Strider (Macintosh only)
Ulf Schmitz, Sequence Analysis I
21
Tools generating Dot Matrices
Ulf Schmitz, Sequence Analysis I
www.
.uni-rostock.de
22
www.
The Dot Matrix
.uni-rostock.de
• When to use the Dot Matrix method?
– unless the sequences are known to be very much alike
• limits of the Dot Matrix
– doesn’t readily resolve similarity that is interrupted by
insertion or deletions
– Difficult to find the best possible alignment (optimal
alignment)
– most computer programs don’t show an actual alignment
Ulf Schmitz, Sequence Analysis I
23
www.
Next step
.uni-rostock.de
We must define quantitative measures of sequence
similarity and difference!
• Hamming distance:
– # of positions with mismatching characters
AGCT
CGTA
Hamming distance = 3
• Levenshtein distance:
– # of operations required to change one string into the
other (deletion, insertion, substitution)
AG-TCC
CGCTCA
Levenshtein distance = 3
Ulf Schmitz, Sequence Analysis I
24
www.
Scoring
.uni-rostock.de
• +1 for a match -1 for a mismatch?
• should gaps be allowed?
– if yes how should they be scored?
• what is the best algorithm for finding the
optimal alignment of two sequences?
• is the produced alignment significant?
Ulf Schmitz, Sequence Analysis I
25
www.
Needleman-Wunsch
A
G
A
G
C
A
1
G
1
1
2
• all matches are given a score of 1
• mismatches score 0 (not shown)
• the diagonal 1s are added sequentially
2
1
A
• Simplified Needleman-Wunsch alignment
1
T
3
1
G
2
4
G
1
1
C
G
1
1
C
A
T
.uni-rostock.de
1
5
• best alignment is found starting with the
sequence characters that correspond to
the highest number and tracing back
through the positions that contributed to
this highest score
2
Ulf Schmitz, Sequence Analysis I
26
www.
Nucleic Acid Scoring Scheme
.uni-rostock.de
• Transition mutation (more common)
– purine
– pyrimidine
purine
pyrimidine
A
T
G
C
• Transversion mutation
– purine
pyrimidine
A
G
T
C
A
20
10
5
5
G
10
20
5
5
A, G
T
5
5
20
10
T, C
C
5
5
10
20
Ulf Schmitz, Sequence Analysis I
27
www.
.uni-rostock.de
Amino acid exchange matrices
Amino acids are not equal:
1. Some are easily substituted because they have
similar:
• physico-chemical properties
• structure
2. Some mutations between amino acids occur more
often due to similar codons
The two above observations give us ways to define
substitution matrices
Ulf Schmitz, Sequence Analysis I
28
Properties of Amino Acids
www.
.uni-rostock.de
Sequence similarity
• substitutions with similar chemical properties
Ulf Schmitz, Sequence Analysis I
29
www.
Scoring Matrices
•
table of values that
describe the probability of
a residue pair occurring in
an alignment
•
the values are logarithms
of ratios of two
probabilities
1.
2.
.uni-rostock.de
probability of random
occurrence of an amino
acid (diagonal)
probability of meaningful
occurrence of a pair of
residues
Ulf Schmitz, Sequence Analysis I
30
www.
Scoring Matrices
.uni-rostock.de
Widely used matrices
•
PAM (Percent Accepted Mutation) / MDM (Mutation Data Matrix ) / Dayhoff
–
–
–
–
–
•
Derived from global alignments of closely related sequences.
Matrices for greater evolutionary distances are extrapolated from those for lesser ones.
The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater
numbers are greater distances.
PAM-1 corresponds to about 1 million years of evolution
for distant (global) alignments, Blosum50, Gonnet, or (still) PAM250
BLOSUM (Blocks Substitution Matrix)
–
–
–
–
–
Derived from local, ungapped alignments of distantly related sequences
All matrices are directly calculated; no extrapolations are used
The number after the matrix (BLOSUM62) refers to the minimum percent identity of the
blocks used to construct the matrix; greater numbers are lesser distances.
The BLOSUM series of matrices generally perform better than PAM matrices for local
similarity searches.
For local alignment, Blosum 62 is often superior
•
Structure-based matrices
•
Specialized Matrices
Ulf Schmitz, Sequence Analysis I
31
www.
Scoring Matrices
.uni-rostock.de
The relationship between BLOSUM and PAM substitution matrices
•
•
BLOSUM matrices with higher numbers and PAM matrices with low numbers
are designed for comparisons of closely related sequences.
BLOSUM matrices with low numbers and PAM matrices with high numbers
are designed for comparisons of distantly related proteins.
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.html
Ulf Schmitz, Sequence Analysis I
32
www.
Gap Penalties
.uni-rostock.de
• the cost of opening a gap is higher than the gap extension penalty
• this reflects the tendency for insertions and deletions to occur over
several residues at a time
• BLOSUM 62: -11 for gap opening; -1 for gap extension
• BLOSUM 50: -12/-1 penalty
• Gap penalties are always a problem:
– There is no formal way to calculate their values.
– you can follow recommended settings, but these are based on trial and error
and not on a formal framework.
A A A G A A A
A A A G G G G A A A
A A A – A A A
A A A – - - - A A A
gap initiation
gap extension
Ulf Schmitz, Sequence Analysis I
33
www.
Gap penalties
.uni-rostock.de
1. Linear: gp(k)=ak
2. Affine: gp(k)=b+ak
3. Concave, e.g.: gp(k)=log(k)
Ulf Schmitz, Sequence Analysis I
34
www.
Scoring alignments
Sa,b =
 s (a , b ) + 
l
i
j
gp(k) = gapinit + kgapextension
k
.uni-rostock.de
Nk  gp(k )
affine gap penalties
2020
10
Amino Acid Exchange Matrix
T D W V T A L K
T D W L - - I K
1
Affine gap penalties (open,
extension)
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)Po-2Px +s(L,I)+s(K,K)
Ulf Schmitz, Sequence Analysis I
35
www.
Sequence Analysis
.uni-rostock.de
to be continued ...
Ulf Schmitz, Sequence Analysis I
36
Download