2 pairwise sequence alignments

advertisement
Pairwise Sequence Alignment
Presented by Liu Qi
Why align sequences?
Functional predictions based on identifying
homologues.
 Assumes:
conservation of sequence
conservation of
function
 BUT: Function carried out at level of proteins, i.e.
3-D structure
Sequence conservation carried out at level of DNA
1-D sequence

Presented By Liu Qi
Presented By Liu Qi
Some Definitions
An alignment is a mutual arrangement of
two sequences, which exhibits where the
two sequences are similar, and where they
differ.
 An optimal alignment is one that exhibits
the most correspondences and the least
differences. It is the alignment with the
highest score. May or may not be
biologically meaningful.

Presented By Liu Qi
Methods



Dot matrix
Dynamic Programming
Word, k-tuple (heuristic based)
Presented By Liu Qi
Brief intro of methods
dot matrix - all possible matches between sequence residues are found;
used to compare two sequences to look for regions where they may align;
very useful for finding indels and repeats in sequences; can be used as a
first pass to see if there is any similarity between sequences
• dynamic programming - mathematically guaranteed to find optimal
alignment (global or local) between pairs of sequences; very computationally
expensive - # of steps increases exponentially with sequence length
• k-tuple (word) methods - used by FASTA and BLAST (previously
described); much faster than dynamic programming and ideal for database
searches; uses heuristics that do not guarantee optimal alignment but are
nevertheless very reliable
Presented By Liu Qi
Dot matrix
1 - one sequence listed along top of page
and second sequence listed along the side
2 - move across row and put dot in any
column where the character is the same
3 - continue for each row until all possible
character matches between the
sequences are represented by dots
4 - diagonal rows of dots reveal sequence
similarity (can also find repeats and inverted
repeats off the main diagonal)
5 - isolated dots represent random similarity unrelated to the alignment
Presented By Liu Qi
Presented By Liu Qi
Dot matrix with noise reduction
Presented By Liu Qi
Dot matrix
To improve visualisation of identical regions
among sequences we use sliding windows
Instead of writing down a dot for every
character that is common in both sequences
 We compare a number of positions (window
size), and we write down a dot whenever
there is minimum number (stringency) of
identical characters

Presented By Liu Qi
Dot matrix
Caution is necessary regarding the
window size and the stringency value.
Generally, they assume different values
for different problems. The optimal values
will accent the regions of similarity of the
two sequences

For DNA sequence usually,


Sliding window=15, stringency=10
For Protein sequence


Sliding window=2 or 3, stringency=2
Presented By Liu Qi
Things to be considered



Scoring matrix for distance correction.
Window size
Threshold
Presented By Liu Qi
The useful of Dot plot
Regions of similarity: diagonals
 Insertions/deletions: gaps


Can determine intron/exon structure
Repeats: parallel diagonals
 Inverted repeats: perpendicular diagonals

Inverted repeats
 Can be used to determine regions of base
pairing of RNA molecules

Presented By Liu Qi
Intra-sequence comparison
Repeats
Inverted repeats
Low complexity
Presented By Liu Qi
Examples

ABRACADABRACAD
Presented By Liu Qi
palindrome
Sequence: ATOYOTA
Presented By Liu Qi
Repeats
Presentedagainst
By Liu Qi
Drosophila melanogaster SLIT protein
itself
Low complexity
Presented By Liu Qi
Inter sequence comparison
Conserved domains
 Insertion and deletion

Presented By Liu Qi
Insertion and deletion
Seq1:DOROTHYCROWFOOTHODGKIN
 Seq2:DOROTHYHODGKIN

Presented By Liu Qi
Conserved domains
Presented By Liu Qi
Translated DNA and protein
comparison :Exons and introns
Presented By Liu Qi
Presented By Liu Qi
Even more can be done with RNA

RNA comparisons of the reverse, complement of
a sequence to itself can often be very
informative.
• Consider the following set of examples from the
phenylalanine transfer RNA (tRNA-Phe) molecule
from Baker’s yeast.
• The sequence and structure of this molecule is also
known; the illustration will show how simple dot-matrix
procedures can quickly lead to functional and
structural insights (even without complex folding
algorithms).
Presented By Liu Qi
Structures of tRNA-Phe
Presented By Liu Qi
RNA comparisons of the reverse,
complement of a sequence to itself
Presented By Liu Qi
Programs for Dot Matrix

Dotlet


http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
SIGNAL

http://innovation.swmed.edu/research/informatics/re
s_inf_sig.html
Dotter
http://www.cgb.ki.se/cgb/groups/sonnhammer/D
otter.html
 COMPARE, DOTPLOT in GCG

Presented By Liu Qi
conclusion

Advantages:
Readily reveals the presence of insertions/deletions
and direct and inverted repeats that are more difficult to
find by the other, more automated methods.
let’s your eyes/brain do the work –VERY EFFICIENT!!!!

Disadvantages:
Most dot matrix computer programs do not show an
actual alignment. Does not return a score to indicate
how ‘optimal’ a given alignment is.
Presented By Liu Qi
Reference



Gibbs, A. J. & McIntyre, G. A. (1970). The diagram
method for comparing sequences. its The diagram
method for comparing sequences. its use with amino
acid and nucleotide sequences.Eur. J. Biochem. 16 , 111.
Maizel, J.V., Jr. and Lenk R.P. (1981). nhanced graphic
matrix analysis of nucleic acid and protein sequences.
Proc. Natl. Acad. Sci. 78: 7665- 7669
Staden, R. (1982). An interactive graphics program for
comparing and aligning nucleic-acid and amino-acid
acid sequences. Nucl. Acid. Res. 10 (9), 2951-2961.
Presented By Liu Qi
Dynamic Programming
Answer: what is the optimal alignment of
two sequences(the best score)?
 How many different alignments?

Presented By Liu Qi
Alignment methods with DP
Global alignment - Needleman-Wunsch
(1970) maximizes the number of matches
between the sequences along the entire
length of the sequences.
 Local alignment - Smith-Waterman
(1981) is a modification of the dynamic
programming algorithm giving the highest
scoring local match between two
sequences

Presented By Liu Qi
Dynamic Programming

A simple example
3
5
B
8
D
4
5
A
F
6
4
9
C
2
E
3
7
Presented By Liu Qi
Exercise
Presented By Liu Qi
动态规划的适用条件
一个最优化策略的子策略总是最优的。
 无后向性



以前各阶段的状态无法直接影响它未来的决策
空间换时间(子问题的重叠性)
Presented By Liu Qi
Dynamic Programming
Presented By Liu Qi
Dynamic Programming
Presented By Liu Qi
Dynamic Programming
Presented By Liu Qi
Dynamic Programming
Presented By Liu Qi
DP Algorithm for Global
Alignment


Two sequences X = x1...xn and Y = y1...ym
F(i, j) be the optimal alignment score of
X1...i and Y1...j (0 ≤ i ≤ n, 0 ≤ j ≤ m).
F 0,0   0
 F i  1, j  1  s xi , y j 

F i, j   max F i  1, j   d
 F i, j  1  d

Presented By Liu Qi
DP in equation form
F i, j 1
F i  1, j  1
sxi , y j 
F i 1, j 
d
F i, j 
d
Presented By Liu Qi
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
F i  1, j  1
sxi , y j 
F i 1, j 
d
F i, j 1
G
d
C
F i, j 
Presented By Liu Qi
A
G
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
0
A
F i  1, j  1
sxi , y j 
F i 1, j 
d
F i, j 1
G
d
C
F i, j 
Presented By Liu Qi
A
G
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
sxi , y j 
F i 1, j 
d
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
0
A
-5
F i, j 1
G
-10
d
C
-15
F i, j 
A
A
G
-5
-10
-15
Presented By Liu Qi
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
sxi , y j 
F i 1, j 
d
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
G
0
-5
-10
-15
A
-5
2
-3
-8
F i, j 1
G
-10
-3
-3
-1
d
C
-15
-8
-8
-6
F i, j 
Presented By Liu Qi
Traceback





Start from the lower right corner and trace back
to the upper left.
Each arrow introduces one character at the end
of each aligned sequence.
A horizontal move puts a gap in the left
sequence.
A vertical move puts a gap in the top sequence.
A diagonal move uses one character from each
sequence.
Presented By Liu Qi
A simple example
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.





Start from the lower right
corner and trace back to
the upper left.
Each arrow introduces one
character at the end of
each aligned sequence.
A horizontal move puts a
gap in the left sequence.
A vertical move puts a gap
in the top sequence.
A diagonal move uses one
character from each
sequence.
A
0
A
A
G
-5
2
-3
G
-1
C
-6
Presented By Liu Qi
A simple example
Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5.





Start from the lower right
corner and trace back to
the upper left.
Each arrow introduces one
character at the end of
each aligned sequence.
A horizontal move puts a
gap in the left sequence.
A vertical move puts a gap
in the top sequence.
A diagonal move uses one
character from each
sequence.
AAG-AGC
A
0
A
A
G
-5
2
-3
G
-1
C
-6
AAGA-GC
Presented By Liu Qi
Exercise

Find Global alignment
X=catgt
 Y=acgctg
 Score: d=-1 mismatch=-1 match=2

Presented By Liu Qi
Answer
Presented By Liu Qi
Local alignment


A single-domain protein may be homologous to
a region within a multi-domain protein.
Usually, an alignment that spans the complete
length of both sequences is not required.
Presented By Liu Qi
Local alignment DP
Align sequence x and y.
 F is the DP matrix; s is the substitution
matrix; d is the linear gap penalty.
F 0,0   0


 F i  1, j  1  s xi , y j 

F i, j   max F i  1, j   d
 F i, j  1  d

0
Presented By Liu Qi
Local DP in equation form
F i  1, j  1
0
F i, j 1
sxi , y j 
F i 1, j 
d
F i, j 
d
Presented By Liu Qi
Local alignment

Two differences with respect to global
alignment:
No score is negative.
 Traceback begins at the highest score in the
matrix and continues until you reach 0.

Global alignment algorithm: NeedlemanWunsch.
 Local alignment algorithm: SmithWaterman.

Presented By Liu Qi
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
F i, j 1
G
d
C
F i, j 
Presented By Liu Qi
A
G
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
0
A
0
F i, j 1
G
0
d
C
0
F i, j 
A
A
G
0
0
0
Presented By Liu Qi
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
G
0
0
0
0
A
0
2
2
0
F i, j 1
G
0
0
0
4
d
C
0
0
0
0
F i, j 
Presented By Liu Qi
A simple example
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and AGC.
Use a gap penalty of d=-5.
A
A
G
0
0
0
0
A
0
2
2
0
F i, j 1
G
0
0
0
4
d
C
0
0
0
0
F i, j 
Presented By Liu Qi
AG
AG
Local alignment
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and GAAGGC.
Use a gap penalty of d=-5.
F i, j 1
d
F i, j 
G
A
A
G
G
C
0
0
0
0
0
0
0
A
0
Presented By Liu Qi
A
0
G
0
Local alignment
A
C
G
T
A
2
-7
-5
-7
C
-7
2
-7
-5
G
-5
-7
2
-7
T
-7
-5
-7
2
F i  1, j  1
0
sxi , y j 
F i 1, j 
d
Find the optimal local alignment of AAG and GAAGGC.
Use a gap penalty of d=-5.
F i, j 1
d
F i, j 
G
A
A
G
G
C
0
0
0
0
0
0
0
A
0
0
2
2
0
0
0
Presented By Liu Qi
A
0
0
2
4
0
0
0
G
0
2
0
0
6
2
0
End-Space Free Alignment
any number of indel operations at the end or at the
beginning of the alignment contribute zero weight.
X= - - c a c - t g t a c
Y= g a c a c t t g - - -
Presented By Liu Qi
End-Space Free Alignment
. Base conditions: ∀i, j. F (i, 0) = 0 , F(0, j) = 0
 Recurrence relation:
 F (i, j) = max
F(i -1, j - 1) + s(Xi, Yj)
F(i -1, j) + d
F (i,j - 1) + d
 Search for i* such that: F (i*.,m) = max1≤i≤n F (i, m)
 Search for j* such that: F(n, j*) =max1≤j≤m F (n, j)
 Define alignment score: F(n, m) =max{ F(n, j*),F (i*,m)}

Presented By Liu Qi
Exercise

Align two sequence
(match=1,mismatch=-1,gap=-1)
X=cactgtac
Y= g a c a c t t g
Presented By Liu Qi
思考题
Does a local alignment program always
produce a local alignment and a global
alignment program always produce a
global alignment?
 Develop an algorithm to find the longest
common subsequence (LCS) of two given
sequences.

Presented By Liu Qi
Affine gap penalty
LETVGY
W----L
-5
-1
-1
-1
Separate penalties for gap opening and
gap extension.
 This requires modifying the DP algorithm

Presented By Liu Qi
Affine gap penalty

a gap of length k is more probable than k gaps of length 1




– a gap may be due to a single mutational event that inserted/deleted a
stretch of characters
– separated gaps are probably due to distinct mutational events
a linear gap penalty function treats these cases the same
it is more common to use gap penalty functions involving two
terms


– a penalty h associated with opening a gap
– a smaller penalty g for extending the gap
Presented By Liu Qi
Gap penalty functions
Presented By Liu Qi
Dynamic Programming for the
Affine Gap Penalty Case

need 3 matrices instead of 1
Presented By Liu Qi
Dynamic Programming for the
Affine Gap Penalty Case
Presented By Liu Qi
Presented By Liu Qi
Presented By Liu Qi
match=1, mismatch=-1
Presented By Liu Qi
Presented By Liu Qi
Exercise

Write the formula for “Local Alignment DP
for the Affine Gap Penalty Case”
Presented By Liu Qi
Word, k-tup

FASTA

BLAST
Presented By Liu Qi
Download