An Introduction To Multiple Sequence Alignment (MSA)

advertisement
Pairwise Sequence
Alignments
Topics to be Covered
Comparison methods
Global alignment
Local alignment
Introduction to Alignment
Analyze the similarities and differences at the
individual base level or amino acid level
Aim is to infer structural, functional and evolutionary
relationships among sequences
Sequence Alignment
982 TGTTTGCTAAAGCTTCAGCTATCCACAACCCAATTGACCTCTAC 1022
| | | | | || | |
|
|
| | | | | | | | | | | | | |
| | || |
961 TCTTTGCTAAGACCGCCTCCATCTACAACCCAATCA - - - TCTAC 1001
Two sequences written out , one on top of the other
Identical or similar characters placed in same column
Nonidentical characters either placed in same column
as mismatch or opposite gap in the other sequence
Overall quality of the alignment is then evaluated
based on a formula that counts the number of identical
(or similar) pairs minus the number of mismatches and
gaps
Pairwise Sequence Alignments
• Why to compare
• Similarity search is necessary for:
•
Family assignment
•
Sequence annotation
•
Construction of phylogenetic trees
•
Learn about evolutionary relationships
•
Classify sequences
•
Identify functions
•
Homology Modeling
Essential Elements of an Alignment Algorithm
• Defining the problem (Global, local alignment)
• Scoring scheme (Gap penalties)
• Distance Matrix (PAM, BLOSUM series)
Global and Local Alignments
Global – attempt is made to align the entire sequence
using as many characters as possible, up to both ends
of the sequences
Local – stretches of sequence with the highest density
of matches are aligned
LGPSSKQTGKGS–S RIWDN
|
|
| | |
|
|
LN–IT KSAG KGAIMR LG DA
Global Alignment
-------TGKG-----| | |
-------AG KG ------
Local Alignment
Local vs. Global Alignment (cont’d)
• Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | |||
|| | | | | ||||
|
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
• Local Alignment—better alignment to find
conserved segment
TCCCAGTTATGTCAGGGGACACGAGCATGCAGAGAC
||||||||||||
AATTGCCGCCGTCGTTTTCAGCAGTTATGTCAGATC
Global and Local Alignments
• Global - When two sequences are of approximately equal length. Here, the
goal is to obtain maximum score by completely aligning them
• Local - When one sequence is a sub-string of the other or the goal is to get
maximum local score
• Protein motif searches in a database
Dynamic programming algorithm
• Dynamic programming =
•
Build up optimal alignment using
previous solutions for optimal alignments
of subsequences
Aligning Sequences without Insertions and
Deletions: Hamming Distance
Given two DNA sequences v and w :
v : AT AT AT AT
w: T A T A T A T A
• The Hamming distance: dH(v, w) = 8 is
large but the sequences are very similar
Aligning Sequences with
Insertions and Deletions
By shifting one sequence over one position:
v : A T A T A T A T -w : -- T A T A T A T A
• The edit distance: dH(v, w) = 2.
• Hamming distance neglects insertions and
deletions in DNA
Edit Distance
Levenshtein (1966) introduced edit distance
between two strings as the minimum number
of elementary operations (insertions, deletions,
and substitutions) to transform one string into
the other
d(v,w) = MIN number of elementary operations
to transform v  w
Edit Distance vs Hamming Distance
Hamming distance
always compares
i-th letter of v with
i-th letter of w
V = ATATATAT
W = TATATATA
Hamming distance:
d(v, w)=8
Computing Hamming distance
is a trivial task.
Edit Distance vs Hamming Distance
Hamming distance
always compares
i-th letter of v with
i-th letter of w
V = ATATATAT
W = TATATATA
Just one shift
Make it all line up
Hamming distance:
d(v, w)=8
Computing Hamming distance
is a trivial task
Edit distance
may compare
i-th letter of v with
j-th letter of w
V = - ATATATAT
W = TATATATA
Edit distance:
d(v, w)=2
Computing edit distance
is a non-trivial task
Edit Distance vs Hamming Distance
Hamming distance
always compares
i-th letter of v with
i-th letter of w
Edit distance
may compare
i-th letter of v with
j-th letter of w
V = ATATATAT
V = - ATATATAT
W = TATATATA
W = TATATATA
Hamming distance:
Edit distance:
d(v, w)=8
d(v, w)=2
(one insertion and one deletion)
How to find what j goes with what i ???
Edit Distance: Example
• TGCATAT  ATCCGAT in 5 steps
•
•
•
•
•
•
•
TGCATAT  (delete last T)
TGCATA
 (delete last A)
TGCAT
 (insert A at front)
ATGCAT
 (substitute C for 3rd G)
ATCCAT
 (insert G before last A)
ATCCGAT
(Done)
Edit Distance: Example
• TGCATAT  ATCCGAT in 5 steps
•
•
•
•
•
•
•
TGCATAT  (delete last T)
TGCATA
 (delete last A)
TGCAT
 (insert A at front)
ATGCAT
 (substitute C for 3rd G)
ATCCAT
 (insert G before last A)
ATCCGAT
(Done)
What is the edit distance? 5?
Edit Distance: Example (cont’d)
TGCATAT  ATCCGAT in 4 steps
TGCATAT  (insert A at front)
ATGCATAT  (delete 6th T)
ATGCATA
 (substitute G for 5th A)
ATGCGTA
 (substitute C for 3rd G)
ATCCGTA
(Done)
Edit Distance: Example (cont’d)
TGCATAT  ATCCGAT in 4 steps
TGCATAT  (insert A at front)
ATGCATAT  (delete 6th T)
ATGCAAT  (substitute G for 5th A)
ATGCGAT  (substitute C for 3rd G)
ATCCGAT
(Done)
Can it be done in 3 steps???
The Alignment Grid
– Every alignment
path is from source
to sink
Alignment as a Path in the Edit Graph
A T
w
0
1
2
C G T
3
4
5
A
C
6
7
v
0
A
T
G
T
T
1
2
3
0 1
A
A
0 1
2
T
T
2
2
_
C
3
3
G
G
4
4
T
T
5
5
T
_
5
6
A
A
6
7
T
_
6
7
_
C
7
- Corresponding path -
4
5
A
6
T
7
(0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6),
(7,6), (7,7)
Alignment as a Path in the Edit Graph
w
0
v
0
A
T
G
1
2
3
T
4
T
5
A
6
T
7
A
1
T
2
C G
3
4
T
5
A
C
6
7
Every path in the edit
graph corresponds to an
alignment:
Alignment as a Path in the Edit Graph
A
w
0
v
0
A
T
1
2
G
3
T
4
T
5
A
6
T
7
1
T
C G
T
A
C
2
3
5
6
7
4
Old Alignment
0122345677
v= AT_GTTAT_
w=ATCGT_A_C
0123455667
New Alignment
0122345677
v= AT_GTTAT_
w=ATCG_TA_C
0123445667
From LCS to Alignment: Change up the Scoring
• The Longest Common Subsequence (LCS)
problem—the simplest form of sequence alignment
– allows only insertions and deletions (no
mismatches).
• In the LCS Problem, we scored 1 for matches and 0
for indels
• Consider penalizing indels and mismatches with
negative scores
• Simplest scoring schema:
•
+1 : match premium
•
-μ : mismatch penalty
•
-σ : indel penalty
Simple Scoring
• When mismatches are penalized by –μ,
indels are penalized by –σ, and matches are
rewarded with +1,
•
• the resulting score is:
• #matches – μ(#mismatches) – σ (#indels)
Dynamic programming algorithm
• define a matrix Fij:
•
Fij is the optimal alignment of
•
subsequence A1...i and B1...j
• iterative build up: F(0,0) = 0
• define each element i,j from
•
(i-1,j):
gap in sequence A
•
(i, j-1):
gap in sequence B
•
(i-1, j-1): alignment of Ai to Bj
Dynamic programming
Sequence Comparison Scoring Matrices
• • The choice of a scoring matrix can
strongly influence the outcome of
sequence analysis
• • Scoring matrices implicitly represent a
particular theory of evolution
• • Elements of the matrices specify the
similarity or the
• distance of replacing one residue (base)
Protein Scoring Matrices
•
• The two most popular matrices are the
PAM and the BLOSUM matrix
Scoring Insertions and Deletions
A T G T A A T G C A
T A T G T G G A A T G A
A T G T - - A A T G C A
T A T G T G G A A T G A
insertion / deletion
The creation of a gap is penalized with a negative score value.
Why Gap Penalties?
• The optimal alignment of two similar sequences is usually that which
• maximizes the number of matches and
• minimizes the number of gaps.
• Permitting the insertion of arbitrarily many gaps can lead to high
scoring alignments of non-homologous sequences.
• Penalizing gaps forces alignments to have relatively few gaps.
Why Gap Penalties?
Gaps not permitted
Score:
1 GTGATAGACACAGACCGGTGGCATTGTGG 29
|||
| | |||
|
|| || |
1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29
Gaps allowed but not penalized
0
Match = 5
Mismatch = -4
Score: 88
1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29
||| || | | | ||| || | | || || |
1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29
Gap Penalties
Linear gap penalty score:
(g) = - gd
Affine gap penalty score:
(g) = -d - (g -1)e
(g) = gap penalty score of a gap of length g
d = gap opening penalty
e = gap extension penalty
g = gap length
Scoring Indels: Naive Approach
• A fixed penalty σ is given to every indel:
– -σ for 1 indel,
– -2σ for 2 consecutive indels
– -3σ for 3 consecutive indels, etc.
• Can be too severe penalty for a series of
100 consecutive indels
Affine Gap Penalties
• In nature, a series of k indels often come as
a single event rather than a series of k single
nucleotide events:
ATA__GC
ATATTGC
This is more
likely.
ATAG_GC
AT_GTGC
Normal scoring would
give the same score This is less
for both alignments
likely.
Accounting for Gaps
• Gaps- contiguous sequence of spaces in one of the
rows
• Score for a gap of length x is:
•
-(ρ + σx)
•
where ρ >0 is the penalty for introducing a gap:
•
gap opening penalty
•
ρ will be large relative to σ:
•
gap extension penalty
•
because you do not want to add too much of a
penalty for extending the gap.
Affine Gap Penalty Recurrences
si,j =
max
s i-1,j - σ
s i-1,j –(ρ+σ)
si,j =
max
s i,j-1 - σ
s i,j-1 –(ρ+σ)
si,j =
max
si-1,j-1 + δ (vi, wj)
s i,j
s i,j
Continue Gap in w (deletion)
Start Gap in w (deletion): from middle
Continue Gap in v (insertion)
Start Gap in v (insertion):from middle
Match or Mismatch
End deletion: from top
End insertion: from bottom
Scoring Insertions and Deletions
match = 1
mismatch = 0
Total Score:
4
A T G T T A T A C
T A T G T G C G T A T A
Total Score:
8 - 3.2 = 4.8
Gap parameters:
d = 3 (gap opening)
e = 0.1 (gap extension)
g = 3 (gap lenght)
(g) = -3 - (3 -1) 0.1 = -3.2
A T G T - - - T A T A C
T A T G T G C G T A T A
insertion / deletion
Modification of Gap Penalties
Score Matrix: BLOSUM62
gap opening penalty
= 3
gap extension penalty = 0.1
score
= 6.3
1 ...VLSPADKFLTNV 12
||||
1 VFTELSPAKTV.... 11
gap opening penalty
= 0
gap extension penalty = 0.1
score
= 11.3
1 V...LSPADKFLTNV 12
|
|||| | | |
1 VFTELSPA.K..T.V 11
Pairwise Sequence Alignment
Local Alignment
Semi-Global Alignment
Local Alignment
•
•
•
•
•
A local Alignment between sequence s and
sequence t is an alignment with maximum
similarity between a substring of s and a
substring of t.
T. F. Smith & M. S. Waterman, “Identification of Common Molecular Subsequences”, J. Mol. Biol., 147:195-
Why choose a local alignment
algorithm?
• More meaningful – point out conserved
regions between two sequences
• Aligns two sequences of different lengths to
be matched
• Aligns two partially overlapping sequences
• Aligns two sequences where one is a
subsequence of another
43
Dynamic Programming
Local Alignment
• Si,j = MAXIMUM
–[ Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),
–Si,j-1 + w (gap in sequence #1),
–Si-1,j + w (gap in sequence #2),
–0]
44
Initialization Step
45
Matrix Fill Step
46
Traceback Step
47
Traceback Step
48
Traceback Step
49
An Introduction To
Multiple Sequence Alignment
(MSA)
Topics To Be Discussed
•
•
•
•
•
•
Motivation for MSA
What is MSA
Extension of Dynamic Programming
The STAR Method
Progressive Alignment
Scoring Multiple Alignments
Multiple Alignment versus
Pairwise Alignment
• Up until now we have only
tried to align two sequences.
Multiple Alignment versus
Pairwise Alignment
• Up until now we have only
tried to align two sequences.
• What about more than two?
And what for?
Multiple Alignment versus
Pairwise Alignment
• Up until now we have only
tried to align two sequences.
• What about more than two?
And what for?
• A faint similarity between two
sequences becomes significant
if present in many
• Multiple alignments can
reveal subtle similarities that
pairwise alignments do not
reveal
Motivation For MSA
• A natural extension of Pairwise Sequence Alignment
• MSA gives Biologist the ability to extract biologically important
•
but perhaps widely dispersed sequence similarities that can
give
•
biologist hints about the evolutionary history of certain
sequences.
• In pairwise alignment, when two sequences align, it is concluded
•
that there is probably a functional relationship between the
two
•
sequences. Whereas for MSA, if it is known that there is a
•
functional similarity amongst a number of sequences, we can
use
•
MSA to find out where the similarity comes from.
What is MSA
• MSA is the alignment of N sequences (Protein/Nucleotide)
•
simultaneously, where N > 2 .
• Let Si denote a sequence than the Global Multiple Sequence
•
Alignment of N > 2 sequences S = { S1 , …, SN } is
obtained
•
by inserting gaps denoted by “ - “ at any possibly the
beginning
•
or end, position.
• The new set of N sequences denoted by
•
Ovar STCVLSAYWKD-LNNYH
S’ = { S1’ , …, SN’ } will all have length L
Bota STCVLSAYWKD-LNNYH
Susc STCVLSAYWRNELNNFH
Hosa STCMLGTY-QD-FNKFH
Rano STCMLGTY-QD-LNKFH
Sasa STCVLGKLSQE-LHKLQ
Interpretation of positions
• Generally there are two interpretations of a
position in a multiple sequence alignment:
• Evolutionary/historical
• Functional/structural
• In many cases these are the same, but they
may not be.
Multiple sequence alignment
algorithm
• Ideal approach to multiple sequence alignment
is to extend dynamic programming.
• Instead of aligning two sequences (two
dimensional grid) we align k sequences (k
dimensional grid)
• Extension is relatively straightforward
Dynamic programming for
sequence alignment
•
•
•
•
•
Recurrence relation
Tabular computation
Traceback
Pairwise recurrence relation
S(i,j) = max[S(i-1, j-1) + m(i,j), S(i-1, j) + g, S(i,
j-1) + g]
• m(i,j) = similarity matrix eg BLOSUM
• g = gap penalty
Aligning Three Sequences
source
• Same strategy as
aligning two sequences
• Use a 3-D “Manhattan
Cube”, with each axis
representing a sequence
to align
• For global alignments,
go from source to sink
sink
2-D cell versus 2-D Alignment Cell
In 2-D, 3 edges in
each unit square
In 3-D, 7 edges in
each unit cube
Architecture
of
3-D
Alignment
Cell
(i-1,j,k-1)
(i-1,j-1,k-1)
(i-1,j,k)
(i-1,j-1,k)
(i,j,k-1)
(i,j-1,k-1)
(i,j-1,k)
(i,j,k)
Multiple Alignment: Dynamic
Programming
• si,j,k = max
si-1,j-1,k-1 + (vi, wj, uk)
si-1,j-1,k + (vi, wj, _ )
si-1,j,k-1 + (vi, _, uk)
si,j-1,k-1 + (_, wj, uk)
si-1,j,k + (vi, _ , _)
si,j-1,k + (_, wj, _)
si,j,k-1 + (_, _, uk)
cube diagonal: no
indels
face diagonal:
one indel
edge diagonal:
two indels
• (x, y, z) is an entry in the 3-D scoring matrix
Extending dynamic programming
• Based on the extrapolation from two to
three sequences, we can define the
recurrence relation for any number of
sequences in the same way
• The other steps - tabular computation and
traceback - are done in the same way as
for pairwise alignment
There are seven cases when
aligning three sequences
1
2
3
4
I
J
K
I
J
-
I
K
J
K
23 -1 to choose the maximum similarity
5
I
-
6
7
J
-
K
Three sequence recurrence
relation
• S(i,j,k) = max[S(i-1, j-1, k-1) + m(i,j) + m(i,k) +
m(j,k),
•
S(i-1, j-1, k) + m(i,j) + g,
•
S(i-1, j, k-1) + m(i,k) + g,
•
S(i, j-1, k-1) + m(j,k) + g,
•
S(i-1, j, k)+ g + g,
•
S(i, j-1, k) + g + g,
•
S(i, j, k-1) + g + g]
• m(i,j) = similarity matrix eg BLOSUM
• g = gap penalty
Dynamic programming time
increases exponentially
• Time taken for alignment by dynamic
programming is O(n * m) for two
sequences n, m characters long.
• Time taken for alignment by dynamic
programming is O(n * m * p) for three
sequences n, m, p characters long.
Dynamic programming time
increases exponentially
• Clearly, for N sequences, each sequence Li
characters long, the time required will be
•
N
•
O( Li )
•
i=1
• This is exponential - O( LN )
• We need to fill out each ‘box’ in the grid
Pairwise Dynamic Programming
Comparing Similar Sequences
• Faster algorithm for aligning similar sequences.
• If two sequences are similar, the best alignments
have their paths near the main diagonal of the
dynamic programming matrix.
• To compute the optimal score and alignment, it is
not necessary to fill in the entire matrix.
• A narrow band around the main diagonal should
suffice
Global Alignment: Comparing Similar
Sequences
Match = 5, Mismatch = -4, Gap w= -7, K=2
Global Alignment: Comparing Similar
Sequences
Match = 5, Mismatch = -4, Gap w= -7, K=2
Heuristic multiple sequence
alignment
• Currently, most practical methods are
hierarchial methods
• For example, pairwise alignments, defining
hierarchy followed by progressive addition
of sequences to alignment
Multiple Alignment Induces
Pairwise Alignments
• Every multiple alignment induces pairwise
alignments
•
•
•
•
x:
y:
z:
AC-GCGG-C
AC-GC-GAG
GCCGC-GAG
• Induces:
•
•
GCGAG
GCCGCGAG
x: ACGCGG-C;
x: AC-GCGG-C;
y: AC-
y: ACGC-GAC;
z: GCCGC-GAG;
z:
Reverse Problem: Constructing Multiple
Alignment from Pairwise Alignments
• Given 3 arbitrary pairwise alignments:
•
•
GC-GAG
GCCGCAGAG
x: ACGCTGG-C;
x: AC-GCTGG-C;
y: AC-
y: ACGC--GAC;
z: GCCGCA-GAG;
z:
•
• can we construct a multiple alignment that
induces
• them?
•
Reverse Problem: Constructing Multiple
Alignment from Pairwise Alignments
• Given 3 arbitrary pairwise alignments:
•
•
GC-GAG
GCCGCAGAG
x: ACGCTGG-C;
x: AC-GCTGG-C;
y: AC-
y: ACGC--GAC;
z: GCCGCA-GAG;
z:
•
• can we construct a multiple alignment that
induces
• them?
•
NOT ALWAYS
Inferring Multiple Alignment
from Pairwise Alignments
• From an optimal multiple alignment, we can
infer pairwise alignments between all pairs of
sequences, but they are not necessarily optimal
• It is difficult to infer a ``good” multiple
alignment from optimal pairwise alignments
between all sequences
Combining Optimal Pairwise
Alignments into Multiple Alignment
Can combine pairwise
alignments into multiple
alignment
Can not combine
pairwise alignments
into multiple alignment
The STAR Alignment Method
• Using a pairwise alignment method (DP,etc) find the sequence that
•
is most similar to all the other sequences.
• Using this “best” sequence as the center (of a star, hence the
name)
•
align the other sequences following the once a gap always a
gap
•
rule .
• For example consider the following set of sequences
•
S1
ATT G C CATT
•
S2
AT G G C CATT
•
S3
AT C C AA T T T T
•
S4
ATC TTC TT
STAR Alignment - 2
•
•
•
Now Consider the following similarity matrix for the pairwise
comparing of the sequences.
S1
sim(Si, Sj)
S2
S3
S4
S5
•
SUM
I≠J
•
S1
•
S2
7
-
-2
0
-4
1
•
•
S3
S4
-2
0
-2
0
0
0
-
-7
-3
-11
-3
•
S5
-3
-4
-7
-3
-
-17
•
-
7
-2
0
-3
For this example S1 is the center of the STAR
2
STAR Alignment - 3
• Next we get the best alignment between S1 and the other
sequences as follows:
•
S1 | A T T G C C A T T
S1 | A T T G C C A T T
•
S2 | A T G G C C A T T
•
S1 | A T T G C C A T T - -
•
S3 | A T C - C A A T T T T
•
S1 | A T T G C C A T T
•
S4 | A T C T T C - T T
S5 | A C T G A C C - -
STAR Alignment 4
• Next to build the MSA we start with S1 & S2 as
•
ATT G C CATT
•
A T G G C C A T T adding S3 using once a gap always a gap
•
ATT G C CATT - -
•
ATG G C CATT - -
•
A T C - C A A T T T T continuing in this fashion we obtain
•
for our MSA of all the sequences
Star Alignment 5
•
ATT G C CATT - -
•
AT G G C CATT - -
•
AT C - C AAT T T T
•
ATC TTC - TT - -
•
ACTGACC - - - -
• Clearly, using the STAR method the time complexity is
•
dominated by computing the pairwise alignment which again
for
•
N sequences we have O(N2) pairs. We consider each
pairwise
alignment to take L2 time where again L is the length of
•
each
STAR Alignment - 6
• Thus the time complexity for computing all pairwise alignments
•
will be O[(NL)2]
• We still have to consider the time it takes to merge the sequences
into a MSA . If Lmax is the upper bound of the alignment length
•
then it will take N2(Lmax) time to merge the sequences into a
MSA.
• Thus the time complexity for STAR is O( N2L2 + N2Lmax )
• Clearly for large N, L this is less than the time complexity for
•
SP which is O[ (2L)N (N2)]
• Recall SP is optimal whereas STAR is not, thus there is a tradeoff between optimization and practicality .
Profile Representation of Multiple
Alignment
T
C
C
C
A
C
G
T
-
A
A
A
A
A
G
G
G
G
G
G
–
–
–
–
C
C
C
C
C
T
T
T
T
T
1
A
A
A
A
A
T
C
C
T
T
1
.6
1
.4
1
C
C
C
C
–
–
T
G
G
G
G
G
G
G
.4
.2
.4 .8 .4
1
.6 .2
.2
1
.8
A
A
A
A
G
.8
1 .2
.2
.2
C
C
C
C
C
.6
Profile Representation of Multiple
Alignment
T
C
C
C
A
C
G
T
-
A
A
A
A
A
G
G
G
G
G
G
–
–
–
–
C
C
C
C
C
T
T
T
T
T
1
A
A
A
A
A
T
C
C
T
T
1
.6
1
.4
1
C
C
C
C
–
–
T
G
G
G
G
G
G
G
.4
.2
.4 .8 .4
1
.6 .2
.2
1
.8
A
A
A
A
G
.8
1 .2
.2
.2
C
C
C
C
C
.6
• In the past we were aligning a sequence
against a sequence
• Can we align a sequence against a profile?
Aligning alignments
• Given two alignments, can we align them?
•
x GGGCACTGCAT
y GGTTACGTC-z GGGAACTGCAG
w GGACGTACC-v GGACCT-----
Alignment 1
Alignment 2
Aligning alignments
• Given two alignments, can we align them?
• Hint: use alignment of corresponding
x GGGCACTGCAT
profiles
y GGTTACGTC-Combined Alignment
•
z GGGAACTGCAG
w GGACGTACC-v GGACCT-----
Multiple Alignment: Greedy Approach
• Choose most similar pair of strings and combine into
a profile , thereby reducing alignment of k sequences
to an alignment of of k-1 sequences/profiles. Repeat
• This is a heuristic greedy method
k
u1= ACGTACGTACGT…
u1= ACg/tTACg/tTACg/cT…
u2 = TTAATTAATTAA…
u2 = TTAATTAATTAA…
u3 = ACTACTACTACT…
…
…
uk = CCGGCCGGCCGG…
uk = CCGGCCGGCCGG
k-1
Greedy Approach: Example
• Consider these 4 sequences
s1
s2
s3
s4
GATTCA
GTCTGA
GATATT
GTCAGC
Greedy Approach: Example (cont’d)
• There are
= 6 possible alignments
s2
s4
GTCTGA
GTCAGC (score = 2)
s1
s4
GATTCA-G—T-CAGC(score = 0)
s1
s2
GAT-TCA
G-TCTGA (score = 1)
s2
s3
G-TCTGA
GATAT-T (score = -1)
s1
s3
GAT-TCA
GATAT-T (score
s3
s4
GAT-ATT
G-TCAGC (score = -1)
= 1)
Greedy Approach: Example (cont’d)
s2 and s4 are closest; combine:
s2
s4
GTCTGA
GTCAGC
s2,4 GTCt/aGa/cA
(profile)
new set of 3 sequences:
s1
s3
s2,4
GATTCA
GATATT
GTCt/aGa/c
Progressive Alignment
• Progressive alignment is a variation of greedy
algorithm with a somewhat more intelligent
strategy for choosing the order of alignments.
• Progressive alignment works well for close
sequences, but deteriorates for distant
sequences
– Gaps in consensus string are permanent
– Use profiles to compare sequences
ClustalW
• Popular multiple alignment tool today
• ‘W’ stands for ‘weighted’ (different parts of
alignment are weighted differently).
• Three-step process
– 1.) Construct pairwise alignments
– 2.) Build Guide Tree
– 3.) Progressive Alignment guided by the tree
The CLUSTALW Algorithm
• Step 1 : Determine all pairwise alignment between sequences and
determine degrees of similarity between each pair.
• Step 2 : Construct a similarity tree * .
• Step 3 : Combine the alignments starting from the most closely
related groups to the most distantly related groups, as in STAR
we use the once a gap always a gap rule .
•
* The PILEUP program is similar to CLUSTALW but uses a
different method for producing the similarity tree .
Heuristic Multiple Alignment Methods
Clustal W progressive multiple
alignment
• Align two sequences to each other
• Align a sequence to an existing alignment
• Align two alignments to each other
Multiple Alignments: Scoring
• As in the pairwise case, not all MSA’s are
equally good.
• We need a method of scoring for
determining when one MSA is better than
another one.
• Number of matches (multiple longest
common subsequence score)
• Entropy score
Multiple LCS Score
• A column is a “match” if all the letters in the
column are the same
AAA
AAA
AAT
ATC
• Only good for very similar sequences
Entropy
• Define frequencies for the occurrence of each
letter in each column of multiple alignment
– pA = 1, pT=pG=pC=0 (1st column)
– pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)
– pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)
• Compute entropy of each column
AAA
AAA
AAT
ATC
Entropy: Example
Best case
Worst case
Multiple Alignment: Entropy Score
Entropy for a multiple alignment is the
sum of entropies of its columns:
 over all columns  X=A,T,G,C pX logpX
Entropy of an Alignment: Example
column entropy:
-( pAlogpA + pClogpC + pGlogpG + pTlogpT)
A A A
•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0]
=0
A C C
•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0]
= -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811
A C G
A C T
•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)]
= 4* -[(1/4)*(-2)] = +2.0
•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811
Sum of Pairs Score(SP-Score)
• Consider pairwise alignment of sequences
•
•
ai and aj
imposed by a multiple alignment of k sequences
• Denote the score of this suboptimal (not necessarily
optimal) pairwise alignment as
•
s*(ai, aj)
• Sum up the pairwise scores for a multiple alignment:
• s(a1,…,ak) = Σi,j s*(ai, aj)
Computing SP-Score
Aligning 4 sequences: 6 pairwise alignments
Given a1,a2,a3,a4:
s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3)
+ s*(a1,a4) + s*(a2,a3)
+ s*(a2,a4) + s*(a3,a4)
SP-Score: Example
a1 ATG-C-AAT
. A-G-CATAT
ak ATCCCATTT
To calculate each column:
Pairs of Sequences
A
1
A
G
1
1
Score=3
A
Column 1
1

C

G
Column 3
Score =
1 – 
SP-Score: Example
• Consider aligning the following 4 portein sequences
•
S1 = AQPILLLV
•
S2 = ALRLL
•
S3 = AKILLL
•
S4 = CPPVLILV
• Next consider the following MSA matrix M
•
AQ PI LLLV
•
ALR - LL - -
•
AK - ILLL -
•
C P PV LI LV
SP-Score: Example
• Assume s(match) = 1 , s(mismatch) = -1 , and s(gap) = -2 ,
•
also assume s(-, -) = 0 to prevent the double counting of
gaps.
• Then the SP score for the 4th column of M would be
•
SP(m4) = SP(I, -, I, V)
•
= s(I,-) + s(I,I) + s(I,V) + s(-,I) + s(-, V) + s(I,V)
•
= -2 + 1 + (-1) + (-2) + (-2) +(-1)
•
= -7
• To find SP(M) we would find the score of each mi and then SUM
•
all the SP(mi) scores to get the score M .
• To find the optimal score using this method we need to consider
•
all possible MSA matrices. We say more about this later.
Some Problems with the SP Score
• Consider column 1 of our example ie A,A,A,C for this column
•
we get SP(m4) = SP(A,A,A,C)
•
= 1 + 1 + (-1) + 1 + (-1) + (-1)
•
= 0
•
whereas if we had A,A,A,A we get a score of
•
SP(A,A,A,A) = 1+1+1+1+1+1 = 6 , thus we get a difference
of
•
6 for what could be explained by a single mutation.
• The SP method tends to overweight the influence of mutations
• The major problem with the SP method is that finding the optimal
MSA is very time consuming.
Download