Multiple sequence alignment - Department of Computer Science • NJIT

BNFO 601
Multiple sequence alignment
Usman Roshan
Optimal pairwise alignment
• Sum of pairs (SP) optimization: find the alignment of two
sequences that maximizes the similarity score given an arbitrary
cost matrix. We can find the optimal alignment in O(mn) time
and space using the Needleman-Wunsch algorithm.
• Recursion:
M (i  1, j  1)  s( x i , y j )
M (i, j )  
M (i, j  1)  g
M (i  1, j )  g
where M(i,j) is the score of the optimal
alignment of x1..i and y1..j, s(xi,yj)
is a substitution scoring matrix, and
g is the gap penalty
Multiple sequence alignment
• “Two sequences whisper, multiple
sequences shout out loud”---Arthur Lesk
• Computationally very hard---NP-hard
Multiple sequence alignment
Unaligned sequences
Aligned sequences
_G_ _ GCTT_
A_ _C_ CTT_
Conserved regions help us
to identify functionality
Sum of pairs score
Sum of pairs score
• What is the sum of
pairs score of this
1. Since computing the alignment with the
optimal SP score is NP-hard we resort to
2. Plenty of work done in this area. Many
standard heuristic approaches in computer
science have been applied.
3. Popular programs are based on profiles.
• A profile can be described by a set of
vectors of nucleotide/residue
• For each position i of the alignment, we
we compute the normalized frequency
of nucleotides A, C, G, and T
Aligning a profile vector to a
• ClustalW/MUSCLE
– Let f be the profile vector
– Score(f,j)= 
f i S (i, j )
i  { A ,C ,G ,T }
– where S(i,j) is substitution scoring matrix
Iterative alignment
(heuristic for sum-of-pairs)
• Pick a random sequence from input set S
• Do (n-1) pairwise alignments and align to
closest one t in S
• Remove t from S and compute profile of
• While sequences remaining in S
– Do |S| pairwise alignments and align to closest
one t
– Remove t from S
Iterative alignment
• Once alignment is computed randomly
divide it into two parts
• Compute profile of each sub-alignment
and realign the profiles
• If sum-of-pairs of the new alignment is
better than the previous then keep,
otherwise continue with a different
division until specified iteration limit
Progressive alignment
• Idea: perform profile alignments in the
order dictated by a tree
• Given a guide-tree do a post-order
search and align sequences in that
• Widely used heuristic
Popular alignment programs
• ClustalW: most popular, progressive alignment
• MUSCLE: progressive and iterative combination;
uses the log expectation score
• T-COFFEE: consistency based alignment; align
sequences in multiple alignment to be close to the
optimal pairwise alignment
• PROBCONS: expected accuracy; probabilistic
consistency progressive based scheme
• MAFFT: alignment based on Fast Fourier Transform
Evaluation of multiple sequence
• Compare to benchmark “true”
• Use simulation
• Measure conservation of an alignment
• Measure accuracy of phylogenetic trees
• How well does it align motifs?
• More…
Benchmarking alignment