Multiple sequence alignment - Department of Computer Science • NJIT

advertisement
BNFO 601
Multiple sequence alignment
Usman Roshan
Optimal pairwise alignment
• Sum of pairs (SP) optimization: find the alignment of two
sequences that maximizes the similarity score given an arbitrary
cost matrix. We can find the optimal alignment in O(mn) time
and space using the Needleman-Wunsch algorithm.
• Recursion:
Traceback:

M (i  1, j  1)  s( x i , y j )

M (i, j )  
M (i, j  1)  g

M (i  1, j )  g

where M(i,j) is the score of the optimal
alignment of x1..i and y1..j, s(xi,yj)
is a substitution scoring matrix, and
g is the gap penalty
Multiple sequence alignment
• “Two sequences whisper, multiple
sequences shout out loud”---Arthur Lesk
• Computationally very hard---NP-hard
Multiple sequence alignment
Unaligned sequences
Aligned sequences
GGCTT
TAGGCCTT
TAGCCCTTA
ACACTTC
ACTT
_G_ _ GCTT_
TAGGCCTT_
TAGCCCTTA
A_ _CACTTC
A_ _C_ CTT_
Conserved regions help us
to identify functionality
Sum of pairs score
Sum of pairs score
• What is the sum of
pairs score of this
alignment?
1. Since computing the alignment with the
optimal SP score is NP-hard we resort to
heuristics.
2. Plenty of work done in this area. Many
standard heuristic approaches in computer
science have been applied.
3. Popular programs are based on profiles.
Profile
• A profile can be described by a set of
vectors of nucleotide/residue
frequencies.
• For each position i of the alignment, we
we compute the normalized frequency
of nucleotides A, C, G, and T
Aligning a profile vector to a
nucleotide
• ClustalW/MUSCLE
– Let f be the profile vector
– Score(f,j)= 
f i S (i, j )
i  { A ,C ,G ,T }
– where S(i,j) is substitution scoring matrix

Iterative alignment
(heuristic for sum-of-pairs)
• Pick a random sequence from input set S
• Do (n-1) pairwise alignments and align to
closest one t in S
• Remove t from S and compute profile of
alignment
• While sequences remaining in S
– Do |S| pairwise alignments and align to closest
one t
– Remove t from S
Iterative alignment
• Once alignment is computed randomly
divide it into two parts
• Compute profile of each sub-alignment
and realign the profiles
• If sum-of-pairs of the new alignment is
better than the previous then keep,
otherwise continue with a different
division until specified iteration limit
Progressive alignment
• Idea: perform profile alignments in the
order dictated by a tree
• Given a guide-tree do a post-order
search and align sequences in that
order
• Widely used heuristic
Popular alignment programs
• ClustalW: most popular, progressive alignment
• MUSCLE: progressive and iterative combination;
uses the log expectation score
• T-COFFEE: consistency based alignment; align
sequences in multiple alignment to be close to the
optimal pairwise alignment
• PROBCONS: expected accuracy; probabilistic
consistency progressive based scheme
• MAFFT: alignment based on Fast Fourier Transform
Evaluation of multiple sequence
alignments
• Compare to benchmark “true”
alignments
• Use simulation
• Measure conservation of an alignment
• Measure accuracy of phylogenetic trees
• How well does it align motifs?
• More…
Benchmarking alignment
programs
• http://nar.oxfordjournals.org/content/38/
15/4917.abstract
Download