Lecture 8 CS566 1
• “Two swallows do not make a summer”
• Discover conserved regions
– Predict important regions of the protein
– Discover domains
• Search for additional members of a protein family (profile-based searching)
• Build phylogenetic trees
Lecture 8 CS566 2
• Scoring schemes
– Pairwise
– N-way
• Optimal
– Multidimensional dynamic programming
• Heuristic algorithms
– Progressive
– Iterative
Lecture 8 CS566 3
• Alignment score = l
• Column Score C l
– Ideally
C l
• Based on n-way joint probability (n-generalized AAS)
– Sum of Pairs
• i<j s ij
Based on amino acid substitution matrices
• Gap-gap = 0; Gap-char = -g
• Commonest scheme used
• Fallacious:
– Assumes only 2-way and not n-way joint probabilities
– Score not proportional to number of sequences in alignment
– N-way sums
• Need to know central point of reference (ancestral sequence)
Lecture 8 CS566 4
• Line up n sequences in a grid having n dimensions
• Score each cell as the maximum of
– Lining up all corresponding characters AND
– All possible combinations of gaps and characters
• Note choice made
• Reconstruct alignment by traceback
• Global or Local dynamic programming?
• Space complexity?
• Time complexity?
Lecture 8 CS566 5
• Carillo-Lipman MSA algorithm
– Uses pair-wise dynamic programming to identify sub-matrix regions of near-optimality
– n-dimensional dynamic programming carried out within space of intersection of near-optimal regions
– Still limited to only a few sequences
– Is this an optimal algorithm or not?
Lecture 8 CS566 6
• New concepts
– Consider aligning alignments to alignments/sequences en bloc
– Hierarchical/Sequential order of alignment
(“Once a cobbler, always a cobbler”)
• Heuristic
• Fast
Lecture 8 CS566 7
• Compute all pairwise alignments
• Convert alignment scores into distances
• Build guide tree (phylogenetic tree)
• Align sequences in order suggested by ‘guide tree’
• Position specific scoring system used
– Gap costs depend on position
• Composition based scoring system used
– Percentage similarity dictates choice of scoring matrix
– Weighting based on composition bias
• Only ‘cross-terms’ (profile-profile) used in scoring
Lecture 8 CS566 8
• ClustalV (Now history!)
• ClustalW (Takes weighting into account for composition bias)
• ClustalX (Graphical interface)
Lecture 8 CS566 9
• “Once a cobbler, now a king!”
• Iterative algorithm:
– Compute all pairwise similarities
– Start with best pair
– Add ‘most-similar’ sequence to profile successively till none left
– Remove and re-align each sequence till convergence
Lecture 8 CS566 10
• Genetic programming-based msa
– Create initial random alignment
– Score alignment
– Retain better scoring half of alignment
– Mutate remaining half of alignment with ideas from genetic recombination
• Random gap insertion
• En bloc shifts
• Probabilistic order of alignment
– Score resulting alignment
– Iterate till convergence
Lecture 8 CS566 11