Or, What is a correspondence set anyway?! Chapter 16, Du and Bourne “Structural Bioinformatics” Topic 12 Alignment vs. superposition • Structural alignment attempts to establish homology between two or more polymer structures based on their shape and 3D structure. • Structural alignment requires no a priori knowledge of equivalent positions. • Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. • Conversely, simple structural superposition uses knowledge of at least some equivalent residues to guide a rigid body superposition. • The most basic possible comparison between protein structures makes no attempt to align the input structures. • Requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation. Structure alignment + Structure alignments are based on structure similarity, from which sequence alignments can be trivially extracted. Due to computational complexity, most structural alignments are pairwise, but multiple alignment methods do exist. First step Second step Dynamic programming and sequence alignment To really understand structure alignment, you need to understand sequence alignment... Dynamic programming (DP) is an algorithm originally developed by Richard Bellman in the early 1950s for “multistage decision processes.” DP methods solve optimization problems, very useful in bioinformatics applications, for example sequence alignment. Even though there are a large number of possible solutions, but only one (or a few) best solution(s). Foundation: Any partial sub-path ending at a point along the true optimal path must itself be an optimal path leading up to that point. So the optimal path can be found by incremental extensions of optimal sub-paths, leading to a recursive algorithm that is (typically) guaranteed to produce the best answer. There are two major types of optimal DP sequence alignments: Global (Needleman-Wunsch) and local (Smith-Waterman) alignments. Based on the assumption of independence, where the score of a residue (mis)match is unaffected by other pairs, thus joint probability! For example… ASCTVL ATCAVI Sali = p(AA)× p(ST )× p(CC)× p(TA)× p(VV )× p(LI) Based on the magic of logarithms ' Sali = S(AA) + S(ST ) + S(CC) + S(TA) + S(VV ) + S(LI) Substitution (scoring) matrix p(ij) S(ij) @ log p(i)p( j) Substitution matrices are composed of log-ratios that compare observed pairs to background expectation. S(ij) > 0 indicate ‘preferred’ matches. For example, the BLOSUM-62 matrix… Dynamic Programming (DP) Match: +5 Mismatch: -2 Insertion/deletion: -6 Sean Eddy, 2004, Nature Biotechnology Back to structure alignment Independence is not a valid assumption in structure because… Similarly, in RNA… That is, the probability of mutating the above lysine to X, p(KX), is NOT independent of the aspartate. This is, of course, the reality in sequence alignment too, but we ignore this fact because we are treating the protein as a 1D sequence that doesn’t reveal those details. Rigid body treatment ≠ independence of positions Structure alignment treats proteins as rigid bodies, leading to an even more serious violation of independence. Rotation of purple by 90o also rotates the green That is, adjusting the position of the purple residue, for example, to maximize overlap with its target will also alter the position of the green residue because they rigidly related. Formalizing the structure alignment problem Given two sets of points A = (a1, a2, …, an) and B = (b1,b2,…bm) in Cartesian space, find the optimal subsets A(P) and B(Q) with |A(P)| = |B(Q)|, and find the optimal rigid body transformation G between the two subsets A(P) and B(Q) that minimizes a given distance metric D over all possible rigid body transformation G, i.e. min{D( A( P) - G( B(Q)))} G The two subsets A(P) and B(Q) define a “correspondence”, and p = |A(P)| = |B(Q)| is called the correspondence length. Naturally, the correspondence length is maximal when A(P) and B(Q) are similar. Therefore there are essentially two problems in structure alignment: (i.) Find the correspondence set (which is NP-hard), and (ii.) Find the alignment transform (which is O(n)). Just to clarify… In the structure alignment literature, you will frequently encounter coordinate root mean squared deviation, which is just like RMSD except B describes a coordinate transformation of b. n å (a - b ) i RMSD = 2 i i=1 n Where B describes a coordinate transformation of b. Common structure alignment methods DALI: Uses 2D distance matrices between CA atoms to represent each structure. Conceptually, the alignment problem is then straightforward, you must simply maximally overlay the matrices (as described in an earlier cartoon). Holm and Sander. Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233:123-128. CE (Combinatorial extension): Uses characteristics of local geometry to seed structural alignments and then joins these regions of local similarity (called aligned fragment pairs, AFPs) into an “optimal” path for the full alignment. Bottom-up approach. Shindyalov and Bourne, Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Prot Eng, 1998, 11:739-747. SSAP (Sequential Structure Alignment Program ): Uses a “double-dynamic programming” algorithm: high level and low level matrices. Used in CATH classification. Taylor WR, Orengo CA. 1989b. Protein structure alignment. J Mol Biol 208:l-22 VAST (Vector Alignment Search Tool ), TM-align and many more…… Dali: The Persistence of Time Overview of the Dali Algorithm Starting with a contact map… Dali attempts to maximize the overlap of the contact maps; however, doing so globally is NP-hard, so the methods focus on local comparisons. Image from Amy Keating at MIT Image from Mark Maciejewski at UConn Overview of the Dali Algorithm The DALI (Distance matrix alignment) algorithm is based on the matrix comparison methods that we have already introduced. Similarity score: S L L i 1 j 1 (i , j ) i and j are equivalent residues in A and B L is the number of such pairs or the size of the substructure is the similarity measure based on the CA distance A and B d ij d ij A iA d ij jA iB B d ij jB Structure A Images and content modified from Mark Maciejewski at UConn Structure B The Dali Algorithm (step by step) 1. Compute distance matrices for both protein A and B 2. Extract a full set of overlapped hexapeptide (6x6) sub-matrices (also called contact patterns) from each matrix 3. Each 6x6 distance matrix from protein A is compared with the 6x6 distance matrix in protein B. (Really?) 6x6 CA distance matrices A d ij dijA - dijB B d ij For example: 6.2 – 12.7 = -6.5 The Dali Algorithm (step by step) Step 1: For each hexapeptide, a distance matrix compares it to every other hexapeptide within its structure. Step 2: Every distance matrix created in step 1 for each protein are compared to each other. “Houston, … we have a problem!” Consider protein A with 100 residues, meaning we have 100 - 5 = 95 hexapeptides. (95^2)/2 = 4,512 contact pattern matrices Consider protein B with 150 residues, meaning 150-5 = 145 hexapeptides. (145^2)/2 = 10,512 contact pattern matrices Even for these two relatively small proteins, there would be 4,512 x 10,512 = 47,430,144 comparisons between A and B. The Dali Algorithm (step by step) 4. Each contact pattern in protein A is paired with its most similar pattern in protein B, a process that generates a pair list 5. The list is sorted based on the strength of pair similarity of contact patterns S L L i 1 j 1 (i , j ) Note that unmatched residues do not contribute to the overall similarity score S. Image from Amy Keating at MIT A note about the similarity measure : We want to maximize the number of equivalent residues while minimize structural variations – it is a tradeoff. That is, if the criteria are so tough that minor structure deviations are not allowed, then the number of matching contact patterns is likely to be very small. The Dali Algorithm (step by step) Q: How do you calculate (i,j)? S L L i 1 j 1 (i , j ) Method 1: Rigid residue-pair similarity score: -- 1.5 Å is the zero level of similarity. f R (i, j) =1.5- | dijA - dijB | -- The only thing that matters is absolute difference, meaning that the same difference at large distances is penalized the same as short distances. Method 2: Elastic similarity score (default): -- Larger differences are tolerate for longer-range contact pairs. A B æ d d ij ij f E (i, j) = ç 0.2 AB ç d ij è æ æ d AB ö2 ö A B ö d + d ij ÷ exp ç - ç ij ÷ ÷ ; where dijAB = ij ÷ ç çè 20 ÷ø ÷ 2 ø è ø The Dali Algorithm (step by step) 6. Merging contact patterns to form chains and reduce complexity The search space is reduced because only the central contact pattern is retained (actually, the one that gives the smallest average intra-pattern distance). The Dali Algorithm (step by step) 7.) After removing the overlapping patterns, we are still left with way too many contact patterns to exhaustively compare all possible pairs. Start comparing pairs at random: -- Keep list of positive scores (discard negative scores) -- Keep comparing till your list has 80,000 positive scores Sort the list and keep the best 40,000 contact pattern matches. 8.) End game: Need to find optimal alignment of the 40,000 contact patterns such that the alignment occurs over as wide a range of the structural pair as possible. Using Markov Chain Monte Carlo (MCMC), start with a random contact pattern from the list of 40,000, and then “walk” to another overlapping pattern (must extend the contact pattern by 4 residues) using the standard Metropolis criterion. Metropolis Monte Carlo Optimization P(x) = e -E ( x )/kBT åe -E ( x )i /kBT \P(x ® y) = e -DE/kBT states In Dali… The net result is that scores that improve are always kept, whereas scores that get worse are excepted with some probability. P(acceptance) = e b ( S¢-S ) where, b =1/ kBT The Dali Algorithm (the reality) Statistical significance of Dali alignments Dali uses Z-score to show the significance of the alignment Z SS s S : raw score S : average score s : standard deviation A common and practical approach to the problem of assessing alignment significance is to determine if the alignment score is better than one could expect by chance. Dali compares each alignment score against an All-to-All protein structure comparison (normalized by length), which defines the z-score. -- Dali Z-scores > 2 are thought to be meaningful. Combinatorial Extension (a cursory look) Combinatorial Extension (a cursory look) Similar to Dali in that it also breaks the structure down into a series of small fragments, from which it attempts to reassemble into a complete alignment. For a pair of proteins A and B, an alignment fragment pair (AFP) is defined as a continuous segment of A aligned against a continuous segment of B of the same size (without gaps). If n1 and n2 are the lengths of A and B, and AFP length is set to m, then there is a total of possible (n1 m)(n2 m) AFPs. Only AFPs that meet a given criteria for local similarity are included in the matrix as means of restricting the search space. An alignment path is calculated as the optimal path through the similarity matrix by linearly progressing through the sequences and extending the alignment with the next possible high-scoring AFP pair. Combinatorial Extension (a cursory look) Goal: Find a “good” local alignment for structures of proteins A and B. 1. Select some initial AFP. 2. Build an alignment path by incrementally adding AFPs in a way that satisfies the conditions (i.e., stitch AFPs together). 3. Repeat step (2) until the length of each protein is traversed, or until no “good” AFPs remain. 4. Optimize the alignment via dynamic programming. 5. Measure statistical significance. Questions: How do we choose the starting AFP? What are the criteria for adding AFPs to our alignment path? What does the distance function look like. When to stop? Or at what point do we know that there no “good” AFPs left? Combinatorial Extension (a cursory look) To assess how good the alignment produced by CE is, we can compare it to the alignment of a random pair of structures, and compute the Z-score based on the RMSD distance and number of gaps in the final alignment. Since CE does not penalize gaps, we can perform additional optimization after the CE is completed in order to remove excess gaps using dynamic programming. The CE method is highly configurable, which is at once its strength and weakness. Adjusting multiple parameters, such as AFP length m, cutoff distances D0 and D1, and definitions for AFP distances, can result varying alignments and execution speeds. In general, CE does not outperform previously existing structural alignment methods, such as Dali and VAST: it does better for some pairs of structures, and worse for others. VAST (a cursory look) VAST = Vector Alignment Search Tool VAST (a cursory look) 1.) Parse protein structures into SSEs (helices and strands). 2.) Fit vectors to SSEs. 3.) To compare a pair of proteins attempt to superpose as many vectors as possible, subject to constraints. 4.) Evaluate the vector alignment for statistical significance (compute an Evalue). 5.) If the vector alignment is significant then proceed to a more detailed residue-to-residue alignment (“refined alignment”). Modified from Tom Madej at GWU + VAST in pictures… Modified from Tom Madej at GWU Double Dynamic Programming (a cursory look) Use two levels of dynamic programming, a high level scoring matrix and a low level matrix for each high level matrix element. For each Fij in the high level scoring matrix, it shows how likely it is that the pair is on an optimal alignment. For each Fij , the likelihood is found by a (low level) optimal alignment with the constraint that Fij is part of the alignment. The scores along the low level alignments are accumulated in the high level scoring matrix. DDP cont. Begin by constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein. A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed. Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural similarity. Stated in a slightly different manner First level: • Represent each residue by neighborhood vector for C • Compare n versus m neighborhood vectors • Generate optimal alignment based on vector differences and dynamic programming Second Level: • Add matrix scores if paths cross in a cumulative matrix • Generate optimal alignment based on the cumulative matrix SSAP = Sequential Structure Alignment Program Generally, SSAP scores above 80 are associated with highly similar structures. Scores between 70 and 80 indicate a similar fold with minor variations. Structures yielding a score between 60 and 70 do not generally contain the same fold, but usually belong to the same protein class with common structural motifs. SSAP originally produced only pairwise alignments but has since been extended to multiple alignments as well. It has been applied in an all-to-all fashion to produce CATH. Multiple Structure Alignment Most multiple structure alignments are based on a pile-up combination of pairwise results; however, few algorithms do an All-to-All optimization. One example of a multiple alignment is Combinatorial Extension Monte Carlo (CE-MC), which is based on a progressive CE multiple alignment strategy, followed by an iterative Metropolis MC refinement.