101 Optimal PDB Structure Alignments: A Branch-and-Cut Algorithm for the Maximum Contact Map Overlap Problem Giuseppe Lancia Robert Carr Brian Walenz Sorin Istrail Contact Maps CONTACT MAPS Unfolded protein CONTACT MAPS Unfolded protein Folded protein = contacts CONTACT MAPS Unfolded protein Folded protein = contacts Contact map = graph CONTACT MAPS Unfolded protein Folded protein = contacts Contact map = graph OBJECTIVE: align 3d folds of proteins = align contact maps Contact Map of a Self-Avoiding Walk 12345 1 4 1 2 3 4 5 2 3 5 1 2 3 4 5 00010 00010 00011 11100 00100 Contact Map Alignments Non-crossing Alignments Protein 1 Protein 2 non-crossing map of residues in protein 1 and protein 2 The value of an alignment The value of an alignment The value of an alignment The value of an alignment Value = 3 The value of an alignment Value = 3 We want to maximize the value Integer Programming Formulation Integer Programming Formulation The use of Integer Linear Programming * Exact solution * Heuristic + guarantee (LP upper bound) Integer Programming Formulation The use of Integer Linear Programming * Exact solution * Heuristic + guarantee (LP upper bound) e 0-1 VARIABLES yef yef for e and f contacts f Integer Programming Formulation The use of Integer Linear Programming * Exact solution * Heuristic + guarantee (LP upper bound) e 0-1 VARIABLES yef yef for e and f contacts e e’ f CONSTRAINTS yef + ye’f’ <= 1 f f’ Integer Programming Formulation The use of Integer Linear Programming * Exact solution * Heuristic + guarantee (LP upper bound) e 0-1 VARIABLES yef yef for e and f contacts Gy e e’ f CONSTRAINTS yef + ye’f’ <= 1 f f’ e f yef OBJECTIVE max Independent Set Problem It’s just a huge max independent set problem in Gy: • a node for each sharing • an edge for each pair of incompatible sharings e’’ e’ f’ e e’ e f e’’ f’’ f’’ f f’ Independent Set Problem It’s just a huge max independent set problem in Gy: • a node for each sharing • an edge for each pair of incompatible sharings e’’ e’ f’ e e’ e f e’’ f’’ f’’ f f’ |Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each) The best exact algorithm for independent set can solve for at most a few hundred nodes Node to Node Variables New variables x provide an easy check for the non-crossing conditions e NEW VARIABLES xij for i and j residues i xij j yef f Node to Node Variables New variables x provide an easy check for the non-crossing conditions e NEW VARIABLES xij for i and j residues NEW CONSTRAINTS i’ i j’ j xij + xi’j’ <= 1 i xij j yef f Node to Node Variables New variables x provide an easy check for the non-crossing conditions e NEW VARIABLES xij for i and j residues i yef xij j f NEW CONSTRAINTS i’ i j’ p i j q j xij + xi’j’ <= 1 y(ip)(jq) <= xij and y(ip)(jq) <= xpq Clique Constraints Variables x define a graph Gx: • A node for each line • An edge between each pair of crossing lines i’ i i j i’ j’ j’ j Clique Constraints Variables x define a graph Gx: • A node for each line • An edge between each pair of crossing lines i’ i i j i’ j’ j’ j • Gx is much smaller than Gy • Gx has nice proprieties (it’s a perfect graph) • It’s easier to find large independent sets in Gx Clique Constraints Non-crossing constraints can be extended to CLIQUE CONSTRAINTS xij <= 1 [i,j] in M For all sets M of mutually incompatible (i.e. crossing) lines All clique constraints satisfied (and Gx perfect) imply a strong bound! Structure of Maximal cliques in Gx 1. Pick two subsets of same size Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx 2. Connect them in a zig-zag fashion Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx 3. Throw in all lines included in a zig or a zag Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx 3. Throw in all lines included in a zig or a zag Structure of Maximal cliques in Gx Structure of Maximal cliques in Gx The result is a maximal clique in Gx Separation of Clique Inequalities PROBLEM There exist exponentially many such cliques (O(22n) inequalities). How do we add them ? Separation of Clique Inequalities PROBLEM There exist exponentially many such cliques (O(22n) inequalities). How do we add them ? SOLUTION We don’t add them in the original LP, but only when needed at run time. Not all of them will be needed, so we are fine as long as… Separation of Clique Inequalities PROBLEM There exist exponentially many such cliques (O(22n) inequalities). How do we add them ? SOLUTION We don’t add them in the original LP, but only when needed at run time. Not all of them will be needed, so we are fine as long as… SEPARATION …we can generate in polynomial time a clique inequality when needed, i.e., when violated by the current LP solution x* [i,j] in M x*ij > 1 Separation of Clique Inequalities PROBLEM There exist exponentially many such cliques (O(22n) inequalities). How do we add them ? SOLUTION We don’t add them in the original LP, but only when needed at run time. Not all of them will be needed, so we are fine as long as… SEPARATION …we can generate in polynomial time a clique inequality when needed, i.e., when violated by the current LP solution x* x*ij > 1 [i,j] in M THEOREM We can find the most violated clique inequality in time O(n2) Separation of Clique Inequalities PROOF (sketch) 1) Clique = zigzag path Separation of Clique Inequalities PROOF (sketch) 1) Clique = zigzag path 1 2 3 4 5 6 7 8 Separation of Clique Inequalities PROOF (sketch) 2) Flip one graph: zigzag 1) Clique = zigzag path 1 2 3 4 5 6 7 8 8 7 6 5 4 3 leftright 2 1 Separation of Clique Inequalities PROOF (sketch) 2) Flip one graph: zigzag 1) Clique = zigzag path 1 2 3 4 5 6 7 8 8 7 6 5 4 3 3) Define lengths for arcs so that length(P) = x*(clique(P)) leftright 2 1 Separation of Clique Inequalities PROOF (sketch) 2) Flip one graph: zigzag 1) Clique = zigzag path 1 2 3 4 5 6 7 8 8 7 6 5 4 3 leftright 2 3) Define lengths for arcs so that length(P) = x*(clique(P)) 4) Use dynamic programming to find P of max length in time O(n2) 1 Genetic Algorithm Heuristic Genetic Algorithm Overview A Population of candidate solutions that evolve (improve) over time Population at time t Recombination operators Evaluation function Recombination creates new candidate solutions via crossover and mutation Population at time t+1 Genetic Algorithm Heuristic Crossover Crossover Crossover selects pieces from both parents and creates two offspring solutions Offspring Blue Parent Red Parent Crossover Crossover selects pieces from both parents and creates two offspring solutions Select a set of edges in one parent to copy to the child Crossover Crossover selects pieces from both parents and creates two offspring solutions Select a set of edges in one parent to copy to the child Crossover Crossover selects pieces from both parents and creates two offspring solutions Select a set of edges in one parent to copy to the child Copy as many edges as possible from the other parent Crossover Crossover selects pieces from both parents and creates two offspring solutions Select a set of edges in one parent to copy to the child Copy as many edges as possible from the other parent These edges conflict with existing edges and are not copied Crossover Crossover selects pieces from both parents and creates two offspring solutions Select a set of edges in one parent to copy to the child Copy as many edges as possible from the other parent Add random edges to fill any remaining space Crossover Crossover selects pieces from both parents and creates two offspring solutions Select a set of edges in one parent to copy to the child Copy as many edges as possible from the other parent Add random edges to fill any remaining space Genetic Algorithm Heuristic Mutation Mutation Mutation introduces small changes to existing solutions by shifting edge endpoints Mutation Mutation introduces small changes to existing solutions by shifting edge endpoints Select a set of endpoints to shift Top or bottom? All edges to the left or right of a selected edge? Shift to the left or the right? Mutation Mutation introduces small changes to existing solutions by shifting edge endpoints Select a set of endpoints to shift Top or bottom? All edges to the left or right of a selected edge? Shift to the left or the right? Mutation Mutation introduces small changes to existing solutions by shifting edge endpoints Select a set of endpoints to shift Top or bottom? All edges to the left or right of a selected edge? Shift to the left or the right? This edge “fell off” the end of the contact map and is removed Mutation Mutation introduces small changes to existing solutions by shifting edge endpoints Select a set of endpoints to shift Top or bottom? All edges to the left or right of a selected edge? Shift to the left or the right? Randomly add new edges Mutation Mutation introduces small changes to existing solutions by shifting edge endpoints Select a set of endpoints to shift Top or bottom? All edges to the left or right of a selected edge? Shift to the left or the right? Randomly add new edges Computational Results Computational Results Branch-and-Cut Results 269 proteins 64 to 72 residues 80 to 140 contacts Selected 597 pairs of proteins out of 36046 possible roughly as many similar pairs as dissimilar pairs Number of Instances Average/Max Num. Residues Average/Max Num. Contacts Num. GA Best Num. LS1 Best Num. LS2 Best 0 42 1 48 2 72 66.4/69 66.8/72 66.7/71 61.1/92 56.3/89 38 25 5 44 20 0 Optimality Gap 3 71 4 76 5 95 >5 193 67.0/72 67.0/71 66.8/72 66.8/72 57.3/93 59.7/95 61.5/88 64.7/89 71.4/133 63 35 0 61 31 1 64 33 5 74 35 12 155 82 53 Skolnick Clustering Test Skolnick Test Results Four Families 1 Flavodoxin-like fold Che-Y related 2 Plastocyanin 3 TIM Barrel 4 Ferratin alpha-beta 8 structures up to 124 residues 15-30% sequence similarity < 3Å RMSD Skolnick Test Results Four Families 1 Flavodoxin-like fold Che-Y related 2 Plastocyanin 3 TIM Barrel 4 Ferratin beta 8 structures up to 99 residues 35-90% sequence similarity < 2Å RMSD Skolnick Test Results Four Families 1 Flavodoxin-like fold Che-Y related 2 Plastocyanin 3 TIM Barrel 4 Ferratin alpha-beta 11 structures up to 250 residues 30-90% sequence similarity < 2Å RMSD Skolnick Test Results Four Families 1 Flavodoxin-like fold Che-Y related 2 Plastocyanin 3 TIM Barrel 4 Ferratin alpha 6 structures up to 170 residues 7-70% sequence similarity < 4Å RMSD Skolnick Test Results Four Families 1 Flavodoxin-like fold Che-Y related 2 Plastocyanin 3 TIM Barrel 4 Ferratin Family 1 Style alpha-beta Residues 124 Seq. Sim. 15-30% RMSD < 3A 2 beta 99 35-90% < 2A 3 alpha-beta 250 30-90% < 2A 170 7-70% < 4A 4 Proteins 1b00, 1dbw, 1nat, 1ntr, 1qmp, 1rnl, 3cah, 4tmy 1baw, 1byo, 1kdi, 1nin, 1pla, 3b3i, 2pcy, 2plt 1amk, 1aw2, 1b9b, 1btm, 1hti, 1tmh, 1tre, 1tri, 1ydv, 3ypi, 8tim 1b71, 1bcf, 1dps, 1fha, 1ier, 1rcd Clustering Define score(P1, P2) as # shared contacts 0 <= <= 1 Min # of contacts of P1,P2 Put P1, P2 in same family if score(P1, P2) >= threshold Clustering Define score(P1, P2) as # shared contacts 0 <= <= 1 Min # of contacts of P1,P2 Put P1, P2 in same family if score(P1, P2) >= threshold If P1, P2 too big, use G.A. and local search to compute score L.P. gives then bounds: HEURISTICS score <= OPT score <= LP bound and we know how far off OPT we are Clustering validation We got some known families from biologists, PDB. Experiment: Take a family F of proteins and align them against each other and against the remaining. Clustering validation We got some known families from biologists, PDB. Experiment: Take a family F of proteins and align them against each other and against the remaining. TYPICAL BEHAVIOUR score proteins were… 0.05 0.1 0.15 0.2 0.25 0.3 0.35 …… 1.0 MISMATCH MISMATCH MISMATCH MISMATCH MISMATCH MISMATCH MATCH …… MATCH Skolnick Results Performance 528 alignments 1.3% false negative 0.0% false positive Clustering ... Clustering STRUCTURAL GENOMICS Structural Genomics Structure Similarity Structure Alignment Fold Recognition Fold Assignment Structure Alignment “Structural Twins:” structurally similar, but sequentially unrelated proteins Agreement that two proteins (or a group) are similar, but no agreement on how they are similar Variety of similarity criteria proposed, e.g., global similarity based on RMS on alpha carbons, or e.g., local similarity based on packing and interaction patterns Standard of Truth Structure alignment regarded as a “standard of truth” for Sequence alignment • Best scoring matrices • Alignment strategies • Published structure alignments differ in almost all positions Quantifying Similarity RMS optimal superposition of alpha carbon atoms Difference between distance maps Contact map overlap Scoring Functions Non-locality of scoring functions Insertions and deletions Capturing geometric pattern Contact Map Homology Modeling: The Differential Protein Folding Problem The prediction of change in structure from change in sequence The hypothesis: small changes in sequence often produce relatively small changes in protein structure Fold Recognition: The Sequence-Structure Alignment Problem Identify a folding pattern of a protein as similar to one or more known structures in a library of protein folds Measure of sequence-structure compatibility Ab initio Protein Folding Problem •Full of partial set of three-dimensional coordinates •Assignement of secondary structures •Sets of contacts between residues •Sets of contacts between helices and strands Detection of Structural Similarity Similarity of two sets of atoms with known correspondences sets have same size; correspondence order preserving Similarity of two sets of atoms with unknown correspondence sets have different sizes; insertion/delitions; correspondence order preserving Similarity of two sets of atoms with unknown correspondence no restriction on the correspondence The Assessment of Structure Prediction Problem Less general than the general than the problem of finding maximal common structures of proteins because the alignment is fixed Finding common substructures is not well formulated because substructures of different sizes will fit to different accuracies