Rotamer Packing Problem: The algorithms Hugo Willy 26 May 2010 Outline • • • • • • Preliminaries Problem Formulation Dead-End Elimination SCWRL TreePack Relevance to my current work Rotamers • Protein side chain may have many different conformation • They are mostly defined by the dihedral angles (bond length and bond angle is relatively fixed) • The figure shows the dihedral angles of a glutamic acid’s side chain. Rotamers (2) • The range of the dihedrals is a continuous 0360°. However, there are certain angles that is preferred because of the energetics. They are the called gauche+ (+60°), gauche- (-60°) and trans (180°). • Those numbers above are approximate values. Different amino acid would have different average angle for gauche+ (g+), g- and trans (t). • These averaged values form a finite number of possible dihedral angles. They are called rotamers. • Hence, rotamers, in a sense, are discretization of the dihedral angle space of amino acid residues Rotamers (3) • Rotamer libraries are collected by selecting unrelated PDB structures with high resolutions. • With more data available, rotamer libraries can be built conditional upon the backbone conformation (phi and psi angle in space of 10°). They are called the backbonedependent rotamer library. Side chain interaction • Each side chain conformation entails a set of interaction between the residue in question with its surrounding neighborhood. • They can be favorable or not-favorable based on their distance and charge. • These interactions is used to score to the chosen conformation. The energy function • Eglobal is the total energy of the system • Etemplate is the energy of the template backbone • E(ir) is a function that defines the interaction energy of residue i with the fixed backbone if it takes the rotamer r • E(ir,js) defines the interaction energy between residues i and j if they adopt the rotamers r and s resp. Rotamer Packing Problem • Given a fixed backbone conformation of a protein sequence S[1..N] and an interaction energy scoring function E • Find the optimal rotamer set {r1,r2,...rN} for S[1..N] such that the sum of all self and pairwise residue interactions is minimized w.r.t E. Rotamer Packing Problem • The brute force rotamer search method is exponential in the number of rotamers per residue. O(nrotN) • Assuming three conformations per dihedral, residue with 1 dihedral (1) would have 3 rotamers, those with two would have 9, 3 yields 27 and, ultimately, 4 gives 81 possible rotamers. • The ones with 4 dihedrals are arginine and lysine—the only two amino acids with positive charge (Histidine have a weaker positive charge but it depends on its environmental pH condition). • Which says that they are pretty common-esp in TFs and DNA interacting proteins (DNA carry a net negative charge). • We need optimization. Dead End Elimination a.k.a DEE (Nature 1992) • If for some rotamer r of residue i, its sum of interaction energies with the best rotamers of other residues w.r.t r is still larger than the interaction energy of rotamer t of i with the worst rotamer possible of other residues w.r.t t • Then r is certainly not in the best rotamer configuration (r is called dead-ending). Dead End Elimination (2) • Extending to rotamer pair, let • If we have • Then the rotamer pair r and s is a dead-end Dead End Elimination (3) • The DEE is applied in iterative fashion 1. DEE is applied for single rotamers 2. DEE is applied for rotamer pairs and they are marked. These pairs are then removed from the possible pairs considered in the single rotamer case in the next iteration. • A rotamer r of residue i whose pairing with all other rotamer of a residue j are marked is also dead-ending and hence removed. • In a case study using insulin structure of 76 residues, the initial number of rotamer configurations is 2.7E+76 • After 9 iterations, only 7200 are left. SCWRL (Protein Sci. 2003) • The problem with DEE would be when there are still a lot of remaining residues with more than 1 possible rotamer. • SCWRL models the remaining residues as a graph where the residues forms the nodes and an edge is established whenever two residues have at least a pair of rotamer configuration whose interaction energy is non-zero SCWRL (2) • Previously, SCWRL will try to find a “keystone” node whose removal would divide the connectivity graph to two. • Then, the energy of the two parts can be computed separately. • Complexity is reduced from nrot11 to nrot7+nrot5 SCWRL (3) • In the most recent improvement, SCWRL splits the graph into biconnected components. • A biconnected component is a subgraph which can not be made disconnected by the removal of only one node. • They are cycles or nested cycles. They can be found by standard DFS based algorithm (Tarjan 1972) • This way, SCWRL manage to have the complexity to be bound by the size of the largest cycle in the residue connectivity graph. SCWRL (4) SCWRL (5) • For the biconnected components, they use a branch and bound algorithm. • First, since their energy function only has positive terms, one can do DFS and bound the search based on the energy of the best path from root to any leaf. • One can also bound the energy contribution of a residue using the sum of minimum self and pairwise energies between it and its descendants. SCWRL (6) • The energy functions used in SCWRL is a linear combination of a rotamer probability term and linear repulsive energy term (van-der-waals repulsive) ri = 1 is the probability of the best rotamer of a given phi and psi. K is a fitting parameter set to 3. r is interatomic distance between two residues i and j, Rij is the sum of van der waals radii of i and j. Tree Pack (J. ACM 2006) • This technique is based on the tree decomposition technique by Robert and Seymour 1986. • Basically given a graph G (V,E), a tree decomposition (T,X) of G is consist of a tree T (I, F) and a vertex mapping X which maps the node in I to a certain subset of V. For each node i I, the subset is denoted by Xi. • Every edge in E must be contained in some Xi. • For all i, j and k in I, if j is on the path from i to k then (Xi Xk) Xj. Tree Pack (2) • The width of a tree decomposition is the maximum of |Xi| -1 • The tree width of G is the minimum width of all possible tree decomposition over G Tree Pack (3) • The computation of the energy based on a tree decomposition. Xr,j Basically, the computation is done in two steps. The first computes the best energies bottom-up. Then the optimal rotamer configuration is computed top-down. Tree Pack (4) • So the complexity is O(Nnrottw+1). • Each residue interaction need a minimum distance of Dl and maximum distance of Du. Residues are defined in a 3D geometric graph. • Definition: k-ply neighborhood system in R3 is a set of closed balls in R3 such that no point is strictly inside more than k balls. • Sphere separator theorem (Miller, 1997): For every k-ply neighborhood system, there is a sphere separator S s.t. – – – – |NE| <= 4/5 N (NE are the balls outside S). |NI| <= 4/5 N (NI are the balls within S) and, |No| = O(k1/3n2/3) where No contains the balls that intersect S. S can be computed in a linear time randomized algorithm. Tree Pack (5) • Given Du and Dl, there is no point inside more than (1+Du/Dl)3 balls. • Then given No, we can have an intersection whose size is at most O(V2/3) References • Tarjan, R. 1972. Depth first search and linear graph algorithms. SIAM J. Comput. 1: 146-160. • Desmet, J. et. al. 1992. The dead end elimination theorem and its use in protein side chain positioning. Nature 356:539-542. • Canutescu. A. et. al. 2003. A graph theory algorithm for rapid protein side chain prediction. Protein Sci. 12:20012014 • Xu. J and Berger. B. 2006. Fast and accurate algorithms for protein side chain packing. J. of the ACM 53:533-557.