A Comparative Study of Techniques for Protein Side-chain Placement by MASSACHUSETTS INSTITUTE OF TECHNOLOGY Eun-Jong Hong B. S., Electrical Engineering Seoul National University (1998) OCT 1 5 2003 LIBRARIES Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2003 © Massachusetts Institute of Technology 2003. All rights reserved. ................ A uthor .......... ... Department of Electrical Engineering and Computer Science September 2, 2003 C ertified by .. ....................... Tomas Lozano-Perez Professor T7hesis Supervisor Accepted by.... Arthur C. Smith Chairman, Department Committee on Graduate Students SARKER A Comparative Study of Techniques for Protein Side-chain Placement by Eun-Jong Hong Submitted to the Department of Electrical Engineering and Computer Science on September 2, 2003, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract The prediction of energetically favorable side-chain conformation is a fundamental element in homology modeling of proteins and the design of novel protein sequences. The space of side-chain conformations can be approximated by a discrete space of probabilistically representative side-chain orientations (called rotamers). The problem is, then, to find a rotamer selection for each amino acid that minimizes a potential function. This problem is an NP-hard optimization problem. The Dead-end elimination (DEE) combined with the A* algorithm has been successfully applied to the problem. However, DEE fails to converge for some classes of complex problems. In this research, we explore three different approaches to find alternatives to DEE/A* in solving the GMEC problem. We first use integer programming formulation and the branch-and-price algorithm to obtain the exact solution. There are known ILP formulations that can be directly applied to the GMEC problem. We review these formulations and compare their effectiveness using CPLEX optimizers. At the same time, we do a preliminary work to apply the branch-and-price to the GMEC problem. W suggest possible decomposition schemes, and assess the feasibility of the decomposition schemes by a computational experiment. As the second approach, we use nonlinear programming techniques. This work mainly relies on the continuation approach by Ng [31]. Ng's continuation approach finds a suboptimal solution to a discrete optimization problem by solving a sequence of related nonlinear programming problems. We implement the algorithm and do a computational experiment to examine the performance of the method on the GMEC problem. We finally use the probabilistic inference methods to infer the GMEC using the energy terms translated into probability distributions. We implement probabilistic relaxation labeling, the max-product belief propagation (BP) algorithm, and the MIME double loop algorithm to test on side-chain placement examples and some sequence design cases. The performance of the three methods are also compared with the ILP method and DEE/A*. The preliminary results suggest that probabilistic inference methods, especially 2 the max-product BP algorithm is most effective and fast among all the tested methods. Though the max-product BP algorithm is an approximate method, its speed and accuracy are comparable to those of DEE/A* in side-chain placement, and overall superior in sequence design. Probabilistic relaxation labeling shows slightly weaker performance than the max-product BP, but the method works well up to medium-sized cases. On the other hand, the ILP approach, the nonlinear programming approach, and the MIME double loop algorithm turns out to be not competitive. Though the three methods have different appearances, they are all based on the mathematical formulation and optimization techniques. We find such traditional approaches require good understanding of the methods and considerable experimental efforts and expertise. However, we also present the results from these methods to provide reference for future research. Thesis Supervisor: Tomas Lozano-Perez Title: Professor 3 Acknowledgments I would like to thank my advisor, Prof. Tomas Lozano-Perez, for his kind guidance and continuous encouragement. Working with him was a great learning experience and taught me the way to do research. He introduced me to the topic of side-chain placement as well as computational methods such as integer linear programming and probabilistic inference. He has been so much involved in this work himself, and most of ideas and implementations in Chapter 4 are his contributions. Without his handson advice and considerate mentorship, this work was not able to be finished. I also would like to thank Prof. Bruce Tidor and members of his group. Prof. Tidor allowed me to access the group's machines, and use protein energy data and the DEE/A* implementation. Alessandro Senes, who used to be a post-Doc in the Tidor group helped setting up the environment for my computational experiments. I appreciate his friendly responses to my numerous requests and answers to elementary chemistry questions. I thank Michael Altman and Shaun Lippow for providing sequence design examples and providing helps in using their programs, Bambang Adiwijaya for the useful discussion on delayed column generation. I thank Prof. Piotr Indyk and Prof. Andreas Schultz for their helpful correspondence and suggestions on the problem, Prof. James Orlin and Prof. Dimitri Bertsekas for their time and opinions. Junghoon Lee was a helpful source on numerical methods, and Yongil Shin became a valuable partner in discussing statistical physics and probability. Prof. Ted Ralphs of Lehigh University and Prof. Kien-Ming Ng of NUS gave helpful comments on their software and algorithm. Last but not least, I express my deep gratefulness to my parents and sister for their endless support and love that sustain me throughout all the hard work. 4 Contents 1 11 Introduction 1.1 1.2 1.3 11 Global minimum energy conformation ........ 1.1.1 NP-hardness of the GMEC problem . . . . 12 1.1.2 Purpose and scope . . . . . . . . . . . . . 13 Related work . . . . . . . . . . . . . . . . . . . . 14 1.2.1 Integer linear programming (ILP) . . . . . 14 1.2.2 Probabilistic methods . . . . . . . . . . . . 14 Our approaches . . . . . . . . . . . . . . . . . . . 15 2 Integer linear programming approach 2.1 2.2 2.3 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1 The first classical formulation by Eriksson, Zhou, and Elofsson 17 2.1.2 The second classical formulation . . . . . . . . . . . . . . . . . 19 2.1.3 The third classical formulation . . . . . . . . . . . . . . . . . . 20 2.1.4 Computational experience . . . . . . . . . . . . . . . . . . . . 21 ILP formulations Branch-and-price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Branching and the subproblem . . . . . . . . . . . . . . . . . 31 2.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.4 Computational results and discussions . . . . . . . . . . . . . 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Sum m ary 5 3 Nonlinear programming approach 3.1 3.2 3.3 3.4 Ng's continuation approach 41 .................... . . . 42 3.1.1 Continuation method .................... . . . 42 3.1.2 Smoothing algorithm . . . . . . . . . . . . . . . . . . . . . . . 42 3.1.3 Solving the transformed problem . . . . . . . . . . . . . . . . 45 Algorithm implementation . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 The conjugate gradient method . . . . . . . . . . . . . . . . . 48 3.2.2 The adaptive linesearch algorithm. . . . . . . . . . . . . . . . 48 Computational results . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Energy clipping and preconditioning the reduced-Hessian system 53 3.3.2 Parameter control and variable elimination . . . . . . . . . . . 53 3.3.3 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Sum m ary 4 Probabilistic inference approach 4.1 4.2 4.3 5 50 60 M ethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.1.1 Probabilistic relaxation labeling . . . . . . . . . . . . . . . . 61 4.1.2 BP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1.3 Max-product BP algorithm . . . . . . . . . . . . . . . . . . 62 4.1.4 Double loop algorithm . . . . . . . . . . . . . . . . . . . . . 63 4.1.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 64 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Side-chain placement . . . . . . . . . . . . . . . . . . . . . . 65 4.2.2 Sequence design . . . . . . . . . . . . . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . 76 Summary . . . . . . . . . . . . . . . . Conclusions and future work 77 6 List of Figures 2-1 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2-2 A feasible solution to the GMEC problem . . . . . . . . . . . . . . . 26 2-3 A set of minimum weight edges . . . . . . . . . . . . . . . . . . . . . 26 2-4 Path starting from rotamer 2 of residue 1 and arriving at the same . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2-5 An elongated graph to calculate shortest paths . . . . . . . . . . . . . 28 2-6 Fragmentation of the maximum clique . . . . . . . . . . . . . . . . . 29 4-1 An example factor graph with three residues . . . . . . . . . . . . . . 62 4-2 The change in the estimated entropy distribution from MP for lamm-80. 71 4-3 The change in the estimated entropy distribution from MP for 256b-80. 72 4-4 Histogram of estimated entropy for lamm-80 and 256b-80 at conver- rotam er 4-5 gence of M P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Execution time for MP in seconds vs log total conformations . . . . . 73 7 List of Tables 2.1 Comparison of three classical formulations (protein: PDB code, #res: number of modeled residues, LP solution: I - integral, F - fractional, TLP: time taken for CPLEX LP Optimizer, Tip: time taken for CPLEX MIP Optimizer, symbol - : skipped, symbol * : failed). 2.2 . . . 22 Test results for BRP1 (IS*I: total number of feasible columns, #nodes: number of explored branch-and-bound nodes, #LP: number of solved LPs until convergence, #cols: number of added columns until convergence, (frac): #cols to IS*1 ratio, #LPop: number of solved LPs until reaching the optimal value, #colsop: number of added columns until reaching the optimal value, TLP: time taken for solving LPs in seconds, Taub: time taken for solving subproblems in seconds, symbol - : IS*1 . . . . . . . 37 . . . . . . . . . . . . . . . . . . . . . . . . . . 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 calculation overflow, symbol *: stopped while running). 2.3 Test results for BRP2. 2.4 Test result for BRP3. 2.5 Test results with random energy examples (#nodes: number of explored branch-and-bound nodes, LB: lower bound from LP-relaxation, T: time taken to solve the instance exactly, symbol * : stopped while running). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Logarithmic barrier smoothing algorithm . . . . . . . . . . . . . . . . 47 3.2 The preconditioned CG method for possibly indefinite systems. . . . . 49 3.3 The adaptive linesearch algorithm . . . . . . . . . . . . . . . . . . . . 51 8 3.4 The parameter settings used in the implementation. . . . . . . . . . . 3.5 Results for SM2. (protein: PDB code, #res: number of residues, #var: 55 number of variables, optimal: optimal objective value, smoothing: objective value from the smoothing algorithm, #SM: number of smoothing iterations, #CG: number of CG calls, time: execution time in seconds, #NC: number of negative curvature directions used, - 0 : initial . . . . . . . . . . . . . . . 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 value of the quadratic penalty parameter.) 3.6 Results for SM1. 4.1 The protein test set for the side-chain placement (logloconf: log total conformations, optimal: optimal objective value, tion time, TIp: IP solver solution time, TDEE: TLP: LP solver solu- DEE/A* solution time, symbol - : skipped case, symbol * : failed, symbol F : fractional LP solution). 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Results for the side-chain placement test (AE: the difference from the optimal objective value, symbol - : skipped). . . . . . . . . . . . . . 69 4.3 The fraction of incorrect rotamers from RL, MP, and DL. . . . . . . . 69 4.4 Performance comparison of MP and DEE/A* on side-chain placement (optimal: solution value from DEE/A*, EMP: solution value from MP, AE: difference between 'optimal' and EMp, IRMp: number of incorrectly predicted rotamers by MP, TDEE: time taken for DEE/A* in seconds, TMp: time taken for MP in seconds). 4.5 . . . . . . . . . . . . 70 Fraction of incorrectly predicted rotamers by MP and statistics of the estimated entropy for the second side-chain placement test. (IR fraction: fraction of incorrectly predicted rotamers, avg Si: average estimated entropy, max Si: maximum estimated entropy, min Si: minimum estimated entropy, predicted rotam ers). 4.6 miniEIR Si: minimum entropy of incorrectly . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 The protein test set for sequence design (symbol ? : unknown optimal value, symbol * : failed). . . . . . . . . . . . . . . . . . . . . . . . . 9 75 4.7 Results for sequence design test cases whose optimal values are known. 4.8 Results for sequence design test cases whose optimal values are unknown (E: solution value the method obtained). . . . . . . . . . . . . 10 75 75 Chapter 1 Introduction 1.1 Global minimum energy conformation The biological function of a protein is determined by its three-dimensional (3D) structure. Therefore, an understanding of the 3D structures of proteins is a fundamental element in studying the mechanism of life. The widely used experimental techniques for determining the 3D structures of proteins are X-ray crystallography and NMR spectroscopy, but their uses are difficult and sometimes impossible because of the high cost and technical limits. On the other hand, the advances in genome sequencing techniques have produced an enormous number of amino-acid sequences whose folded structures are unknown. Researchers have made remarkable achievements in exploiting the sequences to predict the structures computationally. Currently, due to the increasing number of experimentally known structures and the computational prediction techniques such as threading, we can obtain approximate structures corresponding to new sequences in many cases. With this trend, we expect to be able to have approximate structures for all the available sequences in the near future. However, approximate structures are not useful enough in many practical purposes, such as understanding the molecular mechanism of the protein, or designing an amino-acid sequence compatible with a given target structure. Homology modeling of proteins [7] and design of novel protein sequences [6] are often based on the prediction of energetically favorable side-chain conformations. The space of side-chain 11 conformations can be approximated by a discrete space of probabilistically representative side-chain orientations (called rotamers) [34]. The discrete model of the protein energy function we use in this work is described in terms of: 1. the self-energy of a backbone template (denoted as ebackbone) 2. the interaction energy between the backbone and residue i in its rotamer conformation r in the absence of other free side-chains (denoted as ei,) 3. the interaction energy between residue i in the rotamer conformation r and residue j in the rotamer conformation s, i = j (denoted as eir.j) In the discrete conformation space, the total energy of a protein in a specific conformation C can be written as follows: Sc = C-ackbone + 2 i ei, ej. + (1.1 i j>i The problem is, then, to find a rotamer selection for the modeled residues that minimizes the energy function Sc, which is often called global minimum energy conformation (GMEC). In this work, we call the problem as the GMEC problem. 1.1.1 NP-hardness of the GMEC problem The GMEC problem is a hard optimization problem. We can easily see that it is equivalent to the maximum capacity representatives (MCR), a NP-hard optimization problem. A Compendium of NP optimization problems by Crescenzi [5] describes the MCR as follows: Instance Disjoint sets Si, ... , Sm and, for any i # j, x C Si, and y C Si, a nonneg- ative capacity c(x, y). Solution A system of representatives T, i.e., a set T such that, for any i, JTnSi = 1. Measure The capacity of the system of representatives, i.e., ExyET c(x, y). 12 To reduce the MCR to the GMEC problem, we regard each set Si as a residue and its elements as the residue's rotamers. Then, we take the negative value of the capacity between two elements in two different sets as the interactive energy between corresponding rotamers and switch the maximization problem to a minimization problem. This is a GMEC problem with ej, equal 0 for all i and r. A rigorous proof of NP-hardness of the GMEC problem can be found in [38]. As illustrated by the fact that the GMEC problem is an NP-hard optimization problem, the general form of the GMEC problem is a challenging combinatorial problem that have many interesting aspects in both theory and practice. 1.1.2 Purpose and scope Our purpose of the study is generally two-fold. First, we aim to find a better method to solve the GMEC problem efficiently. Despite the theoretical concept of hardness, we often find many instances of the GMEC problem are easily solved by exact methods such as Dead-End Elimination (DEE) combined with A* (DEE/A*) [8, 26]. However, DEE's elimination criteria are not always powerful in significantly reducing the problem's complexity. Though there have been modifications and improvements over the original elimination conditions [12, 33, 11], the method is still not quite a general solution to the problem, especially in the application of sequence design. In this work, we explore both analytical and approximate methods through computational experiments. We compare their performance with one another to identify possible alternatives to DEE/A*. There exists a comparative work by Voigt et al. [40], which examines the performance of DEE with other well-known methods such as Monte Carlo, genetic algorithms, and self-consistent mean-field that concludes DEE is the the most feasible method. We intend to investigate new approaches that have not been used or studied well, but have good standings as appropriate techniques for the GMEC problem. In the second place, we want to understand the working mechanism of the methods. There are methods that are not theoretically understood well, but show extraordinary performance in practice. On the other hand, some methods are effective only 13 for a specific type of instances. For example, the belief propagation and LP approach are not generally good solutions to the GMEC problem with random artificial energy terms, but they are very accurate for the GMEC instances with protein energy terms. Our ultimate goal is to be able to explain why or when a specific method succeeds and fails. The scope of this work is mainly computational aspects of the side-chain placement problem using rotamer libraries. Therefore, we leave issues such as protein energy models, and characteristics of rotamer libraries out of our work. 1.2 1.2.1 Related work Integer linear programming (ILP) The polyhedral approach is a popular technique for solving hard combinatorial optimization problems. The main idea behind the technique is to iteratively strengthen the LP formulation by adding violated valid inequalities. Althaus et al. [1] presented an ILP appraoch for side-chain demangling in rotamer representation of the side chain conformation. Using an ILP formulation, they identified classes of facet-defining inequalities and devised a separation algorithm for a subclass of inequalities. On average, the branch-and-cut algorithm was about five times slower than their heuristic approach. Eriksson et al. [9] also formulated the side chain positioning problem as an ILP problem. However, in their computational experiments, they found that the LP- relaxation of every test instance has an integral solution and, therefore, the integer programming (IP) techniques are not necessary. They conjecture that the GMEC problem always has integral solutions in LP-relaxation. 1.2.2 Probabilistic methods A seminal work using the self-consistent mean field theory was done by Koehl and Delarue [21]. The method calculates the mean field energy as the sum of interaction 14 energy weighted by the conformational probabilities. The conformational probabilities are related to the mean field energy by the Boltzmann law. Iterative updates of the probabilities and the mean field energy are performed until they converge. At convergence, the rotamer with the highest probability from each residue is selected as the conformation. The method is not exact, but has linear time complexity. Yanover and Weiss [42] applied belief propagation (BP), generalized belief propagation (GBP), and mean field method to finding minimum energy side-chain configuration and compared the results with those from SCWRL, a protein-folding program. Their energy function is approximate in that local interactions between neighboring residues are considered, which results in incomplete structures of graphical models. The energies found by each method are compared with those from one another, rather than with optimal values from exact methods. 1.3 Our approaches In this research, we use roughly three different approaches to solve the GMEC problem. In Chapter 2, we use integer programming formulation and the branch-and-price algorithm to obtain exact solutions. There are known ILP formulations that can be directly applied to the GMEC problem. We review these formulations and compare their effectiveness using CPLEX optimizers. At the same time, we do a preliminary work to apply the branch-and-price to the GMEC problem. We review the algorithm, suggests possible decomposition schemes, and assess the feasibility of the decomposition schemes by a computational experiment. As the second approach, we use nonlinear programming techniques in Chapter 3. This work mainly relies on the continuation approach by Ng [31]. Ng's continuation approach finds a suboptimal solution to a discrete optimization problem by solving a sequence of related nonlinear programming problems. We implement the algorithm and do a computational experiment to examine the performance of the method on the GMEC problem. In Chapter 4, we use the probabilistic inference methods to infer the GMEC using 15 the energy terms translated into probability distributions. We implement probabilistic relaxation labeling, the max-product belief propagation (BP) algorithm, and the MIME double loop algorithm to test on side-chain placement examples as well as some sequence design cases. The performance of the three methods are also compared with the ILP method and DEE/A*. The preliminary results suggest that probabilistic inference methods, especially the max-product BP algorithm is most effective and fast among all the tested methods. Though the max-product BP algorithm is an approximate method, its speed and accuracy are comparable to those of DEE/A* in side-chain placement, and overall superior in sequence design. Probabilistic relaxation labeling shows slightly weaker performance than the max-product BP, but the method works well up to medium-sized cases. On the other hand, the ILP approach, the nonlinear programming approach, and the MIME double loop algorithm turns out to be not competitive. Though the three methods have different appearances, they are all based on the mathematical formulation and optimization techniques. We find such traditional approaches require good understanding of the methods and considerable experimental efforts and expertise. However, we also present the results from these methods to provide reference for future research. 16 Chapter 2 Integer linear programming approach In this chapter, we describe different forms of integer linear programming (ILP) formulations of the GMEC problem. We first present a classical ILP formulation by Erkisson et al., and two similar formulations adopted from related combinatorial optimization problems. Based on these classical ILP formulations, we review the mathematical notion of the branch-and-price algorithm and consider three different decomposition schemes for the column generation. Finally, some preliminary results from implementations of the branch-and-price algorithm are presented. 2.1 2.1.1 ILP formulations The first classical formulation by Eriksson, Zhou, and Elofsson In the ILP formulation of the GMEC problem by Eriksson et al. [9], the self-energy of each rotamer is evenly distributed to every interaction energy involving the rotamer. A residue's chosen rotamer interacts with every other residue's chosen rotamer. Therefore, the self-energies can be completely incorporated into the interaction energies without affecting the total energy by modifying the interaction energies 17 as follows: (2.1) + ej, = ei, e' n- I where n is the number of residues in the protein. Then, the total energy of a given conformation C can be written as Sc = (2.2) e'g. Z i j>i Since the total energy now only depends on the interaction energy, Ec can be expressed as a set of value assignments on binary variables that decide whether an interaction between two residues in a certain conformation exists or not. We let xijs, be a binary variable, where its value is 1 if residue i is in the rotamer conformation r, and residue j is in the rotamer conformation s, and 0 otherwise. We also let Ri denote the set of all possible rotamers of residue i. Then, the total energy in conformation C is given by c= SSSS (2.3) e' xijs. rERj j>i sERj i On the other hand, there should exist exactly one interaction between any pair of residues. Therefore, we have the following constraints on the value of xigj,: E E xijs = (2.4) for all i and j, i < j. rERj sERj Under the constraint of (2.4), more than one rotamer can be chosen for a residue. To make the choice of rotamers consistent throughout all residues, we need the following constraints on xir i: Xhq qERh . pERg Xikt xjr Xgpr = = SERj (2.5) tERk (2.6) for all g, h, i, j, k such that g, h < i < j, k, and for all r E Ri. 18 Finally, by adding the integer constraints X'js E {0, 1}, (2.7) we have an integer programming that minimizes the cost function (2.3) under the constraints (2.4) - (2.7). We denote (2.3) - (2.7) by Fl. Eriksson et al. devised this formulation to be used within the framework of integer programming algorithms, but they found that the LP relaxation of this formulation always has an integral solution and hypothesized that every instance of the GMEC problem can be solved by linear programming. However, in our experiments, we found some instances have fractional solution to the LP relaxation of F1, which is not surprising since the GMEC problem is an NP-hard optimization problem. Nonetheless, except two cases, all solved LP relaxations of F1 had integral solutions. 2.1.2 The second classical formulation Here, we present the second classical ILP formulation of the GMEC problem. This is a minor modification of the formulation for the maximum edge-weighted clique problem (MEWCP) by Hunting et al. [17]. The goal of the MEWCP is to find a clique with the maximum sum of edge weights. If we take a graph theoretic approach to the GMEC problem, we can model each rotamer r for residue i as a node ir, and the interaction between two rotamers r and s for residues i and j, respectively, as an edge (ir, js). Then, the GMEC problem reduces to finding the minimum edge-weighted maximum clique of the graph. We introduce binary variables xi, for node i, for all i and r C Ri. We also adopt binary variables yijj for each edge (ir,Js) for all i and j, and for all r E Ri and S E R. Variable xi, takes value 1 if node ir is included in the selected clique, and 0 otherwise. Variable yij, takes value 1 if edge (ir, js) is in the clique, and 0 otherwise. We define V and E as follows: V {irV r 19 RiV i}, (2.8) E = {(zr, js)I V ZrJ E V}. (2.9) Then, the adapted ILP formulation of the MEWCP is given by min E eirxi, + irEV yir, - E Wir < 0, V(ir, js) E E, Yirs - Xis < 0, Xir + Xj - 2Yijrij Li, eirjsyirjs (2.10) (ir,j.)EE E {0, 1}, (2.11) V(ir,js) E E, (2.12) 0, V(ir, js) C E, (2.13) Vi, E V, Yirjs > 0, V(ir, js) E E. (2.14) (2.15) To complete the formulation of the GMEC problem, we add the following set of constraints, which implies that exactly one rotamer should be chosen for each residue: E Xj, = 1. r ER (2.16) We denote (2.10) - (2.16) by F2. When the CPLEX MIP solver was used to solve a given GMEC instance in both F1 and F2, F1 was faster than F2. This is mainly because F2 heavily depends on the integrality of variables xi,. On the other hand, F2 has an advantage when used in the branch-and-cut framework since polyhedral results and Lagrangean relaxation techniques are available for the MEWCP that can be readily applied to F2. [17, 28] 2.1.3 The third classical formulation Koster et al. [25] presented another formulation that captures characteristics of two previous formulations. Koster et al. studied solution to the frequency assignment problem (FAP) via tree-decomposition. There are many variants of the FAP, and, interestingly, the minimum interference frequency assignment problem (MI-FAP) stud- 20 ied by Koster et al. has exactly the same problem setting as the GMEC problem, except that the node weights and edge weights of the MI-FAP are positive integers. The formulation suggested by Koster et al. uses node variables and combines the two styles of constraints from the previous formulations. If we transform the problem setting of the FAP into that of the GMEC problem, we obtain the following ILP formulation for the GMEC problem: min E 3 eixi, + iEV eijsyij, (2.17) (irjs)EE Z Xi, = 1, (2.18) r ER. Syij,= xi,, Vj # i, Vs E Ri, Vi, Vr E Ri, (2.19) E {0, 1}, Vzr E V, (2.20) sERj Xi, YirjS > 0, V(ir, js) E E. (2.21) We denote (2.17) - (2.21) by F3. In F3, (2.18) restricts the number of selected rotamers for each residue to one as it does in F2. On the other hand, (2.19) enforces that the selection of interactions should be consistent with the selection of rotamers. Koster et al. studied the polytope represented by the formulation, and developed facet defining inequalities [22, 23, 24]. 2.1.4 Computational experience We performed an experimental comparison of the classical ILP formulations F1, F2, and F3. The formulations were implemented in C code using CPLEX Callable Library. The test cases of lbpi, lamm, larb, and 256b were generated using a common rotamer library (called by LIB1 throughout the work), but a denser library (LIB2) was used to generate test cases of 2end. The choice of modeled proteins followed the work by Voigt et al. [40]. We made several cases with different sequence lengths using the same protein to control the complexity of test cases. The energy terms were 21 Table 2.1: Comparison of three classical formulations (protein: PDB code, #res: number of modeled residues, LP solution: I - integral, F - fractional, TLP: time taken for CPLEX LP Optimizer, TIp: time taken for CPLEX MIP Optimizer, symbol -: skipped, symbol * : failed). LP solution protein lbpi 1aMM larb 256b 2end #res 10 20 25 46 10 20 70 80 10 20 30 78 30 40 50 60 70 15 25 F1 I I I I I I I I I I I I I I I I F I I F2 F F F F F F F F I F F F - F3 I I I I I I I I I I I I I I I I F I I TLP F1 0 0 0 29 0 1 42 97 0 0 0 24 3 10 26 60 116 31 281 (sec) F2 1 8 134 6487 0 96 33989 * 0 0 15 10384 - F3 0 1 0 14 0 0 29 69 0 0 0 21 2 4 12 36 98 35 214 Tip (sec) F3 F1 0 0 0 0 3 1 37 16 0 0 1 0 45 48 99 102 0 0 0 0 1 0 27 25 6 3 14 6 15 36 87 37 156 112 34 41 343 240 calculated by the CHARM22 script. All experiment was done on a Debian workstation with a 2.20 GHz Intel Xeon processor and 1 GBytes of memory. We used both CPLEX LP Optimizer and Mixed Integer Optimizer to compare the solution time between the formulations. The results are listed in Table 2.1. The result from LP Optimizer tells that F2 is far less effective in terms of both the lower bounds it provides and the time needed for the LP Optimizer. F2 obtained only fractional solutions whereas F1 and F3 solved most of the cases optimally. F1 and F3 show similar performance though F3 is slightly more effective than F1 in LP-relaxation. Since F2 showed inferior performance in LP-relaxation, we measured Mixed Integer Optimizer's running time only with F1 and F3. As shown in the last two columns of Table 2.1, Mixed Integer Optimizer mostly took more time for F1 than F3, but no more than a constant factor as was the case with LP Optimizer. CPLEX optimizers were able to solve the small- to medium-sized cases, but using the optimizers only with a fixed formulation turns out to be not very efficient. 22 On the other hand, DEE/A* is mostly faster than the CPLEX optimizers using the classical formulations, which makes the ILP approach entirely relying on general optimizers look somewhat futile. We also observe cases that both F1 and F3 obtain fractional solutions in LP-relaxation while solving large IP's usually takes significantly more time than solving corresponding LP-relaxations. This suggests that we should explore the possible development of a problem specific IP solver that exploits the special structure of the GMEC problem. As an effort toward this direction, in Section 2.2, we investigate the possible use of the branch-and-price algorithm based on F1 formulation. 2.2 Branch-and-price The branch-and-price (also known as column generation IP) is a branch-and bound algorithm that performs the bounding operation using LP-relaxation [3]. Particu- larly, the LP relaxation at each node of the branch-and-bound tree is solved using the delayed column generation technique. It is usually based on an IP formulation that introduces a huge number of variables but has a tighter convex hull in LP relaxation than a simpler formulation with fewer variables. The key idea of the method is splitting the given integer programming problem into the master problem (MP) and the subproblem (SP) by Dantzig-Wolfe decomposition and then exploiting the problem specific structure to solve the subproblem. 2.2.1 Formulation In this section, we first review the general notion of the branch-and-price formulation developed by Barnhart et al. [3] and devise appropriate Dantzig-Wolfe decomposition of F1. The previous classical ILP formulations can be all captured in the following general form of ILP: min c'x 23 (2.22) Ax < b,2 (2.23) E S, x E {O, 1}", (2.24) where c E Rn is a constant vector, and A is an m x n real-valued matrix. The basic idea of the Dantzig-Wolfe decomposition involves splitting the constraints into two separate sets of constraints - (2.23) and (2.24), and representing the set S* = {XE S I XE {O,1}"} (2.25) by its extreme points. S* is represented by a finite set of vectors. In particular, if S is bounded, S* is a finite set of points such that S* = {yi,... , yp }, where yi E R?, i = ,. . . , P . Now, if we are given S* = {yi, ... , yP}, any point y E S* can be represented as Z y= Ak, (2.26) subject to the convexity constraint S: A k =1 (2.27) {O, 1}, k = 1, ... Ip. (2.28) 1<k~p Ak Let ck = c'yk, and ak = Ayk. Then, we obtain the general form of branch-and-price formulation for the ILP given by (2.22) - (2.24) as follows: min E ck Ak (2.29) 1<k~p ak Ak < b, (2.30) 1<k<p E Ak 1, (2.31) 1<k= Ak EE {0, 1}, k = 1, ... IpA 24 (2.32) Figure 2-1: Problem setting The fundamental difference between the classical ILP formulation (2.22) - (2.24) and the branch- and-price formulation (2.29) - (2.32) is that S* is replaced by a finite set of points. Moreover, any fractional solution to the LP relaxation of (2.22) - (2.24) is a feasible solution of the LP relaxation of (2.29) - (2.32) if and only if the solution can be represented by a convex combination of extreme points of conv(S*). Therefore, it is easily inferred that the LP relaxation of (2.29) - (2.32) is at least as tight as that of (2.22) - (2.24) and more effective in branch- and-bound. Now we recognize that success of the branch-and-price algorithm will, to a great extent, depend on the choice of base formulation we use for (2.22) - (2.24) and how we decompose it. In this work, we use F1 as the base ILP formulation for it usually has the smallest number of variables and constraints among F1, F2, and F3, and also has good lower bounds. Designing the decomposition scheme is equivalent to defining S*. Naturally, the constraints corresponding to (2.23) will be the complement constraints of S* in the (2.4) - (2.7). We consider three different definitions of S* below. Set of edge sets Figure 2-1 is a graphical illustration of the problem setting, and Figure 2-2 shows the feasible solution when all constraints are used. In fact, any maximum clique of the graph is a feasible solution. In this decomposition scheme, we consider only the 25 IIf -- --- % ' - - - - Figure 2-2: A feasible solution to the GMEC problem Figure 2-3: A set of minimum weight edges constraint that one interaction is chosen between every pair of residues. Formally, we have S* = {Q I Q is a set of (ir,js), Vr E R, Vs E R1 , Vi, Vj}. (2.33) We denote this definition of S* by S1. The subproblem is finding out the smallest weight edge between each pair of residues. A solution to the subproblem will generally look like Figure 2-3. The size of S1 is exponential to the original solution space, but each subproblem can be solved within O(n) time. 26 residue 1 2 residue 5 4 4 3 2 2 03 10 residue 04 3 01 2 2 residue 3 residue 43 Figure 2-4: Path starting from rotamer 2 of residue 1 and arriving at the same rotamer Set of paths We can define S* to be the set of paths that starts from node i, and arrives at the same node by a sequence of edges connecting each pair of residues as illustrated by Figure 2-4. A column generated this way is a feasible solution to the GMEC problem only if the number of visited rotamers at each residue is one. We denote S* defined this way as S2. The size of S2 is approximately the square root of the size of S1 . In this case, the subproblem becomes the shortest path problem if we replicate a residue together with its rotamers and add them to the path every time the residue is visited. Figure 25 shows how we can view the graph to apply the shortest path algorithm. The subproblem is less simple than the case S* = S1, but we can efficiently generate multiple columns at a time if we use the k-shortest path algorithm. The inherent hardness of the combinatorial optimization problem is independent of the base ILP formulation or the decomposition scheme used in the branch-andprice algorithm. Therefore, the decomposition can be regarded as the design of the balance point between the size of S* and the hardness of the subproblem. The size of S2 is huge because more than one rotamer can be chosen from every 27 res 1 res 3 res 2 res 4 res 5 res 1' res 3' res 5' res 2' res 4' res1" Figure 2-5: An elongated graph to calculate shortest paths residue on the path but the first. To make the column generation more efficient, we can try fixing several residues' rotamers on the path as we do for the first residue. Set of cliques Based on the idea of partitioning a feasible solution, or a maximum clique into small cliques, the third decomposition we consider is to define S3 as a set of cliques. Figure 26 illustrates this method. A maximum clique consisting of five nodes is fragmented into two three-node cliques composed of nodes 12, 22, 33 and nodes 12, 33, 53 respectively. To complete the maximum clique, four edges represented by dashed lines are added (an edge can be also regarded as a clique of two nodes.) The branch-and-price formulation for this decomposition differs from (2.29) (2.32) in that (2.31) and (2.32) are not necessary. All the assembly of small cliques are done by (2.30) and it is possible we obtain an integral solution when (2.29) (2.32) has a fractional optimal solution from LP-relaxation. Therefore, the branching is also determined by examining the integrality of original variables, x of (2.22) (2.24). The size of S3 is obviously smaller than those of Si and S2. If we let n be the number of residues, and m be the maximum number of rotamers for one residue, the 0( ~) i Mn(n-1) size of S2 is O(m 2 ). In comparison, the size of is S 3 O((m + 1)"). The subproblem turns out to be the minimum edge-weighted clique problem (MEWCP), which is an NP-hard optimization problem. Macambira and de Souza [28] have investigated the MEWCP with polyhedral techniques, and there is also an Lagrangian relaxation approach to the problem by Hunting et al. [17]. However, it is more efficient to solve the subproblem by heuristics such as probabilistic relaxation 28 residue 1 12 residue 5 residue 2 4 3 \ 2 30 2 40 residue 4 0 103 residue 3 Figure 2-6: Fragmentation of the maximum clique labeling [32, 15, 16] that will be examined in Chapter 4 and use the analytical techniques only if heuristics fail to find any new column. On the other hand, there exists a pitfall in application of this decomposition; by generating edges as columns, we may end up with the samebasis for the LP relaxation of the base formulation. To obtain a better lower bound, we need to find a basis that better approximates the convex hull of integral feasible points than that of the LP-relaxed base formulation. However, by generating edges as columns necessary to connect the small cliques, the column generation will possibly converge only after exploring the edge columns. One easily conceivable remedy is generating only cliques with at least three nodes, but it is a question whether this is a better decomposition scheme in practice than, say, S* = S2. The Held-Karp heuristic and the branch-and-price algorithm Since we do not have separable structures in the GMEC problem, it seems more appropriate to first relax the constraints to generate columns and then to sieve them with the rest of the constraints by LP. Defining S* to be S2 is closer to this scheme than defining it to be S3. We show that this scheme is, in fact, compatible with 29 the idea behind the Held-Karp heuristic, and suggest that we should take a similar approach to develop an efficient branch-and-price algorithm for the GMEC problem. Let G be a graph with node set V and edge set E, and aij be the weight on the edge (i, j). Then, the traveling salesman problem (TSP) is often represented as Y min E xjzj = 1, aijxij (2.34) Vi E V, (2.35) V. (2.36) VjEVlj:Ai Z xjj = 1, Vj ViGV,i=Aj (2.37) (ij x+ ji) ;> 2, VS C V, S # V, ij= {0, 1}, V(i, j) E E. (2.38) (2.35) and (2.36) are conservation of flow constraints. (2.37) is a subtour elimination constraint [4]. In Held-Karp heuristic, (2.37) and (2.38) are replaced by (V, {(i, j) I xij = 1}) is a 1-tree. (2.39) If we assign the Lagrangian multipliers u and v, the Lagrangian function is L(x, u, v) = E (ai + ui+ vj)xij - E ui - E j. iEV i,jeV, i7j (2.40) jeV Finally, if we let S = {x I xij E {0, 1} such that (V, {(i, j) ij = 1}) is a 1-tree}, (2.41) the value produced by the Held-Karp heuristic is essentially the optimal dual value given by q* = sup inf L(x, u,v). xE S* 30 (2.42) When the cost and the inequality constraints are linear, the lower bounds obtained by Lagrangian relaxation and integer constraints relaxation are equal yi ... yP}, if S* is a finite set of points, say S* = y E RIE, J = [4]. Therefore, 1,... ,p, the optimal value of the following LP is also q*: aij E AkYk (2.43) Vi C V, (2.44) Akyk= 1, Vj E V, (2.45) mn I ViEV VjEVjfi kip = 1, Aky VjEV,jhi 1<kip ViEV,ifj 1<k<p Yk E S* O < Ak < 1 k 1, ... ,p. = (2.46) Note that (2.43) - (2.46) make a LP relaxation of the branch-and-price formulation of the original ILP, or simply a Dantzig-Wolfe decomposition. Thus, we confirm that the Held-Karp heuristic is essentially equivalent to the branch-and-price algorithm applied to the TSP. 2.2.2 Branching and the subproblem For all decomposition schemes, we use branching on original variables. In other words, when the column generation converges for node U of the branch-and-bound tree and it is found that the LP-relaxation of the restricted master problem (RMP) at U has a fractional solution, we branch on a variable xi of (2.22) - (2.24) rather than on a variable Ak of (2.29) - (2.32). Formally, if we have the RMP for node U with the form of (2.29) - (2.32), S is branched on xt to two nodes Uo and U1 , where the RMP at U is given by min ckAk (2.47) k:1 k~p, yk=i a Aak b, (2.48) k:15ksp, yk~ Ak = 1, k:1<k<p, yk=i 31 (2.49) Ak E {0, 1}, k = 1,. .. , p. (2.50) In our implementation, the branching variable xt is an edge variable for we use (2.3) - (2.7) as the base formulation. The branch variable xt is determined by calculating the value of x from the fractional solution A and taking the one whose value is closest to 1. The tie breaking can be done by the order of the indices. On the other hand, if we let an m dimensional vector p be the dual variable for (2.23) and a scalar q be the dual variable for (2.24), the pricing subproblem for node Uj is given by min (c - p'A)'y - q (2.51) y E S*, yt = i. (2.52) Regarding p and q of (2.51) as constants, the reduced cost vector d = (c -- p'A) represents edge weights for the graph of the GMEC problem. Therefore, the subproblem becomes finding an element of S* with yt = i that has the minimum sum of edge weights when calculated with d. 2.2.3 Implementation To empirically find out how efficient the the decomposition schemes of Section 2.2.1 are, we implemented the branch-and-price algorithm for the GMEC problem using SYMPHONY [35]. SYMPHONY is a generic framework for branch, cut, and price that does branching control, subproblem tree management, LP solver interfacing, cut pool management, etc. The framework, however, does not have the complete functionalities necessary for the branch-and-price algorithm. Thus, we had to augment the framework to allow the column pool control by a non-enumerative scheme, and branching on the original variables. This required more than trivial change especially in the branching control and the column generation control. As a result, the functionalities were roughly implemented, but we could not give enough consideration to make the implementation efficient in memory- or time-wise. We tested the implementation only with small cases to have an idea whether the column generation is viable 32 option for the GMEC problem. The branch-and-price requires a number of feasible solution to the base formulation to use them as LP bases for the initial RMP as well as to set the initial upper bound. We obtained a set of feasible solutions using an implementation of probabilistic relaxation labeling (RL). For detailed description and theory of probabilistic relaxation labeling, see [32, 15, 16]. Chapter 4 also briefly describes the method and the implementation. We started with a random label distribution and did 200 iterations of RL to find maximum 20 feasible solutions. This scheme was successful in the small cases we tested the branch-and-price implementations on, in that it gave the optimal solution as the initial upper bound and the rest of the branch-and-price efforts were spent on confirming the optimality. Using RL in obtaining initial feasible solutions may have reduced the total number of LP-relaxations solved or the CPU time to some extent, but the general trend of the performance as the complexity grows will not be affected much. In fact, most implementations of either the branch-and-cut or branch-and-price algorithm opts to use the most effective heuristic for the problem to find initial feasible solutions and an upper bound. Another issue in the implementation is the use of an artificial variable to prevent the infeasibility of the child RMP in branching. When the parent RMP reaches dual feasibility with fractional optimal solution and decide to branch, its direct children can go infeasible because a variable of the parent RMP will be fixed to 0 or 1. Rather than heuristically finding new feasible solutions that satisfy the fixed conditions of the branching variables, we can add an artificial variable with a huge coefficient to make the child RMP always feasible. We add some implementation details specific to each decomposition scheme below. Set of edge sets We implemented only the basic idea of S* = Si so that, after solving LP-relaxation of the RMP, the program collects an edge between each pair of residues that has the minimum reduced-cost among them. To do more efficiently than this, we may also sort the edges between each pair of residues and take k of them to generate knC2 33 candidate columns at each column generation when n is the total number of residues. We denote the implementation by BRP1. Set of paths For the implementation of S* = S2 , we had to assume that the number of residues is a prime number. This is because we can make a tour of a complete graph's edges only if the complete graph has a prime number of nodes. For more general cases that do not have a prime number of nodes, we can augment the graph by adding a proper number of nodes. In our experiment, we only used test cases that have a prime number of residues. To solve the subproblem when S* = S2 , we used the Recursive Enumeration Algorithm (REA), an algorithm for the k shortest paths problem [18]. The residue that has the minimum number of rotamers was chosen to be the starting residue of the paths. For each rotamer of the starting residue, we calculated 20 shortest paths and priced the results to add paths with negative reduced cost as new columns. We denote the implementation by BRP2 Set of cliques To avoid obtaining the same lower bound as from the base formulation, we restrict the column generation to cliques with four nodes. Therefore, S* is given by, S* = {QI Q is a clique consisting of r, js, kt, 1, such that rcR,scRj,t R,uERj, andzhj fk # l} (2.53) We took every possible quadruple of residues and used RL on them to solve the MEWCP approximately. For example, if we have total 11 residues in the graph, 11 C4 different subgraphs or instances of the MEWCP can be made from it. By setting the initial label distribution randomly, we ran four times of RL on each subgraph and priced out resulting cliques to generate new columns. The test cases were restricted to those that have more than four nodes. 34 We denote the implementation by BRP3. 2.2.4 Computational results and discussions We tested the implementation of three decomposition schemes with small cases of side-chain placement. We wanted to see the difference in effectiveness of three decomposition schemes by examining the lower bound they obtain, the number of LPrelaxations, the number of generated columns, and the running time. We are also interested in the number of branching, and the latency between the points the optimal solution value is attained and the column generation actually converges. The test cases were generated from small segments of six different proteins. LIB1 was used to generate energy files with lbpi, lamm, larb, and 256b. Test cases of 2end and lilq were generated using a denser library LIB2. All program codes were written in C and compiled by GNU compiler. The experiment was performed on a Debian workstation with a 2.20 GHz Intel Xeon processor and 1 GBytes of memory. The results for BRP1 are summarized in Table 2.2. For reference, we included in the table the total number of feasible columns, JS*j when it is possible to calculate so that the number of explored columns can be compared to it. For most cases, however, we had overflow during the calculation of JS*J. The number of columns in the table includes one artificial variable for feasibility guarantee. We stopped running the branch-and-price programs if they do not find a solution within two hours. All test cases that were solved were fathomed in the root node of the branchand-bound tree. This is not unexpected since the base formulation also provides the optimal integral solution when its LP-relaxation is solved. Unfortunately, we could not find any small protein example that has a fractional solution for the LP-relaxed base formulation. Since our implementation of the decomposition schemes were too slow to solve either of the two cases that have fractional solutions, we were not able to compare the effectiveness of the base formulation and the branch-and-price formulations for protein examples. In all solved test cases but one, the optimal values were found from the first LPrelaxation. This is because RL used as a heuristic to find the initial feasible solutions 35 actually finds the optimal solution and they are used as LP bases of the RMP. However, since RL uses a random initial label distribution, this is not always the case as we can confirm in the third case of lamm. Another point to note with the early discovery of the optimal value is the subsequent number of column generations until the column generation does converge. To confirm the optimality by reaching dual feasibility is a different matter from finding the optimal value. In fact, one of the well-known issues in column generation is the tailing-off effect, which refers to the tendency that the improvement of the ob- jective value slows down as the the column generation progresses, often resulting in as many iterations to prove optimality as needed to come across the optimal value. There are techniques to deal with the tailing-off effect such as early termination with the guarantee of the LP-relaxation lower bound [39], but we did not go into them. Table 2.3 and Table 2.4 list the results from BRP2 and BRP3, respectively. Comparing the results from BRP2 with those from BRP1, BRP2 obviously performs better than BRP1. This seems to be mainly due to the efficient column generation using k shortest paths algorithm and the smaller size of S*. It is interesting that BRP1 manages to work comparably with BRP2 for small cases considering |S 1 is huge even for the small cases. Note that some cases that could not be solved within two hours by BRP1 were solved by BRP2. Looking at the results from BRP2 and BRP3, BRP2 shows slightly better performance in CPU time and the number of generated columns for the cases solved by both BRP2 and BRP3, but BRP3 was able to find optimal solutions for some cases BRP2 failed to do so within two hours. We suspect the size of S* is the main factor in determining the rate of convergence rather than the column generation method. Since the base formulation has the smallest S*, its convergence will be faster than any other branch-and-price formulation, yet the purpose of adopting the branch-andprice algorithm is obtaining integral solutions more efficiently by using a tighter LPrelaxation. Unfortunately, we could not validate the concept by testing the branchand-price implementations on protein energy examples. Instead, we performed a few additional tests of BRP2 and BRP3 with artificial energy examples whose number of 36 Table 2.2: Test results for BRP1 (IS*I: total number of feasible columns, #nodes: number of explored branch-and-bound nodes, #LP: number of solved LPs until convergence, #cols: number of added columns until convergence, (frac): #cols to IS*I ratio, #LPOP: number of solved LPs until reaching the optimal value, #colsop: number of added columns until reaching the optimal value, TLP: time taken for solving LPs in seconds, T,b: time taken for solving subproblems in seconds, symbol - : IS*I calculation overflow, symbol * : stopped while running). protein lamm lbpi 256B larb 2end lilq #res 3 5 7 11 [ I #nodes I #LP S*I 2.6x10 - 4 1 1 1 1 2 1 1403 228 13 - * * 3 5 7 11 5.7x10 5 - 1 1 1 1 1 1 20 220 13 - * * 3 5 576 1.7x10 6 1 4 * 1 27 1 #cols (frac) 5 (0.02%) 7(0.00%) 1423 (0.00%) 240 (0.00%) #LPOp 1 1 1381 1 6 4 25 225 I #colsop TLP [ Tsub 4 7 1401 13 0 0 0 1 0 0 0 1 * * * * * (0.00%) (0.00%) (0.00%) (0.00%) 1 1 1 1 6 4 6 6 0 0 0 1 0 0 0 2 * 1 20 * * 1 1 * 1 1 1 6 7 * 5 12 15 0 0 * 0 0 0 0 0 * 0 0 0 * * * * 1 4 0 0 7 - 3 5 7 64 - 1 1 * 1 1 1 11 - * * 3 - 1 1 5 - * * * 1 11 * * 7 - 1 1603 1608(0.00%) 1 6 33 2 11 - * * * 1 13 * * 6 (1.04%) 10 (0.00%) * 5 (7.81%) 38 (0.00%) 15 (0.00%) * 4 (0.00%) 37 protein lamm ibpi 256B larb 2end liq #res 3 5 7 11 13 3 5 7 f Table 2.3: Test results for BRP2. #nodes #LP #cols (frac) #LPOP S* 162 7.9x104 1 1 1 1 1 1 23 15 * * 1 1 1 1 1 2 9 (1.19%) 8 (0.00%) 28 (0.00%) 1 * 6 * 121 (0.00%) * 1 432 1 1 1 2 9 684 8 5.0x10 5 * * 1 1 1 2 6 - 756 1.8 x10 9 - 11 - 13 3 5 7 11 13 3 5 24 1 - 8 16 50 281 (4.94%) (0.01%) (0.00%) (0.00%) #cols, 1 1 1 1 1 1 1 1 8 16 3 21 21 9 8 21 1 21 5 (62.5%) 26 (0.00%) 1 1 1 8 1 1 1 1 21 14 16 381 21 21 5 21 * 14 17 383 11719 (58.3%) (3.94%) (0.00%) (0.00%) * TLp T,,s 0 0 0 0 0 0 1 2 * * 0 0 0 0 0 0 1 * 2 * 0 0 0 577 0 0 0 149 * * 0 0 0 0 0 * 1 * 6 21 0 0 0 2 1 21 * * 1 21 3 3 1 21 * * #colso 12 3 7 11 1.4x10 - 1 * 1 * 21 (0.00%) * 1 * 21 * 3 5 72400 - 1 1 1 9 6 (0.00%) 96 (0.00%) 1 1 7 - * * * 7 11 - 1 187 3741 (0.00%) - * * * Table 2.4: Test result for BRP3. protein lamm lbpi 256B IS*1 #res 5 7 23166 8.3 x 107 11 8.9 x 106 13 1.3 x 10 5 7 11 13 4.2 1.3 7.1 2.9 #nodes 1 1 1 #LP 3 12 #cols (frac) 19 (0.08%) 208 (0.00%) #LPOP 1 1 11 1286 (0.01%) 1 5 7 22 17 20 85 1626 3715 (0.00%) (0.07%) (0.00%) (0.01%) 1 1 1 1 TLP Tsub 0 0 0 1 21 1 8 11 21 21 21 0 0 3 42 0 0 78 219 8 x 105 x 10 5 x 10 7 x 107 1 1 1 1 5 2034 1 4 24 (1.18%) 1 16 0 0 7 11 6.8 x 106 9.8 x 106 1 1 15 82 446 (0.00%) 6181 (0.06%) 9 6 383 2015 0 286 3 793 13 1.7 x 10 8 * * * 1 21 * * 5 7 11 43908 7710 1.4 x 107 1 1 1 5 4 12 29 (0.07%) 83 (1.08%) 1342 (0.01%) 1 1 1 21 21 21 0 0 1 0 0 5 2.5 x 10 7 2end 13 5 1 1 90 10 9464 (0.04%) 42 (0.00%) 1 1 21 21 1262 0 2755 9 lilq 7 7 5.9 x 106 1 1 45 16 1388 (0.00%) 554 (0.01%) 1 1 21 21 10 0 328 4 11 - 1 8 16110 (0.00%) 1 21 44 543 larb - 38 Table 2.5: Test results with random energy examples (#nodes: number of explored branch-and-bound nodes, LB: lower bound from LP-relaxation, T: time taken to solve the instance exactly, symbol * : stopped while running). protein lamm 256b larb 0 0 236 2.308 5.492 5.492 0 1 12 5.681 18.525 26 * 5.436 15.587 2 102 1 3.665 0 3.604 0 11.109 9 11.071 0 0 0 4 * * * 1 * * * * 3.664 * 7 11 5.681 20.326 5 3.665 11.109 2.362 5.515 6.409 2.362 5.492 6.409 #nodes 1 1 1 #nodes 1 3 1 7 TBRP3 TBRP2 optimal 2.362 5.515 6.409 * 11.098 * F1 using CPLEX Tjp LB BRP3 LB BRP2 LB #res 3 5 7 1 * rotamer at each position is same as that of previous test cases but the energy terms are replaced with random numbers between 0 and 1. We ran the programs on each case for no more than 20 minutes and stopped unless they converge. The results are summarized in Table 2.5. We only listed the cases where the base formulation has fractional solutions in LP-relaxation. All of BRP2, BRP3, and CPLEX optimizers took more time in solving the random energy examples than protein energy examples. The results illustrates the use of branch-and-price formulation to some extent. All cases of Table 2.5 have weaker lower bounds with the base formulation than with BRP2 or BRP3. BRP3 mostly found optimal solutions in the initial node whereas BRP2 had overall weaker lower bounds than BRP3 or failed to attain convergence of the column generation. From our preliminary experiment, BRP3 turns out to be more efficient than BRP2. We think that we can improve the performance of BRP2 by changing the size of building cliques or the residue combination rule. 2.3 Summary In this chapter, we examined the application of integer programming techniques to the GMEC problem. We reviewed the three known classical ILP formulations that capture the structure of the GMEC problem. We compared the three formulations by letting CPLEX optimizers to use each formulation on protein energy examples. 39 By the experiment, we found that Koster et al.'s formulation is more efficient than the others, yet simply choosing one of the formulations to use with a general solver will be more and more inefficient as the problem size grows, as shown in Table 4.1. Motivated to find an efficient ILP method that can exploit the problem specific structure, we investigated the use of branch-and-price algorithm, which is an exact method for large-scale hard combinatorial problems. We reviewed the notion of the method, and developed three decomposition schemes of Eriksson et. al's ILP formulation. We implemented the methods using SYMPHONY, a generic branch, cut, and price framework and tested them on protein energy examples. The implementations were able to handle small cases, and found optimal solutions at the root node of the tree when the column generation converged, but the results could illustrate no more than the convergence properties of different decomposition schemes since relaxed Eriksson et al.'s formulation also obtained integral solutions for the same cases. To validate the use of the decomposition schemes, we also performed tests with random energy examples, where the branch-and-price formulations often had tighter lower bounds than the base formulation. Though we were not able to obtain practical performance with our implementations, we believe that more thorough examination of the problem will reveal a better way to apply the branch-and-price algorithm or similar methods and contribute to understanding the combinatorial structure of the GMEC problem. 40 Chapter 3 Nonlinear programming approach In this chapter, we explore the nonlinear programming approach to the GMEC prob- lem. The LP or ILP based method of Chapter 2 has the advantage that it can exploit a fast and accurate LP solver. However, LP or ILP formulations often involve a large number of variables and constraints. Considering that the necessary number of vari- ables is roughly at least the square of the total number of rotamers, LP or ILP based method can be practically of not much use when the problem size grows very large without the support of enormous computing power. As a natural extension of this concern, we turn our interest to the rich theory and techniques of nonlinear programming. We use a quadratic formulation of the GMEC problem that contains only as many variables as the total number of rotamers and whose number of constraints is equal to number of residues. Since the continuous version of the formulation is not generally convex, we expect that obtaining the optimal solution will be hard, but we aim to evaluate the nonlinear programming as an efficient candidate method to compute sub-optimal solutions to the GMEC problem. There have been several attempts to apply nonlinear programming approach to discrete optimization problem [20, 41, 37]. However, some of them are only effective for a special form of problems or should be used in conjunction with other heuristics and combinatorial optimization frameworks. In this work, we mainly rely on Ng's framework for nonlinear nonconvex discrete optimization [31], which is simple and purely based on nonlinear programming techniques. 41 The rest of this chapter is organized as follows. Section 3.1 reviews Ng's smoothing algorithm based on continuation approach. Section 3.2 presents tailored versions of the preconditioned conjugate gradient method and the adaptive linesearch method. Section 3.3 describes the application of the algorithm to the GMEC problem and discusses the computational results from it. Finally, Section 3.4 concludes the chapter. 3.1 Ng's continuation approach In this section, we present and review Ng's work [31]. 3.1.1 Continuation method The continuation method solves a system of nonlinear equations by solving a sequence of simpler equations that eventually converges to the original problem. Suppose we need to solve a system of equation F(x) = 0, where x (E R" and F : R' -* some G : R, - R. For R, x E R", a new function H : Rn x [0, 1] -+ R can be defined by H(x, A) = AF(x) + (1 - A)G(x). (3.1) If we can easily find a root for H(x, 0) = G(x) = 0 or x0 such that H(x0 , A0 ) = 0 for some AO < 1, we can incrementally approximate the solution of the original equations H(x, 1) = F(x) = 0 by starting from xO and solving H(x, A) = 0 as we increase A from AO to 1. Solving H(x, A) = 0 is more advantageous than directly solving F(x) = 0 because iterative methods like Newton's method will behave well for H(x, Ak+1) - 0 when the initial point is a solution of H(x, Ak) and Ak+l is sufficiently close to Ak. 3.1.2 Smoothing algorithm There have been active studies on optimization of convex functions over a convex region. Both the theory and the practice are well-established in this area. However, minimization of a nonconvex function is still very difficult due to the existence of local minima. The smoothing method modifies the original problem by adding a strictly 42 convex function to eliminate poor local minima while trying to preserve significant ones. As in the continuation method, the smoothing method iteratively minimizes the modified function as it incrementally reduces the proportion of the convex term. Suppose, for a convex region X E Rn, we want to minimize f(x) : R -, R and D(x) : R" --+ R is strictly convex over X. Then we define F(x, [t) by f (x) + bA(x). F(x, /t) and H(x, A) =- "), (3.2) we can easily observe that it is in the same If we let A = - form of (3.1). F(x, [t) is strictly convex over X if V 2 1(x) is sufficiently positive definite for all x E X (For more rigorous description of the claim and the proof, see [4].) Therefore, minimization of F(x, fIp) for a large positive value /P is relatively an easy task. Once the solution x(uk) for F(x, yk) is obtained, the smoothing method subsequently minimizes F(x, pk+1) for 0 < Ak+1 < pk from the starting point x([k). One concern about the smoothing method is that it may generate a sequence of x(ik) that is close to minimizers of 1(x) regardless of their qualities as minimizers of f (x). A remedy for this problem is using 1 (x) whose minimizers are close to none of the minimizers of f(x). This can be achieved, when x is a discrete variable, by using 1(x) that resembles a penalty function of the barrier method. Let us consider a nonlinear binary optimization problem: minimize f(x) (3.3) subject to Ax = b, (3.4) X E {0, 1}n (3.5) where x E Rn, A E R" x n, and b E Rm. By relaxing the discrete variable constraint, we obtain minimize f(x) (3.6) subject to Ax = b, (3.7) 43 X G [0, 1]". (3.8) To solve the above problem by the smoothing method, we define 4(x) as n n 1(x) = - Znxj - ln(1 - xj). (3.9) j=1 j=1 The function is well-defined for x E (0, 1)n. It is strictly convex and 4)(x) -+ oo as xi 1 0 or xj T 1. Therefore, 1(x) also functions as a logarithmic barrier function that eliminates the inequality constraints on x and allows the results from the barrier method to be applied. The transformed problem for the smoothing method is as follows: minimize f(x) subject to - [ I{ln f xj + ln(1 - xj)} Z (3.10) Ax = b, (3.11) x E (0, 1)n (3.12) The barrier method approximately finds the solution x(p) of the problem for a sequence of decreasing p until M is very close to 0 and the limit point of x(p) is the global minimum of the original problem (3.6)-(3.8) [4]. Since the variables of the converged solution may not be close enough to 0 or 1 for binary rounding, we use a quadratic penalty function that guides the convergence to binary values. The transformed problem after adding the extra penalty function is as follows: minimize subject to F(x) = f (x) - p {lnx + ln(1 -xz)} +'yZEg xy(1 -xl) (3.13) Ax = b, (3.14) x E (0, 1)n (3.15) 44 where J is the set of variable indices that need the enforcement of the quadratic penalty term. This approach is justified by the fact that the problem minimize g(x) (3.16) subject to Ax = b, (3.17) E {0, 1} (3.18) and the problem minimize g(x) + - En_ Xj (1 - X) (3.19) Ax = b, (3.20) x E [0, 1]n (3.21) subject to have the same minimizers for a sufficiently large -y [31]. 3.1.3 Solving the transformed problem We obtain a sequence of optimization problems in a continuous domain by transforming the original discrete nonlinear optimization problem. Nonetheless, the transformed optimization problem in the continuous domain is never less hard than the original problem because there exist many local minimizers and saddle points. We use the second-order method to assure the solution satisfies the second-order optimality conditions. Once we obtain a solution xo of (3.13)-(3.15) for the initial choice of penalty parameters /t and -y0 , we can work in the null space of A for the rest part of the algorithm. Suppose Z is a matrix with columns that form a null space of A. Then, AZ = 0 and the feasible region can be described as {x Ix = x0 + Zy, y E Rn-- such that x E (0, 1)"}. Therefore, the problem (3.13)-(3.15) is reduced to optimization over a subset of Rn-rn. 45 The quadratic approximation of F(x) around the current point xk is given by Fk(x) = F(xk) + VF(xk)I(X - xk) + -1 2 - x Setting the derivative of Fk(x) to zero, we obtain xk+1 (xk X -'V (3.22) that satisfies the first-order optimality condition. VF(xk) + V 2 F(xk)(Xk+l - xk) = 0, x k+1 should satisfy xk+1 - xk + Zy for some y E Rn-m. (3.23) Therefore, by substituting y and premultiplying both sides by Z', we have a reduced-Hessian system to find a descent direction: Z'V 2 F(xk)Zy = -Z'VF(xk) (3.24) The Newton's method is iteratively applied to (3.13)-(3.15) until Z'VF(xk) is suffi- ciently close to 0 and ZIV 2 F(xk)Z is positive semi-definite. The Newton's method embedded in the smoothing algorithm is given by Table 3.1. Since ZIV 2 F(xk)Z is usually a large dense matrix, (3.24) is solved by an it- erative method such as the conjugate gradient (CG) method rather than by explicitly forming the inverse. Particularly, we use the preconditioned conjugate gradient (PCG) method. CG is an algorithm to solve a positive semidefinite system. In case Z'V 2 F(xk)Z is not positive definite, the algorithm should terminate. On the other hand, to satisfy the second-order optimality condition, we need to find the negative curvature direction when one exists. This can be done within the CG method by finding a vector q such that q'Z'V 2 F(xk)Zq < 0. Since we want the eigenvalue cor- responding to negative curvature direction to be as close to the smallest eigenvalue as possible, we may opt to use the inverse iteration method to enhance q obtained by the CG method. 46 Table 3.1: Logarithmic barrier smoothing algorithm. Input: F(x): transformed function for the smoothing algorithm, Z: matrix with columns that constitute the null-space of A, CF: tolerance for function evaluation, CA: minimum allowed value of p, M: N: Oft: O7: maximum allowed maximum number reduction ratio for reduction ratio for value of -y, the Newton's method is applied, p, y, Ap0: initial value of p, -Y 0: initial value of -y, XO: any feasible starting point. Output: x: a local minimum of the problem (3.6)-(3.8). Smoothing iteration: := p 0 , y := f, and x := x0 . while -y < M or p > c, do: for k = 1 to N do: Set if IIZ'VF(x)| < CFA then stop. else Apply the CG method on Z'V 2 F(x)Zy = -Z'VF(x). Do a linesearch: Obtain a descent direction d and a step length a. Set x := x + ad. endif endfor Set p := 0,p, and y := O-y. endwhile 47 3.2 Algorithm implementation In this section, we describe implementation aspects of the smoothing algorithm. Since we deal with large scale problems, efficient implementation of each step based on numerical properties of the problems is important to obtain a reasonable performance. We first describe a modified CG method to solve the reduced-Hessian system as well as to find a negative curvature direction. The adaptive linesearch algorithm decides a descent direction from the results of the CG method and calculates a step length. We also described a modified penalty parameter control scheme that exploits the bundled structure of the variables and early elimination of converged variables. 3.2.1 The conjugate gradient method We use the CG method to solve reduced-Hessian systems. For indefinite systems, the standard CG method may still converge if all its search direction are on a subspace where the quadratic form is convex. If it ends up with a search direction that has negative curvature, we use both the intermediate result and the negative curvature direction to determine the descent direction and the step length. The modified preconditioned CG method in Table 3.2 solves Ax = b or outputs a negative curvature direction when it finds one [31]. 3.2.2 The adaptive linesearch algorithm. The linesearch algorithm should decide the search direction from the Newton-type direction and the negative curvature direction to calculate a suitable step length in the search direction. The pursuit of Newton-type direction brings a fast convergence to a point that satisfies the first-order optimality condition while the negative curvature direction is necessary to escape from local non-convexity. More and Sorensen [29] and Lucidi et al. [27] describes a curvilinear linesearch algorithm that uses a combination of the two directions. However, the relative scaling of the two direction vectors can be an issue, especially when the negative curvature direction is given too little weight while this direction may give significant reduction in the objective value. In our 48 Table 3.2: The preconditioned CG method for possibly indefinite systems. Input: A, b, M: preconditioner, N: maximum number of iterations, x0 : starting point, er: residual error tolerance. E,: negative curvature error tolerance. Output: a solution x of Ax = b with tolerance c, or a negative curvature direction c PCG iteration: Set xz zx0 , i :=0, r := b - Ax 0 , d := M-'r, 6 new :- 60 = 6 r'd, new while i < N and E26O 6new > Set q := Ad. if d'q < ee||d|| 2 then Set c := d and stop. else Set q := Ad, a 6new d'q x = x + ad, r = r - aq. S :=M-r, 6 old 6 new 6 new - _ s1 d := s + @d, i := i + 1. endif endwhile 49 do: work, we used the adaptive linesearch algorithm by Gould et al. [13], where only one of the two directions is used at a time. On the other hand, before we exploit the negative curvature direction from the conjugate gradient method, we try enhancement of the negative curvature by the inverse iteration method. Table 3.3 summarizes the algorithm. In step 3 and 4 of the algorithm, an upper bound of the possible step length amax needs to be calculated because we have constraints 0 < xi < 1 for each i. The upper bound is determined by ama = mini 13i, where if pi > 0, The parameters ok 13, T, Xi k = if Pi < 0, oc if pi = 0. (3.25) /yt, and a-k can control the behavior of the algorithm. We let be the step length calculated from the last linesearch in the negative curvature direction. 3.3 Computational results In this section, we formulate the GMEC problem as a nonlinear discrete optimization problem and then solve it using the smoothing algorithm presented in previous sections. As we did in the second classical formulation of Section 2.1.2, we introduce binary variables xi, for all residues i and all rotamers r E Ri. Variable xi, takes value 1 if rotamer r is selected for the conformation of residue i, and 0 otherwise. Then the problem can be written as minimize Ei ErER, E(ir)xi + subject to Z%<j ErERi ErERi Xir xi, = ZsER, E(irjs)XiXjs 11 E {0, 1}, Vi, Vr E Ri. 50 (3.26) (3.27) (3.28) Table 3.3: The adaptive linesearch algorithm Input: sk: a Newton-type direction dk: a negative curvature direction gk = VF(xk), Hk - V 2 F(xk), N, -k, 0 E (0, 1), and p E (0, function F(x) function AF(y) = gk'Y + 1y'H y ) Output: ak: step-length 1. Negative curvature enhancement: if (Zs)'H k(Zs) < 0 then if (Zs)'Hk(Zs) > -1 then and set 6 := -1. else set 6 := (Zs)'Hk(Zs). endif Apply the inverse iteration on Z'HkZ with shift value 6 to obtain an eigenvalue A and the eigenvector v. if A < (Zs)'Hk(Zs) then set s := v. endif endif 2. Selection of the search direction: if dk - 0, then set pk := Sk and goto 3. dk else dk - IjdkIV endif TAF(dk) then set pk := sk and goto 3. if 9 k else set pk := dk and goto 4. endif 3. Linesearch in a gradient-related direction: Find amax, and set ak := amax, ck := minr{0,pk'Hkpk},j := 1. 2ck) while i < N and F(xk - akpk)> F(xk) + Set ak - ak and i := i+1. endwhile 4. Linesearch in a negative curvature direction: Find amax, and set ak : 0k, i:= 1. if F(xk + akpk) < F(xk) + pLF(akpk) then while i < N and a < cmax and F(xk + Lpk) <F(xk) + pF( Set ak:= a, and i := i+1. endwhile else while i < N and F(xk + akpk ) < F(xk) + puAF(akpk) Set ak : endwhile Oak and i := i+ 1. endif 51 fPk) We define a symmetric matrix Q= (Qi,,j,) as IE(ijrs) Qjr,js ={ E(ir) if i#j, (3.29) if i = j and r = s, otherwise, 0 and an n x m matrix A = (Air), where m = Ei(# rotamers of residue i), as (3.30) Air = ith column of the m x m identity matrix Im. If we also let i, be the p x 1 one vector, and x be the vector (xi,), then (3.26)-(3.28) is simplified as minimize X'QX (3.31) subject to Ax = 1, (3.32) E {0, 1}, Vi, Vr E Ri. Xir (3.33) (3.31)-(3.33) is in the same form of (3.3)-(3.5). Therefore, we only need to calculate the null-space matrix Z of A to apply the smoothing algorithm of Table 3.1. By the simple structure of A, Z can be given by the form Z = (Z1Z2 ...Zn),i where n is the number of residues and Zi is a m x (# rotamers of residue i) matrix. If we let mi be the number of rotamers of residue i, then Zi is given by Zi = [0mi-1x& my where Opxq - 1 mi-1 Imi-1 Om-lxZj.. (3.34) m~], is the m x n zero matrix, lP is the p x 1 one vector, and 4, is the p x p identity matrix. When mi is one, Z is not defined by the previous formula, but the corresponding residue can be obviously excluded from the computation because it has only one possible rotamer choice. 52 Energy clipping and preconditioning the reduced-Hessian 3.3.1 system The numerical properties play a key role in determining the performance of the algorithm. Since the constraints are in the simple fixed form and the function is given by a pure quadratic form, all the numerical properties come from the matrix Q. We perform a simple preprocessing on Q by clipping too large values in the matrix to a smaller fixed value. By making a cumulative distribution of values in Q, we find the number Qrnax that is larger than most of the values in Q, say 99% of them and replace any value larger than that with Qmax. In this way, we can adaptively prevent extreme ill-conditioning while not affecting the original nature of the problem too much. One of the segments of the algorithm whose performance is most strongly influenced by the numerical properties is the iterative solving routine of the reducedHessian system by the conjugate gradient method. Since the reduced-Hessian system is dense and particularly large for our problem, the preconditioning can be an important factor in improving the performance of the conjugate gradient method. Nash and Sofer [30] describe a set of techniques for positive definite matrices of the form Z'GZ to construct approximation to (Z'GZ)-l and suggest that the simple approximation of W'M 1 W can be effective for general use, where W' is a left inverse of Z and M is a positive definite matrix such that M ~ G. However, Z'HZ of our problem is not guaranteed to be positive definite and we are more interested in the quality of the solution we obtain from the algorithm than the fine tuning of the performance. In our application, we simply took the absolute values of the diagonal terms of Z'HZ to make a diagonal matrix and used its inverse as the preconditioner. 3.3.2 Parameter control and variable elimination We introduced minor modifications to the algorithm for it to work more efficiently on the GMEC problem. A distinct condition with the GMEC problem that is mainly different from the applications in Ng's thesis [31] is n < m, or mi is very large for some 53 i and not for others. By applying Ng's smoothing algorithm, we could find optimal solutions or ones fairly close to them for the cases mi is small for all i. However, when we applied it to a case where the difference between maxi mi and mini mi is large, the solution we obtain was not good even though n is small. We suspect that a problem arises in the monotonic decrease of a single barrier penalty parameter A for every variable. Variables for residues with a small number of rotamers tend to converge at the early stage of the algorithm while the ones with a large number of rotamers take it longer to find its way to the convergence. Therefore, using a single [t seem to force a premature convergence that is more likely to end up with a poor local minimum. Therefore, we instead assign separate barrier penalty parameter pit for each set of variables corresponding to the rotamers of residue i. In the same fashion, we also assign a separate quadratic penalty parameter yi to force the binary rounding to a different degree for each residue. Therefore, the transformed function is given by : yi E xi,(1 - xi,) F(x) = f (x) - E pi E {ln xi, + ln(1 - xi,)} + i i rER, (3.35) rERj We can devise various schemes to control the decrease of /ti or the increase of 'y. In our experiment, we decreased pi only if the deviation of variables or increases after an iteration of the smoothing algorithm. xi, stays same We let SM1 and SM2 denote Ng's smoothing algorithm without and with the this parameter control scheme, respectively. However, by introducing a separate barrier penalty parameter for each residue and letting the residue with a small number of rotamers to converge faster, we face another numerical problem; as variables corresponding to a residue approaches the value 0 or 1 while some others are still quite far from the binary values, the calculation of gradient and Hessian as well as most of the other matrix computation grows numerically unstable because the magnitude of Xir -I- or 1 l-Xir becomes boundlessly large. Therefore, we need to eliminate the variables that converged to one of the binary values from the subsequent computation. The criterion for elimination can be ambiguous and several schemes can be tried. We assumed a variable 54 xi, reached 0 if its value is Table 3.4: The parameter settings used in the implementation. Smoothing: E= M 10-5 = 1000 N = 300 0A = 0.95 0Y = 1.02 p0 = 100 0 = 0 or 0.002 x9= T 1/(# rotamers of residue i) = 1000 Qmax = max{5, (99% of the cumulative distribution) + 1} The CG method: N = 400 E, = 10-5 Ec = 10-5 Linesearch: N = 50 /3=0.8 yt = 10-3 smaller than a factor 10-4 of !. After elimination of variables corresponding to a residue, if there are still remaining variables for it, we need to normalize the value of remaining variables so that their sum is equal to 1 and the computation can proceed on the null-space of A. On the other hand, if any of the variables corresponding to a residue is close to 1, say larger than 0.9, then we regard the residue has converged to a rotamer and exclude all of the variables for the residue from the subsequent computation. Elimination of variable in the middle of the algorithm reduces the dimension of matrices and may contribute to speed-up of the overall performance. 3.3.3 Results We implemented the algorithm using MATLAB 6.5 on a PC with a 1.7 GHz Pentium 4 processor, 256 MBytes memory and Windows XP for its OS. Throughout the experiment, parameter settings as shown in Table 3.4 were used. The results are summarized in Table 3.5. 55 Table 3.5: Results for SM2. (protein: PDB code, #res: number of residues, #var: number of variables, optimal: optimal objective value, smoothing: objective value from the smoothing algorithm, #SM: number of smoothing iterations, #CG: number of CG calls, time: execution time in seconds, #NC: number of negative curvature directions used, IT0: initial value of the quadratic penalty parameter.) protein 256b 256b lamm lbpi J7- #res 5 6 7 8 10 11 15 19 20 #var 11 13 148 24 137 94 110 194 211 optimal -54.16980 -70.95571 -70.18879 -87.68481 -110.18409 -94.69729 -116.8101 -152.34157 -209.37536 smoothing -53.72762 -70.51478 -68.52338 -87.24479 -90.40045 -82.68505 -114.58994 -144.27939 -173.51333 #SM 165 165 309 196 285 248 235 364 275 #CG 431 467 926 561 954 829 715 1166 840 time 1.11 1.28 33.40 2.02 27.48 11.30 14.19 70.81 64.14 #NC 0 0 8 1 6 7 10 12 13 25 350 -242.85081 -158.74741 286 928 333.23 42 0 5 6 7 8 10 11 15 19 20 11 13 148 24 137 94 110 194 211 -54.16980 -70.95571 -70.18879 -87.68481 -110.18409 -94.69729 -116.8101 -152.34157 -209.37536 -53.72762 -70.51478 -68.52338 -87.24479 -90.40045 -81.97510 -115.08332 -144.27939 -202.06752 165 165 310 196 332 268 245 358 274 433 468 934 563 1090 866 731 1144 863 1.15 1.28 33.05 2.16 28.35 11.48 14.28 70.04 65.22 0 0 8 3 8 4 11 14 12 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 25 350 -242.85081 -156.88740 278 894 324.30 30 0.002 10 15 20 78 153 232 -62.64278 -134.63023 -167.31388 -59.63623 -127.46733 999841.81683 378 377 276 1621 1632 937 13.86 37.28 106.58 6 14 27 0.002 0.002 0.002 25 277 138.61011 1063.02470 830 2374 1100.66 50 0.002 10 15 20 128 144 214 -75.37721 -57.13359 -75.8774 -74.10404 -54.72438 14.94464 286 408 405 794 1867 1453 19.42 42.67 99.17 8 19 31 0.002 0.002 0.002 25 303 -93.65443 -69.08175 396 2154 262.39 29 0.002 56 0 0 0 0 0 0 0 0 0 0 . A very We first experimented with the initial quadratic penalty parameter -Y small -y0 and the increasing rate Ol close to 1 is desirable because the quadratic penalty term should not significantly affect the initial behavior of the algorithm. Table 3.5 shows the results from using y0 = 0.002 and not using the quadratic penalty ('f = 0) when other parameter settings are identical. There is not noticeable difference in both the quality of the solution and the performance for the data from 256b, which implies that the algorithm was able to converge to a solution close to binary without the enforcement by the quadratic penalty. However, the algorithm did not converge to binary values when it was applied to the data from lamm and lbpi. For the rest of the experiment, we used y0 = 0.002. The objective values from the smoothing algorithm are fairly close to optimal when the problem size is small, but the results were not stable for the cases with more than 200 variables. We suspect this is an implementation specific problem that can be addressed by careful inspection and experiment of the numerical aspects of the algorithm such as improving the negative curvature direction obtained, the termination criteria of the CG method, and the kind of linesearch method used. At the same time, though we used a rather conservative parameter settings, it is far from being suitable for every case. For example, only by setting the barrier penalty parameter decreasing rate O, to 0.98 instead of 0.95 resulted in an improved solution with the objective value -64.12139 for 20 residues of lbpi compared to 14.94464 when 01= 0.95. We also ran SM1 algorithm on the same data set for comparison with identical parameter settings. The results are summarized in Table 3.6. To our interest, SM1 showed more stable behavior for large cases and was able to obtain reasonable solutions for which SM2 failed to do so; for 20 residues of lamm, SM1 obtained an objective value -164.31605 while SM2 brought an absurd value 999841.81683. However, SM1 also failed to produce any solution close to optimal for the cases of 19 residues of 256b and 25 residues of 256b where SM2 obtained near optimal results. SM1's solution was also often inferior to the one from SM2, and usually took larger number of CG calls, and consequently longer execution time. The larger number of CG calls is 57 Table 3.6: Results for SM1. protein 256b lamm lbpi #res 5 6 7 8 10 11 15 19 20 25 10 15 20 25 10 15 20 25 #var 11 13 148 24 137 94 110 194 211 350 78 153 232 277 128 144 214 303 optimal -54.16980 -70.95571 -70.18879 -87.68481 -110.18409 -94.69729 -116.8101 -152.34157 -209.37536 -242.85081 -62.64278 -134.63023 -167.31388 138.61011 -75.37721 -57.13359 -75.8774 -93.65443 smoothing -53.72762 -70.51478 -68.13366 -87.24479 -86.12662 -91.90795 -116.81010 32.82828 -199.53027 -67.10875 -59.63623 -127.46733 -164.31605 340.53520 -67.25647 -44.67575 -66.64213 -72.48917 I #SM 147 147 246 159 246 211 210 259 246 251 377 376 292 781 383 367 375 349 I #CG 373 412 742 460 1023 708 639 1104 797 964 14717 20080 1083 2612 6480 5576 4731 2780 time 0.92 1.12 39.63 1.732 44.11 16.69 18.74 165.45 105.04 602.09 291.35 1341.82 284.22 1879.61 266.95 286.67 723.93 1326.31 #NC 0 0 2 1 10 1 14 3 5 20 8 8 7 52 5 6 13 14 7] 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 mainly due to the ill-conditioning caused by variables approaching binary values, as was described in Section 3.3.2. 3.4 Summary In this chapter, we explored a nonlinear approach to the GMEC problem. We relaxed the binary quadratic formulation of the GMEC problem to obtain a continuous version of it. Then, we applied Ng's continuation approach to solve the nonlinear nonconvex minimization problem. We presented a modified version of preconditioned CG method for possibly indefinite systems, and an adaptive linesearch algorithm that exploits the negative curvature direction. We also suggested a minor modification to Ng's algorithm such as separate parameter control and variable elimination. The computational results suggest that the continuation approach works well for small instances where each residue do not have many rotamer choices, but the quality of the local minima it finds for large instances is not guaranteed. The modification we gave to Ng's algorithm was not effective for large problems in terms of the quality of the solution it found. However, it contributed to reducing the ill-conditioning and 58 making the algorithm work more efficiently. 59 Chapter 4 Probabilistic inference approach In previous chapters, we investigated discrete and continuous optimization methods to solve the GMEC problem formulated as a mathematical programming problem. In this chapter, we use probabilistic inference methods to infer the GMEC using the energy terms transformed into probability distribution. The probabilistic inference methods are generally approximate methods, but empirically their computation time is short and the solutions found are usually very close to optimal. We test three different probabilistic inference methods on side-chain placement and sequence design to compare their performances among themselves and with the results from CPLEX optimizers and DEE/A*. The main result of this chapter is the promising performance of the max-product belief propagation (BP) algorithm as a tool for protein sequence design. 4.1 Methods In this section, we review probabilistic inference methods that will be used for computational experiment of the next section. 60 4.1.1 Probabilistic relaxation labeling Relaxation labeling is a method for the problem of graph labeling which often becomes a fundamental problem in image analysis and computer vision. The objective of graph labeling is to assign the best label to each node of a graph with compatibility relations which indicate how compatible pairs of labels are. Obviously, the GMEC problem is closely related to the graph labeling. Probabilistic relaxation labeling is an iterative method looking for a fixed point in (4.1), which formalizes how the compatibilities between two nodes weighted by the the information from neighboring nodes update the local probability distribution at each iteration. = x Pk(X) HjEN() The local probability distribution P(i) Ex, ~P(~j over the possible labels xi of node i at iterij (xi , xj) and the probability distribution ation k is updated with the compatibility P (xj) of the neighboring nodes N(i). The updated probability distribution is kept at the node and used to update the probability distribution of itself and the neighboring nodes at the next iteration. For more detailed description and theory of the probabilistic relaxation labeling, see [32, 15, 16]. 4.1.2 BP algorithm In the BP algorithm, node s sends to its neighbor node i the message m8 i(xi) which can be interpreted as how likely node s sees node i is in state xi. The BP algorithm computes exact marginal probabilities in a singly-connected factor graph by an iterative message passing and update protocol. Since our energy function is a sum of self and pairwise terms, through the Boltzmann mapping, we can view it as a probability function that is a product of factors that reference only two variables. Therefore, the factor graph for a problem with three residues looks like Figure 4-1. Then, as shown in [19], the belief at node i is given by the product of the singleton factor Oi(xi) 61 X1 f(X1,X2) f(X1,X3) X2 X3 f(X2,X3) Figure 4-1: An example factor graph with three residues associated with variable xi and all the messages coming into it: 17 P(xi) = kqi(xi) (4.2) mai(xi), sEN'(i) where k is a normalization constant and N'(i) denotes the neighboring nodes of i except the singleton factor node associated with xi. The message update rule at node i is given by mnsi (Xi) Xj #(Xj) Oj ij (Xi, zy ) Mtj (Xj), (4.3) tEN'(j)\s where Oij (xi, xj) is the pairwise factor of xi and xi. Due to the form of the message update rule, the BP algorithm is also called by the sum-product algorithm. BP is not guaranteed to work in arbitrary topology, but Yedida et al. [43] have shown that BP can only converge to a stationary point of the Bethe free energy. 4.1.3 Max-product BP algorithm Suppose X is the set of all possible configurations of the random variable {x}. Then, the MAP (maximum a posteriori) assignment is given by XMAP - arg max P({x}). {X}eX 62 (4.4) Finding the MAP assignment is generally NP-hard. However, if the factor graph is singly connected, the max-product BP algorithm can efficiently solve the problem. The max-product BP algorithm is identical to the standard BP algorithm except that the message update rule takes the maximum of the product instead of the sum, i.e. msi (xi) <- max #j (xj) bj (xi, xj) J mtj (Xj). (4.5) tEN'(j)\s The belief update is given by (4.2). When the belief update converges, the belief for any node i satisfies b (xi) = k max P({x}|xj), (4.6) {x}eX and the MAP assignment can be found by (XMAP)i= arg max bi(xi). (4.7) The max-product BP has no guarantee to find MAP in graphs with loops. Freeman and Weiss have shown that the assignment from the fixed points of the max-product BP is a neighborhood maximum of the posterior probability [10]. 4.1.4 Double loop algorithm BP is guaranteed to work correctly in a singly-connected factor graph, but its performance on loopy graphs has not been understood well. In practical applications, however, such as decoding turbo codes, the method has been very successful even in loopy graphs. An important discovery on the behavior of BP was made by relating it to Gibbs free energy approximation [43], and this elucidated BP as a method for approximate inference in general graphs. Despite this theoretical progress in understanding the fixed points of BP, its applications are limited without reliable convergence properties. The convergence problem of BP together with BP's analogy to the Bethe free energy minimization became the motivation for the development of algorithms that attempt to minimize the Bethe free energy itself. In our application of the max-product BP algorithm to the GMEC problem, the method performed surpris63 ingly well, but we also found cases where the method converges to an unreasonably large energy value. In this work, to evaluate the approximate inference method based on the Bethe free energy minimization as an alternative to the max-product BP when it does not show a good convergence property, the double loop algorithm derived from the mutual information minimization and marginal entropy maximization (MIME) principle [36] is tested. 4.1.5 Implementation We implemented the probabilistic relaxation labeling (RL), the max-product BP (MP), and the MIME double loop algorithm (DL). When applying the probabilistic methods, self-energies and pairwise energies need to be transformed to probabilistic distributions. In RL, the initial estimate of Pi(xi) were given by 1 Pi(xi = r) = -(ei, Z - min ej), 3,s (4.8) where Z is a normalization constant. On the other hand, the initial estimate of a uniform distribution for RL was found to be much worse than (4.8). The significant link that connects the energy minimization and the probabilistic inference is the Boltzmann law, which is generally accepted to be true in statistical physics. The compatibility for RL was calculated by (xj = r, xj = s) = e-e'iris, (4.9) where e',rj is given by (2.1). In MP and DL, both compatibility ?4'j (xi, xj) and the evidence #i(xi) are necessary. These were calculated by the following relation: (xi = r, x3 = s) = e-(eis,-emin). Oi(xi = r) = e-eiremin), 64 (4.10) (4.11) where emin is defined by emin = min{mini,, ei,, mini<j,,,, es }. The initial estimate of messages mij (Xi j), probability distribution P(xi), or Pi (xi, xj) turned out to be almost irrelevant to the performance of MP and DL. The convergence of RL, MP, and DL was determined by the change in the sum of minus log probability. We used the threshold of 10- in the change rate to accept the convergence. The convergence of DL's inner loop is determined by the parameter Cthr, for which we used the value 10-4. We used zero for both 6 and of DL. The rest of the implementation is fairly straightforward for all three methods. One possible problem in the implementation is numerical underflow. Especially, in the double loop algorithm, the update ratio of the joint probability Pij (xi, xj) and that of the marginal probability Pi(xi) are reciprocal. Therefore, care should be taken so that the update ratio near zero or infinity is not used in the iteration. 4.2 Results and discussions We tested the implementation of three methods presented in Section 4.1 on the protein data examples. Since RL and MP are fast and converge for most cases, we were able to carry out fairly large side-chain placement as well as some sequence design. All tests were performed on a Debian workstation with a 2.20 GHz Intel Xeon processor and 1 GBytes of memory, and the programs were written in C and compiled by GNU compiler. 4.2.1 Side-chain placement The protein test set for the side-chain placement is presented in Table 4.1. Each test case is distinguished by the protein's PDB code and the number of residues modeled. For simplicity from now on, we will call each test case by combining the PDB code and the number of residues modeled, for example 256b-50. Note that all the energy calculations for lbpi, lamm, 256b, larb were done using the same rotamer library (call it LIB1), and the test cases of 2end and lilq were generated using a different library (call it LIB2). LIB2 consists of denser samples of X angles, and therefore have a large 65 number of rotamers for each amino acid than LIB1. As a measure of the combinatorial complexity of each test case, we added a column for log total conformations. We know the optimal value of each test case using one or more of CPLEX optimizers, and Altman's DEE/A* implementation [2]. We have extraordinarily large optimal values for some cases of lamm and larb. We believe this results from the clash between rotamers, which might happen because of the non-ideal characteristics of the rotamer library that does not always represent the actual conformation of a side-chain very well. We, however, included, the cases in our test to see whether the characteristics of the rotamer library affects the performance of the methods. For reference, we included the time taken to solve each test case by different solvers. We used the IP solver (CPLEX MIP Optimizer) only if we cannot obtain integral solutions using the LP solver (CPLEX LP Optimizer). For example, in the cases of 256b-70 and 256b-80, the optimal solutions of the LP relaxation are fractional. On the other hand, for 2end-35, 2end-40, and 2end-49, the LP solver broke down because of the system's limited memory. The IP solver is not expected to handle the cases considering its more massive efforts to find integral solutions. Altman's DEE/A* implementation was much faster than the other two solvers in all test cases of Table 4.1. The test result for RL, MP, and DL are shown together in Table 4.2. Noticeably, MP outperforms the other methods, and calculates optimal solutions for all test cases but two of 256b. An interesting point is that MP failed only for the cases where LP relaxation does not have integral solutions, but we do not have a good insight into whether there exists more general connection between the two methods (Throughout our entire experiments including those not presented in this thesis, we could not observe any other protein energy example that has fractional solution in LP relaxation.) On the other hand, the consistent worst performer is DL. DL was able to find approximate solutions only for small cases and took longer than both RL and MP. The probability update mechanism using Lagrangean multiplier to minimize the Bethe free energy approximation turned out to be not very effective in finding the GMEC. DL was accordingly not tested for large cases of the side-chain placement and all of 66 Table 4.1: The protein test set for the side-chain placement (logloconf: log total conformations, optimal: optimal objective value, TLP: LP solver solution time, TIp: IP solver solution time, TDEE: DEE/A* solution time, symbol - : skipped case, symbol * : failed, symbol F : fractional LP solution). protein lbpi 1amM 256b larb 2end lilq #res 10 15 20 25 46 10 15 20 25 70 80 10 15 20 25 30 35 40 50 60 70 80 logioconf 6.9 9.2 13.4 18.4 37.9 8.2 13.1 19.0 22.0 67.1 74.3 4.7 9.2 14.7 22.3 25.7 30.9 34.7 44.5 56.5 66.8 97.4 10 4.6 -65.10481 20 30 78 15 25 35 40 49 7 11 9.0 18.7 59.5 22.5 36.5 53.0 60.1 76.8 7.3 16.3 -128.35604 999715.70565 -460.69454 -182.56152 -285.07644 -347.29088 -402.92974 -540.60516 -81.23579 -172.50967 optimal -75.37721 -57.13359 -75.87743 -93.65443 -205.52529 -62.64278 -134.63023 -167.31388 138.61011 238.12890 17222.13193 -110.18409 -116.81010 -209.37536 -242.85081 -285.57884 -352.56265 -379.45106 -459.25473 -564.94275 -633.09146 -626.06002 67 TLP 0 0 0 2 30 0 0 1 0 44 100 0 0 1 2 3 8 10 26 62 F F 0 0 0 23 30 286 * * * 0 2 TIp 170 974 - - TDEE 0 0 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 0 1 1 3 11 0 0 0 0 2 6 27 30 46 0 0 sequence design. RL's performance was generally better than DL and fairly accurate up to medium-sized cases. The fraction of incorrect rotamers from each method is shown in Table 4.3. It is interesting that RL's incorrect prediction rate is maintained around 0.1 despite the huge AE's for some cases in Table 4.2. We can observe that both RL and MP are considerably faster than LP but slower than DEE/A*. RL and MP are almost equal in speed. Table 4.3 lists the fractions of incorrectly predicted rotamers by each method along the change of log complexity Considering speed and the accuracy of the prediction, only MP can be an alternative to DEE/A* in the side-chain placement. To have a better comparison of the MP and DEE/A*, we tested only MP and DEE/A* on larger cases of side-chain placement. We generated the test cases consisting of 35 ~ 60 residues using LIB2. The length of modeled protein sequences are around the average of those in Table 4.1, but have more number of rotamer conformations. The description of the test cases and the results are presented in Table 4.4. All of the optimal values in Table 4.4 were obtained by DEE/A*. However, we could find cases where the method fails to find the optimal solution during the A* search due to exhausting the available memory. The failure during the A* search implies that DEE's elimination was not effective enough to make the A*'s job doable, and also suggests that DEE is not the ultimate solution even for side-chain placement. MP was able to find four known optimal solutions, but the overall performance was not as impressive as in the previous cases. The graphical model used in MP is a complete graph consisting of nodes representing the residues. Therefore, the convergence of MP or the quality of the solution from it will solely depend on the numerical properties of the energies. As an indirect measure for the numerical properties of the energies, we analyze the estimated entropy of each side-chain i, which is given by Si - Epi(Xi) log pi(Xi). Xi 68 (4.12) Table 4.2: Results for the side-chain placement test (AE: the difference from the optimal objective value, symbol -: skipped). protein lbpi lamm 256b larb 2end lilq #res 10 15 20 25 46 10 15 20 25 #rotamers incorrect DL RL MP 0 0 2 1 0 5 1 0 5 4 0 6 7 0 0 0 1 1 0 1 0 0 6 3 0 3 AE RL MP 0 0.73866 0.73699 1.120E3 1.004E6 0 0.00209 0 4.072E5 0 0 0 0 0 0 0 0 0 70 12 0 - 1.026E6 0 80 10 15 20 25 30 35 40 50 60 70 80 10 20 30 78 15 25 35 14 0 0 1 0 0 2 0 0 4 14 11 0 0 0 14 3 5 6 0 2 4 5 0 4 2 19 9 3 10 2.007E6 0 0 0.19185 0 0 0.94663 0 0 1.50424 1.97973 6.462E5 0 0 0 1.034E6 4.23278 4.25637 59.91247 40 4 49 7 11 6 0 0 0 0 0 0 0 0 0 0 0 0 12 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.33407 0.11505 0 0 0 0 0 0 0 0 0 0 0 - 1 4 4.33043 0 0 DL 1.23534 2.15169 2.12650 8.51308 0.06898 0.51263 29.85257 4.072E5 - 0 8.499E3 6.77121 8.77E3 0 5.79E2 11.872 1.09E6 5.62226 5.30589 5.13E3 - 7.70635 1.00E6 RTime (sec RL MP D 0 0 54 0 0 1 0 0 6 0 0 45 1 2 0 0 0 0 0 2 0 0 4 1 0 1363 2 8 - 3 0 0 0 1 0 0 0 2 3 4 9 0 0 0 1 8 27 147 10 0 1 0 0 0 1 1 3 12 17 61 0 0 0 8 2 13 92 0 1 4 11 0 58 451 633 73 38 157 162 106 - 246 0 1 224 0 0 0 7 Table 4.3: The fraction of incorrect rotamers from RL, MP, and DL. logioconf [ RL I MP I DL 4.6 10 0.008 0 0.139 10 20 0.047 0 0.212 20 30 0.080 0 0.306 30'-40 0.102 0 0 0.082 40 -60 60 ~ 97.4 0.151 0.041 69 | Table 4.4: Performance comparison of MP and DEE/A* on side-chain placement (optimal: solution value from DEE/A*, EMP: solution value from MP, AE: difference between 'optimal' and EMp, IRMp: number of incorrectly predicted rotamers by MP, TDEE: time taken for DEE/A* in seconds, TMp: time taken for MP in seconds). protein #res logioconf lcbn lisu 1igd 9rnt 1whi lctj laac icex 35 49 50 50 50 50 50 50 60 60 60 60 50.3 76.7 87.0 91.6 92.1 95.5 98.5 100.3 110.1 113.3 119.0 118.1 lxnb 1pic 2hbg 2ihl optimal Emp 999802.90413 -329.62462 -376.93007 -391.89270 -450.82517 -439.59678 -549.36958 -519.22413 ? -331.36163 -560.59013 ? 999818.01809 -329.62462 -349.05877 -391.89270 -450.82517 -439.59678 -547.33491 -518.85970 -437.26643 -330.80530 -560.28258 -527.61321 TDEE AE 15.1139 0 27.87193 0 0 0 2.03467 0.36443 IRMp 3 0 7 0 0 0 2 5 * * * 1 4 130 646 * * 0.55633 0.30755 * 14 875 282 70 467 162 289 179 TMp 58 1406 762 435 1568 878 1396 851 1087 1342 2865 3254 Si that is close to 0 implies that nearly one rotamer is a good candidate for the GMEC of residue i whereas high Si suggests that there are many rotamers with almost equal chance to be the conformation. Simply to connect this intuition with MP's performance, we suspect easily solved cases by MP will have overall lower entropy than those that are not. As a rough examination of this conjecture, we take a look at MP's estimated entropy of each side-chain. Figure 4-2 shows the initial entropy distribution and its change when MP converged after 4 iterations for lamm-80. The entropy of each side-chain generally decreased from the average 0.888 to 0.627. However, unlike our expectation that most of the side-chains will have entropy close to 0 because MP converged so easily for this case, we observe more than 20% of side-chains have entropy larger than 1 at the end. For comparison, we also plotted the change of entropy for 256b-80. 256b-80 has similar complexity as lamm-80, but is one of the only two cases of Table 4.1 that MP failed. Figure 4-3 shows the result. 256b-80's average entropy dropped from 1.123 to 0.719, but its general trend of the entropy change looks similar to that of lamm-80. Another view of the two distributions is given by the entropy histogram of Figure 4-4, where the final entropy distribution of 256b-80 is only slightly more extended to the large entropy values. This observation and the statistics of estimated entropy listed 70 Initial entropy distribution 3 ....................................................................................................................................... 2.5 2 k1.5 0.512 Residue number After 4 iterations 2.5 S2 05 0 co) ol 0o- V, t2 0.~ 8) r R1 Lto z- o Residue number Figure 4-2: The change in the estimated entropy distribution from MP for lanm-80. together with the fraction of incorrectly predicted rotamers in Table 4.5 suggests that it is hard to find a direct correlation between MP's prediction error and the estimated entropy. However, the last column of Table 4.5 at least tells us that the residue position with an incorrect rotamer prediction might have a threshold minimum of the estimated entropy (in our test results, approximately 1.00). This fact may provide a step to use MP as a preprocessing procedure to figure out the conformation of a residue with low estimated entropy before using an exact method such as DEE/A* to solve the rest of the problem. Figure 4-5 shows the empirical relation between the log complexity and the time taken for MP in seconds. It can be seen that the time complexity is roughly a polynomial of log complexity. 71 0 initial entropy distribution 3.5 3 2.5 .5 1a 0.5 Residue number After 5 iterations 3 ----- -- 2.5 0 w4 1.5 0.5 Figure 4-3: The change in the estimated entropy distribution from MP for 256b-80. -1 *1 30 25 (a 20 E) :3 zi 1510 510 i. Cf 0.5 1 15 2 11 2.5 Entropy Figure 4-4: Histogram of estimated entropy for lamm-80 and 256b-80 at convergence of MP. 72 Table 4.5: Fraction of incorrectly predicted rotamers by MP and statistics of the estimated entropy for the second side-chain placement test. (IR fraction: fraction of incorrectly predicted rotamers, avg Sj: average estimated entropy, max Sj: maximum estimated entropy, min Sj: minimum estimated entropy, miniEIR Si: minimum entropy of incorrectly predicted rotamers). I #res protein lcbn lisu ligd IR fraction I avg Si 0.09 1.77 0 1.80 0.14 2.16 0 2.14 0 1.88 0 2.32 0.04 2.17 35 49 50 max Si 3.78 5.05 4.06 min Si 0.00 0.00 0.00 0.00 0.24 0.00 0.70 minjeIR Si 1.66 1.03 - 9rnt 50 1whi lctj laac 50 50 50 icex 50 0.10 2.16 4.13 0.47 1.71 1pic 2hbg 60 60 0.02 0.07 2.22 2.25 4.04 4.55 0.58 0.58 4.04 3.31 4.43 4.33 4.29 4.34 - 2.54 3000 2500 0 2000 1500 E :0 Z-) 1000 --& 0 500 0.o 0 C aCli)b oo' o cC0 o o -500 -1 nnn B . 20 . 40 80 s0 100 120 140 log number of points In the coformation space Figure 4-5: Execution time for MP in seconds vs log total conformations. 73 4.2.2 Sequence design Table 4.6 shows the protein test set for sequence design. The optimal solutions were obtained only for three cases, and the rest three cases were not solved through either LP or DEE/A*. Both CPLEX LP Optimizer and Altman's DEE/A* implementation failed in R15R23, sweet7-NOW, and P2P2prime due to the system's limited memory. On the other hand, the LP solver managed to solve lilq-design3 after five hours. DEE/A* broke down for this case during the A* search. We let RL and MP solve the sequence design cases. The results are summarized in Table 4.7. For all the cases with known optimal solutions, MP computed the solutions exactly in extremely short time. In protein design, unlike the side-chain placement, the measure of accuracy is usually the fraction of amino acids that are predicted incorrectly, but MP's results are correct to the rotamers. RL's performance was slightly worse than MP's, but was also impressive. RL resulted in two and one incorrect rotamers for lilq-design2 and lilq-design3 respectively, but its predictions of amino acids were exact. In the other three extremely large cases of sequence design whose optimal values are unknown, we can see MP's accuracy somewhat drops. We know that MP's solutions for R15R23 and P2P2prime are not optimal because RL's solutions have lower energy values. Since both RL and MP are probabilistic inference methods, the probability that RL or MP converges to the correct rotamer will decrease as the number of allowed rotamers at each position increases. Despite this expected limitation, RL and MP prove themselves to be good alternatives to DEE/A* in sequence design. Seeing that DEE/A* fails in the design cases whose sequence length and log complexity are even smaller than the side-chain placement cases it solved, DEE/A*'s elimination criteria seem less powerful especially in sequence design. We think combining the two approaches, say, by feeding DEE's incompletely eliminated output to MP or RL as an input may make a good scheme. 74 Table 4.6: The protein test set for sequence design (symbol ? : unknown optimal value, symbol * : failed). Case ID lilq-designl lilq-design2 lilq-design3 R15R23 sweet7-NOW P2P2prime logioconf 19.1 21.2 23.6 optimal -186.108150 -190.791630 -194.170360 22.5 43.6 60.1 TLP 11 616 19522 TDEE ?* ?* ?* 2 40 * * * * Table 4.7: Results for sequence design test cases whose optimal values are known. Case ID lilq-designl lilq-design2 lilq-design3 #amino acids incorrect RL MP 0 0 0 0 0 0 #rotamers incorrect RL MP 0 0 2 0 1 0 AE RL MP 0 0 0.22507 0 0.23092 0 Time (sec) RL MP 3 1 9 3 13 24 Table 4.8: Results for sequence design test cases whose optimal values are unknown (E: solution value the method obtained). E Case ID R15R23 sweet7-NOW P2P2prime RL 999874.33767 999664.69920 -384.30067 MP 999926.45661 -178.83174 -382.10387 75 Time RL 63 300 248 (sec) MP 157 285 396 4.3 Summary In this chapter, we took probabilistic inference approach to the GMEC problem. By transforming the self and pairwise energies into probability distributions, we were able to apply techniques such as relaxation labeling, the max-product belief propagation, and the MIME double loop algorithm. We had the most satisfactory results with the max-product BP algorithm both in side-chain placement and sequence design. On the other hand, DEE/A* had difficulty in dealing with sequence design and complex side-chain placement. Probabilistic relaxation labeling was reasonably fast and often found good approximate solutions, but the MIME double loop algorithm mostly ended up with poor local minima. We also attempted analysis of the correlation between the max-product BP's error and the estimated entropy of side-chains. 76 Chapter 5 Conclusions and future work In this work, we investigated three different approaches to the global minimum energy conformation (GMEC) problem. We first compared the effectiveness of three known ILP formulations on protein energy examples using CPLEX optimizers. Interestingly, most solved LP-relaxation of Eriksson et al.'s and Koster et al.'s formulations had integral solutions for protein energy examples. We then devised three decomposition schemes of Eriksson et al.'s ILP formulation to solve the GMEC problem using the branch-and-price algorithm. From the computational experiments, we observed that the implementations of the branch-and-price formulations can obtain tight lower bounds and often find the optimal solution at the root node for protein energy examples as well as random energy examples. Though we could not obtain practical performance with the simple implementations, we think the performance can be improved by adopting various known techniques for the branch-and-price algorithm such as tailing-off effect control. The results from random energy examples suggest that the branch-and-price approach can be more favorable for hard instances. Developing different decomposition schemes or subproblem solution methods are interesting topics that can be more investigated. In the context of the ILP approach, another interesting direction not pursued here is randomized rounding scheme. We note that the approximation results of the DENSE-k-SUBGRAPH problem (DSP) can be used to obtain approximation ratio for rounding fractional LP solution to the GMEC problem. However, we need to 77 obtain a good ratio for the case k < n, a different condition from that studied in Han et al.'s work [14]. As the second approach, we used nonlinear programming techniques. We took Ng's continuation approach to transform the discrete optimization problem into a sequence of nonlinear continuous optimization problems. We described detailed procedures necessary to solve each steps, and tested the implementation with protein energy examples. The implementation was able to obtain good solutions of small cases quickly, but accuracy and speed were not guaranteed for larger cases. However, the method can be made faster through implementation in lower level language such as C, and used to provide upper bounds in conjunction with other exact methods. In the probabilistic inference approach, we presented results of applying probabilistic relaxation labeling, the max-product belief propagation (BP), and the MIME double loop algorithm to the GMEC problem. The max-product BP algorithm was by far the most effective method throughout our work. It turned out to be comparable to DEE/A* in side-chain placement and superior in sequence design and very large side-chain placement. While the MIME double loop algorithm did not do well, probabilistic relaxation labeling usually had good approximate solutions for both sidechain placement and sequence design. We also investigated the correlation between the max-product BP's error and the estimated entropy of side-chains, but we could not relate the high estimated entropy with prediction failure at the position. Various approaches can be taken to combine several efficient methods such as DEE/A* and the max-product BP in serial or parallel fashion to obtain an accurate and efficient method. On the other hand, one of the most interesting and important issues in using the max-product BP for protein side-chain placement will be characterizing the convergence behavior of the max-product BP and the optimality condition. We suspect the concurrence of fractional LP solutions and errors of the max-product BP mentioned in Section 4.2.1 might serve to guide in finding the answers. 78 Bibliography [1] Ernst Althaus, Oliver Kohlbacher, Hans-Peter Lenhof, and Peter Miller. A combinatorial approach to protein docking with flexible side-chains. In R. Shamir, S. Miyano, S. Istrail, P Pevzner, and M. Waterman, editors, Proceedings of the 4th Annual InternationalConference on ComputationalMolecular Biology, pages 15-24. ACM Press, 2000. [2] Michael D. Altman. DEE/A* implementation. the Tidor group, Al Lab, MIT, 2003. [3] Cynthia Barnhart, Ellis L. Johnson, George L. Nemhauser, Martin W. P. Savelsbergh, and Pamela H. Vance. Branch-and-price: column generation for solving huge integer programs. Operations Research, 46:316-329, 1998. [4] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999. [5] Pierluigi Crescenzi and Viggo Kann. A compendium of np optimization problems. available at http: //www.nada.kth. se/viggo/problemlist/compendium. html. [6] Bassil I.Dahiyat and Stephen L. Mayo. Protein design automation. Protein Science, 5:895-903, 1996. [7] Marc De Maeyer, Johan Desmet, and Ignace Lasters. All in one: a highly detailed rotamer library improves both accuracy and speed in the modeling of side-chains by dead-end elimination. Folding & Design, 2:53-66, 1997. 79 [8] Johan Desmet, Mark De Maeyer, Bart Hazes, and Ignace Lasters. The deadend elimination theorem and its use in protein side-chain positioning. Nature, 356:539-542, 1992. [9] Olivia Eriksson, Yishao Zhou, and Arne Elofsson. Side chain-positioning as an integer programming problem. In WABI, volume 2149 of Lecture Notes in Computer Science, pages 128-141. Springer, 2001. [10] William T. Freeman and Yair Weiss. On the fixed points of the max-product algorithm. Technical Report TR-99-39, MERL, 2000. [11] D. Ben Gordon and Stephen L. Mayo. Branch-and-terminate: a combinatorial optimization algorithm for protein design. Structure with Folding and Design, 7(9):1089-1098, 1999. [12] D. Benjamin Gordon and Stephen L. Mayo. Radical performance enhancements for combinatorial optimization algorithms based on the dead-end elimination theorem. Journal of Computational Chemistry, 13:1505-1514, 1998. [13] Nicholas I. M. Gould, Stefano Lucidi, Massimo Roma, and Philippe L. Toint. Exploiting negative curvature directions in linesearch methods for unconstrained optimization. Technical Report RAL-TR-97-064, Rutherford Appleton Laboratory, 1997. [14] Qiaoming Han, Yinyu Ye, , and Jiawei Zhang. Approximation of dense-k- subgraph. 2000. [15] Robert M. Haralick. An interpretation for probabilistic relaxation. Computer Vision, Graphics, and Image Processing, 22:388-395, 1983. [16] Robert A. Hummel and Steven W. Zucker. On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine intelligence, PAMI-5(3):267-287, 1983. 80 [17] Olivia Hunting, Ulrich Faigle, and Walter Kern. A lagrangian relaxation approach to the edge-weighted clique problem. European Journal of Operations Research, 131(1):119-131, May 2001. [18] Victor M. Jimenez and Andres Marzal. An algorithm for efficient computation of k shortest paths. Technical Report DSCI-I/38/94, Depto. de Sistemas Informaiticos y Computaci6n, Universidad Politecnica de Valencia, Valencia, Spain, 1994. [19] Michael I. Jordan. An introduction to probabilisticgraphical models. University of California, Berkeley, 2003. [20] Narendra Karmarkar, Mauricio G. C. Resende, and K. G. Ramakrishnan. An interior point algorithm to solve computationally difficult set covering problems. Mathematical Programming,52:597-618, 1991. [21] Patrice Koehl and Marc Delarue. Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. Journal of Molecular Biology, 239:249-275, 1994. [22] Arie M.C.A Koster. Frequency Assignment: Models and Algorithms. PhD thesis, Maastricht University, November 1999. [23] Arie M.C.A. Koster, Stan P.M. van Hoesel, and Antoon W.J. Kolen. The partial constraint satisfaction problem: Facets and lifting theorems. Operations research letters, 23(3-5):89-97, 1998. [24] Arie M.C.A. Koster, Stan P.M. van Hoesel, and Antoon W.J. Kolen. Lower bounds for minimum interference frequency assignment problems. Technical Report RM 99/026, Maastricht University, October 1999. [25] Arie M.C.A. Koster, Stan P.M. van Hoesel, and Antoon W.J. Kolen. Solving frequency assignment problems via tree-decomposition. Technical Report RM 99/011, Maastricht University, 1999. 81 [26] Andrew R. Leach and Andrew P. Lemon. Exploring the conformational space of protein side chains using dead-end elimination and the a* algorithm. PROTEINS: Structure, Function, and Genetics, 33:227-239, 1998. [27] Stefano Lucidi, Francesco Rochetich, and Massimo Roma. Curvilinear stabilization techniques for truncated newton methods in large scale unconstrained optimization. SIAM Journal on Optimization, 8(4):916-939, 1998. [28] Elder Magalhaes Macambira and Cid Carvalho de Souza. The edge-weighted clique problem: Valid inequalities, facets and polyhedral computations. European Journal of Operations Research, 123(2):346-371, June 2000. [29] Jorge J. More and Danny C. Sorensen. On the use of directions of negative curvature in a modified newton method. Mathematical Programming, 16:1-20, 1979. [30] Stephen G. Nash and Ariela Sofer. Preconditioning reduced matrices. SIAM Journal of Matrix Analysis and Applications, 17(1):47-68, 1996. [31] Kien-Ming Ng. A continuation approachfor solving nonlinearoptimization problems with discrete variables. PhD thesis, Stanford University, 2002. [32] Shmuel Peleg. A new probabilistic relaxation scheme. IEEE Transactions on PatternAnalysis and Machine intelligence, PAMI-2(4):362-369, 1980. [33] N. A. Pierce, J. A. Spriet, J. Desmet, and S. L. Mayo. Conformational splitting: a more powerful criterion for dead-end elimination. Journal of Computational Chemistry, 21:999-1009, 2000. [34] J. W. Ponders and F. M. Richards. Tertiary templates for proteins. Journal of Molecular Biology, 193(4):775-791, 1987. [35] Ted K. Ralphs. Symphony version 3.0 user's guide. http: //www. branchandcut . org/SYMPHONY, 2003. 82 available at [36] Anand Rangarajan and Alan L. Yuille. Mime: Mutual information minimization and entropy maximization for bayesian belief propagation. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 873-880, Cambridge, MA, 2002. MIT Press. [37] Mohit Tawarmalani and Nikolaos V. Sahinidis. Global optimization of mixedinteger nonlinear programs: A theoretical and computational study. In Mathematical Programming.(submitted, 1999). [38] Ron Unger and John Moult. Finding the lowest free energy of conformation of a protein is a NP-hard problem: proof and implications. Bulletin of Mathematical Biology, 55(6):1183-1198, 1993. [39] Frangois Vanderbeck and Laurence Wolsey. An exact algorithm for ip column generation. Operations Research Letters, 19(4):151-159, 1996. [40] Christopher A. Voigt, D. Benjamin Gordon, and Stephen L. Mayo. Protein design automation. Protein Science, 5:895-903, 1996. [41] J. P. Warners, T. Terlaky, C. Roos, and B. Jansen. Potential reduction algorithms for structured combinatorial optimization problems. Operations Research Letters, 21:55-64, 1997. [42] Chen Yanover and Yair Weiss. Approximate inference and protein-folding. In Proceedings of Neural Information Processing Systems, 2002. [43] Jonathan S. Yedida, William T. Freeman, and Weiss Yair. Bethe free energy, kikuchi approximations and belief propagation algorithms. Technical Report TR2001-16, MERL, 2001. 83