Artificial Intelligence and Robotics Methods in Computational Biology: Papers from the AAAI 2013 Workshop An Evolutionary Search Algorithm to Guide Stochastic Search for Near-Native Protein Conformations with Multiobjective Analysis Brian Olson1 Amarda Shehu1,2,3 1 Department of Computer Science, 2 Department of Bioengineering, 3 School of Systems Biology George Mason University, Fairfax, VA, 22030 Abstract place of Molecular Dynamics (MD) due to the higher sampling capability of MC. State-of-the-art protocols handle the high dimensionality of the protein conformational space in two ways. First, reduced/coarse-grained representations of the protein chain are employed to lower the number of dimensions. Such representations largely sacrifice side chains, modeling backbone heavy atoms and a designated atom or pseudo-atom per side chain. Second, the molecular fragment replacement technique is employed to sample new conformations. Rather than sampling angle values for each of the backbone dihedral angles independently, the technique couples backbone dihedral angles of consecutive amino acids in a fragment and samples an entire fragment configuration at a time from libraries pre-compiled from known protein native structures (Han and Baker 1996). Coarse-grained representations and molecular fragment replacement have greatly advanced de novo structure prediction (Bradley, Misura, and Baker 2005; Hegler et al. 2009; Shehu 2009; DeBartolo et al. 2010; Shehu and Olson 2010; Olson, Molloy, and Shehu 2011; Olson et al. 2012b; Xu and Zhang 2012; Simoncini et al. 2012; Molloy, Saleh, and Shehu 2013). Recently, this domain-specific expertise has been incorporated in evolutionary search algorithms (EAs) (Olson, De Jong, and Shehu 2013; Olson and Shehu 2012b; 2013; Saleh, Olson, and Shehu 2012; 2013). EAs have been proposed for protein conformational search before, using either lattice or all-atom representations (Chira, Horvath, and Dumitrescu 2010; Islam, Chetty, and Murshed 2011; Cutello, V et al. 2011; Garza-Fabre, Toscano-Pulido, and Rodriguez-Tello 2012; Narzisi, Nicosia, and Stracquadanio 2010). Currently, EAs that employ lattice or all-atom representations have limited applicability and are not competitive with MC-based approaches that employ backbone representations. Recent work has shown that incorporating such representations and molecular fragment replacement makes even very simple EAs, such as basin hopping (Olson and Shehu 2011; 2012b; Olson et al. 2012a; Olson and Shehu 2012a; 2013), or more powerful population-based EAs (Saleh, Olson, and Shehu 2013; Olson, De Jong, and Shehu 2013) competitive with MC-based algorithms for de novo structure prediction. Currently, many stochastic search algorithms are shown to have high sampling capability. However, inaccuracies Predicting native conformations of a protein sequence is known as de novo structure prediction and is a central challenge in computational biology. Most computational protocols employ Monte Carlo sampling. Evolutionary search algorithms have also been proposed to enhance sampling of near-native conformations. These approaches bias stochastic search by an energy function, even though current energy functions are known to be inaccurate and drive sampling to non-native energy minima. This paper proposes a multiobjective approach which employs Pareto dominance, rather than total energy, to evaluate a conformation. This multiobjective approach accounts for the fact that terms in an energy function are conflicting optimization criteria. Our analysis is conducted on a diverse set of 20 proteins. Results show that employing Pareto dominance, rather than total energy, to guide stochastic search is more effective at sampling conformations which are both lower in energy and near the protein native structure. Introduction Millions of protein-encoding sequences extracted from organismal genomes lack any structural or functional characterization (Lee, Redfern, and Orengo 2007). Yet, a detailed structural characterization of the biologically-active or native state of a protein is key to understanding protein function and essential in engineering novel proteins, predicting stability, modeling molecular interactions, and designing novel drug compounds (Shehu 2013). Doing so from only knowledge the protein’s amino-acid sequence, a problem known as de novo structure prediction, is an outstanding challenge in computational biology (Lee, Wu, and Zhang 2009; Shehu 2010; Moult et al. 2011). Current de novo structure prediction protocols employ stochastic search guided by an energy function to iterate over low-energy conformations of a chain of amino acids. The operating principle is that native conformations are associated with the lowest energies in the energy surface that underlies the protein conformational space (Dill and Chan 1997). Most protocols use Monte Carlo (MC) sampling in c 2013, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 32 of forward kinematics allows obtaining cartesian coordinates from these angles (Zhang and Kavraki 2002). The only atoms modeled are the heavy backbone atoms N , Calpha , C, O, and a pseudo-atom centered at the side chain of each amino acid. This representation is the one employed in the Rosetta de novo structure prediction protocol. in energy functions are considered primary reasons why de novo structure prediction remains challenging. Recent work shows that even state-of-the-art coarse-grained energy functions, including the Rosetta energy function, have nonnative energy minima that are lower than the one containing the experimentally-known native structure (Shmygelska and Levitt 2009; Das 2011; Molloy, Saleh, and Shehu 2013). This is not surprising, as energy functions, particularly those that interface with coarse-grained representations, are known to be inaccurate due to the process in which they are obtained. Many terms in them are conflicting. However, this development seems to represent an impasse in computational structural biology. Recent studies have advocated sacrificing efficiency and doing away with coarse-grained energy functions (Bowman and Pande 2009), though the presence of inaccuracies is not disputed even among all-atom energy functions (Verma et al. 2006; Hornak et al. 2006; Roe et al. 2007). In this paper we propose to change the framework in which stochastic search, whether MC- or EA-based, is guided by an energy function. We propose to treat the different terms or groups of terms in a given energy function as conflicting optimization criteria. We do so in the context of an EA that combines both local and global search, which is known as a hybrid or memetic EA. In essence, the EA evolves a population of conformations over generations. The algorithm employs the coarse-grained representation in the Rosetta protocol, the Rosetta energy functions, and the molecular fragment replacement technique. Fragment lengths of 9 and 3 amino acids are used. The proposed algorithm is guided by Pareto dominance rather than the total potential energy of a conformation. Prior to adding a conformation to the evolving population, the algorithm decomposes the energy of the conformation into various terms. The values of these terms are compared to those of other conformations maintained in an archive. The conformation is then added to the population based on a multi-objective analysis detailed below in Methods. The resulting population thus corresponds to the Pareto front of all of the conformations sampled during the search. The proposed algorithm is tested on 20 protein sequences with experimentally-determined native structures. Analysis of sampled conformations and comparison with the known native structures show that employing Pareto dominance to guide stochastic search rather than total energy is more effective at sampling low-energy near-native conformations. Initial Population The initial population P0 is obtained by conducting p independent two-stage Metropolis MC trajectories starting at the fully extended conformation. The first stage of each trajectory consists of 200 moves and uses the score0 Rosetta energy function with a temperature of zero. The second stage uses the score1 Rosetta energy function with a low temperature close to room temperature. Stage two runs until n consecutive MC moves have failed, where n is the number of amino acids. A move in this 0th generation consists of replacing the configuration of a randomly-selected fragment of 9 amino acids with a configuration sampled from the fragment configuration library constructed with the latest protocol described in (Leaver-Fay and et al. 2011). The score0 Rosetta energy function consists of only a soft steric repulsion, and its usage in P0 is to obtain a diverse population of conformations free of steric clashes. The application of score1 allows formation of secondary structure. Evolving Population In each subsequent generation i, the algorithm switches to employing fragments of length 3, and the population Pi is obtained as follows. All conformations of the previous population Pi−1 are first duplicated, then subjected to mutation and projected to a nearby local minimum through a local search. The mutation consists of replacing a configuration of a randomly-sampled fragment of length 3. The local search is a greedy search that terminates when l consecutive replacements fail to lower energy. Analysis in previous work suggests setting l to the number of amino acids in the target protein sequence (Olson and Shehu 2012a). The energy function used for the local search is the score3 Rosetta energy function, which corresponds to the full coarse-grained Rosetta energy function that is a linear combination of 10 different energy terms measuring repulsion, amino-acid propensities, residue environment, residue pair interactions, interactions between secondary structure elements, density, and compactness. Population Selection Method The result of this process is p child conformations which are not automatically added to population Pi . Instead, they are compared to an archive that maintains every child conformation sampled in the algorithm. The archive gives a broad view of conformational space in order to select conformations to add to the population. The comparison is conducted on three groupings of energy terms in the score4 Rosetta energy function. In score4, three additional terms are added to score3, short-range hydrogen bonding, longrange hydrogen bonding, and Ramachandran. These are organized into three terms, shb, which corresponds to short- In the proposed EA, a population of conformations evolves through a series of generations guided by Pareto dominance (detailed below) rather than the total energy of a conformation. Different fragment lengths and different Rosetta energy functions are used at various generations. Molecular Representation A conformation is represented as a vector of 3n angles, which are the φ, ψ, ω backbone dihedral angles of each amino acid in a protein chain of n amino acids. Application 33 In the local search, m is set to the number of amino acids in the particular protein sequence under consideration. range hydrogen bonding, lhb, which corresponds to longrange hydrogen bonding, and all-else, which groups together Ramachandran and all other remaining energy terms. Our analysis indicates that this grouping is most effective (data not shown). Once the energy of a conformation is split into these 3 terms, then essentially a conformation can be regarded to have 3 scores. A child conformation is first added to the archive, and then each conformation in the archive, including the newly added child conformation, is re-evaluated according to these 3 scores. Summary Analysis Table 1 provides details on the 20 protein systems selected for the analysis here. These vary from 53 to 146 amino acids in length and have different native folds. The lowest RMSD to the native structure (also lowest over 5 runs) is shown for EA in column 5 and compared to that reached by MOEA, shown in column 6. In 17/20 cases, highlighted in bold, MOEA reaches lower or comparable (within 0.5Å) lowest RMSDs. Columns 7-8 show that the % of conformations with < 5Å from the native structure is also higher in MOEA, and this difference is dramatic in 3 cases, which are highlighted in bold. In only one case is MOEA outperformed by EA (protein system 11). Columns 9 − 10 compare the algorithms in terms of lowest score4 energy value reached. In 12/20 cases, highlighted in bold, MOEA reaches lower or comparable (within 2.0kcal/mol) energy values than EA. In summary, these results suggest that MOEA reaches both lower-energy and lower-RMSD conformations, thus enhancing sampling of near-native conformations. Multiobjective Analysis for Selection: Pareto Dominance A conformation Ci in the archive is said to dominate another conformation Cj in the archive when each score of Ci is lower than the corresponding score in Cj . The Pareto rank of a conformation is the number of conformations which dominate it. Conformations with a Pareto rank of 0 are said to be non-dominated and belong to the Pareto front. Conformations in the Pareto front are considered equivalent with respect to a multiobjective analysis. We note that the Pareto rank of a conformation in the archive can change over time, so a conformation that starts in the Pareto front will likely fall out of the Pareto front over time. This is the reason the algorithm re-evaluates the entire archive after adding a child conformation to it. In addition to child conformations, the best l parent conformations from population Pi−1 are added to population Pi . This is known as elitism, and its purpose is to preserve good solutions captured in previous generations. The resulting population is reduced down to the same constant size of p individuals through truncation selection. For both elitism and truncation selection, conformations are ranked first by Pareto rank and then by total energy for conformations with the same Pareto rank. Detailed Analysis The rest of the analysis provides some more detail on the actual distribution of RMSDs and energies of sampled conformations. The left panel of Fig. 1 compares the distribution of RMSD values (all conformations sampled by all 5 runs are combined for each algorithm). The distribution for conformations obtained by MOEA is plotted in a dotted blue line and is superimposed over that obtained by EA, plotted in a black line. Three systems are selected to highlight a case where the MOEA results in significantly more conformations with lower RMSDs to the native structure than the EA in Fig. 1(a), a case where the distributions are comparable, shown in Fig. 1(c), and a rare case where EA performs better, shown in Fig. 1(e) (this is the only case where MOEA is outperformed, as indicated in Table 1). The right panel of Fig. 1 compares the algorithms in terms of the energy vs. RMSD distribution of the conformations they sample (conformations from all 5 runs are combined for each algorithm). The only three highlighted cases are those as above. The distribution obtained by the MOEA is superimposed in blue over that obtained by EA in red. Comparison of these distributions allows making a few observations. First, as before, MOEA reaches lower energy values than EA, even though it is guided by Pareto analysis rather than total energy. Second, the energy surfaces sampled are rich in non-native minima. Only in the case of PDB id 1dtjA is the energy surface funneled towards the native structure. In the case PDB id 1aoy, where MOEA performs worse than EA in terms of distribution of RMSDs, the Pareto analysis seems to have steered the search towards a minimum that is 7 − 9Å away from the native structure. Experiments and Results We compare the proposed algorithm, which we refer to as MOEA for Multi-objective EA, with an EA that does not use Pareto rank, but rather only employs total energy to determine whether to add a child conformation to a population based on truncation selection. Our analysis compares EA to the MOEA in terms of lowest energies (score4) reached, the lowest RMSD to the native structure reached, and the entire distribution of energy vs. RMSD values for sampled conformations. RMSD averages Euclidean distance between corresponding Cα atoms from a given conformation and the known native structure. Lower values mean better proximity to the native structure. Implementation Details Each algorithm is run 5 times on each of the 20 proteins employed for our analysis. A fixed budget of 10, 000, 000 energy function evaluations is used, which takes 7−24 hours of CPU time on a 2.4Ghz Core i7 processor, depending on protein length. The size of each population is p = 100, and elitism rate is set to l = 25 for EA and l = 100 for MOEA. Discussion Taken together, the results show that guiding search by multiobjective analysis rather than total energy can be more 34 (a) 1dtjA, 76 aas, α/β (b) 1dtjA, 76 aas, α/β (c) 1ail, 70 aas, α (d) 1ail, 70 aas, α/ (e) 1aoy, 66 aas, α/β (f) 1aoy, 67 aas, α/β Figure 1: Left: Distribution of RMSDs of MOEA-obtained conformations from known native structure (dotted blue line) are superimposed over distribution obtained by EA (black line). Right: Distribution of energies vs. RMSDs from native structure of MOEA-obtained conformations (transparent blue) are superimposed over distribution obtained by EA (red). 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Table 1: Summary of comparison between EA and MOEA on 20 protein sequences. Native Fold min Cα-RMSD (Å) % < 5Å Cα-RMSD Rosetta score4 PDB Id Length Topology EA MOEA EA MOEA EA MOEA 1bq9 53 α/β 3.0 3.4 0.093 0.128 -50.5 -45.8 1dtdB 61 α/β 4.4 5.3 0.006 0.000 -55.0 -74.5 1isuA 62 α/β 6.6 6.4 0.000 0.000 -46.5 -48.4 1c8cA 64 α/β 4.8 3.6 0.001 0.003 -86.4 -98.4 1sap 66 α/β 3.7 3.7 0.015 0.008 -121.4 -120.1 1hz6A 67 α/β 1.9 2.1 13.938 35.418 -130.9 -135.6 1wapA 68 β 6.3 6.4 0.000 0.000 -132.5 -117.5 1fwp 69 α/β 4.3 3.4 0.007 0.107 -84.4 -92.8 1ail 70 α 1.4 1.9 1.747 2.056 -56.1 -67.1 1dtjA 76 α/β 4.2 2.3 0.004 8.174 -82.2 -97.4 1aoy 78 α/β 3.9 3.7 0.368 0.187 -98.1 -102.0 2ci2 83 α/β 3.7 3.9 0.006 0.001 -109.8 -105.7 1cc5 83 α 4.7 4.9 0.001 0.001 -68.6 -67.8 1tig 88 α/β 3.2 2.5 1.095 11.368 -128.0 -151.7 2ezk 93 α 3.4 3.2 0.060 0.493 -100.7 -93.4 1hhp 99 β 8.8 8.6 0.000 0.000 -104.5 -97.3 2hg6 106 α/β 9.3 9.6 0.000 0.000 -102.6 -95.7 3gwl 106 α 5.4 5.8 0.000 0.000 -100.0 -95.3 2h5nD 123 α 6.2 7.5 0.000 0.000 -129.0 -126.6 1aly 146 β 11.2 11.4 0.000 0.000 -81.1 -117.1 effective and enhance sampling of both low-energy and near-native conformations. This direction seems particularly promising in the context of inaccurate energy functions and warrants further investigation in de novo structure prediction. Already researchers in computational biology are investigating multiobjective optimization in the context of protein design (Nivon, Moretti, and Baker 2013). In future work we will consider different energy functions, different groupings of energy terms, and variations of the Pareto-based analysis. It is expected that progress in this direction will not only advance decoy sampling for de novo structure prediction, but it will also provide high-quality decoys for improvements in the process of computational design of protein energy functions. Chira, C.; Horvath, D.; and Dumitrescu, D. 2010. An Evolutionary Model Based on Hill-Climbing Search Operators for Protein Structure Prediction. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics 38– 49. Cutello, V; Morelli, G.; Nicosia, G.; Pavone, M.; and Scollo, G. 2011. On discrete models and immunological algorithms for protein structure prediction. Natural Computing 10(1):91–102. Das, R. 2011. Four small puzzles that rosetta doesn’t solve. PLoS ONE 6(5):e20044. DeBartolo, J.; Hocky, G.; Wilde, M.; Xu, J.; Freed, K. F.; and Sosnick, T. R. 2010. Protein structure prediction enhanced with evolutionary diversity: SPEED. Protein Sci. 19(3):520– 534. Dill, K. A., and Chan, H. S. 1997. From levinthal to pathways to funnels. Nat. Struct. Biol. 4(1):10–19. Garza-Fabre, M.; Toscano-Pulido, G.; and Rodriguez-Tello, E. 2012. Locality-based multiobjectivization for the HP model of protein structure prediction. In GECCO ’12: Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference. ACM Request Permissions. Han, K. F., and Baker, D. 1996. Global properties of the mapping between local amino acid sequence and local struc- Acknowledgment This work is supported in part by NSF CCF No. 1016995 and NSF IIS CAREER Award No. 1144106. References Bowman, G. R., and Pande, V. S. 2009. Simulated tempering yields insight into the low-resolution rosetta scoring functions. Proteins: Struct. Funct. Bioinf. 74(3):777–788. Bradley, P.; Misura, K. M.; and Baker, D. 2005. Toward high-resolution de novo structure prediction for small proteins. Science 309(5742):1868–1871. 36 ture in proteins. Proc. Natl. Acad. Sci. USA 93(12):5814– 5818. Hegler, J. A.; Laetzer, J.; Shehu, A.; Clementi, C.; and Wolynes, P. G. 2009. Restriction vs. guidance: fragment assembly and associative memory hamiltonians for protein structure prediction. Proc. Natl. Acad. Sci. USA 106(36):15302–15307. Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; and Simmerling, C. 2006. Comparison of multiple amber force fields and development of improved protein backbone parameters. Proteins: Struct. Funct. Bioinf. 65(3):712–725. Islam, M. K.; Chetty, M.; and Murshed, M. 2011. Novel local improvement techniques in clustered memetic algorithm for protein structure prediction. In Evolutionary Computation (CEC), 2011 IEEE Congress on, 1003–1011. Leaver-Fay, A., and et al. 2011. ROSETTA3: an objectoriented software suite for the simulation and design of macromolecules. Methods Enzymol 487:545–574. Lee, D.; Redfern, O.; and Orengo, C. 2007. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 8(12):995–1005. Lee, J.; Wu, S.; and Zhang, Y. 2009. Ab initio protein structure prediction. In Rigden, D., ed., Ab Initio Protein Structure Prediction. Springer Science + Business Media B.V. chapter 1. Molloy, K.; Saleh, S.; and Shehu, A. 2013. Probabilistic search and energy guidance for biased decoy sampling in ab-initio protein structure prediction. IEEE Trans. Comp. Biol. and Bioinf. in press. Moult, J.; Fidelis, K.; Kryshtafovych, A.; and Tramontano, A. 2011. Critical assessment of methods of protein structure prediction (CASP) round IX. Proteins: Struct. Funct. Bioinf. Suppl(10):1–5. Narzisi, G.; Nicosia, G.; and Stracquadanio, G. 2010. Robust Bio-active Peptide Prediction Using Multi-objective Optimization. In Biosciences (BIOSCIENCESWORLD), 2010 International Conference on, 44–50. Nivon, L. G.; Moretti, G.; and Baker, D. 2013. A paretooptimal refinement method for protein design scaffolds. PLoS One 8(4):e59004. Olson, B., and Shehu, A. 2011. Populating local minima in the protein conformational space. In IEEE Intl Conf on Bioinf and Biomed (BIBM), 114–117. Olson, B., and Shehu, A. 2012a. Efficient basin hopping in the protein energy surface. In IEEE Intl Conf on Bioinf and Biomed. in press. Olson, B., and Shehu, A. 2012b. Evolutionary-inspired probabilistic search for enhancing sampling of local minima in the protein energy surface. Proteome Sci. in press. Olson, B., and Shehu, A. 2013. Rapid sampling of local minima in protein energy surface and effective reduction through a multi-objective filter. Proteome Sci. in press. Olson, B.; Hashmi, I.; Molloy, I.; and Shehu, A. 2012a. Basin hopping as a general and versatile optimization framework for the characterization of biological macromolecules. Advances in AI J 2012(674832). Olson, B. S.; Molloy, K.; Hendi, S.-F.; and Shehu, A. 2012b. Guiding search in the protein conformational space with structural profiles. J Bioinf and Comp Biol 10(3):1242005. Olson, S.; De Jong, K. A.; and Shehu, A. 2013. Off-lattice protein structure prediction with homologous crossover. In Genet. and Evol. Comput. Conf. (GECCO). in press. Olson, B.; Molloy, K.; and Shehu, A. 2011. In search of the protein native state with a probabilistic sampling approach. J. Bioinf. and Comp. Biol. 9(3):383–398. Roe, D. R.; Okur, A.; Wickstrom, L.; Hornak, V.; and Simmerling, C. 2007. Secondary structure bias in generalized born solvent models: Comparison of conformational ensembles and free energy of solvent polarization from explicit and implicit solvation. J. Phys. Chem. 11(7):1846 –1857. Saleh, S.; Olson, B.; and Shehu, A. 2012. A populationbased evolutionary algorithm for sampling minima in the protein energy surface. In He, J.; Shehu, A.; Haspel, N.; and B., C., eds., Comput Struct Biol Workshop, 48–55. Saleh, S.; Olson, B.; and Shehu, A. 2013. A populationbased evolutionary search approach to the multiple minima problem in de novo protein structure prediction. BMC Struct. Biol. in press. Shehu, A., and Olson, B. 2010. Guiding the search for native-like protein conformations with an ab-initio treebased exploration. Int. J. Robot. Res. 29(8):1106–11227. Shehu, A. 2009. An ab-initio tree-based exploration to enhance sampling of low-energy protein conformations. In Robot: Sci. and Sys., 241–248. Shehu, A. 2010. Conformational search for the protein native state. In Rangwala, H., and Karypis, G., eds., Protein Structure Prediction: Method and Algorithms. Fairfax, VA: Wiley Book Series on Bioinformatics. chapter 21. Shehu, A. 2013. Probabilistic search and optimization for protein energy landscapes. In Aluru, S., and Singh, M., eds., Handbook of Computational Molecular Biology. Chapman & Hall/CRC Computer Information Series. Shmygelska, A., and Levitt, M. 2009. Generalized ensemble methods for de novo structure prediction. Proc. Natl. Acad. Sci. USA 106(5):94305–95126. Simoncini, D.; Berenger, F.; Shrestha, R.; and Zhang, K. Y. J. 2012. A probabilistic fragment-based protein structure prediction algorithm. PLoS ONE 7(7):e38799. Verma, A.; Schug, A.; Lee, K. H.; and Wenzel, W. 2006. Basin hopping simulations for all-atom protein folding. J. Chem. Phys. 124(4):044515. Xu, D., and Zhang, Y. 2012. Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins: Struct. Funct. Bioinf. 80(7):1715–1735. Zhang, M., and Kavraki, L. E. 2002. A new method for fast and accurate derivation of molecular conformations. Chem. Inf. Comput. Sci. 42(1):64–70. 37