Methods in Molecular Biology 2405 Thomas Simonson Editor Computational Peptide Science Methods and Protocols METHODS IN MOLECULAR BIOLOGY Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK For further volumes: http://www.springer.com/series/7651 For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed. Computational Peptide Science Methods and Protocols Edited by Thomas Simonson Lab de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France Editor Thomas Simonson Lab de Biologie Structurale de la Cellule (CNRS UMR7654) Ecole Polytechnique Palaiseau, France ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-1854-7 ISBN 978-1-0716-1855-4 (eBook) https://doi.org/10.1007/978-1-0716-1855-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A. Preface Computational peptide science is a broad and fast-moving field. “Peptides” have many shapes, and computations come in many forms. This book provides a collection of protocols and approaches, compiled by many of today’s leaders in the field. While diverse and important, the topics are far from exhaustive and the choices partly subjective. The methodologies include mining properties from sequence databases, predicting structure, dynamics, and interactions using molecular modeling, and designing peptides computationally. But the diversity starts with the peptides. Many natural peptides are produced in cells. They can be genetically encoded or nonribosomal and have important functions, acting as ligands, inhibitors, messengers, hormones, toxins, or structural building blocks. For example, natural antimicrobial peptides target bacterial ribosomes as part of the innate immunity of animals and insects [1, 2]. Other peptides arise as by-products of protein cleavage, maturation, degradation, or misfolding. Their accumulation or aggregation, for example in amyloid fibers, can have major consequences for the health of cells and tissues [3]. Exogenous peptides are processed by the major histocompatibility complex (MHC) for immunity. Protein regions that are intrinsically or transiently disordered have many of the same properties as peptides. Thus, protein–protein interactions are often mediated by short, weakly structured, linear peptide motifs [4] or by larger protein regions that only become structured upon binding [5, 6]. Synthetic peptides and peptidomimetics are another category that have potential applications as antibacterial ligands [1, 2], miniproteins [7], or for the formation of assemblies and biomaterials [8, 9]. We would like to understand and engineer all these systems. However, it is hard to characterize the structure and dynamics of peptides experimentally. They are often disordered when they are not engaged by another macromolecule in a complex. They can sample many conformations over many timescales, similar to a denatured protein [10, 11]. They may interact with lipid membranes, which are themselves dynamic and fluid. Such structures are not readily solved by crystallography or NMR. Mean properties like the radius of gyration or diffusion coefficient can be measured in vitro, and peptides can be probed chemically by protease digestion or hydrogen exchange. But their precise conformations and dynamics remain elusive, and their behavior in vivo even more so [12]. Computational approaches are another route. They are increasingly attractive as computer power continues to grow. Massive sequence databases can be mined to identify low complexity regions, propensities for disorder [13], or amyloid formation [14]. Molecular modeling can be used to predict structure and binding [15]. Molecular dynamics can explore conformational space with atomic resolution, revealing conformer populations, timescales, solvent or lipid structure, and the underlying physical interactions. Virtual directed evolution of peptides or miniproteins can be done with the methods of protein design [7, 16, 17]. Here, too, there are difficulties. Structure, flexibility, binding, and specificity arise from a competition and balance among many interactions, most of them weak, involving peptides, solvents, ions, and possibly receptors or membranes [18]. Peptide recognition often involves conformational selection or induced fit. Enthalpic and entropic effects are both essential. To capture all these effects at a manageable cost, molecular modeling introduces many v vi Preface approximations. Widely used force fields are quite simple, with constant atomic point charges, simple, transferable Lennard–Jones interactions, and water molecules described by just three particles [15]. To increase throughput, an essential further step is to treat solvent implicitly, usually as a dielectric continuum [19]. This is a drastic approximation even for polar interactions, and it does not include the nonpolar interactions with solvent, which require specific treatment. For receptor binding, pose selection and scoring are hard combinatorial problems, and further approximations are often used, such as a rigid receptor and even simpler solvent models. Methodologies continue to develop and improve. Force fields have been refined for peptidomimetics and for unfolded proteins [20]. Electronic polarizability can be treated explicitly for ionic interactions [21, 22]. Powerful methods to sample conformations are increasingly available, like adaptive landscape flattening [23–25]. Coarse-grained models allow very long simulation times [26]. High-throughput methods for peptide docking [27] and design [7, 16, 17] continue to improve. Powerful machine learning approaches are under development to mine sequence databases [28, 29]. This volume introduces many of these methodologies. A few chapters have the form of literature reviews. Most are practical tutorials for specific methods. The first four describe methods to infer peptide properties from their sequences: antimicrobial activity, foldability, sheet formation. The next five describe methods to simulate the structure and dynamics of peptides, including amyloid formers and membrane-active peptides, using tools of increasing sophistication. Five chapters describe the design and modeling of peptides to form organized assemblies and to bind protein interfaces, and the prediction of peptide–MHC complexes. Gallichio reviews advanced free energy simulations for peptide binding. Finally, the last four chapters describe methods for high-throughput peptide or miniprotein design. This is an exciting time for computational peptide science. The concepts, methods, and guidelines laid out below should help both novices and experienced workers benefit from the new opportunities and challenges, now and in the future. Paris, France Thomas Simonson References 1. Krizsan A, Volke D, Weinert S, Str€a ter N, Knappe D, Hoffmann R (2014) Insect-derived proline-rich antimicrobial peptides kill bacteria by inhibiting bacterial protein translation at the 70S ribosome. Angew Chem Int Ed 53:12236–12239 2. Seefeldt AC, Nguyen F, Antunes S, Pérébaskine N, Graf M, Arenz S, Inampudi KK, Douat C, Guichard G, Wilson DN, Innis CA (2016) The proline-rich antimicrobial peptide onc112 inhibits translation by blocking and destabilizing the initiation complex. Nat Struct Mol Biol 22:470–475 3. Scheckel C, Aguzzi A (2018) Prions, prionoids and protein misfolding disorders. Nat Rev Genet 19: 405–418 4. Borg JP (ed) (2020) PDZ mediated interactions: methods and protocols, vol 2256. Springer Verlag, New York 5. Shoemaker BA, Portman JJ, Wolynes PG (2000) Speeding molecular recognition by using the folding funnel: the fly-casting mechanism. Proc Natl Acad Sci U S A 97:8868–8873 Preface vii 6. Gianni S, Dogan J, Jemth P (2016) Coupled binding and folding of intrinsically disordered proteins: what can we learn from kinetics? Curr Opin Struct Biol 36:18–24 7. Cao L, Goreshnik I, Coventry B, Case JB, Miller L, Kozodoy L, Chen RE, Carter L, Walls L, Park Y-J, Stewart L, Diamond M, Veesler D, Baker D (2020) De novo design of picomolar SARS-Cov-2 miniprotein inhibitors. Science 370:426–431 8. Reches M, Gazit E (2003) Casting metal nanowires within discrete self-assembled peptide nanotubes. Science 300:625–627 9. Wei G, Su Z, Reynolds NP, Arosio P, Hamley IW, Gazit E, Mezzenga R (2017) Self-assembling peptide and protein amyloids: from structure to tailored function in nanotechnology. Chem Soc Rev 46:4661–4708 10. Ptitsyn OB (1995) Molten globule and protein folding. Adv Protein Chem 47:83–229 11. Korzhnev DM, Religa TL, Banachewicz W, Fersht AR, Kay LE (2010) A transient and low-populated protein-folding intermediate at atomic resolution. Science 329:1312–1316 12. Theillet FX, Binolfi A, Bekei B, Martorana A, Rose HM, Stuiver M, Verzini S, Lorenz D, van Rossum M, Goldfarb D, Selenko P (2016) Structural disorder of monomeric α-synuclein persists in mammalian cells. Nature 530:45–50 13. Barik A, Katuwawala A, Hanson J, Paliwal K, Zhou Y, Kurgan L (2020) Depicter: intrinsic disorder and disorder function prediction server. J Mol Biol 432:3379–3387 14. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L (2004) Prediction of sequencedependent and mutational effects on the aggregation of peptides and proteins. Nat Biotech 22: 1302–1306 15. Becker O, MacKerell AD Jr, Roux B, Watanabe M (eds) (2001) Computational biochemistry & biophysics. Marcel Dekker, New York 16. Stoddard B (ed) (2016) Design and creation of ligand binding proteins, vol 1414. Springer Verlag, New York 17. Mignon D, Druart K, Michael E, Opuu V, Polydorides S, Villa F, Gaillard T, Panel N, Archontis G, Simonson T (2020) Physics-based computational protein design: an update. J Phys Chem A 124: 10637–10648 18. Simonson T (2015) The physical basis of ligand binding. In: Casavotto C (ed) In silico drug discovery and design: theory, methods, challenges, and applications, chapter 1. CRC Press, Boca Raton 19. Roux B, Simonson T (1999) Implicit solvent models. Biophys Chem 78:1–20 20. Best RB (2017) Computational and theoretical advances in studies of intrinsically disordered proteins. Curr Opin Struct Biol 42:147–154 21. Panel N, Villa F, Fuentes EJ, Simonson T (2018) Accurate PDZ-peptide binding specificity with additive and polarizable free energy simulations. Biophys J 114:1091–1101 22. Rackers JA, Wang Z, Lu C, Laury ML, Lagardere L, Schnieders MJ, Piquemal J-P, Ren PY, Ponder JW (2018) Tinker 8: software tools for molecular design. J Chem Theory Comput 14:5273–5289 23. Lu C, Li X, Wu D, Zheng L, Yang W (2016) Predictive sampling of rare conformational events in aqueous solution: designing a generalized orthogonal space tempering method. J Chem Theory Comput 12:41–52 24. Villa F, Panel N, Chen X, Simonson T (2018) Adaptive landscape flattening in amino acid sequence space for the computational design of protein:peptide binding. J Chem Phys 149:072302 25. Yalinca H, Gehin CJC, Oleinikovas V, Lashuel HA, Gervasio FL, Pastore A (2019) The role of post-translational modifications on the energy landscape of Huntingtin N-terminus. Front Mol Biosci 6:95 viii Preface 26. Souza P, Alessandri R, Barnoud J, Thallmair S, Faustino I, Grunewald F, Patmanidis I, Abdizadeh H, Bruininks B, Wassenaar T, Kroon P, Melcr J, Nieto V, Corradi V, Khan H, Domanski J, Javanainen M, Martinez-Seara H, Reuter N, Best R, Vattulainen I, Monticelli L, Periole X, Tieleman P, de Vries AH, Marrink SJ (2021) Martini 3: a general purpose force field for coarse-grained molecular dynamics. Nat Methods 18(4):382–388 27. Goodsell DS, Sanner MF, Olson AJ, Forli S (2021) The AutoDock suite at 30. Prot Sci 30:31–43 28. Gao W, Mahajan SP, Sulam J, Gray JJ (2020) Deep learning in protein structural modeling and design. Patterns 1:1–23 29. Cannataro M, Guzzi PH, Agapito G, Zucco C, Milano M (2021) Artificial intelligence in bioinformatics: from omics analysis to deep learning and network mining. Elsevier, Amsterdam Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v xi 1 Machine Learning Prediction of Antimicrobial Peptides . . . . . . . . . . . . . . . . . . . . . Guangshun Wang, Iosif I. Vaisman, and Monique L. van Hoek 2 Tools for Characterizing Proteins: Circular Variance, Mutual Proximity, Chameleon Sequences, and Subsequence Propensities . . . . . . . . . . . . . . . . . . . . . . . Mihaly Mezei 3 Exploring the Peptide Potential of Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Papadopoulos, Nicolas Chevrollier, and Anne Lopes 4 Computational Identification and Design of Complementary β-Strand Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoonjoo Choi 5 Dynamics of Amyloid Formation from Simplified Representation to Atomistic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phuong Hoang Nguyen, Pierre Tufféry, and Philippe Derreumaux 6 Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charles H. Chen, Karen Pepper, Jakob P. Ulmschneider, Martin B. Ulmschneider, and Timothy K. Lu 7 Coarse-Grain Simulations of Membrane-Adsorbed Helical Peptides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel N. Melo 8 Peptide Dynamics and Metadynamics: Leveraging Enhanced Sampling Molecular Dynamics to Robustly Model Long-Timescale Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joseph Clayton, Lokesh Baweja, and Jeff Wereszczynski 9 Metadynamics Simulations to Study the Structural Ensembles and Binding Processes of Intrinsically Disordered Proteins . . . . . . . . . . . . . . . . . . . Rui Zhou and Mojie Duan 10 Computational and Experimental Protocols to Study Cyclo-dihistidine Self- and Co-assembly: Minimalistic Bio-assemblies with Enhanced Fluorescence and Drug Encapsulation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . Asuka A. Orr, Yu Chen, Ehud Gazit, and Phanourios Tamamis 11 Computational Tools and Strategies to Develop Peptide-Based Inhibitors of Protein-Protein Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maxence Delaunay and Tâp Ha-Duong 12 Rapid Rational Design of Cyclic Peptides Mimicking Protein–Protein Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brianda L. Santini and Martin Zacharias 1 ix 39 63 83 95 115 137 151 169 179 205 231 x 13 14 15 16 17 18 19 Contents Structural Prediction of Peptide–MHC Binding Modes . . . . . . . . . . . . . . . . . . . . . Marta A. S. Perez, Michel A. Cuendet, Ute F. Röhrig, Olivier Michielin, and Vincent Zoete Molecular Simulation of Stapled Peptides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Victor Ovchinnikov, Aravinda Munasinghe, and Martin Karplus Free Energy-Based Computational Methods for the Study of Protein-Peptide Binding Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emilio Gallicchio Computational Evolution Protocol for Peptide Design . . . . . . . . . . . . . . . . . . . . . . Rodrigo Ochoa, Miguel A. Soler, Ivan Gladich, Anna Battisti, Nikola Minovski, Alex Rodriguez, Sara Fortuna, Pilar Cossio, and Alessandro Laio Computational Design of Miniprotein Binders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Younes Bouchiba, Manon Ruffini, Thomas Schiex, and Sophie Barbe Computational Design of Peptides with Improved Recognition of the Focal Adhesion Kinase FAT Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eleni Michael, Savvas Polydorides, and Georgios Archontis Knowledge-Based Unfolded State Model for Protein Design . . . . . . . . . . . . . . . . . Vaitea Opuu, David Mignon, and Thomas Simonson Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 283 303 335 361 383 403 425 Contributors GEORGIOS ARCHONTIS • Department of Physics, University of Cyprus, Nicosia, Cyprus SOPHIE BARBE • TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse, France ANNA BATTISTI • SISSA, Trieste, Italy LOKESH BAWEJA • Department of Physics and the Center for Molecular Study of Condensed Soft Matter, Illinois Institute of Technology, Chicago, IL, USA YOUNES BOUCHIBA • TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse, France CHARLES H. CHEN • Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, USA YU CHEN • Department of Molecular Microbiology and Biotechnology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel NICOLAS CHEVROLLIER • Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette, cedex, France YOONJOO CHOI • Combinatorial Tumor Immunotherapy MRC, Chonnam National University Medical School, Hwasun-gun, Jeollanam-do, Republic of Korea JOSEPH CLAYTON • Department of Physics and the Center for Molecular Study of Condensed Soft Matter, Illinois Institute of Technology, Chicago, IL, USA PILAR COSSIO • Biophysics of Tropical Diseases, Max Planck Tandem Group, University of Antioquia, Medellin, Colombia; Department of Theoretical Biophysics, Max Planck Institute of Biophysics, Frankfurt am Main, Germany MICHEL A. CUENDET • Molecular Modelling Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland; Oncology Department, Centre Hospitalier Universitaire Vaudois (CHUV), Precision Oncology Center, Lausanne, Switzerland MAXENCE DELAUNAY • Université Paris-Saclay, CNRS, BioCIS, Châtenay-Malabry, France PHILIPPE DERREUMAUX • Laboratoire de Biochimie Théorique, CNRS, Université de Paris, UPR 9080, Paris, France; Institut de Biologie Physico-Chimique, Fondation Edmond de Rothschild, PSL Research University, Paris, France MOJIE DUAN • State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan, People’s Republic of China SARA FORTUNA • Italian Institute of Technology (IIT), Genova, Italy; Department of Chemical and Pharmaceutical Sciences, University of Trieste, Trieste, Italy EMILIO GALLICCHIO • Department of Chemistry, Ph.D. Program in Biochemistry and Ph.D. Program in Chemistry at The Graduate Center of the City University of New York, Brooklyn College of the City University of New York, New York, NY, USA EHUD GAZIT • Department of Molecular Microbiology and Biotechnology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel IVAN GLADICH • Qatar Environment and Energy Research Institute, Hamad Bin Khalifa University, Doha, Qatar; SISSA, Trieste, Italy TÂP HA-DUONG • Université Paris-Saclay, CNRS, BioCIS, Châtenay-Malabry, France xi xii Contributors MARTIN KARPLUS • Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA; Laboratoire de Chimie Biophysique, ISIS, Université de Strasbourg, Strasbourg, France ALESSANDRO LAIO • The Abdus Salam International Centre for Theoretical Physics, Trieste, Italy; SISSA, Trieste, Italy ANNE LOPES • Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette, cedex, France TIMOTHY K. LU • Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA MANUEL N. MELO • Instituto de Tecnologia Quı́mica e Biologica Antonio Xavier, Universidade Nova de Lisboa, Oeiras, Portugal MIHALY MEZEI • Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA ELENI MICHAEL • Department of Physics, University of Cyprus, Nicosia, Cyprus OLIVIER MICHIELIN • Molecular Modelling Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland; Oncology Department, Centre Hospitalier Universitaire Vaudois (CHUV), Precision Oncology Center, Lausanne, Switzerland DAVID MIGNON • Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France NIKOLA MINOVSKI • Department of Chemical and Pharmaceutical Sciences, University of Trieste, Trieste, Italy; Theory Department, Laboratory for Cheminformatics, National Institute of Chemistry, Ljubljana, Slovenia ARAVINDA MUNASINGHE • Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA PHUONG HOANG NGUYEN • Laboratoire de Biochimie Théorique, CNRS, Université de Paris, UPR 9080, Paris, France; Institut de Biologie Physico-Chimique, Fondation Edmond de Rothschild, PSL Research University, Paris, France RODRIGO OCHOA • Biophysics of Tropical Diseases, Max Planck Tandem Group, University of Antioquia, Medellin, Colombia VAITEA OPUU • Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France ASUKA A. ORR • Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX, USA VICTOR OVCHINNIKOV • Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA CHRIS PAPADOPOULOS • Institute for Integrative Biology of the Cell (I2BC), Université ParisSaclay, Gif-sur-Yvette, cedex, France KAREN PEPPER • Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, USA MARTA A. S. PEREZ • Computer-aided Molecular Engineering Group, Department of Oncology UNIL-CHUV, Lausanne University, Lausanne, Switzerland; Ludwig Institute for Cancer Research, Lausanne, Switzerland; Molecular Modelling Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland SAVVAS POLYDORIDES • Department of Physics, University of Cyprus, Nicosia, Cyprus ALEX RODRIGUEZ • The Abdus Salam International Centre for Theoretical Physics, Trieste, Italy Contributors xiii UTE F. RÖHRIG • Molecular Modelling Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland MANON RUFFINI • TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse, France; Université Fédérale de Toulouse, ANITI, INRAE, UR 875, Toulouse, France BRIANDA L. SANTINI • Center for Functional Protein Assemblies, Physics Department T38, Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany THOMAS SCHIEX • Université Fédérale de Toulouse, ANITI, INRAE, UR 875, Toulouse, France THOMAS SIMONSON • Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France MIGUEL A. SOLER • Italian Institute of Technology (IIT), Genova, Italy PHANOURIOS TAMAMIS • Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX, USA; Department of Materials Science and Engineering, Texas A&M University, College Station, TX, USA PIERRE TUFFÉRY • Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, RPBS, Paris, France JAKOB P. ULMSCHNEIDER • Department of Physics, Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China MARTIN B. ULMSCHNEIDER • Department of Chemistry, King’s College London, London, UK IOSIF I. VAISMAN • School of Systems Biology, George Mason University, Manassas, VA, USA MONIQUE L. VAN HOEK • School of Systems Biology, George Mason University, Manassas, VA, USA GUANGSHUN WANG • Department of Pathology and Microbiology, College of Medicine, University of Nebraska Medical Center, 985900 Nebraska Medical Center, Omaha, NE, USA JEFF WERESZCZYNSKI • Department of Physics and the Center for Molecular Study of Condensed Soft Matter, Illinois Institute of Technology, Chicago, IL, USA MARTIN ZACHARIAS • Center for Functional Protein Assemblies, Physics Department T38, Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany RUI ZHOU • State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan, People’s Republic of China VINCENT ZOETE • Computer-aided Molecular Engineering Group, Department of Oncology UNIL-CHUV, Lausanne University, Lausanne, Switzerland; Ludwig Institute for Cancer Research, Lausanne, Switzerland; Molecular Modelling Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland Chapter 1 Machine Learning Prediction of Antimicrobial Peptides Guangshun Wang, Iosif I. Vaisman, and Monique L. van Hoek Abstract Antibiotic resistance constitutes a global threat and could lead to a future pandemic. One strategy is to develop a new generation of antimicrobials. Naturally occurring antimicrobial peptides (AMPs) are recognized templates and some are already in clinical use. To accelerate the discovery of new antibiotics, it is useful to predict novel AMPs from the sequenced genomes of various organisms. The antimicrobial peptide database (APD) provided the first empirical peptide prediction program. It also facilitated the testing of the first machine-learning algorithms. This chapter provides an overview of machine-learning predictions of AMPs. Most of the predictors, such as AntiBP, CAMP, and iAMPpred, involve a single-label prediction of antimicrobial activity. This type of prediction has been expanded to antifungal, antiviral, antibiofilm, antiTB, hemolytic, and anti-inflammatory peptides. The multiple functional roles of AMPs annotated in the APD also enabled multi-label predictions (iAMP-2L, MLAMP, and AMAP), which include antibacterial, antiviral, antifungal, antiparasitic, antibiofilm, anticancer, anti-HIV, antimalarial, insecticidal, antioxidant, chemotactic, spermicidal activities, and protease inhibiting activities. Also considered in predictions are peptide posttranslational modification, 3D structure, and microbial species-specific information. We compare important amino acids of AMPs implied from machine learning with the frequently occurring residues of the major classes of natural peptides. Finally, we discuss advances, limitations, and future directions of machine-learning predictions of antimicrobial peptides. Ultimately, we may assemble a pipeline of such predictions beyond antimicrobial activity to accelerate the discovery of novel AMP-based antimicrobials. Key words Multidrug resistance, Antimicrobial peptides, Database, Machine learning, Peptide prediction 1 Introduction The discovery and production of antibiotics has saved millions of lives. It is regarded as one of the greatest achievements of humankind in the twentieth century. However, pathogens fight back, leading to reduced potency of conventional antibiotics. To minimize toxic effects, bacteria can pump the drug out of the cells, reduce drug affinity to specific targets via mutations, and degrade antibiotics by enzymes. Among various multidrug-resistant (MDR) microbes, the ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 1 2 Guangshun Wang et al. Pseudomonas aeruginosa, and Enterobacter species) account for 90% of infections in hospitals [1]. There are also other emerging resistant pathogens, including human immunodeficiency virus type 1 (HIV-1), SARS-CoV2, Ebola, Zika viruses, resistant bacteria Mycobacterium tuberculosis, Salmonella, Candida, Neisseria gonorrhoeae, and Clostridioides difficile. If no action is taken, the projected annual deaths could reach ten million by 2050 [2]. To meet this challenge, one fundamental strategy is to develop a new generation of antimicrobials that are capable of eliminating those MDR pathogens. Antimicrobial peptides (AMPs) are considered as an alternative to conventional non-peptide antibiotics. This chapter focuses on prediction of antimicrobial peptides. First, we provide a brief introduction to AMPs. Second, we discuss the major prediction methods of AMPs. Third, both the data sets for predictions and the algorithms of machine learning are described. Fourth, we discuss the major machine-learning prediction of AMPs. Fifth, we compare the prediction outcomes of machine learning in terms of accuracy on the same platform. Results from test runs using new peptides not included in the training sets, and the important amino acids implied from machine learning are compared with those derived from our database analysis of the major classes of natural AMPs. Then, we outline additional predictions that may speed up computer-aided novel antimicrobial discovery. Finally, we summarize the major achievements and limitations of AMP predictions and discuss future directions. 2 Innate Immune Antimicrobial Peptides Naturally occurring antimicrobial peptides are important components of innate immune systems. Such peptides are deployed in a variety of organisms such as plants and animals. They play a critical role in protecting organisms from infections. AMPs have remained potent for millions of years. As a consequence, they are recognized candidates for developing novel antimicrobials since they can kill drug-resistant pathogens, including bacteria, fungi, viruses, and parasites. AMPs are usually gene-encoded and can be expressed constitutively to guard certain niches or induced in response to invading pathogens [3–8]. According to the newly programmed antimicrobial peptide database (APD, https://aps. unmc.edu) http://aps.unmc.edu/AP, https://wangapd3.com, over 3000 natural AMPs have been discovered from six life kingdoms (bacteria, archaea, protists, fungi, plants, and animals) [9–11]. At present, 74% of the peptides originated from animals, while 11.2% and 11.1% were discovered in bacteria and plants, respectively. Most of natural AMPs (88%) are cationic, and only a small portion (6%) are anionic. Anionic AMPs, such as daptomycin already in Machine Learning Prediction of Antimicrobial Peptides 3 Table 1 Amino acid properties, frequency, and peptide count in the antimicrobial peptide database (APD) Single letter Full name Molecular weight Peptide Classa count Count% (2020) I Isoleucine 113.16 Phobic 2511 0.77 5.9% V Valine 99.13 Phobic 2492 0.76 5.69% L Leucine 113.16 Phobic 2835 0.87 8.26% F Phenyl alanine 147.18 Phobic 2240 0.69 4.09% C Cysteine 103.14 Phobic 1721 0.53 6.81% M Methionine 131.2 Phobic 959 0.29 1.27% A Alanine 71.08 Phobic 2511 0.77 7.68% W Tryptophan 186.21 Phobic 1185 0.36 1.65% G Glycine 57.05 Special 2950 0.91 11.51% P Proline 97.12 Special 1958 0.60 4.67% T Threonine 101.11 Polar 2053 0.63 4.48% S Serine 87.08 Polar 2483 0.76 6.07% Y Tyrosine 163.18 Polar 1266 0.39 2.49% Q Glutamine 128.13 Polar 1352 0.42 2.59% N Asparagine 114.1 Polar 1968 0.60 3.86% E Glutamate acid 129.12 Acidic 1465 0.45 2.68% D Aspartic acid 115.09 Acidic 1463 0.45 2.7% H Histidine 137.14 Basic 1231 0.38 2.17% K Lysine 128.17 Basic 2782 0.85 9.51% R Arginine 156.19 Basic 1843 0.57 5.88% Frequency in 3257 AMPs phobic ¼ hydrophobic. In the APD, the hydrophobic content (Pho) is the ratio between the total hydrophobic amino acids and total amino acids in a peptide sequence [9]. Visited January 2021 a clinical use, may need metal to be active [12]. Another 6% of AMPs have a net charge of zero. In the APD, the majority of AMPs contain hydrophobic contents (Pho) between 10% and 70% (defined in Table 1). Only about 1% such peptides have very high (>70%) or very low (<10%) Pho. In terms of length, 2879 peptides in the current APD3 (88%) are shorter than 50 amino acids. The average length of all AMPs (3257 as of January 2021) in the APD3 is 33.2 with an averaged net charge of +3.3. The most frequently occurring amino acids (>8%) are glycine (G), lysine (G), and leucine (L), [10] while the least occurring amino acids (<2%) include methionine (M) and tryptophan (W) (Table 1). Such frequencies are proportional to the percentage of natural AMPs containing one 4 Guangshun Wang et al. Fig. 1 Important amino acids derived from amino acid composition profiles of classic classes of antimicrobial peptides [3]: (a) α-helical and β-sheet families and (b) amino acid-rich families, including Trp-rich, His-rich, Pro-rich, and Leu-rich AMPs. Data obtained in the APD [13] in Dec 2020 of the 20 amino acids also calculated in Table 1. The variation of the amino acid (composition) signatures of natural AMPs in different structure, activity, and source groups has been tabulated elsewhere [13]. Figure 1 displays amino acid signatures for known α-helical, β-sheet peptides (panel A), tryptophan-rich (Trp-rich), histidinerich (His-rich), proline-rich (Pro-rich) AMPs, and leucine-rich (Leu-rich) temporins (panel B). It is evident that such signatures depend on the amino acid composition of a group of AMPs in the APD. The amino acid sequence of a peptide, however, clearly plays a role as well in determining peptide structure and activity [6, 14]. Another important player is posttranslational modification (e.g., amidation, glycosylation, halogenation, hydroxylation, and cyclization) of peptide sequences, with 24 types of modifications annotated in the current APD3 as of October 2020 [11, 15]. Typically, cationic AMPs target anionic bacterial membranes due to the formation of the classic amphipathic helix structure [3–6]. However, such peptides can also attack other targets such as bacterial cell walls and ribosomes. It is believed that the simultaneous attack of more than one target renders it difficult for bacteria to develop resistance to AMPs. Beyond bacterial killing and biofilm inhibition, Machine Learning Prediction of Antimicrobial Peptides 5 AMPs are found to have other functional roles, ranging from pathogen toxin neutralization, wound healing, to host immune regulation [4, 5, 16]. A total of 24 types of AMP functions are annotated in the APD3 [11, 13]. 3 An Overview of Prediction Methods of Antimicrobial Peptides The majority of natural AMPs were identified using the classic isolation and characterization methods [3–5]. Such peptide identification procedures are laborious and time-consuming. One alternative method is to predict AMPs by computers based on the current peptide knowledge and sequenced genomes of numerous organisms [9, 17–19]. These prediction methods are grouped into five classes based on the information considered in programming [20]: (1) mature peptide (i.e., AMPs), (2) propeptide, (3) mature peptide and propeptide, (4) processing enzyme, and (5) genomic context (Fig. 2). Some AMPs such as cathelicidins possess a conserved pro-sequence domain prior to the mature peptide. Such a conserved sequence pattern became one method for identifying uncharacterized cathelicidins from sequenced genomes for mammals, fish, reptiles, birds, and amphibians (method 2). The human cathelicidin was initially predicted as FALL-39 [21], which is merely 1–2 resides longer than the mature forms isolated in human neutrophils and reproductive system (LL-37 and ALL-38), respectively [22, 23]. In the same vein, the discovery of bacteriocins from bacteria has been expanded from highly conserved processing enzymes (method 4a) to transporters (method 4b) and the entire gene clusters (i.e., genomic context; method 5). Computer programs such as BAGEL, antiSMASH, and BACIIα have been established for bacteriocin identifications [24– Fig. 2 Five information-content based methods for prediction of antimicrobial peptides [20] 6 Guangshun Wang et al. 26]. Occasionally, both precursor and mature sequences (method 3) were considered in clustering AMPs probably due to the nature of a particular data set then available [27]. The most widely explored information for prediction are mature peptides (method 1). Sequence patterns such as multiple disulfide bonds were utilized for identifying defensin-like AMPs in plants, cattle, mice, and humans [28–30]. A GXC γ-core motif has also been identified in these peptides and utilized for AMP prediction [31]. The construction of databases for AMPs greatly facilitated the development of computer-based design [32] and prediction methods. Table 2 provides a list of databases for AMPs [11, 18, 33– 49]. In 2004, the APD and ANTIMIC were simultaneously published in the database issue of Nucleic Acid Research in 2004 [9, 50]. The APD, with a focus on structure and activity of mature AMPs, was widely accepted and utilized by the AMP field [9]. Since then, more databases have been established with varying scopes or by entering additional details (Table 2). A systematic review on such databases has been described elsewhere [51]. Because of the model role of the APD, it is useful to describe its data scope and evolution. In the first two versions [9, 10], the APD attempted to cover all AMP sequences: experimentally determined, predicted, and synthetic. This history can be seen from a small number of synthetic and predicted entries remaining in the current APD (72 synthetic peptides and 211 predicted peptides without activity data). There are three types of activity data annotated in the APD: (1) minimal inhibitory concentration (MIC); (2) diffusion distance; and (3) optical density decrease as an evidence of inhibition. Due to convenience, MIC values based on microdilution assays are frequently measured and reported. Since predicted peptides might not be true AMPs [11], it was decided to postpone the collection of such peptides in the APD. Also, a large number of the synthetic peptides derived from the same template tended to dominate data filtering in the database, thereby deviating the database filtering from natural wisdom to artificial peptides. As a consequence, the APD also postponed the collection of synthetic peptides. Thus, the third version of the APD (APD3) [11] uses the following criteria to register AMPs: (1) natural peptides, (2) peptides with known amino acid sequences, (3) peptides with known activity (MIC <100 μM), and (4) peptides of less than 100 amino acids [11]. The last condition was relaxed to 200 amino acids to incorporate important human antimicrobial proteins. This practice generates a widely utilized core data set for AMP search, prediction, and design. Based on mature peptides, the first computer-based prediction was programmed in the APD in 2003 [9]. The program informs users whether the input sequence is likely to be an AMP based on some known AMP knowledge, such as positive charge and amphipathic nature. Later, it was improved based on the peptide Table 2 Web accessible databases dedicated to antimicrobial peptidesa Databases and prediction algorithms a Citing references Link Notes APD3 http://aps.unmc.edu/AP/main. php Antimicrobial peptide database, [11] with curated, experimentally verified antimicrobial peptides from bacteria, archaea, protists, fungi, plants, and animals CAMPR3 http://www.camp3.bicnirrh.res.in/ Collection of Antimicrobial peptides DBAASP v3 https://dbaasp.org Database of antimicrobial activity [33] and structure of peptides [18] Defensins http://defensins.bii.a-star.edu.sg/ knowledgebase Antimicrobial peptides from the defensin family [34] BaAMPs http://www.baamps.it/ Database of biofilm-active antimicrobial peptides [35] BACTIBASE http://bactibase.hammamilab.org/ about.php Bacterocin-type naturally occurring antimicrobial peptides [36] DADP http://split4.pmfst.hr/dadp/ Database of anuran (frog or toad) [37] defense peptides DRAMP http://dramp.cpu-bioinfor.org Database of AMPs including clinical trial data on peptides [38] Peptaibol http://peptaibol.cryst.bbk.ac.uk/ introduction.htm Database of peptaibols, mainly antifungal peptides [39] LAMP http://biotechlab.fudan.edu.cn/ database/lamp/index.php AMPs taken from other databases [40] YADAMP http://www.yadamp.unisa.it/ default.aspx Yet another database of antimicrobial peptides [41] PhytAMP http://phytamp.pfba-lab-tun.org/ main.php A database dedicated to plant AMPs [42] InverPep https://ciencias.medellin.unal.edu. AMPs from invertebrates from other databases co/gruposdeinvestigacion/ prospeccionydisenobiomoleculas/ InverPep/public/home_en [43] HIPdb http://crdd.osdd.net/servers/ hipdb Manually curated database of experimentally validated HIV inhibitory peptides [44] Thiobase https://db-mml.sjtu.edu.cn/ THIOBASE/ Sulfur-rich, highly modified [45] heterocyclic peptide antibiotics EnzyBase http://biotechlab.fudan.edu.cn/ database/EnzyBase/home.php Lysins, bacteriocins, autolysins, and lysozymes [46] ParaPep http://crdd.osdd.net/raghava/ parapep/ Antiparasitic peptides [47] dbAMP Not accessible AMPs [48] AntiTbPdb https://webs.iiitd.edu.in/raghava/ antitbpdb/ Anti-TB peptides [49] Adapted and updated based on the APD Links [13, 20] 8 Guangshun Wang et al. parameter space (net charge, hydrophobic content, and peptide length) defined by the entire database [19]. If such parameters of a new sequence are out of the scope, the program will inform the users that the input sequence is less likely to be an AMP. The APD also outputs five peptide sequences most similar to the user’s input. Subsequently, Lata et al. first programmed an artificial neural network (ANN), quantitative matrices (QM), and a support vector machine (SVM) in 2007 based on the APD data set [17]. Since then, there has been a growing interest in AMP prediction at both the single-label and multi-label levels. The single-label prediction will predict the likelihood of being antimicrobial, while multi-label predictions were developed based on different functions of AMPs annotated in the APD3 [11], such as chemotaxis, toxin neutralization, protease inhibition, and wound healing. The first multi-label prediction [52] predicts antibacterial activity in the initial stage followed by predictions of other types of activities, including antifungal, antiviral, anti-HIV, and anticancer activities. CAMP collected both synthetic and predicted peptides. Its prediction tool [18, 53] enables three tasks. First, users can predict the antimicrobial activity of a peptide sequence by four different models. Second, users can predict the antimicrobial region within a peptide sequence. Third, users can generate a large combinatorial list of sequences for a user-defined sequence and then can predict effect of single residue substitutions on antimicrobial activity using the AMP predictor. Table 3 lists some major machine-learning prediction programs [48, 53–77]. 4 Training Data Sets, Machine-Learning Models, and Algorithms for Classification and Prediction of Antimicrobial Peptides Machine learning models are commonly used for classification and prediction of AMPs. Nearly all machine-learning predictions of AMPs are supervised. The quality of these models is determined by a number of different factors. Among the most important contributors to the model performance are training sets consisting of antimicrobial and non-antimicrobial peptides, features used to represent the peptides, classification schemes, and machine-learning algorithms. 4.1 Training Sets for Predictions 4.1.1 Positive Training Set Quality of the training set is critically important for the model performance, since it is the only source of information the model uses to learn. AMP sequences for the training set are usually extracted from one or more AMP databases. The growing number of AMP databases (some examples are listed in Table 2) represents a wide range of approaches to data collection, data curation, and data management. For the purpose of training set design, it is important Machine Learning Prediction of Antimicrobial Peptides 9 Table 3 Machine learning prediction of antimicrobial peptides Tool name URL Algorithms Features Year References AntiBP http://crdd.osdd.net/raghava/ antibp2 SVM, QM, Single-label ANN 2007 [17] CAMP http://www.bicnirrh.res.in/ antimicrobial SVM, RF, DA Single-label 2010 [18, 53] http://amp.biosino.org/ BLASTP, NNA Single-label 2011 [54] AMP region Scan 2012 [55] AMPA http://tcoffee.crg.cat/apps/ampa ANFIS NA ANFIS Single-label 2012 [56] Peptide Locator http://bioware.ucd.ie/ BRNN Single-label 2013 [57] iAMP-2L http://www.jci-bioinfo.cn/ iAMP-2L FKNN Two-level, Multi-label 2013 [52] DBAASP https://dbaasp.org/prediction/ general Thresholds SVM-LZ NG (BioMed Research international) SVM Single-label 2015 [58] ADAM http://bioinformatics.cs.ntou.edu. tw/ADAM/ SVM, HMM Single-label 2015 [59] MLAMP http://www.jci-bioinfo.cn/ MLAMP RF—MLSMOTE Multi-label 2016 [60] iAMPpred http://cabgrid.res.in:8080/ amppred/ SVM Single-label 2017 [61] AmPEP http://cbbio.cis.umac.mo/ software/AmPEP/ RF Single-label 2018 [62] AMP scanner http://www.ampscanner.com DNN Single-label, Large scale 2018 [63] AntiMPmod https://webs.iiitd.edu.in/raghava/ antimpmod/ SVM Single-label, PTM/3D 2018 [64] dbAMP http://csb.cse.yzu.edu.tw/ dbAMP/ RF Single-label 2019 [48] AMAP http://faculty.pieas.edu.pk/fayyaz/ software.html#AMAP SVM, XGBoost Multi-label 2019 [65] NA IDQD Single-label 2019 [66] AMPfun http://fdblab.csie.ncu.edu.tw/ AMPfun/index.html CART Multi-label 2020 [67] AMP0 http://ampzero.pythonanywhere. com ZSL, FSL Single-label, Species-specific 2020 [68] 2014 [33] (continued) 10 Guangshun Wang et al. Table 3 (continued) Tool name URL Algorithms Features Year References MIV-RF NA RF Single-label, Sequence 2020 [69] Deephttps://cbbio.cis.um.edu.mo/ AmPEP30 AxPEP CNN Genome Search 2020 [70] ACEP https://github.com/Fuhaoyi/ ACEP DNN Highthroughput predictions 2020 [71] IAMPE http://cbb1.ut.ac.ir/ KNN, Single-label SVM, RF 2020 [72] Macrel https://big-data-biology.org/ software/macrel RF Genome search 2020 [73] https://github.com/mtyoumans/ lstm_peptides LSTM RNN Single-label 2020 [74] Ampir https://github.com/legana/ampir SVM Genome wide 2020 [75] amPEPpy https://github.com/tlawrence3/ amPEPpy RF Genome wide 2020 [76] Ensemble model Single-label 2021 [77] Ensemblehttp://ncrna-pred.com/Hybrid_ AMPPred AMPPred.htm to take into account that AMP databases vary in size, sources of information, amount and quality of annotations, and other parameters. Sizewise, the current versions stretch from over 3000 peptides in the APD [9–11] to 10,000 in CAMP [18, 53], 12,000 in dbAMP [48], 16,000 in DBAASP [33], and 23,000 in LAMP2 [40]. Some of the larger databases (e.g., LAMP2 [40]) may contain the entire content of the smaller ones by copying the peptide entries from existing databases. At the same time, the non-overlapping components are frequently present, primarily in the scope of synthetic peptides and due to different definitions of AMPs. Some specialized databases have expanded the data set by including other types of peptides, which do not necessarily fall into the definition of classic AMPs [44, 49]. For instance, antiviral peptides can also be designed by investigators in the laboratories based on the viral machinery such as proteases. As a result, the distribution of peptides by sequence length in databases can be different as well. The APD contains mostly natural AMPs, which are templates for making synthetic peptides. For example, there are hundreds of LL-37–derived peptides. 88% of the entries in the APD are less than 50 amino acids and only 80 peptides out of 3257 have a length greater than 100 residues. Similarly, most peptides in DBAASP database are shorter than 50 residues. Only 20 entries in DBAASP Machine Learning Prediction of Antimicrobial Peptides 11 are longer than 100 residues, while CAMP contains 1850 such sequences. The longest sequence in APD and DBAASP is less than 190 residues compared to 1256 residues in CAMP. The first training set for machine-learning model testing was extracted from the APD [17]. Another data set used in AMP prediction was derived from the CAMP [18]. Because the majority of natural AMPs in the CAMP were taken from the APD, there is a significant overlap between these two data sets. Some recent studies generated a hybrid data set by merging the peptide sequences from different databases [61, 62, 69, 70, 77]. The size of the positive data set appears to influence prediction outcome [61]. Speciesspecific predictions of AMPs [68] were made based on the DBAASP, which annotate antimicrobial activity in more details [33]. For 3D structural data, the APD has direct links to the Protein Data Bank (PDB) [78]. Hence, a list of training peptides with 3D structures can also be generated without redundancy (i.e., multiple sets of structural coordinates are possible for the same peptide determined by different methods, at different resolutions, or under different conditions). 4.1.2 Negative Data Set Ideally, the negative set should consist of peptides which were tested experimentally and displayed no antimicrobial activity against one or more relevant pathogens. Non-AMP sequences are a natural byproduct of any wet lab screening for antimicrobial peptides. However, negative results are rarely published, and as a result, the large sets of validated non-AMP sequences are likely sitting in the drawers of investigators and not available to the public. Creating a database of non-AMP sequences and convincing researchers to contribute data into this database would be a helpful step in improving the quality of the training sets. Bioinformaticians/computing scientists have taken an alternative approach to obtaining negative data sets. The AntiBP [17] generated the first negative data set based on the Uniprot [79]. The negative part of the training set is usually selected from the random sequences in the protein sequence database, which are not annotated as antimicrobial, secretory, toxins, etc. Sequences in the negative set can be controlled by the level of sequence identity, sequence composition, similarity to the sequences in the positive set, structural, and other properties. Since the protein sequence databases are very large (the October 2020 release of UniProt database contains more than 200 million sequences) [79], the supply of sequences for the negative sets is practically unlimited. There are caveats with these data. The sequences in the negative set may possess antimicrobial properties, although the probability of this is relatively low. Also, antimicrobial activities of AMPs are very sensitive to sequence variation [80]. Such features may not be represented in the current negative data set. Training the models on different combinations of a positive set with several independent 12 Guangshun Wang et al. negative sets may provide insights into the scale of negative set contamination by hitherto unknown antimicrobial peptides. In many cases, it is advisable to use a balanced training set, where the AMP and non-AMP sequences are equally represented. AMP sequences can be selected from AMP databases (Table 2). Normally, only a subset of the entire database (or several databases) can be used to compile a positive part of the training set. Sequences from the database are filtered by length, activity, sequence identity, and other parameters. In most studies, the positive sets range from several hundred to several thousand sequences, while the size of the negative set from Uniprot can be much larger. However, the data sets for numerous species-specific predictions were much smaller due to limited MIC data [68]. 4.2 Descriptors and Features Many different features of peptides can be used to characterize their antimicrobial activity and discriminate between antimicrobial and non-antimicrobial peptides. Frequently, these features are based on identities, physicochemical properties, structural properties, and compositions of individual amino acid residues and their combinations [61, 81–83]. Physical and chemical properties of amino acids which are most likely to improve machine-learning (ML) model performance include hydrophobicity, electrostatic charge, and polarity. Similarly important are structural properties such as helical propensity and solvent accessibility. In many models, feature vectors include residue locations in the sequence, compositional characteristics, and sequence patterns. The overall number of features can be very large; in those cases, feature selection can help to reduce the size of the feature vector by removing features with relatively low contributions to the model performance. 4.3 MachineLearning Algorithms A large number of different machine-learning algorithms (Table 3) have been implemented in AMP classification and prediction models since the first papers reporting this approach were published in 2007 [17, 27, 84]. ML methods successfully used in AMP modeling include K-nearest neighbor [52, 72], hidden Markov models (HMMER) [27], naı̈ve Bayes [72], neural networks (NN) (including their deep learning varieties) [63, 70, 71, 74, 85–87], support vector machines [17, 18, 58, 59, 61, 64, 65, 72, 75], random forests (RF) [18, 48, 60, 62, 69, 73, 76], zero-shot learning (ZSL) [68], and many others (Table 3). Support vector machine classification maps feature vectors representing the peptides in the training set into a higher dimensional space. Then the algorithm constructs an optimal hyperplane which separates two classes of peptides, AMPs and non-AMPs, with the maximal margin of separation between the classes. This hyperplane serves as a decision boundary in the original space. The hyperplane divides the entire higher dimensional space into two half-spaces, and each new peptide from the prediction set is going Machine Learning Prediction of Antimicrobial Peptides 13 to be located in one of these two half-spaces. This location will determine the predicted class for new peptides. Decision tree (DF) classifiers have the form of a rooted binary tree. A divide-and-conquer approach is used during model training. It traverses the tree starting from the root, and at each node, an input feature is selected that best separates the output classes. Learned trees are frequently pruned to decrease overfitting. After the tree is created using a training set, a new peptide can be sorted down the tree based on the values of the input features on the corresponding node, and the appropriate branch is followed to the next node. The recursive process terminates once the peptide reaches a leaf node, where the peptide class, AMP or non-AMP, is identified. The random forest algorithm is an ensemble method based on decision trees. It generates multiple bootstrapped data sets, each data set trains a classification tree by randomly selecting a fixed-size subset of the available predictors for splitting at each node, and predictions are made by majority vote over all trees. Random forests help to avoid many pitfalls of the decision tree algorithm, particularly overfitting. While most of the predictions aimed to discriminate AMP and non-AMP (i.e., single-label), several labs have attempted a multilabel prediction based on the multifunctional data annotated in the APD3 [11, 13]. The four multi-label predictions (iAMP-2L, MLAMP, AMAP, and AMPfun) all conduct predictions on two levels [52, 60, 65, 67]. Similar to the single-label prediction described above, the first level of the multi-label prediction predicts whether the peptide is an AMP or non-AMP. If it is, then the program moves on to the second-level prediction to predict the likelihood of other functions the peptide may have. These can include antibacterial, antibiofilm, antiviral, anti-HIV, antifungal, antiparasitic, antimalarial, anticancer, insecticidal, antioxidant, chemotactic, enzyme inhibitors, and spermicidal activity. It appears that AMAP is best in terms of accuracy. It also predicted more biological functions of AMPs at the second level. To evaluate the performance of an algorithm on a training set, cross-validation (CV) and random split into two subsets are commonly used. Implementation of tenfold CV begins with a random grouping of the training set peptides into ten equally sized subsets. Stratification is applied to maintain class proportions of the full training set in each of the subsets. At the next step, one of the subsets is held out while the remaining nine subsets (90% of the original training set) are combined into one set that is used to train a model. The heldout subset (10% of the original training set) is then treated as a test set, and the trained model predicts the class for each peptide in the subset. Then the procedure is repeated for the remaining nine combinations. The iterative procedure yields a single prediction for each of the peptides in the original training set, which is then compared to the actual class. These comparisons 14 Guangshun Wang et al. allow to calculate the numbers of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions. Commonly used performance measures, such as sensitivity, specificity, precision, balanced error rate, and Matthew’s correlation coefficient, are all functions of these four numbers. Many published ML models report CV accuracy values which are close to 100%. The actual real-world performance of these models on predicting novel antimicrobial peptides may be lower due in part to the extremely complex AMP activity landscape. 5 Machine Learning Predictions of Special Antimicrobial Peptides 5.1 Utility and Main Drawbacks of AMP Prediction Algorithms Overall, our ability to accurately predict the antimicrobial activity, hemolytic activity, or cytotoxic activity of any peptide sequence is a developing field. While advances in machine learning, positive and negative data sets, and analytic approaches have been made, the accuracy of predicting the properties of a new peptide sequence is still low, too low to be of reliable use in a screening step, for example. Improvements in the peptide sorting and analysis, especially thinking about the different surface properties of Gramnegative and Gram-positive bacteria, could yield significant advancements in accuracy, which would significantly advance the field. This lack of reliability is the main drawback of AMP prediction algorithms and the main hindrance in their use in high-throughput design programs to generate new AMPs. 5.2 Antiviral Peptide Predictors and Data The antiviral activity of antimicrobial peptides is of considerable interest. In particular, antiviral peptides (AVPs) appear to have activity against membrane-enveloped viruses, such as LL-37 against influenza virus [88, 89]. Some peptides (e.g., LL-37 and θ-defensins) have been found to have HIV inhibitory activities [90]. Antiviral peptides (AVPs) have been shown to exert their activities at various steps in the viral life cycle, including impeding attachment to host cells, altering viral replication within cells, or indirectly by recruiting other parts of the immune system to promote host defense [90]. The antimicrobial peptide LL-37 has been shown to be effective to inhibit attachment and entry of the influenza virus [88, 89]. As an example of the indirect mode of antiviral activity, the Rhesus theta-defensin has been shown to be indirectly antiviral against SARS-CoV-1 [91], with the major effect being an increase in the host defense that allows survival of the mice against this infection. LL-37 is also active against Zika virus [92]. Recently, several highly effective AMPs were designed that show significant activity against Ebola virus (EBOV) infection of cells [93]. These peptides were designed or “engineered” fragments of LL-37 peptide [7] and were found to strongly inhibit EBOV entry into cell lines and human primary macrophages, but not viral replication Machine Learning Prediction of Antimicrobial Peptides 15 Table 4 Prediction algorithm websites for antiviral peptides (AVPs) Prediction algorithms Link Notes References AVPPred http://crdd.osdd.net/ servers/avppred/ Webserver for collecting and detecting effective AVPs [94] AVPdb http://crdd.osdd.net/ servers/avpdb A database of experimentally validated antiviral peptides [95] FIRM-AVP https://msc-viz.emsl.pnnl. gov/AVPR “Feature-informed reduced machine learning for antiviral peptide prediction” [96] [93]. This study represents an exciting advance in both the design of active antiviral peptides and their application to important diseases such as Ebola. Several websites [94–96] have been established to assist the prediction of AVPs (Table 4). Using database analysis and a feature reduction technique (recursive feature elimination algorithm, or RFE), one group generated a software tool to predict antiviral peptides with this advance, Feature-Informed Reduced Machine Learning for Antiviral Peptide Prediction (FIRM-AVP) [96]. The analysis assembled 649 features that correlated with antiviral activity and then applied a reduction of the number of features to 169 based on the Pearson’s correlation coefficient and computed MDGI (mean decrease of Gini index) values. They then applied the RFE technique to order the features by importance and to identify the most important features. Three features that were identified in common between two different parts of the analysis include “PseAAC (pseudo amino acid composition) feature for leucine (L) amino acid,” “PseAAC feature for lysine (K) amino acid,” and “Location oriented feature for α-helix” [96]. This suggests that these features may have a strong contribution to the physicochemical features of an effective antiviral peptide. Overall, this is in agreement with the general observation that antiviral peptides are often alpha-helical and positively charged peptides [90]. 5.3 Antifungal Peptide Predictors and Data Specific databases and prediction models [97, 98] have been developed for antifungal peptides (AFPs) (Table 5). Antifungal peptides appear to have a prominence of the amino acids cysteine (C), glycine (G), histidine (H), lysine (K), arginine (R), and tyrosine (Y) in their amino acid sequences [98]. A similar set of frequently occurring amino acids L, C, alanine (A), G, K, and R was obtained when 1210 antifungal AMPs in the APD were statistically analyzed [11]. Positional analysis suggests that the amino-terminus of antifungal peptides may predominately be R, valine (V), or K, while C and H are predominant at the carboxyl terminus of the peptide. 16 Guangshun Wang et al. Table 5 Prediction algorithm websites for antifungal peptides (AFPs) Database Link Notes References PlantAFP http://bioinformatics.cimap.res.in/sharma/ PlantAFP/ Plant-derived peptides [97] AntiFP https://webs.iiitd.edu.in/raghava/antifp/algo.php [98] Table 6 Prediction algorithm websites for other specific and unique kinds of peptides Databases and prediction algorithms Link Notes References AIPred www.thegleelab.org/AIPpred Anti-inflammatory peptides [99] PIP-EL www.thegleelab.org/PIP-EL Pro-inflammatory peptide [100] AntiTBpred http://webs.iiitd.edu.in/raghava/ antitbpred/ Antitubercular peptides [101] This is different from the most common amino acids (G, L, A, and K) found in antibacterial helical peptides [10, 11]. 5.4 Specific and Unique Peptide Prediction Tools Many other specialized prediction algorithms for peptides have been developed in recent years [99–101]. While anti-inflammatory and pro-inflammatory activities are closely linked to infection outcomes, these peptides may not be directly antimicrobial. However, it may be of interest to antimicrobial peptide researchers, especially since many antimicrobial peptides, such as LL-37, are known to have host-directed effects in addition to antibacterial effects [105]. Some websites have been developed for predicting very specific kinds of activities that may be of interest to antimicrobial peptide researchers, including anti-inflammatory peptides, pro-inflammatory peptides, and antitubercular peptides (Table 6). 5.5 Tuberculosis (TB) continues to be a plague on humanity, infecting more than ten million people each year worldwide, and is responsible for approximately two million annual deaths globally. The emergence of multidrug resistant and extremely multidrug resistant (XDR) strains of TB, especially in prisons and other enclosed conditions, is an extreme challenge to society and to the medical community to develop new approaches to treat these infections. Antimicrobial peptides may represent one new approach to treating Mycobacterium infection [102–104], likely in combination with other treatments. The AntiTBpred website has been developed to Tuberculosis Machine Learning Prediction of Antimicrobial Peptides 17 Table 7 AntiTBpred output for the activity of LL-37 against tuberculosis Prediction method ID AntiTB_MD SVM ensemble LL37 AntiTB_RD SVM ensemble Score Prediction ID Anti-TB peptide HBD2 0.30 LL37 0.25 Non Anti-TB peptide HBD2 0.202 Non Anti-TB peptide AntiTB_MD Hybrid method LL37 0.25 Non Anti-TB peptide HBD2 0.053 Non Anti-TB peptide AntiTB_RD Hybrid method * LL37 HBD2 0.673 Anti-TB peptide 0.78 0.317 Anti-TB peptide Score Prediction Non Anti-TB peptide help researchers parse through antimicrobial peptide sequences and to try to identify candidates that might be useful against this recalcitrant and challenging organism. Using LL-37, the human cathelicidin, as an example, AntiTBpred analysis suggests (Table 7) that this peptide either may or may not be an antitubercular peptide. Studies have shown that in vitro and in vivo, LL-37 is antibacterial for Mycobacterium tuberculosis (MTb) and can reduce bacilli counts in a mouse model [105]. Further studies have shown that LL-37 is required to control intracellular MTb replication [103–105]. The antimicrobial peptide HBD2 has also been shown to have antibacterial activity against MTb in vitro [106]. In the output example below, these two peptide sequences were analyzed using all four models within AntiTBPred. Only 1 of the 4 models correctly predicted (gray highlights) that HBD2 was antiTB, and it also predicted that LL-37 would be antiTB. 5.6 Antibiofilm Peptide Predictors and Data Biofilm formation by bacteria is a major contributor to colonization, persistence, and difficulty in treatment of bacterial infections. Chronic, nonhealing diabetic wounds on the lower extremities, lung infections in cystic fibrosis patients, hip-replacement and other orthopedic implants, and chronic bladder infections all have bacterial biofilm as a major component of their etiology. In recent years, as our understanding of bacterial biofilms has increased [107–109], it has become clear that some antimicrobial peptides have the ability to either prevent the attachment and formation of biofilm or can induce the dispersal of bacterial biofilms [110– 117]. Several databases and websites [11, 35, 118–120] have been developed to gather the information on antibiofilm peptides and to try to predict their activity (Table 8). Although not strictly a peptide-focused resource for peptide researchers, a related tool aBiofilm (https://bioinfo.imtech.res.in/ manojk/abiofilm/) [121] may be of interest to antibiofilm peptide researchers. This tool provides a database, an antibiofilm predictor and data-visualization tools. 18 Guangshun Wang et al. Table 8 Prediction algorithm websites for Antibiofilm peptides Databases and prediction algorithms Link Notes References BaAMPs http://www.baamps.it/ Database of biofilm-active antimicrobial peptides [35] dPABBs http://ab-openlab.csir.res.in/ abp/antibiofilm/ Predictor of antibiofilm activity of peptides and generates possible peptide variants and predicts their antibiofilm activity [118] BIPEP http://cbb1.ut.ac.ir/ BIPClassifier/Index Uses NMR and physicochemical descriptors [119] BioFIN http://metagenomics.iiserb.ac. in/biofin/ and http:// metabiosys.iiserb.ac.in/ biofin/ 6 [120] Antimicrobial Prediction Outcome Comparison 6.1 Prediction Comparison on the Same Platform The prediction accuracy of AMPs can be determined by numerous factors, ranging from data sets, peptide sequence information encoding, to algorithms. Which data set to use depends on the aim of the prediction and personal knowledge. How to represent the peptide faithfully in a manner which is understandable by computers is a challenging task by itself. This is further complicated by numerous types of chemical modifications annotated in the APD3 [11]. An optimized prediction requires a sufficient definition of both the types and numbers of peptide features. Such peptide features range from a dozen to hundreds. The algorithms or models may be used alone or in combination. 6.1.1 Data Sets A reliable data set is critical to obtain useful predictions. Machinelearning predictions normally use a balanced positive and negative data ratio of 1:1 to avoid a biased prediction toward the large data set. CAMP used a positive:negative ratio of 1:1.5 [18]. AmPEP tested numerous ratios and achieved a higher accuracy when a 1:3 ratio was utilized [62]. A too high ratio is undesired as the prediction will tilt toward negative sequences, thereby reducing the overall performance of machine learning in predicting AMPs. Meher and colleagues tested the effect of the size of positive peptides. They found that the more positive peptides, the better the prediction [61]. This makes sense because the prediction program is better trained with more positive examples (synthetic + natural Machine Learning Prediction of Antimicrobial Peptides 19 AMPs). When more and more synthetic peptides are included, however, the prediction accuracy toward natural AMPs may drop. This is undesired when the goal is to scan the genomes to discover novel antibiotics. 6.1.2 Peptide Features A thorough description of the peptide sequence would require numerous features. The first prediction noticed the need of a more complete representation of peptide information. A higher accuracy was achieved when the peptide features from both the N and C-termini were considered [17]. Wang et al. [54] utilized 270 sequence features to represent each AMP. These include 20 standard amino acids (AAC) and 50 pseudo-amino acid compositions (PseAAC) that describe the peptide sequence based on positional correlations between amino acids. Each PseAAC is also linked with five features: polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge (50 5). However, each peptide feature may not play the same role in prediction. In pattern recognition, it is most important to identify the major features significant for peptide classification. CAMP started with 257 features and found 64 features were best for RF [18]. It is possible to further reduce the peptide features required for prediction. Bhadra et al. were able to reduce the features from 105 to 23 without a loss of prediction accuracy [62]. Tripathi and Tripathi utilized merely 15 peptide features to reach a comparable prediction accuracy, including the consideration of the sequence shuffling effect [69]. It appears that only a dozen key peptide features are needed to achieve a comparable prediction accuracy. 6.1.3 Algorithms/Models Tripathi and Tripathi applied different algorithms (RF, J48, SVM, and Naı̈ve Bayes) to peptide prediction based on the same data set. They found random forest is best [69]. Also, Yan et al. found that deep learning (CNN) performed similarly to RF but better than SVM [70]. However, both SVM (8 studies) and RF (7 cases) are popular in Table 3. To reduce overfitting, there is also an attempt to utilize an ensemble approach by involving multiple models [77]. Lin and Xu [60] revealed a higher accuracy of the more recent multi-label prediction methods such as iAMP-2L and MLAMP (92.2% and 94.7%) than those programmed in the CAMP (SVM, RF, and DA at 57.8–77.5% accuracy) [18]. It appears that the high accuracy reported for machine learning does not match the outcomes of real tests (below). There is room to improve for all the existing programs. 6.2 Testing the Prediction Outcomes by Using Peptides Not Included in the Training Set How each program performs in AMP prediction can be put into practice. We tested the AntiBP program by using newly discovered natural AMPs, which were not included in the training set. Among the 17 peptides with known activity, 71% were predicted correctly [20]. Another test was conducted in 2015 using 10 new peptides (APD ID: 2399-2408) [51]. AntiBP SVM predicted 70% correctly, 20 Guangshun Wang et al. whereas the RF, SVM, ANN, and DA programs in CAMP [18] obtained 60–80% correctness. iAMP-2L [52] achieved a similar prediction of 80%. Bishop et al. [122] identified 568 novel peptides from alligator plasma. From 45 predicted to be AMPs by CAMP [18], eight peptides were chemically synthesized and subjected to antibacterial assays. Five were experimentally proved to be antimicrobial (a prediction accuracy of 5/8 ¼ 62.5%). Yan et al. [70] developed Deep-AmPEP30 and predicted three antimicrobial sequences from the genome of Candida glabrata, and one peptide was proved active against Gram-positive bacterium Bacillus subtilis and Gram-negative Vibrio parahaemolyticus. These tests underscore the limitations of existing programs. Porto et al. [80] found that the machine-learning programs worked well only for peptides resembling the trained data set. However, they failed to predict sequence shuffled peptides [14], indicating an insufficient consideration of peptide sequence information. 6.3 Comparison with Existing AMP Knowledge Every machine-learning algorithm is essentially a black box. It is not surprising that there is no direct link between the computing outcome and AMP biology. AmPEP compared various descriptors that distinguish the AMPs from non-AMPs and identified charge as the most important descriptor [62]. The iAMPpred program [61] also found the importance of net charge followed by isoelectric point of the peptides in the training set. The iAMP-2L program reveals that amino acid composition accounts for 60% of the weightings [52]. Taken together, the AMP charge and composition are two major features for AMP differentiation. Overall, these machine-learning findings agree with the research results of AMPs that cationicity and hydrophobicity are the two most important factors that determine peptide antimicrobial activity. Amino acid composition is important in determining the peptide activity spectrum as well [9, 123, 124]. Some programs documented selected amino acids to be important predictors of AMPs. Based on the APD3 data set, the AMAP study [65] identified amino acids C, K, V, and phenylalanine (F) for AMP prediction, whereas aspartic acid (D), glutamic acid (E), L, Y, proline (P), R, and asparagine (N) are indicators for non-AMPs. Using a merged data set, iAMPpred identified amino acids K, P, C, and isoleucine (I) [61]. Wang [54] found C, P, R, W, and H based on both natural and patented AMPs in the CAMP database. In another study, amino acids G, F, P, and W were identified [44] based on the DBAASP data set [33]. It is evident that there is a low level of consensus from different prediction studies. This may result from differences in the training data sets, algorithms, and the assessment of important features during prediction. It may be useful to compare the above amino acids with the frequently occurring amino acids (~10%) discovered from analyses of the major classes of natural peptides in the APD3 [10]. K, L, Machine Learning Prediction of Antimicrobial Peptides 21 G, and A are frequently occurring (abundant) amino acids (~10% or more) in 463 known helical AMPs. In contrast, amino acids C, G, and R are abundant in natural AMPs with a known β-sheet structure (87 in the APD3) (Fig. 1a). For the “rich” families, His-rich AMPs are clearly rich in H and G, while Pro-rich AMPs are rich in P and R. Also, Trp-rich peptides are rich in W and R (Fig. 1b). When combined, we have G, L, A, K, C, R, H, P, R, and W. Most of the machine learning discovered amino acids correspond to part of the frequently occurring amino acids of AMPs discovered in the APD3 [13]. Machine learning also identified hydrophobic V, F, and I. While F and I are abundant in helical AMPs from fish and mammals, V is abundant in lactone and lactam types of bacteriocins [13]. It is puzzling why both L and A were not identified by any machine learning. Leucine is, on average, rich in 121 amphibian temporins (Fig. 1b) and important for peptide design [32]. Alanine is particularly high in amphibian AMPs from South America [13]. Increased conversations between AMP and bioinformatics people may improve the prediction outcomes in the future. 7 Beyond Antimicrobial Properties and Proposed Prediction Integration Toward Future Medicine 7.1 Antimicrobial Peptide Properties that Contribute to AMP Activity As discussed above, the general properties of peptides that appear to be positively correlated with AMP activity have been identified from experience and usually include the following physicochemical parameters: (1) peptide length, (2) amphipathicity, (3) hydrophobicity, and (4) cationicity. However, the translation of these general principles into very specific physicochemical rules by which certain sequences can be included or excluded or predicted to have antimicrobial activity or not has been the challenge of the last decades since their discovery. As discussed above, there are many detailed bioinformatic and computational approaches that seek to solve this problem of AMP prediction (Table 3). 7.2 Important Antimicrobial Peptide Properties in Addition to AMP Activity Additional properties of peptides will contribute to them being “successful” antimicrobial peptides besides AMP activity. These properties, beyond antimicrobial peptide activity, include: toxicity towards host cells, ability to penetrate microbial or eukaryotic membranes, susceptibility to host proteases and “stickiness,” and the propensity to be bound to albumin or other high-abundance proteins in the host, among others. Host-cell toxicity can include hemolytic activity and cytotoxicity, or it can be observed in vivo through toxicity trials. Cell permeability of the peptide can be a critical factor if the target of the AMP is an intracellular bacteria, for example. “Stickiness” to high-abundance host proteins or high susceptibility to host proteases can affect the in vivo availability of the peptide and its half-life, aspects of pharmacodynamics (PD) and pharmacokinetics (PK) that have significant implication for future 22 Guangshun Wang et al. clinical success. Unfortunately, the PK/PD data for AMPs are sparse, since most of the peptides have not been advanced to that level [6]. Some of the major parameters for consideration and possible inclusion in a computational approach are listed in Table 9. Many tools for computing these properties are available online, for example, in R (Peptides, https://rdrr.io/cran/ Peptides/man/), ExPASy (expasy.org), and the calculation tool of the APD3 [11]. LL-37 is a widely studied human cathelicidin peptide encoded by the single CAMP gene. It is stored in and released from neutrophils and expressed in other types of human cells as well. Depending on the cells and physiological conditions, the precursor of human cathelicidin may be cleaved into different mature peptides. This peptide has been found to be antibacterial against many pathogens, including resistant strains, persisters, and biofilms. It belongs to the classic amphipathic helical family with a short tail at the C-terminus (PDB: 2K6O) [7]. In Table 9A, the major physicochemical properties of LL-37 are shown as computed by one of the many websites described below. This peptide is short (37 aa), amphipathic (>1), cationic (net charge +6), has a high pI (>10), and has a low molecular weight (under 5 kDa). ExPASy ProtParam tool provides instability index (23.34) and aliphatic index 89.46. The APD website calculates GRAVY (0.724), Boman index (2.99 kcal/mol), and Wimley-White whole residue hydrophobicity (12.83) for LL-37. As a well-studied peptide, we will use LL-37 as an example in our discussion of the online tools described below. 7.3 Host-Cell Toxicity and Hemolysis Host-cell cytotoxicity and hemolysis are critical to the clinical potential of any antimicrobial peptide. Thus, we propose that this issue needs to be considered early, right after identification of desired antimicrobial activity of any peptide as a potential strong counter-selection criterion. Although sequence features such as multiple lysines and high hydrophobicity are known to contribute to host-cell cytotoxicity, it appears to remain challenging to “design-out” host-directed toxicity of active peptides while retaining the desired antimicrobial activity of the sequence. The combined AMP selection and counter-selection procedure leads to a short list of AMPs with high therapeutic indexes for experimental validation. There are multiple online programs available for the computational prediction of toxicity and hemolysis of antimicrobial peptides. For example, Gupta et al. have published a method of in silico toxicity prediction for peptides [125, 126]. This site is called ToxinPred and has two algorithms available, ToxinPred SVM-SwissProt and ToxinPred QM-di-SwissProt. To illustrate the use of this website, we submitted the sequence of LL-37, the human cathelicidin, to compare the prediction versus in-laboratory data (Table 9A, B). It can be seen that experimentally the cytotoxicity of LL-37 is dose-dependent and increases with increasing concentration of peptide (Table 9B). SVM score Nontoxin 0.34 HydroPrediction phobicity A549 A431 squamous cell carcinoma cells pMSC MA-104 Thermally wounded human skin equivalents MTT (HSE) Scrambled LL-37 LL-37 LL-37 LL-37 LL-37 MTT, Neutral red MTT MTT MTT MTT A549 LL-37 Assay Cell line Peptide (B) Experimental cytotoxicity activity of human cathelicidin LL-37 LLGDFFRKSKEKIGKEFKRIVQRIKDFL 1.58 RNLVPRTES Peptide sequence [131] No cytotoxicity at up to 200 μg/model [129] No toxicity up to 10 μg/mL [130] [128] Cytotoxic at 20 μg/mL. Not toxic at 5 μg/mL Statistically significant cytotoxicity (>10%) observed 20–50 μg/mL [127] Not cytotoxic up to 50 μg/mL 10.61 4493.32 Mol wt [127] +6.0 pI Not cytotoxic up to 50 μg/mL 0.62 Net charge References 1.06 Hydrophilicity Result 0.72 AmphiHydropathicity pathicity (A) Predicted Toxicity of LL-37 on ToxinPred (validated via ExPASy ProParam tool) Table 9 Hemolytic prediction of activity for LL-37 human cathelicidin peptide Machine Learning Prediction of Antimicrobial Peptides 23 24 Guangshun Wang et al. However, this subtlety of concentration of peptide is not captured by the predictors, which just predict one result for some unknown concentration of peptide. Thus, just like a stopped clock is correct twice a day, the predictor is correct at some concentrations of LL-37 and is incorrect at higher concentrations. This concentrationdependence of the real-life data needs to be integrated with computational predictors in the future, perhaps by including the concentrations at which the results are included in the data set as an “antibacterial” or “noncytotoxic” peptide. Hemolytic activity is the ability of a peptide to lyse red blood cells. This assay is normally performed with a washed 2% solution of red blood cells, following a standard protocol [132, 133]. Many different defibrinated red blood cell types can be used, depending on the intent of the experiment, such as sheep [132–134], horse [135], chicken [136], or mouse [137, 138], which may be more sensitive to peptide hemolysis than human red blood cells [138]. Often it is desirable to use deidentified human blood to test hemolytic activity, which can be obtained from companies like BioIVT and used in these assays [138]. Computational predictors of hemolytic activity can be used to compute an estimate of hemolytic activity. For example, HemoPred [139], HemoPI/Hemolytik [140], and HAPPENN [141] are some of the websites currently available (Table 10). HemoPred utilizes a random forest classifier based on amino acid sequence, dipeptide composition, and physicochemical parameters [139]. HemoPI is based on comparing a data set of highly hemolytic peptides to a random data set of peptides from SwissProt [140]. Finally, the HAPPENN tool employs neural networks based on classification of known peptides as hemolytic and nonhemolytic to predict the hemolytic activity from a new peptide’s primary sequence [141]. As an exercise, we ran the sequence of the LL-37 peptide through the various hemolysis predictors (Table 11) and compared the results to published laboratory generated data regarding hemolytic activity (Table 12). From the literature, the following hemolysis data was obtained for the LL-37 peptide (Table 12), as an example. This is not a comprehensive meta-analysis, but shows data from several papers that contained data over a wide range of peptide concentrations and Table 10 Hemolytic predictor websites Name Link References HemoPred http://codes.bio/hemopred/ [139] HemoPI/ Hemolytik https://webs.iiitd.edu.in/raghava/hemopi/index.php or http://crdd. osdd.net/raghava/hemopi/ [140] HAPPENN https://research.timmons.eu/happenn [141] Machine Learning Prediction of Antimicrobial Peptides 25 Table 11 Hemolytic prediction of activity for LL-37 human cathelicidin peptide Test sequence LL-37: LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES Prediction results Program used Predicted result Notes HemoPred Hemolytic HemoPI PROB score 0.34 (SVM (HemoPI-1) based 0.72 (SVM (HemoPI-2) based) (Hemolytic) 0.88 SVM (HemoPI-3) based) (Hemolytic) Note from website: PROB score is the normalized SVM score and ranges between 0 and 1, i.e., 1 very likely to be hemolytic, 0 very unlikely to be hemolytic HAPPENN PROB score 0.089 (Not Hemolytic) Note from website: PROB score is the normalized sigmoid score and ranges between 0 and 1. 0 is predicted to be most likely nonhemolytic, 1 is predicted to be most likely hemolytic Table 12 Summary of reported percent hemolysis results with different amounts of LL-37 peptide against human red blood cells Hemolysis of human red blood cells References 8% hemolysis at 20 μM [144] ~30% hemolysis at 20 μM [147] 4.47% hemolysis at 38.8 μM [146] ~10% hemolysis at 60 μM [143] 9% hemolysis at 100 μM [148] ~60% hemolysis at 100 μM [142] ~50% hemolysis at 200 μM [145] hemolytic results [142–148]. Of course, there is no indication from these computational predictors of dose-dependence of the effect, although “the dose makes the poison” in most cases with antimicrobial peptides, including LL-37. The prediction results vary from absolutely one end of the hemolytic activity spectrum to the other—one analysis result says “Not Hemolytic,” one result is “Somewhat hemolytic,” and one result is “Hemolytic.” This small analysis suggests that there is significant room for improvement in the accuracy of these predictors compared to actual experimental data generated in the laboratory (Table 13 and Fig. 3). 26 Guangshun Wang et al. Table 13 Peptide parameters for integrated prediction Parameter of interaction Commonly used parameters Comments Antibacterial activity MIC >8 μg/mL is often considered “active” performed under CLSI guidelines using CA-MHB and designated concentrations of peptide. The peptide is defined as inactive in the APD with MIC >100 μg/mL or μM Different methods and conditions for antimicrobial activity make it difficult to compare peptide activity Does not account for peptide binding to serum proteins or being cleaved by serum factors in vivo PK/PD data are lacking for AMPs, and they are not addressed by this metric Host-cell cytotoxicity Cytotoxicity at 100 μg/mL or less; TC50 should be <10–20% at the MIC, depending on the assay used The relationship of this value in vitro with in vivo/whole body toxicity has not been established. Often the level of LL-37 is taken as a benchmark, since it is native to the human body Hemolysis Hemolysis at 100 μg/mL or HC50 should The relationship of this value to in vivo/ be <10–20% at MIC whole body toxicity has not been measured. Often the level of LL-37 is taken as a benchmark, since it is native to the human body Host cell permeability An important parameter if the target microorganism has an intracellular step to its infectious life-cycle Pathogen cell permeability An important parameter if the target of the Assays to measure intracellular bacterial targets such as enzymes or DNA in peptide at sub-MIC concentrations the presence of extracellular peptide might be an intracellular component of are useful to assess this parameter the bacteria, such as target enzymes or [149–151] DNA Assays to measure intracellular replication of bacteria in the presence of extracellular peptide are useful to assess this parameter [113] Initially called protein-binding potential “This function computes the potential Stickiness to [3], Boman index was renamed and protein interaction index proposed by other proteins programmed in the APD for every Boman [3] based in the amino acid (Boman peptide [9]. It is also available in the sequence of a protein. The index is equal index) calculation and prediction interface of to the sum of the solubility values for all the APD for any other peptides. This residues in a sequence, it might give an parameter is also programmed in R at overall estimate of the potential of a https://rdrr.io/cran/Peptides/man/ peptide to bind to membranes or other boman.html proteins as receptors, to normalize it is divided by the number of residues. A protein has high binding potential if the index value is higher than 2.48” Propensity for host protease cleavage Protease cleavage will reduce the activity and half-life of the peptide Can be predicted using Expasy server PeptideCutter. https://web.expasy. org/peptide_cutter/ Other negative effects References Comments (continued) Machine Learning Prediction of Antimicrobial Peptides 27 Table 13 (continued) Parameter of interaction Commonly used parameters Comments Carcinogenic effect None No reports were found on the carcinogenic effect of antimicrobial peptides. Work is being done to use AMPs to fight cancer [155, 156] Antigenicity None It is very difficult to raise antibodies against antimicrobial peptides. This is accomplished if at all by coupling KLH to the peptide. To our knowledge, there have been no reports of spontaneous antibody production against naturally produced AMP, which is too small Cell penetrating [152] properties Cell penetrating properties of peptides are probably a negative property on net, especially in seeking a bactericidal mechanism. Website are available to select for CPPs; this could be a counter-selection or down-selection step in an AMP design protocol unless this property is used to target intracellular pathogens Fig. 3 Percent hemolysis results with different amounts of LL-37 peptide against human red blood cells. The data from Table 11 were plotted. The best-fit line is y ¼ 0.2142x + 8.0017. The shaded gray area represents a 95% confidence interval 28 Guangshun Wang et al. 7.4 Bacterial CellPenetrating Peptides Another factor that may need to be considered in computational prediction of AMP activity is the characteristic of cell-penetration of the pathogen itself: bacteria, membrane-virus, fungal cell, etc. While the main mechanism of action of AMPs is clearly membrane targeting and disruption, there are multiple, well-defined examples of intra-bacterial targets of AMPs that may contribute to their physiological effect, especially at Sub-MIC levels in vivo. These can include targeting bacterial enzymes critical for bacterial survival, or direct interference of the AMP with the bacterial DNA. One example of the association of AMPs with critical bacterial enzymes is the identification of acyl carrier protein as a target of LL-37, the human cathelicidin protein. This association was first determined biochemically by binding the bacterial proteins to immobilized peptide and identifying high-affinity binding proteins [149]. Another example of intra-bacterial targets of AMPs is the association of LL-37 directly with bacterial DNA within the cell, leading to mutations of critical genes [150, 151]. This work includes a compelling visualization of the AMP inside the live Pseudomonas bacteria, associated with the DNA. This property of AMPs to enter the bacteria to exert some direct, non-membrane acting effect could be computationally assessed using cellpenetrating peptide (CCP) analysis, such as is done for other well-known CPPs [152]. Unlike AMPs, CPPs for bacterial pathogens should have the property of being non-killing but membranepenetrating, and comparison of these sets of peptide sequences may reveal some interesting differences. It might be possible to use the CPP algorithm to counter-select for peptides that do not have this property if a membrane-targeting peptide was desired to possibly achieve bactericidal activity. 7.5 Inclusion of Additional Parameters in Drug Development It would be useful if these computational predictors could be used in a combinatorial fashion to achieve the goals of the researcher in designing new AMPs, such as was designed in the database filtering technology approach [153, 154]. For example, perhaps one seeks a short, helical antimicrobial peptide that has activity against Gramnegative bacteria and especially has anti-biofilm activity and low hemolytic activity. It would be useful to have separate analytical tools linked together to generate the desired output. With the everincreasing number of modules available in R, and web-based prediction and analysis tools, this analysis could be done from small scale to high-throughput sequence analysis to design novel peptides. If the computational predictors could be made more accurate, this could be useful in drug-development projects upstream of in vitro screening programs, for example, to increase hit efficacy. The inclusion of prescreening for hemolysis and cytotoxicity would be very useful to reduce the number of hits that have poor in vivo performance characteristics. In addition, high throughput peptide sequencing could enable the generation of high-quality training sets and negative data sets. Machine Learning Prediction of Antimicrobial Peptides 8 8.1 29 Current Achievements and Future Directions Achievements In summary, antimicrobial peptide prediction is in essence a peptide classification problem. Different supervised learning algorithms have been trained to predict AMPs (Table 3). The major achievements include the following: 1. Construction of AMP databases that facilitated machine learning prediction. The APD database, initially online in 2003 and updated regularly, provides a platform for understanding the structure and activity relationship of natural AMPs. 2. Generation of hypothetically negative data sets based on UniProt. 3. Successful encoding peptide features for machine-learning prediction. 4. Programming of various machine-learning algorithms with more or less similar prediction outcomes. 5. Execution of both single and multi-label predictions as well as ensemble predictions of AMPs. 6. Consideration of the impact of the peptide sequence in addition to amino acid composition. 7. Consideration of posttranslational modifications and 3D structure of AMPs although rare. 8. Species-specific prediction of AMPs. 8.2 Future Directions Machine-learning prediction of AMPs remains a challenging task. The success rate is modest and not yet perfect because numerous factors are in play. We anticipate that the quality of AMP prediction will improve with the development of the following aspects: 1. More complete positive data set for AMPs from continued peptide search and database update. There are two types of positive data. First, a continued expansion of natural AMPs in the APD will increase the accuracy of identifying natural AMP sequences. Second, data merging from different databases is anticipated to continue and a large data set with more and more synthetic peptides may improve the prediction of artificial sequences. 2. Experimentally validated negative data sets for AMPs. Our ongoing collection of such peptides may reduce false positives in ML predictions. 3. Ranking peptide activity data based on the same scale (e.g., MIC, diffusion distance, and E-test). This is a challenging task due to limited activity analysis under various lab conditions. A recommended guide for antimicrobial assays of AMPs may be helpful. 30 Guangshun Wang et al. 4. Increased use of information about the target organism in classification and analysis of AMPs (e.g., the target is Grampositive vs Gram-negative bacteria, or a specific pathogen). 5. Continued improvement of peptide encoding for rapid and accurate computing identification. 6. Increased use of peptide information on chemical modifications and their relationship with activity. 7. Increased high-quality 3D structures and their applications in AMP prediction. This is yet another challenging task as currently only ~13% AMPs are known to have 3D structures in the APD3 and high-quality structures are not easy to obtain [11]. 8. Development of more powerful machine-learning/artificial intelligence algorithms or models to better handle sequence and structural diversity and data imbalance of AMPs. Combined use of various ML models (i.e., ensemble) may improve predictions. 9. Increased communication between AMP investigators and machine learning/AI scientists. 10. Establishment of a pipeline of predictions of peptide properties required as a medicine by considering antimicrobial activity, cell toxicity in vitro and in vivo, and peptide bioavailability for efficacy in vivo. Besides AMP prediction, another goal of the APD database is to help design novel peptides to combat antibiotic-resistant pathogens [9]. Different methods have been demonstrated [32]. The frequently occurring amino acids, such as glycine, leucine, and lysine, are sufficient in designing peptides with antibacterial activity comparable to human cathelicidin LL-37 [10, 13]. Interestingly, a substitution of leucine in the database designed peptide DFTamP1 with isoleucine or valine led to activity or solubility decrease [153], underscoring the significance of nature’s choice of leucine as a frequently occurring amino acid in AMPs [10]. Also, there is an inverse correlation between peptide length and leucine content of over 1000 amphibian peptides in the APD [157]. Our screening of representative peptides from the APD led to the identification of different sets of AMPs against methicillin-resistant Staphylococcus aureus (MRSA) and HIV-1 [158, 159]. The grammar approach emphasizes the unique sequences in the database and their combinations [14]. The database filtering technology (DFT) is an ab initio approach, thereby providing another avenue [153]. The database derived parameters are useful to make peptide mimics [160] or to design even short peptides to decrease the production cost [6]. Our expansion of the DFT from in silico filtering to in vitro and in vivo filtering establishes a pipeline for peptide discovery [154]. This idea can be harnessed to establish a pipeline of Machine Learning Prediction of Antimicrobial Peptides 31 machine-learning predictions to accelerate peptide discovery. When quantitative MIC values are used to train ML algorithm, it becomes possible to rank the peptide activity to identify most potent sequences [161]. Likewise, a subsequent counterselection can be conducted by ranking peptide toxicity to host cells (Table 10) so that less toxic peptides can be selected for experimental validation. Ultimately, one may be able to generate an expert system that automatically designs and produces personalized antimicrobials with designed activity spectrum and molecular target for patients to treat a particular pathogen-caused infection. The multiple functions of AMPs annotated in the APD3 imply other potential applications as well. Acknowledgments This study was supported by Joint Warfighter Medical Research Program (JWMRP) JW200188 (MVH), the NIH grant R01GM138552, and the University of Nebraska Collaborative Initiation Grant (GW). Thanks to Fahad Alsaab and Maxwell Tabarrok for assistance with the hemolytic data. References 1. Boucher HW, Talbot GH, Bradley JS, Edwards JE, Gilbert D, Rice LB, Scheld M, Spellberg B, Bartlett J (2009) Bad bugs, no drugs: no ESKAPE! An update from the Infectious Diseases Society of America. Clin Infect Dis 48:1–12 2. O’Neill J. (2016) Tracking drug resistant infections globally: Final report and recommendations, The review on antimicrobial resistance, Wellcome Trust, HM Government. 3. Boman HG (2003) Antibacterial peptides: basic facts and emerging concepts. J Inter Med 254:197–215 4. Mangoni ML, McDermott AM, Zasloff M (2016) Antimicrobial peptides and wound healing: biological and therapeutic considerations. Exp Dermatol 25:167–173 5. Hancock REW, Sahl HG (2006) Antimicrobial and host-defense peptides as new antiinfective therapeutic strategies. Nat Biotechnol 24:1551–1557 6. Lakshmaiah Narayana J, Mishra B, Lushnikova T, Wu Q, Chhonker YS, Zhang Y, Zarena D, Salnikov ES, Dang X, Wang F, Murphy C, Foster KW, Gorantla S, Bechinger B, Murry DJ, Wang G (2020) Two distinct amphipathic peptide antibiotics with systemic efficacy. Proc Natl Acad Sci U S A 117:19446–19454 7. Wang G, Narayana JL, Mishra B, Zhang Y, Wang F, Wang C, Zarena D, Lushnikova T, Wang X (2019) Design of Antimicrobial Peptides: Progress made with human cathelicidin LL-37. Adv Exp Med Biol 1117:215–240 8. Browne K, Chakraborty S, Chen R, Willcox MD, Black DS, Walsh WR, Kumar N (2020) A new era of antibiotics: the clinical potential of antimicrobial peptides. Int J Mol Sci 21: 7047 9. Wang Z, Wang G (2004) APD: the antimicrobial peptide database. Nucleic Acids Res 32: D590–D592 10. Wang G, Li X, Wang Z (2009) The updated antimicrobial peptide database and its application in peptide design. Nucleic Acids Res 37: D933–D937 11. Wang G, Li X, Wang Z (2016) APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res 44:D1087–D1093 12. Kreutzberger MA, Pokorny A, Almeida PF (2017) Daptomycin-Phosphatidylglycerol domains in lipid membranes. Langmuir 33: 13669–13679 13. Wang G (2020) The antimicrobial peptide database provides a platform for decoding the design principles of naturally occurring antimicrobial peptides. Protein Sci 29(1): 8–18 32 Guangshun Wang et al. 14. Loose C, Jensen K, Rigoutsos I, Stephanopoulos G (2006) A linguistic model for the rational design of antimicrobial peptides. Nature 443(7113):867–869 15. Wang G (2012) Post-translational modifications of natural antimicrobial peptides and strategies for peptide engineering. Curr Biotechnol 1:72–79 16. Wang G, Mishra B, Lau K, Lushnikova T, Golla R, Wang X (2015) Antimicrobial peptides in 2014. Pharmaceuticals (Basel) 8: 123–150 17. Lata S, Sharma BK, Raghava GP (2007) Analysis and prediction of antibacterial peptides. BMC Bioinformatics 8:263 18. Thomas S, Karnik S, Barai RS, Jayaraman VK, Idicula-Thomas S (2010) CAMP: a useful resource for research on antimicrobial peptides. Nucleic Acids Res 38:D774–D780 19. Wang G (2015) Improved methods for classification, prediction, and design of antimicrobial peptides. Methods Mol Biol 1268:43–66 20. Wang G (2010) Antimicrobial peptides: discovery, design and novel therapeutic strategies, 2nd edn. CABI, England. published in 2017 21. Gudmundsson GH, Agerberth B, Odeberg J, Bergman T, Olsson B, Salcedo R (1996) The human gene FALL39 and processing of the cathelin precursor to the antibacterial peptide LL-37 in granulocytes. Eur J Biochem 238: 325–332 22. Sørensen O, Arnljots K, Cowland JB, Bainton DF, Borregaard N (1997) The human antibacterial cathelicidin, hCAP-18, is synthesized in myelocytes and metamyelocytes and localized to specific granules in neutrophils. Blood 90:2796–2803 23. Sørensen OE, Gram L, Johnsen AH, Andersson E, Bangsbøll S, Tjabringa GS, Hiemstra PS, Malm J, Egesten A, Borregaard N (2003) Processing of seminal plasma hCAP-18 to ALL-38 by gastricsin: a novel mechanism of generating antimicrobial peptides in vagina. J Biol Chem 278(31): 28540–28546 24. de Jong A, van Heel AJ, Kok J, Kuipers OP (2010) BAGEL2: mining for bacteriocins in genomic data. Nucleic Acids Res 38: W647–W651 25. Blin K, Kazempour D, Wohlleben W, Weber T (2014) Improved lanthipeptide detection and prediction for antiSMASH. PLoS One 9(2): e89420 26. Yount NY, Weaver DC, de Anda J, Lee EY, Lee MW, Wong GCL, Yeaman MR (2020) Discovery of novel type II Bacteriocins using a new high-dimensional Bioinformatic algorithm. Front Immunol 11:1873 27. Fjell CD, Hancock RE, Cherkasov A (2007) AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics 23:1148–1155 28. Dos Santos-Silva CA, Zupin L, OliveiraLima M, Vilela LMB, Bezerra-Neto JP, Ferreira-Neto JR, Ferreira JDC, de OliveiraSilva RL, Pires CJ, Aburjaile FF, de Oliveira MF, Kido EA, Crovella S, Benko-Iseppon AM (2020) Plant antimicrobial peptides: state of the art, in silico prediction and perspectives in the omics era. Bioinform Biol Insights 14: 1177932220952739 29. Jia HP, Mills JN, Barahmand-Pour F, Nishimura D, Mallampali RK, Wang G, Wiles K, Tack BF, Bevins CL, McCray PB Jr (1999) Molecular cloning and characterization of rat genes encoding homologues of human beta-defensins. Infect Immun 67: 4827–4833 30. Wang CK, Kaas Q, Chiche L, Craik DJ (2008) CyBase: a database of cyclic protein sequences and structures, with applications in protein discovery and engineering. Nucleic Acids Res 36:D206–D210 31. Yount NY, Andrés MT, Fierro JF, Yeaman MR (2007) The gamma-core motif correlates with antimicrobial activity in cysteinecontaining kaliocin-1 originating from transferrins. Biochim Biophys Acta 1768(11): 2862–2872 32. Wang G (2013) Database-guided discovery of potent peptides to combat HIV-1 or superbugs. Pharmaceuticals (Basel) 6(6):728–758 33. Pirtskhalava M et al (2021) DBAASP v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Res 49:D288–D297 34. Seebah S et al (2007) Defensins knowledgebase: a manually curated database and information source focused on the defensins family of antimicrobial peptides. Nucleic Acids Res 35:D265–D268 35. Di Luca M et al (2015) BaAMPs: the database of biofilm-active antimicrobial peptides. Biofouling 31:193–199 36. Hammami R, Zouhir A, Le Lay C, Ben Hamida J, Fliss I (2010) BACTIBASE second release: a database and tool platform for bacteriocin characterization. BMC Microbiol 10: 22 37. Novković M, Simunić J, Bojović V, Tossi A, Juretić D (2012) DADP: the database of Machine Learning Prediction of Antimicrobial Peptides anuran defense peptides. Bioinformatics 28: 1406–1407 38. Kang X et al (2019) DRAMP 2.0, an updated data repository of antimicrobial peptides. Sci Data 6:148 39. Whitmore L, Wallace BA (2004) The Peptaibol database: a database for sequences and structures of naturally occurring peptaibols. Nucleic Acids Res 32:D593–D594 40. Zhao X, Wu H, Lu H, Li G, Huang Q (2013) LAMP: a database linking antimicrobial peptides. PLoS One 8:e66557 41. Piotto SP, Sessa L, Concilio S, Iannelli P (2012) YADAMP: yet another database of antimicrobial peptides. Int J Antimicrob Agents 39:346–351 42. Hammami R, Ben Hamida J, Vergoten G, Fliss I (2009) PhytAMP: a database dedicated to antimicrobial plant peptides. Nucleic Acids Res 37(Database issue):D963–D968 43. Gómez EA, Giraldo P, Orduz S (2017) InverPep: a database of invertebrate antimicrobial peptides. J Glob Antimicrob Resist 8:13–17 44. Qureshi A, Thakur N, Kumar M (2013) HIPdb: a database of experimentally validated HIV inhibiting peptides. PLoS One 8:e54908 45. Li J, Qu X, He X, Duan L, Wu G, Bi D, Deng Z, Liu W, Ou HY (2012) ThioFinder: a web-based tool for the identification of thiopeptide gene clusters in DNA sequences. PLoS One 7(9):e45878 46. Wu H, Lu H, Huang J et al (2012) EnzyBase: a novel database for enzybiotic studies. BMC Microbiol 12(1):54 47. Mehta D, Anand P, Kumar V, Joshi A, Mathur D, Singh S, Tuknait A, Chaudhary K, Gautam SK, Gautam A, Varshney GC, Raghava GP (2014) ParaPep: a web resource for experimentally validated antiparasitic peptide sequences and their structures. Database 2014:bau051 48. Jhong JH, Chi YH, Li WC et al (2019) dbAMP: an integrated resource for exploring antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data. Nucleic Acids Res 47:D285–D297 49. Usmani SS, Kumar R, Kumar V, Singh S, Raghava GPS (2018) AntiTbPdb: a knowledgebase of anti-tubercular peptides. Database (Oxford) 2018:bay025 50. Brahmachary M, Krishnan SP, Koh JL, Khan AM, Seah SH, Tan TW, Brusic V, Bajic VB (2004) ANTIMIC: a database of antimicrobial sequences. Nucleic Acids Res 32(Database issue):D586–D589 33 51. Wang G (2015) Database resources dedicated to antimicrobial peptides. In: Chen C, Yan X, Jackson CR (eds) Antimicrobial resistance and food safety. Academic Press, Cambridge, Massachusetts, pp 365–384 52. Xiao X, Wang P, Lin WZ, Jia JH, Chou KC (2013) iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal Biochem 436:168–177 53. Waghu FH, Barai RS, Gurung P, IdiculaThomas S (2016) CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides. Nucleic Acids Res 44 (D1):D1094–D1097 54. Wang P, Hu L, Liu G, Jiang N, Chen X, Xu J, Zheng W, Li L, Tan M, Chen Z, Song H, Cai YD, Chou KC (2011) Prediction of antimicrobial peptides based on sequence alignment and feature selection methods. PLoS One 6(4):e18476 55. Torrent M, Di Tommaso P, Pulido D, Nogués MV, Notredame C, Boix E, Andreu D (2012) AMPA: an automated web server for prediction of protein antimicrobial regions. Bioinformatics 28(1):130–131 56. Fernandes FC, Rigden DJ, Franco OL (2012) Prediction of antimicrobial peptides based on the adaptive neuro-fuzzy inference system application. Biopolymers 98(4):280–287 57. Mooney C, Haslam NJ, Holton TA, Pollastri G, Shields DC (2013) PeptideLocator: prediction of bioactive peptides in protein sequences. Bioinformatics 29(9):1120–1126 58. Ng XY, Rosdi BA, Shahrudin S (2015) Prediction of antimicrobial peptides based on sequence alignment and support vector machine-pairwise algorithm utilizing LZ-complexity. Biomed Res Int 2015: 212715 59. Lee HT, Lee CC, Yang JR et al (2015) A large-scale structural classification of antimicrobial peptides. Biomed Res Int 2015: 475062 60. Lin W, Xu D (2016) Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types. Bioinformatics 32(24):3745–3752 61. Meher PK, Sahu TK, Saini V, Rao AR (2017) Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci Rep 7:42362 62. Bhadra P, Yan J, Li J, Fong S, Siu SWI (2018) AmPEP: sequence-based prediction of 34 Guangshun Wang et al. antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 8:1697 63. Veltri D, Kamath U, Shehu A (2018) Deep learning improves antimicrobial peptide recognition. Bioinformatics 34:2740–2747 64. Agrawal P, Raghava GPS (2018) Prediction of antimicrobial potential of a chemically modified peptide from its tertiary structure. Front Microbiol 9:2551 65. Gull S, Shamim N, Minhas F (2019) AMAP: hierarchical multi-label prediction of biologically active and antimicrobial peptides. Comput Biol Med 107:172–181 66. Feng P, Wang Z, Yu X (2019) Predicting antimicrobial peptides by using increment of diversity with quadratic discriminant analysis method. IEEE/ACM Trans Comput Biol Bioinform 16:1309–1312 67. Chung CR, Kuo TR, Wu LC et al (2019) Characterization and identification of antimicrobial peptides with different functional activities. Brief Bioinform:bbz043. https:// doi.org/10.1093/bib/bbz043 68. Gull S, Minhas FUAA (2020) AMP0: speciesspecific prediction of anti-microbial peptides using zero and few shot learning. IEEE/ ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2020. 2999399 69. Tripathi V, Tripathi P (2020) Detecting antimicrobial peptides by exploring the mutual information of their sequences. J Biomol Struct Dyn 38:5037–5043 70. Yan J, Bhadra P, Li A, Sethiya P, Qin L, Tai HK, Wong KH, Siu SWI (2020) DeepAmPEP30: improve short antimicrobial peptides prediction with deep learning. Mol Ther Nucleic Acids 20:882–894 71. Fu H, Cao Z, Li M et al (2020) ACEP: improving antimicrobial peptides recognition through automatic feature fusion and amino acid embedding. BMC Genomics 21:597 72. Kavousi K, Bagheri M, Behrouzi S et al (2020) IAMPE: NMR-assisted computational prediction of antimicrobial peptides. J Chem Inf Model 60:4691–4701 73. Santos-Junior CD, Pan S, Zhao XM et al (2020) Macrel: antimicrobial peptide screening in genomes and metagenomes. PeerJ 8: e10555 74. Youmans M, Spainhour JCG, Qiu P (2020) Classification of antibacterial peptides using long short-term memory recurrent neural networks. IEEE/ACM Trans Comput Biol Bioinform 17:1134–1140 75. Fingerhut L, Miller DJ, Strugnell JM et al (2020) Ampir: an R package for fast genome-wide prediction of antimicrobial peptides. Bioinformatics 36:5262–5263 76. Lawrence TJ, Carper DL, Spangler MK et al (2020) amPEPpy 1.0: A portable and accurate antimicrobial peptide prediction tool. Bioinformatics 37(14):2058–2060. https://doi. org/10.1093/bioinformatics/btaa917 77. Lertampaiporn S, Vorapreeda T, Hongsthong A et al (2021) Ensemble-AMPPred: robust AMP prediction and recognition using the ensemble learning method with a new hybrid feature for differentiating AMPs. Genes (Basel) 12:137 78. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data Bank. Nucleic Acids Res 28(1):235–242 79. MacDougall A et al (2020) UniRule: a unified rule resource for automatic annotation in the UniProt knowledgebase. Bioinformatics 36(17):4643–4648 80. Porto WF, Pires ÁS, Franco OL (2017) Antimicrobial activity predictors benchmarking analysis using shuffled and designed synthetic peptides. J Theor Biol 426:96–103 81. Othman M, Ratna S, Tewari A, et al. (2017) Classification and prediction of antimicrobial peptides using N-gram representation and machine learning. Proceedings of the 8th ACM international conference on bioinformatics, computational biology, and health informatics. Boston, Massachusetts, USA: Association for Computing Machinery, 605 82. Mooney C, Haslam NJ, Pollastri G et al (2012) Towards the improved discovery and design of functional peptides: common features of diverse classes permit generalized prediction of bioactivity. PLoS One 7:e45012 83. Burdukiewicz M, Sidorczuk K, Rafacz D et al (2020) Proteomic screening for prediction and design of antimicrobial peptides with AmpGram. Int J Mol Sci 21:4310 84. Kaplan N, Morpurgo N, Linial M (2007) Novel families of toxin-like peptides in insects and mammals: a computational approach. J Mol Biol 369:553–566 85. Muller AT, Kaymaz AC, Gabernet G et al (2016) Sparse neural network models of antimicrobial peptide-activity relationships. Mol Inform 35:606–614 86. Schneider P, Muller AT, Gabernet G et al (2017) Hybrid network model for "deep learning" of chemical data: application to antimicrobial peptides. Mol Inform 36 Machine Learning Prediction of Antimicrobial Peptides 87. Su X, Xu J, Yin Y et al (2019) Antimicrobial peptide identification using multi-scale convolutional network. BMC Bioinformatics 20: 730 88. Tripathi S et al (2015) Antiviral activity of the human cathelicidin, LL-37, and derived peptides on seasonal and pandemic influenza a viruses. PLoS One 10:e0124706 89. Barlow PG et al (2011) Antiviral activity and increased host defense against influenza infection elicited by the human cathelicidin LL-37. PLoS One 6:e25333 90. Wang G (2012) Natural antimicrobial peptides as promising anti-HIV candidates. Curr Top Pept Protein Res 13:93–110 91. Wohlford-Lenane CL et al (2009) Rhesus theta-defensin prevents death in a mouse model of severe acute respiratory syndrome coronavirus pulmonary disease. J Virol 83: 11385–11390 92. He M, Zhang H, Li Y, Wang G, Tang B, Zhao J, Huang Y, Zheng J (2018) Cathelicidin-derived antimicrobial peptides inhibit Zika virus through direct inactivation and interferon pathway. Front Immunol 9: 722 93. Yu Y et al (2020) Engineered human cathelicidin antimicrobial peptides inhibit Ebola virus infection. iScience 23:100999 94. Thakur N, Qureshi A, Kumar M (2012) AVPpred: collection and prediction of highly effective antiviral peptides. Nucleic Acids Res 40:W199–W204 95. Qureshi A, Thakur N, Tandon H, Kumar M (2014) AVPdb: a database of experimentally validated antiviral peptides targeting medically important viruses. Nucleic Acids Res 42:D1147–D1153 96. Chowdhury AS et al (2020) Better understanding and prediction of antiviral peptides through primary and secondary structure feature importance. Sci Rep 10:19260 97. Tyagi A et al (2019) PlantAFP: a curated database of plant-origin antifungal peptides. Amino Acids 51:1561–1568 98. Agrawal P et al (2018) In silico approach for prediction of antifungal peptides. Front Microbiol 9:323 99. Manavalan B et al (2018) AIPpred: sequencebased prediction of anti-inflammatory peptides using random Forest. Front Pharmacol 9:276 100. Manavalan B et al (2018) PIP-EL: a new ensemble learning method for improved Proinflammatory peptide predictions. Front Immunol 9:1783 35 101. Usmani SS, Bhalla S, Raghava GPS (2018) Prediction of Antitubercular peptides from sequence information using ensemble classifier and hybrid features. Front Pharmacol 9: 954 102. Gupta K, Singh S, van Hoek ML (2015) Short, synthetic cationic peptides have antibacterial activity against Mycobacterium smegmatis by forming pores in membrane and synergizing with antibiotics. Antibiotics (Basel) 4:358–378 103. Torres-Juarez F et al (2015) LL-37 immunomodulatory activity during Mycobacterium tuberculosis infection in macrophages. Infect Immun 83:4495–4503 104. Rao Muvva J et al (2019) Polarization of human monocyte-derived cells with vitamin D promotes control of Mycobacterium tuberculosis infection. Front Immunol 10:3157 105. Rivas-Santiago B et al (2013) Activity of LL-37, CRAMP and antimicrobial peptidederived compounds E2, E6 and CP26 against Mycobacterium tuberculosis. Int J Antimicrob Agents 41:143–148 106. Corrales-Garcia L et al (2013) Bacterial expression and antibiotic activities of recombinant variants of human beta-defensins on pathogenic bacteria and M. tuberculosis. Protein Expr Purif 89:33–43 107. Wong GC, O’Toole GA (2011) All together now: integrating biofilm research across disciplines. MRS Bull 36:339–342 108. O’Toole GA (2003) To build a biofilm. J Bacteriol 185:2687–2689 109. O’Toole GA (2011) Microtiter dish biofilm formation assay. J Vis Exp 47:2437 110. de la Fuente-Nunez C et al (2014) Broadspectrum anti-biofilm peptide that targets a cellular stress response. PLoS Pathog 10: e1004152 111. de la Fuente-Nunez C et al (2012) Inhibition of bacterial biofilm formation and swarming motility by a small synthetic cationic peptide. Antimicrob Agents Chemother 56: 2696–2704 112. Overhage J et al (2008) Human host defense peptide LL-37 prevents bacterial biofilm formation. Infect Immun 76:4176–4182 113. Chung EMC et al (2017) Komodo dragoninspired synthetic peptide DRGN-1 promotes wound-healing of a mixed-biofilm infected wound. NPJ Biofilms Microbiomes 3:9 114. Duplantier AJ, van Hoek ML (2013) The human cathelicidin antimicrobial peptide LL-37 as a potential treatment for Polymicrobial infected wounds. Front Immunol 4:143 36 Guangshun Wang et al. 115. Dean SN, Bishop BM, van Hoek ML (2011) Susceptibility of Pseudomonas aeruginosa biofilm to alpha-helical peptides: D-enantiomer of LL-37. Front Microbiol 2:128 116. Dean SN, Bishop BM, van Hoek ML (2011) Natural and synthetic cathelicidin peptides with anti-microbial and anti-biofilm activity against Staphylococcus aureus. BMC Microbiol 11:114 117. Amer LS, Bishop BM, van Hoek ML (2010) Antimicrobial and antibiofilm activity of cathelicidins and short, synthetic peptides against Francisella. Biochem Biophys Res Commun 396:246–251 118. Sharma A et al (2016) dPABBs: a novel in silico approach for predicting and designing anti-biofilm peptides. Sci Rep 6:21839 119. Fallah F et al (2020) BIPEP: sequence-based prediction of biofilm inhibitory peptides using a combination of NMR and physicochemical descriptors. ACS Omega 5: 7290–7297 120. Gupta S et al (2016) Prediction of biofilm inhibiting peptides: an in silico approach. Front Microbiol 7:949 121. Rajput A, Thakur A, Sharma S, Kumar M (2018) aBiofilm: a resource of anti-biofilm agents and their potential implications in targeting antibiotic drug resistance. Nucleic Acids Res 46:D894–D900 122. Bishop BM, Juba ML, Devine MC, Barksdale SM, Rodriguez CA, Chung MC, Russo PS, Vliet KA, Schnur JM, van Hoek ML (2015) Bioprospecting the American alligator (Alligator mississippiensis) host defense peptidome. PLoS One 10:e0117394 123. Cherkasov A, Hilpert K, Jenssen H, Fjell CD, Waldbrook M, Mullaly SC, Volkmer R, Hancock RE (2009) Use of artificial intelligence in the design of small peptide antibiotics effective against a broad spectrum of highly antibiotic-resistant superbugs. ACS Chem Biol 4(1):65–74 124. Wang X, Mishra B, Lushnikova T, Narayana JL, Wang G (2018) Amino acid composition determines peptide activity Spectrum and hot-spot-based Design of Merecidin. Adv Biosyst 2(5):1700259 125. Gupta S et al (2013) In silico approach for predicting toxicity of peptides and proteins. PLoS One 8:e73957 126. Gupta S et al (2015) Peptide toxicity prediction. Methods Mol Biol 1268:143–157 127. Gordon YJ et al (2005) Human cathelicidin (LL-37), a multifunctional peptide, is expressed by ocular surface epithelia and has potent antibacterial and antiviral activity. Curr Eye Res 30:385–394 128. Wang W et al (2017) Antimicrobial peptide LL-37 promotes the viability and invasion of skin squamous cell carcinoma by upregulating YB-1. Exp Ther Med 14:499–506 129. Oliveira-Bravo M et al (2016) LL-37 boosts immunosuppressive function of placentaderived mesenchymal stromal cells. Stem Cell Res Ther 7:189 130. Hosseini Z et al (2020) The human cathelicidin LL-37, a defensive peptide against rotavirus infection. Int J Pept Res Ther 26:911–919 131. Haisma EM et al (2014) LL-37-derived peptides eradicate multidrug-resistant Staphylococcus aureus from thermally wounded human skin equivalents. Antimicrob Agents Chemother 58:4411–4419 132. Barksdale SM, Hrifko EJ, van Hoek ML (2017) Cathelicidin antimicrobial peptide from Alligator mississippiensis has antibacterial activity against multi-drug resistant Acinetobacter baumanii and Klebsiella pneumoniae. Dev Comp Immunol 70:135–144 133. Barksdale SM, Hrifko EJ, Chung EM, van Hoek ML (2016) Peptides from American alligator plasma are antimicrobial against multi-drug resistant bacterial pathogens including Acinetobacter baumannii. BMC Microbiol 16:189 134. Hitt SJ, Bishop BM, van Hoek ML (2020) Komodo-dragon cathelicidin-inspired peptides are antibacterial against carbapenemresistant Klebsiella pneumoniae. J Med Microbiol 69:1262–1272 135. de Latour FA et al (2010) Antimicrobial activity of the Naja atra cathelicidin and related small peptides. Biochem Biophys Res Commun 396:825–830 136. van Dijk A et al (2009) Identification of chicken cathelicidin-2 core elements involved in antibacterial and immunomodulatory activities. Mol Immunol 46:2465–2473 137. Nizet V et al (2001) Innate antimicrobial peptide protects the skin from invasive bacterial infection. Nature 414:454–457 138. Gao J et al (2020) Design of a sea Snake Antimicrobial Peptide Derivative with therapeutic potential against drug-resistant bacterial infection. ACS Infect Dis 6:2451–2467 139. Win TS et al (2017) HemoPred: a web server for predicting the hemolytic activity of peptides. Future Med Chem 9:275–291 140. Chaudhary K et al (2016) A web server and Mobile app for computing hemolytic potency of peptides. Sci Rep 6:22843 Machine Learning Prediction of Antimicrobial Peptides 141. Timmons PB, Hewage CM (2020) HAPPENN is a novel tool for hemolytic activity prediction for therapeutic peptides which employs neural networks. Sci Rep 10:10869 142. Oren Z et al (1999) Structure and organization of the human antimicrobial peptide LL-37 in phospholipid membranes: relevance to the molecular basis for its non-cell-selective activity. Biochem J 341(Pt 3):501–513 143. Ciornei CD, Sigurdardottir T, Schmidtchen A, Bodelsson M (2005) Antimicrobial and chemoattractant activity, lipopolysaccharide neutralization, cytotoxicity, and inhibition by serum of analogs of human cathelicidin LL-37. Antimicrob Agents Chemother 49:2845–2850 144. Al-Adwani S et al (2020) Studies on citrullinated LL-37: detection in human airways, antibacterial effects and biophysical properties. Sci Rep 10:2376 145. Rajasekaran G, Kim EY, Shin SY (2017) LL-37-derived membrane-active FK-13 analogs possessing cell selectivity, anti-biofilm activity and synergy with chloramphenicol and anti-inflammatory activity. Biochim Biophys Acta Biomembr 1859:722–733 146. Luo Y et al (2017) The naturally occurring host defense peptide, LL-37, and its truncated mimetics KE-18 and KR-12 have selected biocidal and Antibiofilm activities against Candida albicans, Staphylococcus aureus, and Escherichia coli in vitro. Front Microbiol 8:544 147. Koro C et al (2016) Carbamylated LL-37 as a modulator of the immune response. Innate Immun 22:218–229 148. Murakami M et al (2004) Postsecretory processing generates multiple cathelicidins for enhanced topical antimicrobial defense. J Immunol 172:3070–3077 149. Chung MC, Dean SN, van Hoek ML (2015) Acyl carrier protein is a bacterial cytoplasmic target of cationic antimicrobial peptide LL-37. Biochem J 470:243–253 150. Limoli DH et al (2014) Cationic antimicrobial peptides promote microbial mutagenesis and pathoadaptation in chronic infections. PLoS Pathog 10:e1004083 151. Limoli DH, Wozniak DJ (2014) Mutagenesis by host antimicrobial peptides: insights into microbial evolution during chronic infections. Microb Cell 1:247–249 37 152. Oikawa K et al (2018) Screening of a cellpenetrating peptide library in Escherichia coli: relationship between cell penetration efficiency and cytotoxicity. ACS Omega 3: 16489–16499 153. Mishra B, Wang G (2012) Ab initio design of potent anti-MRSA peptides based on database filtering technology. J Am Chem Soc 134(30):12426–12429 154. Mishra B, Lakshmaiah Narayana J, Lushnikova T, Wang X, Wang G (2019) Low cationicity is important for systemic in vivo efficacy of database-derived peptides against drug-resistant Gram-positive pathogens. Proc Natl Acad Sci U S A 116(27): 13517–13522 155. Beheshtirouy S, Mirzaei F, Eyvazi S, Tarhriz V (2020) Recent advances on therapeutic peptides for breast cancer treatment. Curr Protein Pept Sci. https://doi.org/10.2174/ 1389203721999201117123616 156. Marqus S, Pirogova E, Piva TJ (2017) Evaluation of the use of therapeutic peptides for cancer treatment. J Biomed Sci 24:21 157. Wang G (2020) Bioinformatic analysis of 1000 amphibian antimicrobial peptides uncovers multiple length-dependent correlations for peptide design and prediction. Antibiotics (Basel) 9(8):491 158. Wang G, Watson KM, Peterkofsky A, Buckheit RW Jr (2010) Identification of novel human immunodeficiency virus type 1-inhibitory peptides based on the antimicrobial peptide database. Antimicrob Agents Chemother 54(3):1343–1346 159. Menousek J, Mishra B, Hanke ML, Heim CE, Kielian T, Wang G (2012) Database screening and in vivo efficacy of antimicrobial peptides against methicillin-resistant Staphylococcus aureus USA300. Int J Antimicrob Agents 39(5):402–406 160. Dong Y, Lushnikova T, Golla RM, Wang X, Wang G (2017) Small molecule mimics of DFTamP1, a database designed antistaphylococcal peptide. Bioorg Med Chem 25(3):864–869 161. Witten J, Witten Z (2019) Deep learning regression model for antimicrobial peptide design bioRxiv A preprint posted on July 12, 2019 Chapter 2 Tools for Characterizing Proteins: Circular Variance, Mutual Proximity, Chameleon Sequences, and Subsequence Propensities Mihaly Mezei Abstract For the characterization of various aspects of protein structures, four useful concepts are discussed: chameleon sequences, circular variance, mutual proximity, and a subsequence-based foldability score. These concepts were used in estimating foldability of globular, intrinsically disordered and fold-switching proteins, properties of protein–protein interfaces, quantifying sphericity, helping to improve protein–protein docking scores, and estimating the effect of mutations on stability. A conjecture about the Achilles’ heel of proteins is presented as well. Key words Circular variance, Mutual proximity, Chameleon sequence, Amino acid propensity, Foldability score 1 Introduction The study of proteins is made difficult by the fact that their structure emerges from a polymer sequence that, at first glance, appears to be random. However, the fact that choosing randomly from the huge space of possible amino acid (AA) sequences the chance of finding a sequence that results in a well-folded protein is virtually zero [1] suggests that beyond the seeming randomness there must be a plethora of information. On the other hand, the existence of chameleon sequences [2] and fold-switching sequences [3] point to the limitation of what can be inferred from sequence alone. One aim of this chapter is to discuss some of the works aiming to tease out from a sequence the information about foldability. The other difficulty of dealing with protein structures is their irregular shape. While they are often able to form crystals, it is not always the case, and, in any event, the regularity required by the formation of periodic structured is achieved by filling the empty space, mostly with water. Characterizing irregular shapes is not a Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 39 40 Mihaly Mezei simple geometric problem, and the other aim of this chapter is to show how the concept of circular variance (CV) can help in this endeavor. Most proteins function as part of a complex. The interface between proteins thus has to be able to maintain the complex; being able to characterize the interface is necessary for understanding its formation. One aspect of characterizing the interface is the identification of contact pairs, usually done using distance thresholds. A further aim of this chapter is to show the usefulness of the concept that defines contacts as mutually proximal pairs of atoms. In addition, CV was also shown to be helpful for characterizing geometric properties of the interfaces. 2 Materials and Methods 2.1 Data Sets Used Most studies described in this chapter relied on the Protein Data Bank (PDB) [4]. The PDB is the depository of protein structures, carefully annotated for structural features (e.g., helix or sheet), as well as measures of the reliability of the data (i.e., temperature factors and resolution). Unresolved regions are also indicated. The PDB provided a file called ss.txt that lists the sequences of 394869 protein chains (domains) of the structures in the PDB as of 2018. This set is referred to as the PDB set and was used for the chameleon search and for the establishment of various statistics. For the test of foldability predictions, a new set of 40351 chains was obtained from structures deposited after 2018, i.e., not included in the PDB set. This set is referred to as the new PDB set. Both the PDB and the new PDB set were filtered to insure that no pair in the set has more than 50% sequence identity; the filtered sets are referred to as the PDB50 and newPDB50 set, resp., containing 35667 and 4735 sequences, resp. For the study of protein–protein interfaces [5], 1172 protein complexes were selected from the PDB, referred to as the PDB-C set. For each complex, only the pair with the largest number of contacts was used as defined by the biological oligomer annotation. The sequences of intrinsically disordered proteins (IDPs) were obtained from the DisProt [6] data set, the list of fold-switching (F-S) sequences was obtained from the paper of Porter and Looger [3], and the data on mutations were obtained from the data set compiled by Pucci et al. [7]. For the test of the contact potential, protein complexes were selected from the DOCKGROUND [8] and ZLAB [9] benchmark sets. Docked protein complexes were generated by ClusPro [10] and PatchDock [11]. 2.2 Circular Variance The circular variance (CV) of the angles [12], a measure of the spread of the angles, is shown as a vertical bar on the y-axis inside the dial. For a set of n angles [φi], CV is defined as: Tools for Characterizing Proteins 41 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi !2 !2 u n n u X X t CV ¼ 1 sin φi þ cos φi =n ð1Þ i¼1 i¼1 Since sin φi and cos φi are the x and y components of a unit ! vector e i , the definition of CV generalizes to three dimensions as: X n ! ð2Þ CV ¼ 1 e i =n i¼1 ! For a set of vectors of arbitrary length v i , the weighted circular variance CVw can be defined [13] as: X X n n ! ! CV w ¼ 1 v i = ð3Þ v i¼1 i¼1 i The calculation of CV and CVw has been implemented into the program Simulaid [14] available at the URL https://mezeim01.u. hpc.mssm.edu/simulaid. 2.3 Mutual Proximity 2.4 Definition of Surface and Interface Atoms and their Properties Mutual proximity is used to provide a parameter-free definition of contacts between two proteins [15]. Figure 1 demonstrates the relation between mutual proximity and contact—red arrows are placed between the two mutually proximal atoms. Note, that the concept was also found to be useful in designing Monte Carlo moves satisfying microscopic reversibility [16]. Surface atoms were defined by two criteria: (a) exposed accessible surface fraction exceeds 3% and (b) the CV calculated with respect to the rest of the protein atoms is less than 0.8. Since the crystal structures did not contain hydrogen coordinates, only heavy atoms were considered. The CV filter was needed as some proteins contained large enough cavities that internal atoms had significant accessible surface. The ruggedness indicator of a surface atom i, RG(i) is defined as: X jCV ði Þ CV ðkÞj RG ði Þ ¼ ð4Þ jN n ði Þj ! ! kj r k r i <RN where ! r k is a surface neighbor atom of i and Nn(i) is the number of surface neighbors of atom i. The assignment of interface atoms relied on the mutual proximity criterion: atoms are considered to be on the interface (a) if it is an atom forming contact (i.e., it is on the list of mutually proximal atom pairs) or (b) if it is within 4 Å of a contact atom. The contact propensity PRi,j of AA pair {i,j} is defined as: 42 Mihaly Mezei Fig. 1 Illustration of the relation between mutual proximity and contact. Red two-headed arrows are drawn between mutually proximal pairs of atoms PRi,j ¼ N i,j 20 P i, j ¼1 = P i P i 2 δi,j ð5Þ N i,j where Pi is the probability of AA i being on the surface, Ni,j is the number of [i,j] pairs found in the data set, and δi,j is the Kronecker delta. 2.5 Contact Score and Docking Score Correction With the contact propensity of residue pairs {i,j}, PRi,j, known, a contact score S, and its normalized variant SN can be defined as: X kT ln P R i,j ; S N ¼ S=N ct ð6Þ S¼ ½i, j where Nct is the number of contacts in the complex. A putative improvement to a docking score SD can be defined as: S D0 ¼ S D w S ð7Þ where w is a scaling parameter. Analogous corrections can be made with SN. 2.6 Foldability Scores Based on the propensities of various p-tuples in folded proteins of known structure, two kinds of scores were defined: SCp when the statistics of p-tuples is adequate (i.e., p 5) and SSCp when the statistics are sparse. For a sequence of length N + p 1, XN PN i SC p ¼ =N ð8Þ ln i¼0 PR i Tools for Characterizing Proteins 43 where PNi and PRi are the probabilities of finding the i-th p-tuple in the PDB50 set and in the RANW set, resp. If the i-th p-tuple was missing from the PDB50 set, PNi was set to 0.5/20p. Clearly, for SCp to be meaningful, there has to be ample statistics on the p-tuples. For situations where this is not the case, the simpler score SSCp was defined: " # N X n i =N ð9Þ SSC p ¼ i¼0 where ni is the number of occurrences of the p-tuple i in the experimental set. 3 Applications 3.1 Chameleon Sequences Chameleon sequences provide an important indication to the limits of predicting the secondary structure of a protein from its sequence. The number and length of known chameleon sequences grew significantly as the number of known protein structures grew: in 1998 the longest chameleons had seven residues and only three such chameleons were found [17] while the recent search [2] found chameleons up to length 11 (not considering some longer ones found on the same protein in different structures) and a lot more of the shorter ones. This suggests that the issue of chameleon propensities can be now treated with reasonable precision since there is enough data to reduce statistical fluctuations. The recent work used all the sequences available at the time the project was started. While this involved many similar sequences, use of a reduced set was likely to result in missing chameleons. Thus, it was important to develop an efficient algorithm since the complexity of the brute-force approach is O(N2). The algorithm used was based on the sorting at various steps, making its complexity O (NlogN) [2]; an implementation is available at the URL https:// mezeim01.u.hpc.mssm.edu/cham. Table 1 gives the chameleon propensities of each of the 20 AAs (the % of residues in chameleons/% AA) for chameleons of length 5–8—lengths with ample statistics. Two interesting observations can be made: (1) the propensities vary significantly with length for some of the AAs and (2) the residues with the highest chameleon propensity (ALA, VAL, LEU) also have higher than average AA propensities. This raises the question whether higher than average chameleon propensities are not simply a consequence of the high AA propensities since finding the same AA in two different sequences is proportional to the square of their propensities. To see whether this is the case, the Spearman rank correlation was also calculated between the AA propensities and the chameleon 44 Mihaly Mezei Table 1 Chameleon propensities of chameleon sequences Length 5 6 7 8 % AA ALA 1.0 1.5 1.6 1.6 8.0 CYS 0.8 0.3 0.2 0.2 1.4 ASP 0.6 0.5 0.5 0.4 5.6 GLU 0.8 1.0 1.1 1.0 6.6 PHE 1.4 0.9 0.8 0.8 3.9 GLY 0.6 0.6 0.6 0.6 7.4 HIS 0.7 0.3 0.3 0.8 2.7 ILE 1.6 1.5 1.3 1.2 5.6 LYS 0.8 0.8 0.9 1.1 5.9 LEU 1.2 1.8 1.9 1.6 9.0 MET 1.0 0.5 0.5 0.5 2.3 ASN 0.7 0.4 0.4 0.6 4.2 PRO 0.2 0.1 0.1 0.1 4.6 GLN 0.9 0.7 0.6 0.9 3.8 ARG 1.0 1.0 1.0 0.7 5.2 SER 0.9 0.8 0.8 1.1 6.3 THR 1.3 1.0 0.9 0.9 5.6 VAL 1.8 2.1 2.0 2.0 7.0 TRP 1.0 0.3 0.3 0.6 1.3 TYR 1.6 0.9 0.8 0.6 3.4 # of chameleons 187,803 36,669 1822 79 Corr. With AA prop. 0.20 0.22 0.35 0.08 propensities of the chameleons of different length, also shown in Table 1. The correlations are indeed positive, but quite small, suggesting that there is more to chameleon propensities than just the effect of the AA propensity. Furthermore, chameleon propensities were also found to be not a significant factor in fold-switching proteins [18]. Looking at chameleons of length 5, 6, and 7, the propensities of chameleons in fold-switching proteins were slightly larger: 15.6 vs. 13.7, 1.2 vs. 0.8, and 0.1 vs. 0.0, for lengths 5, 6, and 7, resp. Tools for Characterizing Proteins 45 3.2 Shape and Smoothness of the Protein–Protein Interface The atoms forming the interface in the complexes of the PDB-C set were examined by asking two questions: (a) is the surface at the interface more or less smooth than elsewhere and (b) is the interface protruding relative to the rest of the surface or the opposite. The smoothness (or lack thereof) was measured by ruggedness indicator RG defined by Eq. (4) above. The average RG values set of the interaction and full surface atoms in the PDB-C set were found to be 0.069 (s.d. ¼ 0.007) and 0.066 (s.d. ¼ 0.005), resp. Since this difference is significant at the level p ¼ 0.001, it can be concluded that interaction surfaces are less smooth than the rest. As for the shape of the interface, it can be characterized by comparing the CV values of the atoms in the interface and of the atoms elsewhere. Again, using the PDB-C set, the average CV values of atoms at the interface and of atoms at the surface were found to be 0.454 (s.d. ¼ 0.044) and 0.464 (s.d. ¼ 0.029), resp. While this difference is quite small, it is significant at the level p ¼ 0.001 according to the Student t test. Thus, we can conclude that the interaction surface is likely to be protruding, even if only slightly. 3.3 Contact Statistics and its Application Correlated mutations have been proven to be an important tool in analyzing protein–protein associations [19]. This suggests that the residue pairs in contact are far from random, and there are specific propensities of residue pairs to be in contact at protein–protein interfaces. Indeed, the analysis of protein complexes in the PDB-C set showed large variations in the propensities of AA pairs to be in contact. Table 2 presents the propensities PRi,j of each AA pair to be in contact. The large variations in these contact propensities indicate the importance of this data. The contact propensities were used to help improving protein– protein docking (PPD) results. When attempting to find the interface between two proteins known to form a complex, PPD algorithms provide a number of putative solutions—most of them looking quite reasonable. The problem is that the PPD results are strongly dependent on small details of the conformation, as seen from the fact that when the monomers are in the conformations that they form in the complex (i.e., redocking) then the top-scoring conformation is actually the experimentally known one in most cases. However, if the monomer conformations are obtained independently of the complex structure then it is very rare that the top-scoring conformation is the experimental one, suggesting that there is room for improvement. The first test of the contact potentials defined by Eq. (6) above used 18 complexes where the ensemble of docked models contained a model close to the crystal structure. The first question was how many of the models have better contact scores than the crystal structure and how many of them have better score than the score of the model close to the crystal structure. If the answer is 0.65 0.31 0.59 0.59 0.68 0.77 1.19 1.00 1.04 0.61 0.46 0.55 1.55 0.77 2.07 0.68 0.30 0.37 0.38 1.23 GLY ALA VAL LEU ILE PHE TRP TYR PRO MET SER THR CYS ASN HIS GLN ASP GLU LYS ARG 0.92 0.32 0.21 0.47 0.47 0.46 1.42 0.92 0.65 0.45 0.54 0.73 1.03 1.99 1.68 0.99 0.81 0.52 0.70 1.04 0.45 0.85 0.64 1.20 1.12 0.73 1.09 0.90 0.89 1.17 1.01 1.56 1.78 1.73 1.09 1.46 1.36 0.76 0.41 0.39 0.37 0.80 0.69 0.84 1.60 0.68 0.45 1.64 1.24 1.59 1.73 2.51 2.60 1.29 1.13 0.71 0.70 0.94 1.00 1.52 1.22 1.25 0.88 0.39 1.28 0.87 2.35 1.23 2.99 2.30 2.56 0.96 0.84 0.90 1.46 1.81 1.39 1.31 1.82 1.37 3.44 3.05 1.37 1.19 1.00 0.64 2.21 4.37 1.48 0.63 0.26 0.54 0.64 1.07 0.40 3.20 2.92 1.18 1.23 4.28 0.45 0.64 0.79 0.99 0.61 1.05 1.50 0.87 3.41 2.17 3.23 2.25 1.70 0.95 7.40 3.80 1.36 2.69 2.75 3.54 1.85 9.12 3.24 10.0 Table 2 Contact pair propensities normalized by surface propensities 3.65 0.79 0.62 0.96 1.25 1.56 0.79 0.75 1.10 0.93 3.68 0.81 0.50 0.77 0.82 0.56 0.77 0.68 0.35 0.41 0.58 0.86 0.53 0.46 0.60 1.07 0.99 0.77 0.34 0.62 1.54 0.92 0.47 0.17 0.91 2.30 1.56 17.2 1.35 0.71 0.62 0.76 0.65 1.37 1.02 1.32 0.52 1.32 2.30 1.89 2.78 1.91 0.49 1.46 0.63 1.17 2.88 1.12 0.31 0.31 1.67 1.16 0.21 0.58 0.47 0.86 46 Mihaly Mezei Tools for Characterizing Proteins 47 Table 3 Contact score comparison between redocked models generated by ClusPro and the crystal structure and the native-like model # of # of models beating PDB ID models the crystal S score # of models beating the crystal SN score #of models beating the native-like RMSD of the S score best model 4ODS 30 0 0% 4 13% 4 2.0 4ONL 19 4 21% 4 21% 4 2.8 4POZ 30 1 3% 12 40% 0 4.3 2QKO 24 4 16% 3 13% 15 3.6 4QVF 7 3 42% 4 57% 3 4.7 4UHP 16 5 31% 7 43% 0 4.8 4X7S 25 0 0% 7 28% 0 4.6 4YII 20 1 5% 4 20% 4 6.0 4YON 30 10 33% 10 33% 0 4.0 4Z95 25 0 0% 13 52% 13 5.1 4GUZ 24 11 45% 7 29% 2 3.4 4I4N 15 6 40% 8 53% 0 4.7 4OFW 30 28 93% 27 90% 12 8.6 4PGG 30 0 0% 0 0% 1 6.0 4PVC 30 5 16% 5 16% 6 5.8 4R1N 26 24 92% 22 84% 7 5.6 4WOY 30 21 70% 30 100% 19 49.6 4WUM 30 30 100% 29 96% 9 4.5 zero for both, then this contact potential is fully capable to rank the docked poses. At the other end of the spectrum, if the answer is 50% or more than this, contact potential has no relevance. Table 3 shows the contact score comparison between redocked models generated by ClusPro and the crystal structure and the native-like model. The average percentage is 33.7 and 43.7 for S and SN, resp., i.e., better than 50%, albeit not by too much. The fact that S performs better than SN suggests that the number of contact is an important contribution to the score. Also, the fact that in four out of 18 complexes S did gave the best scores suggests that the performance is better than 33.7%. The crucial test is whether, using Eqs. (6 and 7), the contact score can improve in the ranking given by the docking servers. Table 4 shows, using different weight factors w, the rescoring 48 Mihaly Mezei Table 4 Rescoring results for unbound docking ensembles # of models beating the model with the best RMSD using the correction factor w below Best model PDB ID RMSD Rank Software 0.0 1.0 2.0 5.0 10.0 20.0 50.0 1DE4 5.5 8 ClusPro 7 7 7 7 7 7 8 1E6E 9.2 4 ClusPro 3 2 2 2 2 1 0 1E6J 6.0 17 ClusPro 16 17 16 16 14 14 14 1HIA 7.8 16 ClusPro 15 15 15 15 16 16 15 1HIA 8.5 3 PatchDock 2 0 0 1MAH 7.9 2 ClusPro 1 1 1 1 2 3 8 1MLC 5.4 19 ClusPro 18 19 19 23 24 23 22 1N8O 10.0 7 ClusPro 6 6 6 5 6 7 6 3MXW 2.2 6 ClusPro 5 4 3 1 1 1 1 3SIC 5.7 2 ClusPro 1 1 0 0 0 0 0 0 0 100.0 200.0 0 0 results for unbound docking ensembles for complexes, i.e., where the docking started from monomer structures obtained independently of the complex structure. Note that for most unbound docking ensembles, there was no structure close to the experimental complex structure—hence the small number of tests shown. While the original rank of the experimental-like structure was not always raised to one, in several cases it was improved. More importantly, it did not worsen the rank. The calculation of the docking score adjustment of ClusPro and PatchDock docked ensembles has been implemented in the program Rescore. It is available at the URL https://mezeim01.u. hpc.mssm.edu/rescore. 3.4 Characterization of Sphericity When comparing regular geometric shapes like ellipsoids, it is a simple matter to measure their deviation from spherical. For objects that are irregular, like a globular protein, the answer is far from simple. It turned out that CV could help in this issue as well. The idea is to calculate the CV of each atom with respect to all the others and compare the distribution of the CV values with the distribution of CV values calculated from points uniformly distributed in a sphere [20]. The first step is the calculation of the reference distribution of CVs, and CVws, i.e., their distribution for points within a sphere. To that effect, 1,600,000 uniformly distributed points were generated by a Monte Carlo procedure. However, direct comparison of the reference distribution function and the distribution function of the Tools for Characterizing Proteins 49 CV values of a protein is not likely to give meaningful answer since most globular proteins (protein domains) have much fewer atoms than the points used in the reference distribution thus, unlike the reference distribution, it would either be too noisy (if the bin sizes are small) or too crude (if the bin sizes are large). To see which properties of the distribution can be used to characterize the extent of distortion from the spherical, the point set used for the reference distribution was progressively scaled along one direction to produce points within progressively more elongated ellipsoids. Next, various properties of the distribution were calculated and their ability of properly correlate with the degree of distortion was tested. These properties included the average (absolute or squared) differences between the density distributions, and between the cumulative probability distributions, the differences between the sums of various CV powers, as well as the various moments of the CV distributions. Comparisons were made both using the CV and CVw values. Interestingly, the Pearson correlations between the various measures and the difference between the sphericities calculated from the volume and surface of the elongated ellipsoids were generally much higher when using the power sums or the moments than when looking directly at the (discretized) distributions. The sphericity calculations were also performed on the set Hass and Kohl used for their calculations of sphericity based on accessible surface [21]. The power and moment-based correlations again were higher than the ones based on the explicit distributions. The largest correlation (0.78) was obtained using the second power of the CVw sums. This correlation indicates that the difference between the two measures is larger than what would expect from the various approximations involved in either algorithms. Rather it shows that sphericity of an irregular object is not a uniquely defined concept. The calculation of the CV-based sphericity measure has been implemented in the program CVDISTR. It is available at the URL https://mezeim01.u.hpc.mssm.edu/cvdistr. 3.5 Achilles’ Heel of Proteins One demonstration of the usefulness of the CV for determining the degree at which an atom is buried involved shaving off atoms from a protein one by one: at each step, the atom that had the largest exposed surface area was removed. It was found that ordering the atoms by this shaving procedure correlates well with ordering them by their circular variance calculated w.r.t. the rest of the atoms. This correlation then suggests that shaving can also be done meaningfully based on CV instead of exposed surface area. During this shaving (either by the use of the exposed surface area or by CV), most of the time side chain atoms or backbone atoms at the chain end were removed; it was rare that a backbone atom that was not near the chain end was the one removed. It is 50 Mihaly Mezei Table 5 Achilles’ heel residue propensities Residue % in PDB-C % AH residues % AH residues/% in PDB-C GLY 6.91 11.62 1.68 ALA 6.84 7.28 1.06 VAL 6.72 3.54 0.53 LEU 8.77 4.46 0.51 ILE 5.28 2.30 0.44 SER 5.95 8.19 1.38 THR 5.92 5.80 0.98 ASP 5.58 8.99 1.61 GLU 6.76 10.09 1.49 ASN 4.28 6.38 1.49 GLN 4.44 4.51 1.01 LYS 6.30 9.32 1.48 HIS 2.40 1.64 0.68 ARG 4.83 3.56 0.74 PHE 3.78 1.43 0.38 TYR 4.13 1.62 0.39 TRP 2.06 1.13 0.55 CYS 1.87 0.85 0.45 MET 1.94 1.12 0.58 PRO 5.01 6.20 1.24 thus proposed that these spots on the protein have special properties, perhaps vulnerabilities. Hence the suggested name—Achilles’ heel (AH) of a protein. CV-based shavings were run on each protein domain in the PDB-C set. The residue of a backbone atom that was shaved and was at least 10 residues from the nearest chain end was considered an AH residue. On the average, 0.6% of the residues were found to be AH residues. Table 5 shows the percent occurrence of the 20 AAs in the PDB-C set and among the AH residues, as well as the propensity of an AA to be an AH residue as the fraction of the AH%/PDB-C%. Remarkably, except for alanine and glutamine, this fraction significantly differs from one, suggesting that AH propensity may indeed have a role in the function of the protein. Tools for Characterizing Proteins 51 The secondary-structure element propensity of the AH residues has also been examined. Their propensities to be in helix, sheet, and loop were found to be 16.7%, 10.0%, and 73.3%, resp. This is not surprising since most of the time loops are on the surface, sheets are in the interior, and helices have one side exposed. Finally, the correlation of the AH propensity of an amino acid with several other molecular properties was examined. The Pearson correlation coefficient with AA hydrophobicity, using the hydrophobicity scale of Rose et al. [22], turned out to be 0.89. This means that amino acids with high AH propensity are likely to be polar, which is indeed the case. This, in turn, may have structural consequence: the backbone of AH residues is likely to experience a pull from the favorable solvation of the polar side chains, resulting in sticking out more than the nonpolar ones. The Pearson correlation coefficient with the molecular volume, using the values of Darby and Creighton [23], is 0.47 and with the average length of the AH residue side chains is 0.20. While these are rather weak correlations, they are in the direction expected: larger/longer side chains tend to “protect” the backbone. The shaving by CV has been implemented in the program CVSHAVE. It is available at the URL https://mezeim01.u.hpc. mssm.edu/cvshave. 3.6 Foldability of a Sequence For a long time, the proteins were viewed in “black and white,” i.e., either folded into a unique structure or is disordered. This picture got slowly refined by the discovery of intrinsically disordered proteins and the discovery of proteins that can exist in different conformation—at first just a few, involved in diseases and later the larger set of fold-switching proteins [3]. Given that only a negligible fraction of possible sequences is capable of folding, the question arises: what information can be obtained from the sequence about a sequence’s capability of folding. The first observation is that the distribution of AAs in known proteins is far from uniform. Furthermore, the specific propensities show little variation when different types of organism are compared [24] indicating that the deviation from the uniform distribution has an important role in biology. Digging deeper, the propensities of different AAs to be present in different secondary structure elements (SSE), i.e., helices or sheets, are different, leading to a whole family of methods predicting the SSEs that a given sequence might form. The question thus arose: can folding and non-folding sequences be differentiated by the amount of SSEs predicted? To answer this question, the Garnier-Robinson-Osguthrop (GOR) method was used to predict the % of residues forming helix or sheet in the PDB50 set, the IDP set as well as two sets of randomly generated sequences, one using the uniform distribution while the other the amino acid propensities, referred to as RANU 52 Mihaly Mezei and RANW, resp. For the PDB50, IDP, and RANW sets, the average percentages, 62 11, 59 13, and 59 7, resp., were rather close but the average percentage for the RANU set was significantly less, 43 8. Incidentally, the average of the experimental percentage was 53 13. Here the represents one standard deviation. The interesting part of this result is that there is significant difference between the RANW and RANU set result, confirming the importance of the nonuniformity of AA propensities for the ability of a sequence to fold. The slight difference between the PDB and IDB and between the PDB and RANW sets indicates that, indeed, there does exist a subtle difference between these sequences. The importance of the nonuniformity of AA propensities, using different arguments, has also been observed by Mittal et al. [25]. Beyond simple propensities, the logical next question is the propensities of various, progressively more complex combinations of AAs. Here the difficulty is increasing the complexity of combination to investigate increases the chance of obtaining useful information but the statistics gets progressively more inadequate to produce reliable result. The increase in the size of the PDB helps in improving the statistics but the number of structures in the PDB increases at a much lower rate than the increase of the p-tuple space as a function of p. Recent work [26, 27] examined a large number of protein sequences with known folded structures, using the PDB50 set, for signals that would set these sequences apart from non-folding ones. With increasing complexity, the signal strength increased. Furthermore, even when the statistics became woefully inadequate, some useful signals were still found. Comparing the order of AAs in each neighboring pair, 26 of the 190 AA pairs showed asymmetry (the ratio of occurrence of the pair in different order) >1.15 or <1/1.15 and a few pairs showed significantly larger values: MET-HIS (0.56), PRO-GLU (1.62), PRO-CYS (0.72), MET-THR (1.32), MET-TRP (0.76), and PRO-HIS (0.77). Not surprisingly, prolines are prominent in this list. The propensities of various AAs to be separated by ns residues (irrespective of order) were also examined. These propensities were normalized by the propensity at ns ¼ 10. Some pairs showed significant effects, not just for ns ¼ 0 (i.e., neighbors). Here HIS-HIS and CYS-CYS pairs showed the largest deviation from one: 1.71 and 0.74, resp., for ns ¼ 0, and even for ns ¼ 2, the ratios were 1.46 and 0.56, resp.; however, most of the pairs showed no effect for ns > 0. Looking at the SCp scores for p ¼ 3, 4, and 5, significant differences were obtained when using different type of input sets [18, 26, 27]. Table 6 lists, for p ¼ 3, 4, and 5, the average scores and their S.D. for the PDB50, IDP, F-S, RANW, and RANU sets, Tools for Characterizing Proteins 53 Table 6 Foldability scores of various data sets p: 3 4 5 PDB50 <SCp> S.D. 0.047 0.062 0.094 0.097 0.338 169 IDP <SCp> S.D. Ov(PDB50,IDP) 0.044 + 0.067 0.91 0.074 0.100 0.88 0.203 0.197 0.68 F-S <SCp > S.D. Ov(PDB50,F-S) 0.028 0.060 0.82 0.061 0.093 0.79 0.323 241 0.71 RANW <SCp> S.D. Ov(PDB50,RANW) 0.043 0.032 0.30 0.082 0.048 0.19 0.205 0.075 0.324 RANU <SCp > S.D. Ov(PDB50,RANU) 0.051 0.033 0.27 0.109 0.052 0.14 0.044 0.089 0.009 100% 99.58% 69.92% p-tuple space coverage as well as the overlaps between the PDB50 set and all the other sets. The coverage of the p-tuple space is also given in the table. For all three p values, there was a clear progression in the average foldability scores from the experimentally known folded set, PDB50 to IDP, the intrinsically disordered set, followed by F-S, the foldswitching set, and, significantly farther, the two random sets RANW and RANU, with the uniformly random set being the farthest from the experimental set. The distance between the averages was also reflected in the overlaps that showed that the nonrandom sets, while distinct, were still fairly close to each other; real difference was only seen between the random and nonrandom sets. Given the significant separation between the PDB50 and random sets, the question arose: how reliably can the foldability score predict if a given sequence will fold? For the test, the new PDB50 set was used; randomly generated sets were prepared using different random number seeds from the ones used in generating and comparing score distributions. Table 7 collects the folding predictions using SCp, p ¼ 3, 4, and 5 on the new PDB50, RANW, and RANU sets. In view of the smaller overlap between the PDB50 and RANW distributions for p ¼ 4 vs. p ¼ 3, it is not surprising that the SC4 performs better than SC3. The performance of SC5, however, was uneven, even though extending the tuple length is expected to yield better performance. The likely reason for not meeting this expectation is the fact that while the triplet and quadruplet space is essentially fully covered by the PDB50 set, the coverage of the pentuplet space is significantly lower. However, the surprisingly low rate of false negatives with SC5 suggested the combination of SC4 and SC5 in the following way; whenever the SC4 predicts randomness, but SC5 predicts 54 Mihaly Mezei Table 7 Foldability predictions Score Prediction New PDB50 RANW RANU Triplet Folded Non-folded 75.8% 24.3% 9.4% 90.6% 6.3% 93.7% Quadruplet Folded Non-folded 79.3% 20.7% 4.3% 95.7% 4.0% 96.0% Pentuplet Folded Non-folded 79.4% 20.6% 0.5% 99.5% 3.6% 96.4% Quadruplet+ Pentuplet Folded Non-folded 88.4% 11.6% 4.2% 95.8% 31.1% 68.9% foldability, change the prediction to foldable; otherwise keep the SC4 prediction. As shown also on Table 7, the combination significantly improved the reliability of the folding predictions. A recent work used a modified score and refined the foldability predictions by making the scores depend not only on the AAs in the p-tuplets but also on the secondary structure predicted from the sequence [28]. It performed slightly better than the combination of SC4 and SC5 described above. Given the success of the foldability scores SCp, the logical next question is whether it can serve as a measure of stability as well. However, the short answer is, sorry, not really; the Pearson correlation between the melting temperatures (Tm) and SCp is only 0.1, 0.19, and 0.14 for p ¼ 3, 4, and 5, resp. Looking at the Spearman (rank) correlations, the numbers are similarly low: 0.10, 0.18, and 0.19 for p ¼ 3, 4, and 5, resp. The discouragingly low correlations prompted another look at longer p-tuples, despite the progressively worse coverage in the PDB50 set (11% and 0.7% for p ¼ 6 and 7, resp.), using the SSCp score. The first question raised was whether the change in stability upon a mutation can be predicted by the change in the foldability score. Not surprisingly, the answer was again negative; the correlations were between the score change upon mutation and the change in Tm were 0.04, 0.03, 0.11, 0.10, and 0.18 for p ¼ 3, 4, 5, 6, and 7, resp. The next question raised was whether “coarse-graining” our quest by asking whether it is possible to predict just the sign of the stability change upon a mutation. Here the answer was a qualified yes. Table 8 shows the percent of mutations where change in the foldability score matched the change in the Tm for p ¼ 3, 4, 5, 6, and 7, resp. Not surprisingly, for p ¼ 3 and 4, the predictions were too close to 50% to be of any use. However, for p ¼ 6 and 7, the prediction accuracy was over 70%. Furthermore, for mutations where the sign change predictions are the same for more than Tools for Characterizing Proteins 55 Table 8 Mutation sign prediction p Percent sign match 3 50.8% 4 57.4% 5 67.9% 6 71.4% 7 70.6% one p, the accuracy can be further increased; the best pair performance was found using p ¼ 6 and 7 (72.1%), and the best three combination was found using p ¼ 5, 6, and 7 (73.7%). While this is still rather low, it can be used to prioritize planned mutations in an experimental project since at negligible computational costs the success rate can be significantly improved. The calculations of the foldability measures and estimates have been also implemented in the program FOLD. It is available at the URL https://mezeim01.u.hpc.mssm.edu/fold . An other recent work also took advantage of the fact that the missing coverage also contains information [29]. In this work, the authors found certain short sequences that were found only in IDPs. 4 Notes and Protocols All programs mentioned in this chapter can be downloaded free for academic users. They are written in Fortran 77. The download package includes the source code, documentation, and, wherever applicable, data files and/or sample inputs; compiled executables are also available. The programs are run from the command line. The general format of the commands is: <Program name> [directive argument] [directive argument] . . . If only the program name is typed, a list of possible directives, arguments, and, if available, defaults are printed. For example, calling the chameleon search program without any directives prints 56 Mihaly Mezei The possible command line options are: -mn : minimum chameleon length default: 5 -mx : maximum chameleon length default: 24 -ro : file name root default: chameleon -db : debug level default: 0 -lf : list of files/sequences default: ss.txt -ff : file format (pdb|cif|ann) default: ann -mp : residue mapping file In general, the directives can be in any order. Some of the programs also offer an interactive quiz to enter the run information when just the program name is typed. All programs check the input for consistency, limits, and file availability. If limits are violated, the program tells which size parameters to change, if possible; the documentation (.html file) indicates how data limits can be changed, if necessary. A typical chameleon calculation would run as follows: cham -mn 5 -mx 10 -ro demo -lf ss.txt -ff ann Chameleon finder - written by Mihaly Mezei - Version 07/21/ 2018 The program needs a file listing the PDB/CIF files or PDBids to process or a FASTA file with seqences and SS annotations Memory use: over 1421 Mb Opened list file ss.txt Opened result file demo.res Opened detail file demo.dtl Opened debug file demo.log ... Read/analyzed 139870 PDB ids and 394364 chains Average chain length= 260.3 residues Number of 10-residue chameleons found= 32 Number of 10-residue sequences used= 5380037 A typical part of the chameleon list in the output file demo.res is below: HX:1AMB A ir= 11 SH:5OQV F ir= 11 Nctot= 6 EVHHQKLVFF GLU VAL HIS HIS GLN LYS LEU VAL PHE PHE Full list:1AMB A 11 h 1AMC A 11 h 1IYT A 11 h 1Z0Q A 11 h 2LMQ K 11 s Full list:5OQV F 11 s Tools for Characterizing Proteins 57 A typical rescoring calculation would run as follows: rescore -sf 4GUZ.cp_sc -md cluspro.4GUZ.24 -mr model.000 -xf 4GUZ_AD.pdb -ds CP Rescoring a set of docked P-P models using the contact propensity matrix Written by Mihaly Mezei - Version 08/11/2020 Opening file 4GUZ.cp_sc (ClusPro result file) Opening file 4GUZ_AD.pdb (X-ray structure file) Opening file cluspro.4GUZ.24_rescore.res Result will be printed to file cluspro.4GUZ.24_rescore.res Read 24 scores # of ATOM records= 4482 Opening file cluspro.4GUZ.24/model.000.00.pdb (First docked model) ... Opening file cluspro.4GUZ.24/model.000.23.pdb (24th docked model) # of ATOM records= 5448 4GUZ Average model score= 3.138 Range: [ -3.115, 9.134] 4GUZ Average normalized model score= 0.075 Range: [ -0.074, 0.203] 4GUZ X-ray dimer scoresum= 3.87 # of contacts= 30 sc/n= 0.129 4GUZ # of models=24 nsc_xp=11 nscn_xp= 7 nCA:568 284 nCT= 30 RMSDmin= 3.36 (# 6) wfac= 0.0 # beating RMSDmin= 8 ... RMSDmin= 3.36 (# 6) wfac= 50.0 # beating RMSDmin= 3 im_min= 6 scmin= 6.34919596 SCM: 4.7 6.2 5.5 3.6 2.8 6.3 2.9 9.1 4.9 1.5 SCM: 1.2 4.9 1.9 0.2 5.1 -1.6 1.6 -3.1 4.5 2.6 SCM: -1.5 5.2 6.7 -0.2 Number of non-native structures with better contact score than the native= 2 A typical sphericity calculation would run as follows: cvdistr -cvcomplist pdb.list > cvdistr.res where pdb.list is a list of PDB files for which sphericity parameters will be calculated. The results are redirected to the file cvdistr.res. A typical output for one PDB file would be: Compare UNNORMALIZED CV ID=1a02N2 NN CVd= 0.03405 CVd_cum= 0.21472 CVdA= 0.68107 CVdA_cum= 2.12160 Difference (abs) of the powers of the CV distribution 0.141155E-01 0.174501E-01 0.126555E-01 0.817334E-02 58 Mihaly Mezei 0.525148E-02 0.347859E-02 0.239483E-02 0.170957E-02 0.125860E-02 0.574280E-03 0.456523E-03 0.367119E-03 0.200791E-03 0.166126E-03 0.137933E-03 0.950353E-03 0.732546E-03 0.298048E-03 0.243868E-03 0.114784E-03 Cumulative difference (abs) of the powers of the CV distribution 0.141155E-01 0.315656E-01 0.442211E-01 0.523944E-01 0.635193E-01 0.652289E-01 0.664875E-01 0.687447E-01 0.692012E-01 0.695683E-01 0.703110E-01 0.704772E-01 0.706151E-01 0.576459E-01 0.611245E-01 0.674379E-01 0.681704E-01 0.698664E-01 0.701103E-01 0.707299E-01 Difference (abs) of the moments of the CV distribution 0.365169E-08 0.923624E-02 0.165961E-03 0.725276E-03 0.374331E-04 0.183825E-04 0.825398E-05 0.737525E-06 0.259078E-06 0.862984E-07 0.472241E-07 0.434107E-07 0.327017E-07 0.154974E-03 0.967322E-04 0.382788E-05 0.172708E-05 0.240465E-07 0.390821E-07 0.286229E-07 Cumulative difference (abs) of the moments of the CV distribution 0.365169E-08 0.923625E-02 0.940221E-02 0.101275E-01 0.104166E-01 0.104350E-01 0.104433E-01 0.104496E-01 0.104498E-01 0.104499E-01 0.104500E-01 0.104501E-01 0.104501E-01 0.102825E-01 0.103792E-01 0.104471E-01 0.104488E-01 0.104499E-01 0.104500E-01 0.104501E-01 A typical shaving calculation would run as follows: cvshave -lf PDB_list -rd 10 -iv 0 -of PDB_d10.out CV-based shaving of proteins for Achilles heel search Written by Mihaly Mezei - version 08/15/2020 File PDB_list opened OK as unit 10 File PDB_d10.out opened OK as unit 30 CV cutoff= 15.0 Residue distance limit= 10 ... Tools for Characterizing Proteins 59 Processed 1179 PDB files; analyzed 4016 protein domains Average % of BB break residues= 0.56 1 GLY aa%= 6.91 aa_ach%= 11.62 aa_ach%/aa%= 1.68 2 ALA aa%= 6.84 aa_ach%= 7.28 aa_ach%/aa%= 1.06 ... A typical foldability prediction calculation would run as follows: 1. Filter the chains in ss.txt to no more than 50% sequence identity. fold -op FILT -da READ -in ss.txt -pm 50 -ou ss_nr50.out 2. Calculate the propensities of all quadruplets (file dat). quad_nr50. fold -op QUAD -da READ -in ss_nr50.txt -qd quad_nr50.dat -ou quad_nr50.out 3. Calculate the quadruplet-based foldability scores if the sequences in ss_nr50.txt. fold -op SC04 -da READ -in ss_nr50.txt -sf SSPR -qd quad_nr50. dat -ou quad_nr50_score.out 4. Predict the foldabilities of the sequences in ss_new_nr50.txt. fold -op PRFO -da READ -in ss_new_nr50.txt -qd quad_nr50.dat -us QUAD -ou prfo_nr50_quad.out -lf prfo.list A typical output segment for the foldability prediction: 6QMB:B <HX len>= 8.2 A sc= 0.0000 T sc= 0.0000 Q sc= 0.0247 P sc= 0.0000 foldprop= 17.98 ln(fp)= 2.8893 ibin= 78 GF 6QMM:A <HX len>= 9.2 A sc= 0.0000 T sc= 0.0000 Q sc= 0.1358 P sc= 0.0000 foldprop= 1.00 ln(fp)= 1.0000 ibin= 59 FD 6QPI:B <HX len>= 7.6 A sc= 0.0000 T sc= 0.0000 Q sc=-0.0757 P sc= 0.0000 foldprop= 0.14 ln(fp)= -1.9942 ibin= 30 GR 6QPP:A <HX len>= 5.7 A sc= 0.0000 T sc= 0.0000 Q sc=-0.0217 P sc= 0.0000 foldprop= 0.74 ln(fp)= -0.3070 ibin= 46 GR? Here, the labels GF, FD, GR, and GR stand for Guessed Folded, Folded, Guessed Random, and Probably Random, resp. 60 Mihaly Mezei Acknowledgments Conversations with Prof George Rose on the ideas described here are gratefully acknowledged. This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai. References 1. Dokholyan N (2009) Protein designability and engineering. In: Structural bioinformatics, 2nd edn. Wiley-Blackwell, Hoboken, NJ 2. Mezei M (2018) Revisiting chameleon sequences in the protein data Bank. Algorithms 1 1 : 1 1 4 . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / a11080114 3. Porter LL, Looger LL (2018) Extant foldswitching proteins are widespread. Proc Natl Acad Sci U S A 115:5968–5973. https://doi. org/10.1073/pnas.1800168115 4. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data Bank. Nucleic Acids Res 28:235–242. https://doi. org/10.1093/nar/28.1.235 5. Mezei M (2015) Statistical properties of protein-protein interfaces. Algorithms 8: 92–99. https://doi.org/10.3390/a8020092 6. Piovesan D, Tabaro F, Marco IM, Quaglia NF, Oldfield CJ, Aspromonte MC, Davey NE, Davidović R, Dosztányi Z, Elofsson A, Gasparini A, Hatos A, Kajava AV, Kalmar L, Leonardi E, Lazar T, Macedo-Ribeiro S, Macossay-Castillo M, Meszaros A, Minervini G, Murvai N, Pujols J, Roche DB, Salladini E, Schad E, Schramm A, Szabo B, Tantos A, Tonello F, Tsirigos KD, Veljković N, Ventura S, Vranken W, Warholm P, Uversky VN, Dunker AK, Longhi S, Silvio P, Tosatto CE (2016) DisProt 7.0: a major update of the database of disordered proteins. Nucleic Acids Res 45: D219–D227. https://doi.org/10.1093/nar/ gkw1279 7. Pucci F, Bourgeas R, Rooman M (2016) Highquality thermodynamic data on the stability changes of proteins upon single-site mutations. J Phys Chem Ref Data 45. https://doi.org/10. 1063/1.4947493 8. Kirys T, Ruvinsky AM, Singla D, Tuzikov AV, Kundrotas PJ, Vakser IA (2015) Simulated unbound structures for benchmarking of protein docking in the DOCKGROUND resource. BMC Bioinformatics 16:243 9. Vreven T, Moal I, Vangone A, Pierce B, Kastritis P, Torchala M, Chaleil R, JimenezGarcia B, Bates P, Fernandez-Recio J, Bonvin A, Weng Z (2015) Updates to the integrated protein-protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J Mol Biol 427: 3031–3041 10. Comeau SR, Gatchell DW, Vajda S, Camacho CJ (2004) ClusPro: a fully automated algorithm for protein-protein docking. Nucleic Acids Res 32:W96–W99. https://doi.org/10. 1093/nar/gkh354 11. Schneidman-Duhovny D, Inbar Y, Nussinov R, Wolfson HJ (2005) PatchDock and SymmDock: servers for rigid and symmetric docking. Nucl Acids Res 33:W363–W367. https://doi. org/10.1093/nar/gki481 12. Mardia KV, Jupp PE (2000) Directional statistics. John Wiley & Sons, Ltd, Chichester 13. Mezei M (2003) A new method for mapping macromolecular topography. J Mol Graph Model 21(5):463–472 14. Mezei M (2010) Simulaid: a simulation facilitator and analysis program. J Comput Chem 31(14):2658–2668. https://doi.org/10. 1002/jcc.21551 15. Mezei M, Zhou M-M (2007) Pspace: a program to plan the covering of a protein space. Source Code Biol Med 2:6. https://doi.org/ 10.1186/1751-0473-2-6 16. Mezei M (2003) Efficient Monte Carlo sampling for long molecular chains using local moves, tested on a solvated lipid bilayer. J Chem Phys 118:3874–3880. https://doi. org/10.1063/1.1539839 17. Mezei M (1998) Chameleon sequences in the PDB. Prot Engng 11:411–414. https://doi. org/10.1093/protein/11.6.411 18. Mezei M (2020) Foldability and chameleon propensity of fold-switching protein sequences. Proteins 89:3–5. https://doi.org/ 10.1002/prot.25989 Tools for Characterizing Proteins 19. Göbel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins 18:309–317. https://doi.org/10.1002/prot.340180402 20. Mezei M (2015) Use of circular variance to quantify the deviation of a macromolecule from the spherical shape. J Math Chem 53: 2184–2189. https://doi.org/10.1007/ s10910-015-0540-4 21. Hass J, Koeh P (2014) How round is a protein? Exploring protein structured for globularity using conformal mapping. Front Mol Biosci 1:1–15. https://doi.org/10.3389/fmolb. 2014.00026 22. Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH (1985) Hydrophobicity of amino acid residues in globular proteins. Science 229: 834–838. https://doi.org/10.1126/science. 4023714 23. Creighton NJDaTE (1993) Protein structure. In: Focus. IRL Press, Oxford University Press, Oxford. https://doi.org/10. 1016/0307-4412(95)90200-7 24. Gaur RK (2014) Amino acid frequency distribution among eukaryotic proteins. IIOAB J 5: 6–11 61 25. Mittal A, Jayaram B, Shenoy S, Bawa TS (2010) A stoichiometry driven universal spatial Organization of Backbones of folded proteins: are there Chargaff’s rules for protein folding? J Biomol Struct Dyn 28:133–142. https://doi. org/10.1080/07391102.2010.10507349 26. Mezei M (2020) On predicting foldability of a protein from its sequence. Proteins 88: 355–356. https://doi.org/10.1002/prot. 25811 27. Mezei M (2019) Exploiting sparse statistics for a sequence-based prediction of the effect of mutations. Algorithms 12:214. https://doi. org/10.3390/a12100214 28. Kaushik R, Zhang KYJ (2020) A protein sequence fitness function for identifying natural and nonnatural proteins. Proteins 88(10): 1271–1284. https://doi.org/10.1002/prot. 25900 29. Mittal A, Changani AM, Taparia S (2021) Unique and exclusive peptide signatures directly identify intrinsically disordered proteins from sequences without structural information. J Biomol Struct Dyn 39 (8):2885–2893. https://doi.org/10.1080/ 07391102.2020.1756410 Chapter 3 Exploring the Peptide Potential of Genomes Chris Papadopoulos, Nicolas Chevrollier, and Anne Lopes Abstract Recent studies attribute a central role to the noncoding genome in the emergence of novel genes. The widespread transcription of noncoding regions and the pervasive translation of the resulting RNAs offer to the organisms a vast reservoir of novel peptides. Although the majority of these peptides are anticipated as deleterious or neutral, and thereby expected to be degraded right away or short-lived in evolutionary history, some of them can confer an advantage to the organism. The latter can be further subjected to natural selection and be established as novel genes. In any case, characterizing the structural properties of these pervasively translated peptides is crucial to understand (1) their impact on the cell and (2) how some of these peptides, derived from presumed noncoding regions, can give rise to structured and functional de novo proteins. Therefore, we present a protocol that aims to explore the potential of a genome to produce novel peptides. It consists in annotating all the open reading frames (ORFs) of a genome (i.e., coding and noncoding ones) and characterizing the fold potential and other structural properties of their corresponding potential peptides. Here, we apply our protocol to a small genome and show how to apply it to very large genomes. Finally, we present a case study which aims to probe the fold potential of a set of 721 translated ORFs in mouse lncRNAs, identified with ribosome profiling experiments. Interestingly, we show that the distribution of their fold potential is different from that of the nontranslated lncRNAs and more generally from the other noncoding ORFs of the mouse. Key words Noncoding DNA, Fold potential, De novo genes, Small ORF-encoded peptides, ORFtrack, ORFold 1 Introduction Many studies attribute a central role to the noncoding genome in novel gene birth and more generally in the emergence of genetic novelty. As a matter of fact, thousands of small open reading frames (ORFs) have been identified in noncoding regions of various genomes. Interestingly, the wide use of transcriptomics revealed a highpervasive transcription of noncoding regions, and an important fraction of the resulting RNAs has been shown to be translated by ribosome profiling experiments [1–4]. In addition, mass spectrometry experiments conducted on mammals, bacteria, or plants [5– 11] confirm the existence of these translation products in the cell, Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_3, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 63 64 Chris Papadopoulos et al. with the identification of hundreds of peptides derived from noncoding regions. The fact that these noncanonical products exhibit short sizes, are present in low abundance, and use alternative start codons renders difficult their identification and suggests that their number is largely underestimated. Interestingly, their sequences are more conserved than those of noncoding sequences, suggesting that they are subjected to purifying selection [5, 6] and they could be functional. It has been proposed that these noncanonical translation products are consequently exposed to natural selection and, thereby, provide the organism with the raw material for the emergence of genetic novelty. However, how noncoding sequences can give rise to novel genes remains unclear. Particularly, noncoding sequences are not expected to fold to a stable and specific structure and have not been subjected to purifying selection in order not to be deleterious for the cell. One can ask how these pervasively translated products can (1) be tolerated by the cell and (2) give rise to functional products, since most proteins achieve their function through a well-defined 3D structure. Indeed, noncoding sequences display different sequence features from coding ones, being shorter and characterized by different nucleotide compositions [5, 12]. They are rather expected to encode disordered, misfolded, or aggregation-prone peptides, and we can hypothesize that they would be rapidly degraded or short-lived in evolutionary history. Nevertheless, it has been demonstrated that proteins from random libraries could fold in silico or in vitro, some of them being even beneficial in Escherichia coli [13–16]. All these results place the foldability of noncoding ORFs at the center of novel gene birth and strengthen the need to characterize the fold potential (including the propensities for disorder, folded state, and aggregation), not only of the experimentally observed de novo peptides but also of all the amino acid sequences “encoded” by presumed noncoding ORFs, which could give rise to novel peptides upon pervasive translation. Therefore, we present a protocol that enables in an automated way (1) the extraction and annotation of all possible ORFs of a genome and (2) the prediction of their fold potential along with their propensities for disorder and aggregation. It relies on the ORFmine package (unpublished but available at https://github. com/i2bc/ORFmine) which aims to annotate a genome’s ORFs and probe their fold potential and structural properties. ORFmine consists of two independent programs, ORFtrack and ORFold. ORFtrack works in a stand-alone fashion and is very flexible, enabling different levels of annotation depending on the user request. ORFold relies on three gold-standard programs, HCA [17–20], Tango [21–23], and IUPred2A [24–26], which predict respectively the fold potential, the aggregation, and the disorder propensities of an amino acid sequence. Here, we consider as foldable the amino acid sequences that are able to fold to a stable Exploring the Peptide Potential of Genomes 65 3D structure or to a molten globule state, in which the specific tertiary structure is lost but the secondary structures are intact. Our protocol can be applied to any completely sequenced genome and takes a few hours on a personal computer for a small genome (bacteria, archaea, or fungi), although we recommend launching the pipeline on a cluster for larger genomes (e.g., plant or mammal genomes). Here we present a detailed application of our protocol to the small genome of E. coli. Then we show how to apply our protocol to very large genomes (Mus musculus). In the last part, we present a case study based on a ribosome profiling experiment performed on the mouse. In this example, we probe the fold potential of 721 ORFs present in lncRNAs which are translated, not conserved across species, and which show weak or no signature of selective pressure (i.e., presumed as noncoding). We then show how ORFold can be used to compare the fold potential of a subset of ORFs of interest (e.g., translated ORFs present in lncRNAs) with those of the coding and noncoding ORFs of the genome they belong to. The latter protocol can be extended to any set of sequences of interest, including, for example, peptides identified in mass spectrometry experiments carried out in different conditions, de novo peptides associated with specific diseases, or even designed sequences. 2 Materials ORFmine ORFmine is a package that we developed in order to explore the peptide potential of a noncoding genome, with the extraction and annotation of all the possible ORFs present in noncoding regions. The ORFmine package is not published yet, but is available at: https://github.com/i2bc/ORFmine. It consists of two independent programs, ORFtrack and ORFold, that can be combined together or used independently (Fig. 1). Used together, ORFtrack and ORFold, provide a global picture of the fold potential and the structural properties of all the potential peptides of a genome. Otherwise, ORFtrack can simply be used to extract and annotate the ORFs of a genome, while ORFold can estimate the fold potential of any set of sequences without using genomic information. 2.1.1 ORFtrack ORFtrack aims at extracting and annotating all the possible ORFs of a genome according to a set of defined genomic features. It takes as inputs a FASTA file containing all the chromosome or contig sequences and its corresponding annotation GFF file (for more details, see the GFF3 file format description at https://github. com/The-Sequence-Ontology/Specifications/blob/master/ gff3.md). ORFtrack searches, in the six possible frames, for all possible ORFs of at least 60 nucleotides bounded by STOP codons (i.e., it does not search for start codons). In order to annotate each 2.1 66 Chris Papadopoulos et al. Fig. 1 Pipeline of ORFmine. The inputs and outputs are represented with gray rectangles while the main scripts are shown with red circles. The mandatory inputs necessary to the ORF annotation and the estimation of their structural properties (e.g., fold potential and disorder and aggregation propensities), as well as their corresponding outputs are connected to their related scripts with black arrows. The classical pipeline of ORFmine provides the user with a plot representing the distribution of the fold potential of the input ORFs (red box). Optionally, a genome annotation file (GFF format) can be given to ORFold (dashed arrows). In this case, ORFold produces new GFF files (one per studied structural property) where all input ORFs are associated with the score of the corresponding property. The GFF produced by ORFtrack and ORFold can be subsequently uploaded to a genome viewer (black boxes) where ORFs will be colored according to their annotation (black box on the left) or their structural properties (black box on the right) resulting ORF (e.g., intergenic ORF, noncoding ORF that overlaps a coding sequence, coding ORF), their localization is subsequently compared to those of all genomic features annotated in the GFF file (e.g., CDS, tRNA, rRNA, or any other feature defined by the user in the third column of the GFF file) (Figs. 2 and 3). There are four main categories of ORFs: (1) Coding ORFs (c_CDS) which correspond to ORFs that include a coding sequence (CDS) (i.e., in the same frame as a CDS). They are generally larger than the CDS since they are defined from STOP-to-STOP (2). Noncoding intergenic ORFs (nc_intergenic) which do not overlap any genomic feature (3). Noncoding ORFs which overlap a genomic feature on the same strand (nc_ovp_same-x with x standing for the corresponding genomic feature), and (4) noncoding ORFs which overlap a genomic feature on the opposite strand (nc_ovp_opp-x with x standing for the corresponding genomic feature) (Figs. 2 and 3). The user has to keep in mind that ORFtrack provides an ORF-centered point of view of the input genome and that ORFs do not correspond to real biological objects but rather to the potential peptides that Exploring the Peptide Potential of Genomes 67 Fig. 2 Decision tree of ORFtrack. ORFs are annotated according to four main categories: c_CDS for coding ORFs (orange box), noncoding intergenic ORFs (gray box), and noncoding ORFs that overlap a genomic feature on the same strand (blue box) or on the opposite strand (green box) Fig. 3 Schematic representation of the six frames of a DNA section. The genomic features annotated in the original GFF file are represented in the middle line. The ORFs of the six frames are colored with respect to their ORFtrack annotation. The overlap between an ORF and a genomic feature is illustrated with a rectangle colored according to the ORF annotation could be produced upon pervasive translation with no information on the localization of their first translated codon. For example, a noncoding ORF overlapping a tRNA does not correspond to a 68 Chris Papadopoulos et al. tRNA, which by definition has neither phase nor a corresponding amino acid sequence, but to the corresponding peptide which could be produced upon the pervasive translation of the tRNA gene with no knowledge of the first translated codon. If a noncoding ORF overlaps more than one genomic feature, ORFtrack applies the following priority rules: 1. The noncoding ORF overlaps a CDS and any other genomic feature: it is annotated as a noncoding ORF overlapping a CDS (same or opposite strand) (e.g., nc_ovp_(same/opp)-CDS). 2. The noncoding ORF overlaps a genomic feature on the same strand and any other genomic feature on the other strand (except CDS): it is annotated as a noncoding ORF overlapping the feature on the same strand (e.g., nc_ovp_same-x). 3. The noncoding ORF overlaps two or more genomic features located on the same strand that can correspond to the same or the opposite strand of the noncoding ORF: it is annotated as overlapping the genomic feature that has the larger overlap with it (e.g., nc_ovp_(same/opp)-x). The program provides the user with a new GFF file containing all the identified ORFs annotated according to the four categories defined previously. ORFget (a tool provided with ORFtrack) generates a FASTA file containing the amino acid sequences of all identified ORFs or a subset of ORFs selected with respect to their annotation category (e.g., c_CDS, nc_intergenic, nc_ovp_same, nc_ovp_opp) or to their complete annotation for a finer selection. An example is nc_ovp_same-lncRNAs and nc_ovp_opp-lncRNAs, if the user seeks to investigate whether ORFs overlapping lncRNAs display specific properties compared to other noncoding ORFs—see Subheading 3.3 for an example). Finally, ORFget allows the user to extract in a FASTA file the amino acid sequences of all annotated proteins and to reconstruct all isoforms of multi-exonic genes if they are annotated in the input GFF file. 2.1.2 ORFold ORFold aims at estimating the fold potential of a set of amino acid sequences using the HCA method [17–20]. In addition, it can predict their disorder or aggregation propensities, with IUPred and Tango, respectively [21–26]. Although HCA is very fast and can handle all ORFs of a small genome in a few minutes, the calculation of the disorder and aggregation propensities slows down ORFold (around 3 h on a single CPU (2 GHz processor, 16 GB RAM) for all the ORFs of E. coli). Consequently, the user can turn off the calculation of the disorder and aggregation propensities. ORFold takes as input a FASTA file containing the amino acid sequences to treat. The output of ORFold is a table containing the fold potential and/or the disorder and aggregation propensities of each input sequence. Optionally, the user can provide ORFold Exploring the Peptide Potential of Genomes 69 with the genome annotation GFF file of the input genome. In this case, the fold potential and/or the disorder and aggregation propensities of each ORF will be added to the GFF file. The latter can be uploaded subsequently on a genome viewer such as IGV [27], enabling the visual inspection and manual analysis of the distribution of the fold potential and the other structural properties along the genome. The program can handle several FASTA files at the same time and will generate as many outputs as given FASTA files. Finally, ORFold can also provide the user with plots representing the distribution of the fold potential of the input sequences along with those of a dataset of globular proteins used as reference, taken from Mészáros et al. [24]. HCA ORFold estimates the fold potential with the HCA (Hydrophobic Cluster Analysis) approach [19, 28]. HCA toolkit is available at https://github.com/T-B-F/pyHCA. It splits an amino acid sequence into hydrophobic clusters and linkers. The former gathers strong hydrophobic residues (V, I, L, F, M, Y, W) and cysteines while the latter corresponds to stretches of residues which are composed of at least four non-hydrophobic residues or a proline. Hydrophobic clusters usually indicate one or several regular secondary structures connected by short loops, which constitute signatures of globular domains. Linkers correspond to loops or disordered regions. The fold potential of a sequence is determined by its composition in hydrophobic clusters and linkers and is reflected by the HCA score. The latter ranges from 10 to +10 with low HCA scores indicating sequences that are enriched in linkers and expected to be disordered. High HCA scores correspond to sequences with a high density in hydrophobic clusters and are likely to form aggregates in solution, though some of them may be able to fold in lipidic environments. Sequences that are able to fold in solution are usually characterized by intermediate HCA scores, as shown with the HCA scores of the reference dataset of globular proteins in Fig. 5. Tango ORFold calculates the aggregation propensity of a sequence with Tango [21–23], which is available at http://tango.crg.es upon request to the developers. Following the criteria proposed by Linding et al. [21], a sequence segment is considered as aggregation-prone if it is composed of at least five consecutive residues predicted as populating a b-aggregated conformation with a percentage occupancy greater than 5%. The aggregation propensity of a sequence is then calculated as the fraction of residues predicted in an aggregation-prone segment. 70 Chris Papadopoulos et al. IUPred 3 ORFold calculates the disorder propensity with IUPred [24–26, 29]. We use the version 2A of IUPred [24, 25], which is available at https://iupred2a.elte.hu upon request to the developers. Consistent with the criteria used for the definition of an aggregationprone region, we considered as disordered a region composed of at least five consecutive residues displaying a disorder probability higher than 0.5. According to the aggregation propensity calculation, the disorder propensity of a sequence is calculated as the fraction of residues predicted in a disordered prone segment. Methods 3.1 Classical Use: Probing the Fold Potential of a Complete Genome Here we seek to probe the fold potential and the aggregation and disorder propensities of all noncoding ORFs of E. coli str. K-12 substr. MG1655 (E. coli), regardless whether they overlap a genomic feature. As a reference, we will also characterize these properties for all CDS of E. coli. 3.1.1 FASTA and GFF Files Used in this Example 1. E_coli.fna (available at https://github.com/i2bc/ORFmine in the “examples” directory). 2. E_coli.gff (available at https://github.com/i2bc/ORFmine in the “examples” directory). 3.1.2 Annotation of the ORFs of E. coli with ORFtrack The following ORFtrack instruction displays all the genomic features annotated in the E. coli genome: > orftrack -fna E_coli.fna -gff E_coli.gff --show-types Up to 12 different genomic features are annotated in the E. coli genome, including CDS, tRNA, rRNA (see Note 1). We then annotate all the possible ORFs of E. coli with the following instruction: > orftrack -fna E_coli.fna -gff E_coli.gff The execution time on a single CPU (2 GHz processor, 16 GB RAM) is 38 s. ORFtrack generates a new GFF file (mapping_orf_E_coli.gff) that contains 135097 annotated ORFs of which 130637 are annotated as noncoding. Table 1 shows the distribution of the output ORFs across the different annotation categories with various levels of annotations. This information is available in the summary file produced by ORFtrack (summary.log). Notice that it is also possible to scan all the annotated ORFs by loading the new GFF into a genome viewer. Exploring the Peptide Potential of Genomes 71 Table 1 Counts of E. coli ORFs for each annotation category Total ORFs 135,097 Coding (c_CDS) Noncoding (nc_*) 4460 130,637 Noncoding intergenic (nc_intergenic) Noncoding overlapping with a genomic feature (nc_ovp_*) 18,318 112,319 On the same On the opposite strand strand (nc_ovp_opp-x) (nc_ovp_same-x) 47,880 64,439 With x standing for: 3.1.3 Extraction and Writing of the Noncoding ORFs and the CDS of E. coli Extraction of Noncoding ORFs 45,053 CDS 62,354 1136 Repeat region 545 626 Sequence feature 566 607 r-RNA 528 140 nc-RNA 130 119 t-RNA 114 119 Pseudogene 109 77 Mobile genomic element 87 3 Origin of replication 4 0 Recombination feature 2 In this example, we consider all the 130637 noncoding ORFs and do not differentiate noncoding intergenic ORFs from those that overlap a genomic feature. Therefore, we extract and write the amino acid sequences of all noncoding ORFs (i.e., nc_intergenic, nc_ovp_same, and nc_ovp_opp) with ORFget with the following command line (see Note 2): > orfget -fna E_coli.fna -gff mapping_orf_E_coli.gff -features_include nc -o E_coli_noncoding ORFget generates a FASTA file with the resulting 130637 amino acid sequences. 72 Chris Papadopoulos et al. Extraction of CDS Finally, in order to compare the structural properties of CDS with those of the potential peptides “encoded” in noncoding regions, we extract and rebuild the amino acid sequences of each CDS of E. coli according to the original annotation GFF file: > orfget -fna E_coli.fna -gff E_coli.gff -features_include CDS -o E_coli_CDS We obtain a FASTA file of 4316 protein sequences. 3.1.4 Characterization of the Fold Potential, and the Disorder and Aggregation Propensities of the ORFs and CDS of E. coli with ORFold We aim to characterize the fold potential and the disorder and aggregation propensities of the noncoding ORFs (intergenic and overlapping ORFs) and CDS of E. coli. ORFold can handle the two datasets at the same time with the following instruction: > orfold -fna E_coli_noncoding.pfasta E_coli_CDS.pfasta -gff mapping_orf_E_coli.gff E_coli.gff -options HIT The execution time on a single CPU is around 3 h. ORFold generates two tables (one per dataset) containing, for each sequence, its fold potential as well as its disorder and aggregation propensities calculated by HCA, IUPred, and Tango, respectively. In addition, ORFold writes the output values in a new GFF file that can be uploaded into a genome viewer. The original GFF can be uploaded as well, providing a reference with the exact localization of the genomic features annotated in the original GFF. We recall that ORFtrack identifies and annotates all the possible ORFs of a genome, which do not correspond to real objects but rather to the potential peptides that could be produced if their corresponding DNA region is transcribed and the resulting RNA subsequently translated. Figure 4 shows the two DNA strands of a genomic section of E. coli represented by the genome viewer IGV [27] after uploading the original GFF (blue genes in the middle) and the new GFF returned by ORFtrack (small ORFs in the panels 2 and 4). Although the genome of E. coli is very compact, with few intergenic regions, there is a high density of noncoding ORFs that overlap with the coding genes of E. coli and that represent a high potential of novel peptides in case of ribosomal frameshifting. Interestingly, the distribution of the fold potential along the genome is not homogeneous. We observe an island of noncoding ORFs with high HCA values (ORFs in light and dark red in the middle of the figure). These ORFs potentially encode peptides enriched in hydrophobic residues that are likely to be foldable (light red ORFs) or expected to form aggregates in solution (dark red ORFs). The GFF returned by ORFold containing the Tango or IUPred values can provide the user with complementary information (data not shown). The genomic regions around the island of high HCA Exploring the Peptide Potential of Genomes 73 Fig. 4 Screenshot of a genomic section of E. coli represented by IGV. Genomic features present in the original GFF file (CDS in this example) are represented with blue boxes in the middle of the figure (panel 3). Panels 2 and 4 represent the noncoding ORFs identified by ORFtrack in the positive and negative strands, respectively. They are colored according to their annotation category (gray, blue, and green for nc_intergenic, nc_ovp_same, and nc_ovp_opp, respectively). Panels 1 and 5 represent the same ORFs colored with respect to their HCA scores. ORFs with low HCA scores are colored in blue, whereas ORFs with high HCA scores are colored in red. For more clarity, c_CDS that correspond to ORFs including a CDS in the same frame are not shown, since the corresponding CDS are already represented with the blue boxes in the middle panel values ORFs are enriched in ORFs with intermediate HCA values typical of foldable sequences (ORFs in light red and light blue). Overall, it is interesting to note that the fold potential seems to be quite conserved among the three frames of a strand, though it can vary along the strand. This recalls the observation made by Bartonek et al. [30], who showed that the hydrophobicity profiles of protein sequences are preserved in +1, 1 frames through the structure of the genomic code. Finally, the visual inspection of the distribution of the fold potential of noncoding ORFs suggests that there are a vast number of ORFs that potentially encode foldable peptides (light blue and light red boxes corresponding to intermediate HCA values). Whether these peptides would fold to a specific 3D structure or to a molten globule is a crucial and very difficult question that deserves further investigation. Finally, we plot the distributions of the fold potential of the two datasets with ORFplot. Notice that ORFplot can deal with several inputs and will plot as many distributions as given tables. > orfplot -tab E_coli_CDS.tab E_coli_nocoding.tab -names “E. coli CDS” “E. coli noncoding ORFs” 74 Chris Papadopoulos et al. Fig. 5 Distribution of the HCA scores calculated for the CDS and the noncoding ORFs of E. coli (dark blue and light blue curves, respectively). The HCA score distribution of the set of globular proteins is represented by the gray histogram. Dotted black lines delineate the boundaries of the low, intermediate, and high HCA score bins so that 95% of the globular proteins fall into the intermediate HCA score bin. Each distribution is compared with that of the globular protein set with a Kolmogorov-Smirnov test. Asterisks on the plot denote level of significance: *** < 0.001 Figure 5 shows the fold potential distributions of the noncoding ORFs and the CDS of E. coli as plotted by ORFplot. Furthermore, as a reference, ORFplot plots the distribution of the HCA scores of a set of globular protein sequences taken from [24]. The fold potential distribution of the CDS is clearly different from the one of the noncoding sequences (KS test, P ¼ 9.9 1018). The CDS is enriched in intermediate HCA values typical of foldable proteins, as shown by the HCA scores of the globular proteins. Conversely, noncoding ORFs display a wide range of HCA values reflecting foldable, disordered, or aggregation-prone potential peptides. Nevertheless, it is interesting to note that most of them (~64%) exhibit similar HCA scores to globular proteins, revealing an important potential of foldable peptides, in line with the observation made in Fig. 4. 3.2 Application to Large Genomes and Comparison with Other Species The execution time and the size of the outputs increase with the size of the input genome. This can become dramatic for very large genomes such as those of mammals or plants. Even if the execution time for ORFtrack and ORFget is acceptable, it becomes prohibitive for ORFold. Furthermore, the sizes of the outputs are very Exploring the Peptide Potential of Genomes 75 large. In this section, we present alternatives to reduce the computational time and the size of the generated outputs. 3.2.1 FASTA and GFF Files Used in this Example 1. M_musculus.fna. 2. M_musculus.gff (downloadable at https://www.ncbi.nlm.nih.gov/ genome/?term¼mus+musculus). 3. E_coli.fna. 4. E_coli.gff (downloadable at https://www.ncbi.nlm.nih.gov/ genome/?term¼e+coli). 5. H_volcanii.fna. 6. H_volcanii.gff. (downloadable at https://www.ncbi.nlm.nih.gov/ genome/?term¼haloferax+volcanii). 7. D_melanogaster.fna. 8. D_melanogaster.gff (downloadable at https://www.ncbi.nlm.nih.gov/ genome/?term¼drosophila+melanogaster). 3.2.2 Annotation of ORFs of M. musculus with ORFtrack In order to reduce the execution time (around 64 h on a single CPU), we recommend running ORFtrack on a cluster. The following command displays all the “seqid” values contained in the first column of the input GFF file (usually chromosomes and contigs): > orftrack-fna M_musculus.fna -gff M_musculus.gff --show-chr The ORF annotation can be therefore distributed over multiple CPUs (i.e., one job per “seqid”), reducing substantially the computational time. That way, ORFtrack must be launched as many times as different “seqid” are indicated in the original GFF. Here, ORFtrack is launched on the chromosome NC_000067.7 with the following instruction: > orftrack-fna M_musculus.fna -gff M_musculus.gff -chr NC_000067.7 Extracting all annotated ORFs with ORFget takes around 3 h on a single CPU and generates a 7.5 GB FASTA file containing up to 89 106 noncoding ORFs. Characterizing their fold potential and disorder and aggregation propensities with ORFold would take about 6 months on a single CPU. Consequently, we recommend running ORFold on a representative subset of noncoding ORFs. Indeed, a subset of 20,000 ORFs is sufficient to estimate the fold potential and the disorder and aggregation propensities of the 76 Chris Papadopoulos et al. 3.2.3 Extraction and Writing of the ORFs and CDS of M. musculus with ORFget Definition of a Minimal Subset Size to Characterize the Fold Potential and Structural Properties of Noncoding ORFs Extraction and Writing of the Amino Acid Sequences of a Dataset of 20,000 Noncoding ORFs whole dataset of noncoding ORFs. The Kolmogorov-Smirnov test p-value calculated for the comparison of the HCA score distribution obtained with a subset of 20,000 randomly selected noncoding ORFs with that of the complete set of noncoding ORFs of Drosophila melanogaster is not significant. The same observations are made for the IUPred and Tango score distributions and hold also for other species such as Haloferax volcanii and E. coli. Consequently, in the next section, ORFold will be applied to a set of 20,000 randomly selected noncoding ORFs extracted from the complete set of mouse noncoding ORFs. The following instruction allows the extraction of a subset of 20,000 noncoding ORFs (see Note 3 for more advanced examples): > orfget -fna M_musculus.fna -gff mapping_orf_M_musculus.gff -features_include nc -o M_musculus_noncoding -N 20000 Then, in order to compare the fold potential and the disorder and aggregation propensities of the noncoding ORFs of M. musculus with those of the CDS, we reconstruct the amino acid sequences of all the isoforms annotated in the original GFF file: > orfget M_musculus.fna -gff M_musculus.gff -features_include CDS -o M_musculus_CDS 3.2.4 Characterization of the Fold Potential and the Structural Properties of a Set of 20,000 Noncoding ORFs Along with those of M. musculus CDS We execute ORFold on the small dataset of randomly selected noncoding ORFs and the complete set of mouse isoforms: > orfold -fna M_musculus_noncoding.pfasta M_musculus_CDS. pfasta -options HIT ORFold provides us with two tables, containing the fold potential and the disorder and aggregation propensities of the 20,000 noncoding ORFs and the 92,473 mouse isoforms (around 40 h on a single CPU). 3.2.5 Comparison of the Fold Potential of the Noncoding ORFs and the CDS Calculated for Different Species ORFplot can handle multiple datasets at the same time. Following the same protocol as the one used for the mouse, we also calculated the fold potential of a subset of 20,000 noncoding ORFs and all CDS of H. volcanii, E. coli, and D. melanogaster. We then present the HCA score distributions of all datasets on the same graph. > orfplot -tab E_coli_CDS.tab H_volcanii_CDS.tab D_melanogaster_CDS.tab M_musculus_CDS.tab -names “E. coli” “H. volcanii” “D. melanogaster” “M. musculus” > orfplot -tab E_coli_noncoding.tab H_volcanii_noncoding.tab Exploring the Peptide Potential of Genomes 77 Fig. 6 (a) Distribution of the HCA scores calculated for the CDS of E. coli, H. volcanii, D. melanogaster, and M. musculus (dark blue, light blue, dark orange, and light orange curves, respectively). (b) Distribution of the HCA scores calculated for the noncoding ORFs of E. coli, H. volcanii, D. melanogaster, and M. musculus (dark blue, light blue, dark orange, and light orange curves, respectively). The HCA score distribution of the globular proteins is presented with the gray histogram. Each distribution is compared with the one of the globular proteins set with a Kolmogorov-Smirnov test. Asterisks on the plot denote the level of significance: *** < 0.001 D_melanogaster_noncoding.tab mouse_noncoding.tab -names “E. coli” “H. volcanii” “D. melanogaster” “M. musculus” Figure 6 shows, for the four species, the HCA score distributions of the corresponding CDS (Fig. 6a) and noncoding ORFs (Fig. 6b). Although the fold potential distributions of the CDS display slight variations among the four species, the vast majority (more than 85%) exhibit intermediate HCA scores typical of the scores obtained for the globular proteins. This reflects that being foldable is a trait that has been strongly selected during evolution. However, the fold potential distribution of the noncoding ORFs calculated for H. volcanii is clearly different from those of the other species. Indeed, the other species are mostly characterized by noncoding ORFs that, similarly to CDS, encode peptides predicted as foldable. Conversely, the noncoding ORFs of H. volcanii are enriched in sequences with low HCA scores that are likely to encode disordered peptides. Whether this enrichment in hydrophilic sequences comes from the fact that this species lives in hypersaline environments is an exciting question that deserves further investigations. 78 Chris Papadopoulos et al. 3.3 Probing the Fold Potential of a Set of Mouse Noncoding ORFs Shown to Be Pervasively Translated Recently, Ruiz-Orera et al. [1] revealed with ribosome profiling experiments the translation of 721 ORFs in mouse lncRNAs (i.e., translated lncRNA-ORFs). They are not conserved across neighboring species nor subjected to selective pressure. The authors propose them as intermediates between noncoding ORFs and de novo genes [1]. This prompts us to ask whether their corresponding peptides display specific structural properties compared to peptides encoded by ORFs in other lncRNAs (i.e., nontranslated lncRNA-ORFs). Therefore, in this section, we characterize their respective HCA score distributions, along with those of the CDS and the subset of 20,000 randomly selected noncoding ORFs defined in Subheading 3.2. The amino acid sequences of all translated products identified in Ruiz_Orera et al. [1] (i.e., products coming from protein coding genes or noncoding regions) can be downloaded at https://figshare.com/articles/ dataset/Ruiz-Orera_et_al_2017_/4702375?file¼10323906. We extracted the sequences of the 721 translated lncRNA-ORFs by searching the sequences containing either the “lncRNAa:translated:NC” or the “novel:translated:NC” pattern in their annotation. Then, 20,000 nontranslated lncRNA-ORFs were extracted randomly from the GFF generated with ORFtrack in Subheading 3.2 with the following instruction: > orfget -fna M_musculus.fna -gff mapping_orf_M_musculus.gff -features_include nc_ovp_same-lncRNA -o M_musculus_nc_ovp_same-lncRNA -N 20000 The amino acid sequences of the 721 translated lncRNA-ORFs and the 20,000 nontranslated lncRNA-ORFs can be directly given as input to ORFold. > orfold -fna M_musculus_nc_ovp_same-lncRNA.pfasta M_musculus_translated_721_orfs.pfasta -options H We subsequently plot the fold potentials of the four sets of ORFs with ORFplot: > orfplot M_musculus_CDS.tab M_musculus_noncoding.tab M_musculus_nc_ovp_same-lncRNA.tab M_musculus_translate- d_721_orfs.tab -names “CDS” “Noncoding ORFs” “Nontranslated lncRNA-ORFs" “Translated lncRNA-ORFs” Figure 7 shows the HCA score distributions of the four sets of ORFs. If the nontranslated lncRNA-ORFs display similar HCA scores to noncoding ORFs (Kolmogorov-Smirnov test, P ¼ 0.46), the 721 translated lncRNA-ORFs exhibit a clearly different HCA value distribution from the three other datasets (Kolmogorov-Smirnov test, P ¼ 5.9 106, 4.8 106, and Exploring the Peptide Potential of Genomes 79 Fig. 7 Distribution of the HCA scores calculated for the CDS, the 20,000 noncoding ORFs, the 2000 nontranslated lncRNA-ORFs, and the 721 translated lncRNA-ORFs of M. musculus (dark blue, light blue, dark orange, and light orange curves, respectively). The HCA score distribution of the set of globular proteins is presented with the gray histogram. Each distribution is compared with that of the globular proteins with a Kolmogorov-Smirnov test. Asterisks on the plot denote the level of significance: *** < 0.001 2.4 106 with nontranslated lncRNA-ORFs, noncoding ORFs, and CDS, respectively). Although they are characterized by a majority of intermediate HCA score sequences expected to be foldable, they are clearly enriched in disorder-prone sequences, recalling the observation made by Wilson et al. [31] that young proteins are more disordered than old ones. That said, it is interesting to note that, similarly to the two other noncoding ORF categories, the translated lncRNA-ORFs exhibit a majority of sequences that potentially encode peptides expected to be foldable. Further investigations are needed to determine whether their corresponding peptides fold to a well-defined and stable 3D structure or to a molten globule. 4 Conclusion Here, we presented three protocols that all aim at characterizing the fold potential and the structural properties of different sets of ORFs, including coding sequences, the ensemble or a representative subset of the noncoding ORFs of a genome, or a specific subset of sequences of interest. ORFtrack is very fast, annotating a million ORFs in a few hours. In addition, it allows the user to deal with 80 Chris Papadopoulos et al. different levels of annotation and various combinations of selection patterns, thereby facilitating the definition of many ORF categories. ORFold can handle many inputs and enables the simultaneous visualization of the fold potential calculated for different datasets or the manual inspection of the fold potential or structural properties of all annotated ORFs of a genome with a genome viewer. In addition, ORFold can be used to probe the fold potential and the structural properties of any set of amino acid sequences without any genomic information including, for instance, designed peptides or de novo peptides identified with mass spectrometry in different tissues or conditions. Finally, ORFmine opens up new applications in peptide discovery and characterization. In particular, recent studies have reported the existence of de novo peptides associated with human diseases [11, 32–37]. ORFtrack can be used to mine noncoding genomes for the identification of de novo peptides which are usually difficult to identify with mass spectrometry experiments (for example, peptides resulting from the translation of RNAs associated with diseases). On the other hand, ORFold provides valuable and complementary information with the characterization of their fold potential and structural properties. 5 Notes 1. Notice that the genomic features of a GFF3 file follow a specific hierarchy. For example, the feature “gene” has children (e.g., CDS, exons, tRNAs, rRNAs). In addition, features of the same level can overlap with each other (e.g., a CDS and its corresponding exon). By default, the features “gene” and “exon” are not considered. ORFs that match with the feature “gene” will be annotated according to its children or related features (mRNA, tRNA. . .). For example, ORFs overlapping tRNAs on the same strand necessarily overlap the parent genes as well, but for a more precise annotation, ORFtrack will annotate them as nc_ovp_same-tRNA instead of nc_ovp_same-gene. Finally, an ORF that matches the feature “CDS” usually matches the corresponding “exon” feature as well. However, the “exon” feature is not considered, and the ORF will be annotated as c_CDS if it is in the same frame as the CDS, or as nc_(same/opp)_ovp-CDS if it is in another frame. 2. Notice that the following instructions will lead to the same result: > orfget -fna E_coli.fna -gff mapping_orf_E_coli.gff -features_include nc_intergenic nc_ovp -o E_coli_noncoding Exploring the Peptide Potential of Genomes 81 3. Notice that ORFget can extract a random subset of ORFs belonging to a specific category (e.g., extraction of 20,000 noncoding ORFs overlapping lncRNAs on the same strand) as follows: > orfget -fna M_musculus.fna -gff mapping_orf_M_musculus.gff -features_include nc_ovp_same-lncRNA -o M_musculus_nc_ORF_ovp_same-lnRNA -N 20000 References 1. Ruiz-Orera J, Verdaguer-Grau P, VillanuevaCañas J et al (2018) Translation of neutrally evolving peptides provides a basis for de novo gene evolution. Nat Ecol Evol 2:890–896 2. Chen J, Brunner A-D, Cogan JZ et al (2020) Pervasive functional translation of noncanonical human open reading frames. Science 367: 1140–1146 3. Ingolia NT, Lareau LF, Weissman JS (2011) Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147:789–802 4. Li J, Liu C (2019) Coding or noncoding, the converging concepts of RNAs. Front Genet 10: 496 5. Slavoff SA, Mitchell AJ, Schwaid AG et al (2013) Peptidomic discovery of short open reading frame–encoded peptides in human cells. Nat Chem Biol 9:59 6. Prabakaran S, Hemberg M, Chauhan R et al (2014) Quantitative profiling of peptides from RNAs classified as noncoding. Nat Commun 5: 5429 7. Samayoa J, Yildiz FH, Karplus K (2011) Identification of prokaryotic small proteins using a comparative genomic approach. Bioinformatics 27:1765–1771 8. Hobbs EC, Fontaine F, Yin X, Storz G (2011) An expanding universe of small proteins. Curr Opin Microbiol 14:167–173 9. Eguen T, Straub D, Graeff M, Wenkel S (2015) MicroProteins: small size–big impact. Trends Plant Sci 20:477–482 10. Deng Y, Bamigbade AT, Hammad MA et al (2018) Identification of small ORF-encoded peptides in mouse serum. Biophys Rep 4: 39–49 11. Wang S, Mao C, Liu S (2019) Peptides encoded by noncoding genes: challenges and perspectives. Signal Transduct Target Ther 4: 1–12 12. Carvunis A-R, Rolland T, Wapinski I et al (2012) Proto-genes and de novo gene birth. Nature 487:370–374 13. Schaefer C, Schlessinger A, Rost B (2010) Protein secondary structure appears to be robust under in silico evolution while protein disorder appears not to be. Bioinformatics 26:625–631 14. Tretyachenko V, Vymětal J, Bednárová L et al (2017) Random protein sequences can form defined secondary structures and are welltolerated in vivo. Sci Rep 7:1–9 15. Keefe AD, Szostak JW (2001) Functional proteins from a random-sequence library. Nature 410:715–718 16. Neme R, Amador C, Yildirim B et al (2017) Random sequences are an abundant source of bioactive RNAs or peptides. Nat Ecol Evol 1: 1–7 17. Faure G, Callebaut I (2013) Comprehensive repertoire of foldable regions within whole genomes. PLoS Comput Biol 9:e1003280 18. Faure G, Callebaut I (2013) Identification of hidden relationships from the coupling of hydrophobic cluster analysis and domain architecture information. Bioinformatics 29: 1726–1733 19. Bitard-Feildel T, Callebaut I (2018) HCAtk and pyHCA: A toolkit and python API for the hydrophobic cluster analysis of protein sequences. bioRxiv 249995 20. Lamiable A, Bitard-Feildel T, Rebehmed J et al (2019) A topology-based investigation of protein interaction sites using hydrophobic cluster analysis. Biochimie 167:68–80 21. Linding R, Schymkowitz J, Rousseau F et al (2004) A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. J Mol Biol 342:345–353 22. Fernandez-Escamilla A-M, Rousseau F, Schymkowitz J, Serrano L (2004) Prediction of sequence-dependent and mutational effects 82 Chris Papadopoulos et al. on the aggregation of peptides and proteins. Nat Biotechnol 22:1302–1306 23. Rousseau F, Schymkowitz J, Serrano L (2006) Protein aggregation and amyloidosis: confusion of the kinds? Curr Opin Struct Biol 16: 118–126 24. Mészáros B, Erdős G, Dosztányi Z (2018) IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res 46: W329–W337 25. Erdős G, Dosztányi Z (2020) Analyzing protein disorder with IUPred2A. Curr Protoc Bioinformatics 70:e99 26. Dosztányi Z (2018) Prediction of protein disorder based on IUPred. Protein Sci 27: 331–340 27. Robinson JT, Thorvaldsdóttir H, Winckler W et al (2011) Integrative genomics viewer. Nat Biotechnol 29:24–26 28. Bitard-Feildel T, Callebaut I (2017) Exploring the dark foldable proteome by considering hydrophobic amino acids topology. Sci Rep 7: 1–13 29. Mészáros B, Simon I, Dosztányi Z (2009) Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 5: e1000376 30. Bartonek L, Braun D, Zagrovic B (2020) Frameshifting preserves key physicochemical properties of proteins. Proc Natl Acad Sci U S A 117:5907–5912 31. Wilson BA, Foy SG, Neme R, Masel J (2017) Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nat Ecol Evol 1:1–6 32. Yin X, Jing Y, Xu H (2019) Mining for missed sORF-encoded peptides. Expert Rev Proteomics 16:257–266 33. Lawrence MS, Stojanov P, Polak P et al (2013) Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499:214–218 34. Yadav M, Jhunjhunwala S, Phung QT et al (2014) Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing. Nature 515:572–576 35. Sendoel A, Dunn JG, Rodriguez EH et al (2017) Translation from unconventional 50 start sites drives tumour initiation. Nature 541:494–499 36. Barbosa C, Peixeiro I, Romão L (2013) Gene expression regulation by upstream open reading frames and human disease. PLoS Genet 9:e1003529 37. von Bohlen AE, Böhm J, Pop R et al (2017) A mutation creating an upstream initiation codon in the SOX 9 50 UTR causes acampomelic campomelic dysplasia. Mol Genet Genomic Med 5:261–268 Chapter 4 Computational Identification and Design of Complementary β-Strand Sequences Yoonjoo Choi Abstract The ß-sheet is a regular secondary structure element which consists of linear segments called ß-strands. They are involved in many important biological processes, and some are known to be related to serious diseases such as neurologic disorders and amyloidosis. The self-assembly of ß-sheet peptides also has practical applications in material sciences since they can be building blocks of repeated nanostructures. Therefore, computational algorithms for identification of ß-sheet formation can offer useful insight into the mechanism of disease-prone protein segments and the construction of biocompatible nanomaterials. Despite the recent advances in structure-based methods for the assessment of atomic interactions, identifying amyloidogenic peptides has proven to be extremely difficult since they are structurally very flexible. Thus, an alternative strategy is required to describe ß-sheet formation. It has been hypothesized and observed that there are certain amino acid propensities between ß-strand pairs. Based on this hypothesis, a database search algorithm, B-SIDER, is developed for the identification and design of ß-sheet forming sequences. Given a target sequence, the algorithm identifies exact or partial matches from the structure database and constructs a position-specific score matrix. The score matrix can be utilized to design novel sequences that can form a ß-sheet specifically with the target. Key words Beta strand, Beta sheet, Complementary sequence, Amyloid, Computational design, Amino acid propensity 1 Introduction One of the major elements of protein structure is the ß-sheet which consists of adjacent linear strands in a parallel or antiparallel arrangement. A ß-sheet is composed of two or multiple ß-strands connected by hydrogen bonds. Though the bonds are formed between backbone atoms, there are certain statistical propensities of residue pairs between ß-strands [1, 2]. For example, the diphenyalanine (FF) motif can be self-assembled by π stacking interactions [3]. Amino acids with aromatic residues tend to form a Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_4, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 83 84 Yoonjoo Choi ß-sheet pairing with adjacent valine or glycine [2]. Charge–charge interactions between neighboring strands may also be important to form ß-sheets [4]. ß-strand forming peptides have practical applications in many fields. The self-assembly nature of ß-sheet forming amyloidogenic peptides has profound applications in nanomedical sciences. Various peptides derived from amyloid-ß (Aß), which is an important biomarker for Alzheimer’s disease [5], are utilized to construct diverse nanostructures [6–9]. Their mechanical stability, thermal robustness, and biocompatibility have significant advantages in medical engineering such as tissue regeneration and engineering [10, 11]. On the other hand, serious diseases such as Alzheimer’s, Parkinson’s, type II diabetes, and amyloidosis [12, 13] are related to ß-sheet aggregation, and there are therapeutic agents for such aggregate-prone regions [14, 15]. Recent studies have shown that amyloidogenic peptides can be used for ß-sheet stacking [16, 17], implying that complementary ß-strand sequences can be utilized as therapeutic peptides [15, 18]. The importance and applicability of ß-sheet forming peptides have attracted great attention and a number of computational algorithms for the identification of ß-sheet forming regions have been developed based on known ß-strand pairing patterns [19– 26]. However, utilization of such patterns has critical limitations because they have poor commonalities and thus are in general very noisy [27–29]. Therefore, an alternative strategy for the recognition of meaningful patterns is required. Here, we provide a step-by-step guide to the B-SIDER (β-Stacking Interaction DEsign for Reciprocity) [30] software, which finds complementary ß-strand sequences for a given peptide. The computational method generates target-specific statistical patterns for the query from a well-curated structure database. Significant statistical patterns are amplified by overlapping partially matched pairing matches (Fig. 1). 2 2.1 Materials Software B-SIDER is a Python-implemented database search algorithm. The software can run on any standard desktop machine with the necessary software components. 1. Python 3.4 or higher (https://www.python.org). 2. The B-SIDER software: (a) Make a root directory for the software (e.g., [/home/user/ B-SIDER]). Henceforth, directories and file names are in [italics]; customize as desired. (b) Download B-SIDER (available at https://github.com/ yoonjoolab/B-SIDER). The main script of B-SIDER is a Complementary β-Strand Sequence Identification 85 Fig. 1 Overview of the B-SIDER protocol. (a) Identical sequences (YLLYY) form an antiparallel ß-sheet (PDB ID: 4E0L). (b) B-SIDER divides the query sequence into smaller peptides (minimum by default: 3) and finds exact matches from a predefined database (see Fig. 2). Complementary sequences of the matches are extracted. (c) A position-specific score matrix (PSSM) is constructed from the identified complementary sequences. In this example, hydrophobic amino acids are likely to be complementary to the target sequence. (d) Based on the PSSM, it can be estimated how likely a pair of identical target sequences can form a ß-sheet. The rank is evaluated against randomly generated sequences. In this case, YLLYY may be self-assembled as an antiparallel ß-sheet with a high probability single executable Python file. The database required to run B-SIDER and other scripts are in [B-SIDER/ database]. A Jupyter notebook file is also included in the root folder. 3. The database builder file [B-SIDER/database/B-SIDER_DB_builder.py] is written for PyMOL [31]. (a) PyMOL 1.8 #download). 2.2 Database Files or higher (https://pymol.org/2/ The complementarity sequence database is an essential component of B-SIDER. It is pre-built for users and placed in [episweep/database/comp_seq_DB.db]. The background frequency file is optional and built-in, but users may utilize other values for their own purposes. The files should follow specific formats described in the Notes. 86 Yoonjoo Choi 1. comp_seq_DB.db: Complementarity sequence database, in SQLite3 format (see Note 1). 2. background_frequency.csv: Amino acid background frequencies [30, 32], in CSV format (see Note 2). The values are precalculated and hard-coded in B-SIDER. 2.3 Output Files The primary output values of B-SIDER are the best-matched complementary sequence for the given target sequence and its complementarity score (the lower the better). Users can also extract intermediate result files as output. 1. Position-specific score matrix (PSSM), in CSV (comma separated value) format (see Note 3). 2. Sequences used to construct the PSSM, in plain text format (see Note 4). 3 Methods 3.1 Database Construction The methods are illustrated with a case study application to a ß-sheet amyloid mimic [17]. In this example, two identical sequences (YLLYY) form an antiparallel ß-sheet. This self-assembly nature is frequently observed in disease-related amyloidogenic peptides [33]. B-SIDER takes an amino acid sequence as input. The query is then split into smaller linear fragments and compared to sequences in a database. If there are any exact matches, their complementary sequences are extracted. The final output is a PSSM score and the best sequence based on the PSSM. The B-SIDER software can also calculate complementarity between two given sequences and compare to their complementarity to randomly generated sequences, to estimate how likely it is the pair can form a ß-sheet. B-SIDER identifies highly probable complementary sequences from a well-defined database. A full database is provided in the B-SIDER software. If one wants to update the existing database, or construct a new one from scratch, the PyMOL script [B-SIDER/ database/B-SIDER_DB_builder.py] can be utilized. In this section, a crystal structure of S. enterica CheB methylesterase (PDB ID: 1CHD chain A) is used as an example. There are nine strands in the 1CHD:A structure (one pair makes an antiparallel ß-sheet and seven consecutive parallel strands form a parallel ß-sheet (Fig. 2). 1. The script takes two arguments, one for the structure file must be given, but the other ([example.db] in the following example) is optional. $ pymol 1chdA.pdb -cq B-SIDER_DB_builder.py -- 1chdA.pdb example.db Complementary β-Strand Sequence Identification 87 Fig. 2 B-SIDER database. (a) There are nine strands (numbered from 0 to 8) which form two ß-sheets in 1CHD:A (Parallel ß-sheet in blue and antiparallel in red). (b) An example of B-SIDER database. The PyMol script identifies ß-strands and their complementary sequence information from the structure 2. If the argument for the database is missing, [comp_seq_DB.db] is automatically generated. $ pymol 1chdA.pdb -cq B-SIDER_DB_builder.py -- 1chdA.pdb 3. Users can create their own databases using either a new set of protein structures or house-implemented scripts following the structure of the database (see Note 1). 88 Yoonjoo Choi 3.2 Identification of Complementary Sequence and Construction of Position-Specific Score Matrix Once a database is constructed, the only essential query input is a target sequence. There are also several other options available (see Note 5). Some basic commands are as follows: 1. To find complementarity sequences of a query sequence (e.g., YLLYY), the basic command line is: $ python B-SIDER.py -t YLLYY 2. The default complementarity direction is antiparallel. If parallel, its direction needs to be specified (0 for antiparallel and 1 for parallel). $ python B-SIDER.py -t YLLYY -p 1 3. All the processes are printed as standard output. One may prune intermediate processes by controlling verbosity (0 for no standard output and 1 for vice versa). $ python B-SIDER.py -t YLLYY -v 0 4. If a new database file is used for the sequence, one can explicitly specify the database. For example, if the new database is [example.db]: $ python B-SIDER.py -t YLLYY-d example.db 5. All matched complementary sequences can be saved in a text file. $ python3 B-SIDER.py -t YLLYY-d example.db -o complmentary_seq.txt 6. The position-specific score matrix can be saved in CSV (Note 3). $ python3 B-SIDER.py -t YLLYY-d example.db -o complmentary_seq.txt -s score.csv 7. There are other minor options, which are provided but not recommended to use unless necessary. See Note 5. Complementary β-Strand Sequence Identification 89 8. B-SIDER can be loaded as a Python module. (a) Load B-SIDER. >>> import importlib >>> bsider = importlib.import_module("B-SIDER") (b) Initialize a B-SIDER class object (see Note 6). The class initialization needs to specify a database. If not given, the default database file [./database/comp_seq_DB.db] is loaded. >>> a = bsider.B_SIDER(“./database/comp_seq_DB.db”) (c) Specify a target sequence, parallelity and sequence output file and search for complementary sequences from the database. >>> t = “YLLYY” # target sequence >>> p = False # parallelity: antiparallel >>> n = 3 # The shortest number of residues for search >>> s = "ex_complementary_sequences.txt" # sequence output >>> a.comp_seq_search(target=s, parallel=p, min_frag=n, output=s) (d) Build a PSSM. If necessary, specify the background frequency file (see Note 2). The score matrix is printed as standard output, but can be saved as a CSV file. >>> b = “./database/background_frequency.csv" >>> o = "ex_score_output.csv" >>> a.build_score_matrix(background_frequency=b, output=o) 3.3 Estimation of Complementarity Against Random Sequences Given two sequences, B-SIDER can estimate how likely the two sequences are to form a ß-sheet together, compared to forming one by associating with randomly generated sequences. According to our previous study [30], B-SIDER scores of amyloidogenic sequences, which are prone to aggregation and forming ß-sheets, tend to be approximately within 5% of randomly generated sequences. 90 Yoonjoo Choi 1. The basic execution only requires two sequences. The other options (see Note 5) can also be specified if required. $ python B-SIDER.py -t YLLYY -c YYLLY 2. By default, 10,000 randomly generated sequences are used for comparison. The number of random sequences (e.g., 100,000) can be specified as follows: $ python B-SIDER.py -t YLLYY -c YYLLY -n 100000 3. If loaded as a Python module, execute the following attribute. >>> c = “YYLLY” >>> n = 100000 >>> a.compare_complementarity(comp_seq=c, randnum=n) 4. The quantile rank score of the two sequences is on average 1.94% (σ: 0.14), which indicates that the sequences are highly likely to form a ß-sheet (Fig. 1d). 4 Notes 1. The file for the complementary strand sequences consists of two tables. pdb strand where (a) pdb has five elements as follows: l code: The name of the structure (e.g., 1chdA.pdb). l parallel: Parallelity of two strands; “0” for antiparallel and “1” for parallel ß-sheet. l strand1: The index of stand 1. l strand2: The index of stand 2. l seq1: The sequence of strand 1. l seq2: The sequence of strand 2. (b) strand has four elements: code as in the pdb table, chain for the chain identifier, strand_num for the strand index, and range for residue numbers separated by commas. Complementary β-Strand Sequence Identification 91 (c) The current sequence database is built based on a precompiled PISCES [34] list (sequence identities <90% by chain, resolution <3 Å, R-factor < 0.3, sequence length from 40 to 10,000). The original work additionally used TM-score < 0.7 [30, 35]. However, though TM-score calculation consumes a considerable amount of time, an extremely low number of sequences are actually filtered (data not shown). This filter is no longer used in B-SIDER. 2. The background amino acid frequency file is in two-column CSV format, with each row listing an amino acid (“AA”) and its frequency (“freq”). The background frequency is calculated from HOMSTRAD [32, 36]. The file is hard-coded in B-SIDER and thus does not have to be specified if one uses default values. AA Freq A 0.0628 C 0.0193 E 0.0537 D 0.0709 G 0.0513 F 0.0315 I 0.0356 H 0.0224 K 0.0577 M 0.0155 L 0.0637 N 0.0523 Q 0.0343 P 0.0784 S 0.0694 R 0.0427 T 0.0642 W 0.0106 V 0.0499 Y 0.03 3. The position-specific score matrix contains information of log-scaled amino acid frequency for each position [30, 32]. 92 Yoonjoo Choi Pos 1 2 3 4 5 Target Y L L Y Y A 0.173 0.103 0.058 0.301 0.380 C 0.129 0.325 0.151 0.135 0.049 D 0.295 1.301 1.472 0.932 0.993 ... ... ... ... ... ... T 0.042 0.097 0.115 0.249 0.488 V 0.847 0.987 0.970 0.959 0.946 W 1.371 0.598 0.602 0.968 2.454 Y 0.644 0.795 0.620 0.892 0.699 4. The target sequence is divided into shorter linear fragments ranging from the full length to a user-defined number of residues (default: 3; see Note 5i). 5. The following options are available for B-SIDER. (a) -h or --help: Help message and exit. (b) -v or --verbose [0 or 1]: Output verbosity. “1” for True and “0” for False. The default is True. (c) -t or --target [sequence]: Target sequence. (d) -p or --parallel [0 or 1]: Parallelity. “1” for parallel sheet and “0” for antiparallel. The default: False. (e) -d or --database [database file]: Database file. If not given, the script tries to use [./database/comp_seq_DB.db] (see Note 1). (f) -b or --background_freq [background file]: Background frequency file in CSV. If not given, default values are used (see Note 2). (g) -s or --score_output [score output file]: Output file for the position-specific score matrix in CSV (see Note 3). The default is none, and if the verbosity is true, the score matrix is printed as standard output. (h) -o or --seq_output [sequence output file]: List of sequences in text used to construct the position-specific score matrix. The default is none, and if the verbosity is true, the score matrix is printed as standard output. (i) -m or --min_frag [integer value > 0]: The shortest number of residues for search. The default is 3 (see Note 4). (j) -c or --compare [sequence]: Sequence for comparison. If present, the complementarity of this sequence for the target is compared against a certain number of random sequences. Complementary β-Strand Sequence Identification 93 (k) -n or --randnum [integer value >0]: Number of random sequences to compare. The default is 10,000. 6. The B_SIDER class object has three main attributes as follows: (a) comp_seq_search(self, target, parallel, min_frag ¼ 3, output ¼ None): Database search. (b) build_score_matrix(self, output ¼ None, background_frequency ¼ None): Construction of a PSSM based. (c) compare_complementarity(self, comp_seq, randnum ¼ 10,000): Estimation of complementarity against random sequences. 5 Conclusion Through an extensive benchmark study and experimental validation [30], we showed that B-SIDER can be practically applicable to the design of novel peptides. Future developments will address the design of novel ß-sheet proteins and nanostructure scaffolds. Acknowledgments This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2018R1A5A2024181). References 1. Mandel-Gutfreund Y, Zaremba SM, Gregoret LM (2001) Contributions of residue pairing to β-sheet formation: conservation and covariation of amino acid residue pairs on antiparallel β-strands. J Mol Biol 305(5):1145–1159 2. Steward RE, Thornton JM (2002) Prediction of strand pairing in antiparallel and parallel β-sheets using information theory. Proteins 48 (2):178–191 3. Reches M, Gazit E (2003) Casting metal nanowires within discrete self-assembled peptide nanotubes. Science 300(5619):625–627 4. Shammas SL, Knowles TP, Baldwin AJ, MacPhee CE, Welland ME, Dobson CM, Devlin GL (2011) Perturbation of the stability of amyloid fibrils through alteration of electrostatic interactions. Biophys J 100(11):2783–2791 5. Lansbury PT, Lashuel HA (2006) A centuryold debate on protein aggregation and neurodegeneration enters the clinic. Nature 443 (7113):774–779 6. Ryu J, Park CB (2008) High-temperature selfassembly of peptides into vertically well-aligned nanowires by aniline vapor. Adv Mater 20 (19):3754–3758 7. Smith AM, Williams RJ, Tang C, Coppo P, Collins RF, Turner ML, Saiani A, Ulijn RV (2008) Fmoc-diphenylalanine self assembles to a hydrogel via a novel architecture based on π–π interlocked β-sheets. Adv Mater 20 (1):37–41 8. Yan X, Cui Y, He Q, Wang K, Li J (2008) Organogels based on self-assembly of diphenylalanine peptide and their application to immobilize quantum dots. Chem Mater 20 (4):1522–1526 9. Yan X, He Q, Wang K, Duan L, Cui Y, Li J (2007) Transition of cationic dipeptide nanotubes into vesicles and oligonucleotide delivery. Angew Chem 119(14):2483–2486 10. Stupp SI (2010) Self-assembly and biomaterials. Nano Lett 10(12):4783–4786 94 Yoonjoo Choi 11. Zhang S (2003) Fabrication of novel biomaterials through molecular self-assembly. Nat Biotechnol 21(10):1171–1178 12. Chiti F, Dobson CM (2017) Protein misfolding, amyloid formation, and human disease: a summary of progress over the last decade. Annu Rev Biochem 86:27–68 13. Richardson JS, Richardson DC (2002) Natural β-sheet proteins use negative design to avoid edge-to-edge aggregation. Proc Natl Acad Sci 99(5):2754–2759 14. Giorgetti S, Greco C, Tortora P, Aprile FA (2018) Targeting amyloid aggregation: an overview of strategies and mechanisms. Int J Mol Sci 19(9):2677 15. Sormanni P, Aprile FA, Vendruscolo M (2015) Rational design of antibodies targeting specific epitopes within intrinsically disordered proteins. Proc Natl Acad Sci 112(32):9902–9907 16. Gallardo R, Ramakers M, De Smet F, Claes F, Khodaparast L, Khodaparast L, Couceiro JR, Langenberg T, Siemons M, Nyström S (2016) De novo design of a biologically active amyloid. Science 354(6313):aah4949 17. Liu C, Zhao M, Jiang L, Cheng P-N, Park J, Sawaya MR, Pensalfini A, Gou D, Berk AJ, Glabe CG (2012) Out-of-register β-sheets suggest a pathway to toxic amyloid aggregates. Proc Natl Acad Sci 109(51):20913–20918 18. Kumar DKV, Choi SH, Washicosky KJ, Eimer WA, Tucker S, Ghofrani J, Lefkowitz A, McColl G, Goldstein LE, Tanzi RE (2016) Amyloid-β peptide protects against microbial infection in mouse and worm models of Alzheimer’s disease. Sci Transl Med 8 (340):340ra372 19. Bryan AW Jr, Menke M, Cowen LJ, Lindquist SL, Berger B (2009) BETASCAN: probable β-amyloids identified by pairwise probabilistic analysis. PLoS Comput Biol 5(3):e1000333 20. Fernandez-Escamilla A-M, Rousseau F, Schymkowitz J, Serrano L (2004) Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nat Biotechnol 22(10):1302–1306 21. Kim C, Choi J, Lee SJ, Welsh WJ, Yoon S (2009) NetCSSP: web application for predicting chameleon sequences and amyloid fibril formation. Nucleic acids research 37 (suppl_2):W469–W473 22. Maurer-Stroh S, Debulpaep M, Kuemmerer N, De La Paz ML, Martins IC, Reumers J, Morris KL, Copland A, Serpell L, Serrano L (2010) Exploring the sequence determinants of amyloid structure using position-specific scoring matrices. Nat Methods 7(3):237–242 23. Tartaglia GG, Vendruscolo M (2008) The Zyggregator method for predicting protein aggregation propensities. Chem Soc Rev 37 (7):1395–1401 24. Trovato A, Chiti F, Maritan A, Seno F (2006) Insight into the structure of amyloid fibrils from the analysis of globular proteins. PLoS Comput Biol 2(12):e170 25. Tsolis AC, Papandreou NC, Iconomidou VA, Hamodrakas SJ (2013) A consensus method for the prediction of ‘aggregation-prone’ peptides in globular proteins. PLoS One 8(1): e54175 26. Zibaee S, Makin OS, Goedert M, Serpell LC (2007) A simple algorithm locates β-strands in the amyloid fibril core of α-synuclein, Aβ, and tau using the amino acid sequence alone. Protein Sci 16(5):906–918 27. Bhattacharjee N, Biswas P (2010) Positionspecific propensities of amino acids in the β-strand. BMC Struct Biol 10(1):1–10 28. Fujiwara K, Toda H, Ikeguchi M (2012) Dependence of α-helical and β-sheet amino acid propensities on the overall protein fold type. BMC Struct Biol 12(1):1–15 29. Hutchinson EG, Sessions RB, Thornton JM, Woolfson DN (1998) Determinants of strand register in antiparallel β-sheets of proteins. Protein Sci 7(11):2287–2300 30. Yu T-G, Kim H-S, Choi Y (2019) B-SIDER: computational algorithm for the design of complementary β-sheet sequences. J Chem Inf Model 59(10):4504–4511 31. Schrödinger LLC The PyMOL molecular graphics system. Version 20 32. Choi Y, Deane CM (2010) FREAD revisited: accurate loop structure prediction using a database search algorithm. Proteins 78 (6):1431–1440 33. Haass C, Selkoe DJ (2007) Soluble protein oligomers in neurodegeneration: lessons from the Alzheimer’s amyloid β-peptide. Nat Rev Mol Cell Biol 8(2):101–112 34. Wang G, Dunbrack RL Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19(12):1589–1591 35. Zhang Y, Skolnick J (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33 (7):2302–2309 36. Mizuguchi K, Deane CM, Blundell TL, Overington JP (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci 7(11):2469–2471 Chapter 5 Dynamics of Amyloid Formation from Simplified Representation to Atomistic Simulations Phuong Hoang Nguyen , Pierre Tufféry , and Philippe Derreumaux Abstract Amyloid fibril formation is an intrinsic property of short peptides, non-disease proteins, and proteins associated with neurodegenerative diseases. Aggregates of the Aβ and tau proteins, the α-synuclein protein, and the prion protein are observed in the brain of Alzheimer’s, Parkinson’s, and prion disease patients, respectively. Due to the transient short-range and long-range interactions of all species and their high aggregation propensities, the conformational ensemble of these devastating proteins, the exception being for the monomeric prion protein, remains elusive by standard structural biology methods in bulk solution and in lipid membranes. To overcome these limitations, an increasing number of simulations using different sampling methods and protein models have been performed. In this chapter, we first review our main contributions to the field of amyloid protein simulations aimed at understanding the early aggregation steps of short linear amyloid peptides, the conformational ensemble of the Aβ40/42 dimers in bulk solution, and the stability of Aβ aggregates in lipid membrane models. Then we focus on our studies on the interactions of amyloid peptides/inhibitors to prevent aggregation, and long amyloid sequences, including new results on a monomeric tau construct. Key words Amyloid, Aggregation, Simulations, Intrinsically disordered proteins, Bulk solution, Membranes, Inhibitors, Aβ, Tau, α-synuclein 1 Introduction Aβ, tau, α-synuclein, and prion protein misfolding and aggregation in the central nervous system lead to three neurodegenerative diseases in elderly, Alzheimer’s, Parkinson’s, and prion diseases [1– 3]. The senile extracellular Alzheimer’s disease plaques between neurons contain Aβ42 and Aβ40 peptides, Aβ42 being more toxic than Aβ40, with the Aβ sequence consisting of a charged hydrophilic N-terminus (residues 1–16), the central hydrophobic core (CHC, residues 17–21), a charged region (residues 22–29), and a hydrophobic C-terminus (residues 30–40/42). The Aβ peptides result from the proteolytic cleavage of the transmembrane amyloid precursor protein (APP) by secretases [2]. The senile intracellular Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 95 96 Phuong Hoang Nguyen et al. Alzheimer’s disease filaments in neurons are made of hyperphosphorylated tau proteins of 441 residues consisting of a charged N-terminal (residues 1–207), a proline-rich region, a microtubulebinding domain of four repeats spanning residues 244–368, and a C-terminal region [4]. The α-synuclein protein has a membranebinding domain (residues 1–66), a hydrophobic region spanning residues 67–96, and a highly charged C-terminal (residues 97–140) implicated in calcium binding and in protein–protein interactions at the surface of synaptic vesicles [1, 5]. In contrast to the monomeric disordered Aβ, tau, and α-synuclein structures, the monomeric prion structure obtained by nuclear magnetic resonance (NMR) using solution full-length and C-terminal domain from both mouse and Syrian hamster prion reveals an unstructured N-terminal fragment (23–124) and a C-terminal globular domain (125–228) with three α-helices and a short antiparallel β-sheet between helices 1 and 2 [6]. Despite low sequence identity, the four proteins aggregate over time to amyloid fibrils with a common cross-β structure [7–10]. Aggregation is a nonequilibrium process where the amyloid proteins form fibrils with sigmoidal kinetics profiles in which the proteins self-assemble (lag-phase) prior to fibril elongation (growth phase), followed by a plateau where the fibrils and free monomers are in equilibrium (saturation phase) [7, 11]. The lag phase time of fibrillation and the fibril growth time vary with amino acid length, mutation, temperature, pH, concentration, agitation, shear forces, metal ions, crowding, and the presence of lipids such as cholesterol [11–15]. Of note, in vitro aggregation of full-length tau requires the addition of heparin due to the highly charged N- and C-terminal regions and is impacted by the pattern of hyperphosphorylation [16]. The molecular mechanisms involved in amyloid fibril formation are well described by primary nucleation, (fragmentation and surface-catalyzed) secondary nucleation, and elongation. It is now possible to determine the rate constants of each of these microscopic processes operating at multiple time-scales by fitting the solutions of nonlinear chemical master equations to the experimental aggregation curves [11, 15]. Such a knowledge is essential to design molecular inhibitors that target selectively mature amyloid fibrils or intermediate oligomeric species and interfere with the desired microscopic aggregation step. Interestingly, the Aβ, tau, and α-synuclein proteins have hydrophobic aggregation-prone regions which by themselves form amyloid fibrils in vitro, e.g., the CHC consisting of 17LVFFA21 and the C-terminus 37GGVIA42 in Aβ, the PH6 and PH6* motifs consisting of PHF6*: 275 VQIINK280 and PHF6: 306VQIVYK311 in tau, and the NAC region in α-synuclein [17–19]. These fragments have therefore represented ideal models for understanding amyloid fibril formation by computational studies [20, 21]. Dynamics of Amyloid Proteins by Simulations 97 It has also been demonstrated that the early formed aggregates of Aβ, tau, and α-synuclein are toxic [2, 22, 23]. This finding has also motivated many computational studies to disclose the structures of the oligomeric species in various environments. This is a challenge computationally due to the size and heterogeneity of the conformational ensemble to be explored and the need of accurate force fields to model intrinsically disordered proteins [24, 25]. The next sections review our main contributions to the field of amyloid protein simulations. We also present new results of the application of the PEP-FOLD framework to a monomer of tau construct. 2 Early Aggregation Steps of Short Linear Amyloid Peptides in Bulk Solution To accelerate aggregation, all simulations use an mM peptide concentration, namely three and six orders of magnitude higher than in vitro and in vivo conditions, respectively. Various oligomers of the Aβ16–22, Aβ37–42, Aβ25–35, Aβ11–25, KFFE, GNNQQNQ, and NNQQ from yeast protein Sup35, NFGAIL, and SNNFGAILSS from islet amyloid polypeptide (IAPP), and NHVTLSQ from β2-microglobulin peptides (Table 1) were explored by coarse-grain (OPEP) and atomistic simulations, based on molecular dynamics (MD), Monte Carlo (MC), the ActivationRelaxation Technique (ART), standard replica exchange MD (REMD), and replica exchange with solute tempering (REST2) [26–46]. OPEP, which consists of an all-atom representation of the backbone, one bead per side chain, the exception being proline, and an implicit solvent model, was parametrized on well-folded peptides and proteins [47–52]. ART accelerates the search for lowest-energy microstates by finding repeatedly activated mechanisms that connect minima via first-order saddle-points and accepting/rejecting them by the Metropolis criterion [53–55]. Independently of the protein (coarse-grain or atomistic) and solvent (implicit or explicit) representation and the amino acid sequence, self-assembly of up to 20 peptides starts by a hydrophobic collapse and the formation of disordered oligomers, which evolve in time to transient β-rich aggregates [35–37]. These heterogeneous β-rich assemblies can have various sheet-to-sheet pairing angles ranging from parallel to perpendicular, and form open and closed β-barrels [28–30, 37, 39, 42]. Of note, these topologies have been observed in vitro by X-ray crystallography of macrocyclic β-sheet mimics [56] and designed amyloid peptides [57]. For all peptides, however, the free energy landscape is dominated by amorphous aggregates [34, 42, 43]. Our studies, as other computational studies [58, 59], highlight the impact of the monomeric amyloidcompetent state, which corresponds to an extended conformation for short amyloid sequences, on (1) the kinetics of aggregation, e.g., the association/dissociation times of all oligomers and the Table 1 Summary of the different amyloid systems studied Applicationsa a Sampling b Force field References Aggregation of short linear peptides ART MDc REMDd OPEP REST2e OPEP [26–32] OPEP, all-atom [33–37] All-atom [38–40] [41–46] All-atom [42] Nucleation sizes of Aβ16–22 and Aβ37–42 REMC On-lattice OPEP [71–73] Impact of hydrodynamics on early aggregation steps LBMD OPEP [78, 79] Aβ40 WT, Aβ42 WT dimers REMD All-atom [87, 88] Aβ40 WT, Aβ42 WT, D23N monomers and dimers ART, REMD OPEP [89, 90] Aβ16–35 monomer and dimer REMD OPEP [92] Aβ42 (S8C) dimers REMD All-atom [93] Aβ42 H6R, D7N dimers MD All-atom [94–96] Aβ42 A2T, A2V dimers REMD All-atom [97, 99] Aβ1–28 WT and A2V monomers REMD All-atom [98] Trimeric Aβ11–42 structures in DPPC membrane REMD All-atom [115] Tetrameric Aβ40, Aβ42 WT, D23N, A2T β-barrels and Aβ helix bundles in DPPC lipid membrane REMD All-atom [112, 113, 116] Aβ29–42 dimer with membranes containing omega-3 and omega-6 fatty acids MD All-atom [118] Aβ1–28/NQ-Trp REMD All-atom [124, 125] Aβ16–22 oligomers/N-methylated inhibitor MD OPEP [120] Aβ17–42 trimers/small molecule inhibitors REMD All-atom [121] Aβ40/42 dimers/small molecule inhibitors MD, REMD All-atom [122, 123, 127] Aβ16–22, Aβ25–35 oligomers/carbon nanotubes MD, REMD All-atom [134, 135] Prion (127–164) monomer MC OPEP [142] Prion WT monomer MD REMD All-atom All-atom [143, 144, 147] [145] Prion (125–228) dimer MD OPEP [148] Dimeric tau (306–378) construct and phosphorylation REMD, MD All-atom [149] α-synuclein monomer PEP-FOLD [25] Monomeric tau (306–378) PEP-FOLD OPEP This work Simulations in bulk solution unless specified Aβ16–22 dimer and trimer, Aβ11–25 tetramer, NFGAIL dodecamer, KFFE dimer up to heptamer, and 3-mers, 12-mers and 20-mers of GNNQQNY c All-atom for Aβ16–22 dimer, OPEP for 4-mers up 16-mers of β2m(83–89), Aβ(16–22) 8-mers and IAPP(20–29) trimer d All-atom for Aβ16–22 dimer and trimer, 7-mers of β2m(83–89) and 16-mers of Aβ37–42; OPEP for ccβ peptide, 2-mers and 3-mers of Aβ16–22, 6-mers of Aβ25–35, 20-mers of NNQQ and 3-mers e 12-mers and 20-mers of GNNQQNY b Dynamics of Amyloid Proteins by Simulations 99 conversion time between architectures and (2) on the thermodynamics of aggregation, such as the population of the substates having different β-strand mismatches (mixed parallel/antiparallel β-strands), β-sheet sizes, and topologies [27, 35]. The simulations also emphasize a complex free energy landscape with β-rich oligomers having in-register and out-of-register conformations and the existence of many kinetic traps [26–31, 40, 42], as it has been observed experimentally in the late steps of aggregation [60] or computationally by Markov chain models in the lock phase of oligomers [61]. It is not yet understood which recent modern force field is the most appropriate for describing the aggregation pathways and the equilibrium ensemble of intrinsically disordered proteins [33, 62– 66]. Similarly, the nucleus size (N*) for primary nucleation, corresponding to the highest free energy aggregate from which fibrils can grow, remains elusive, simulations predicting N* between 5 and 40 [21, 67–74]. Yet, this knowledge is important as nucleation seed size determines amyloid clearance and establishes a barrier to prion appearance in yeast [75]. Another important issue, when performing coarse-grained simulations with an implicit solvent representation, is that exchange of momenta between the peptide’s particles and the solvent and solvent-mediated correlations are ignored. In this context, we coupled Lattice-Boltzmann MD simulation, which includes naturally hydrodynamic interactions, to the CG OPEP model [76, 77] and performed simulations of Aβ16–22 peptides in bulk solution [78–80]. Two main results are important. For a system of 100 peptides, hydrodynamic interactions augment the oligomer size of the first two clusters and the exchange of peptides compared to the results of standard Langevin dynamics [78]. For a system of 1000 peptides at a concentration of 60 mM, we can follow the formation and growth of a large elongated oligomer and its slow β-sheet structuring resulting from the fusion and dissociation of small disordered aggregates. Many mechanisms are observed from elongation to surface-catalyzed effects, and in particular the lateral growth mechanism on the surface of prefibrillar states, which is sustained by long-range hydrodynamic correlations, and allows the formation of large branched structures consisting of 600 peptides, spanning a few tens of nanometers and hosting annular pores of dimensions 3–5 nm [79, 80]. This computational simulation at a quasiatomistic representation of a system of unprecedented size—previous simulations exploring at most 125 peptides with very few degrees of freedom—illustrates the critical contribution of secondary nucleation to amyloid aggregation kinetics. It is interesting that amyloid plaques with annular pores have been evidenced by electron microscopy and antibodies in the brain of Alzheimer’s disease patients [81]. Of note, our early phase of aggregation would approach the second time scale at nM (in vivo) concentrations. 100 3 Phuong Hoang Nguyen et al. Dimeric Aβ40/42 States in Bulk Solution The wild-type (WT) Aβ40/42 dimers, known to be structurally heterogeneous by experiments, are of high interest as they are the smallest species to lead to tau hyperphosphorylation and neuritic degeneration, and their levels have been found to increase sharply and correlate with plaque load in brain tissue of Alzheimer’s disease patients [82]. Experiments have shown that the English (H6R) familial Alzheimer disease mutation, the Tottori (D7N) mutation, the Flemish (A21G) mutation, and the Iowa (D23N) mutation speed up the fibril formation process of both Aβ40 and Aβ42 peptides in vitro and increase toxicity to cells [21]. In contrast, the engineered disulfide-bond-locked double (S8C) mutant has been shown to form an exclusive homogeneous and neurotoxic dimer [83]. Mutations at position 2 have dramatic impact on AD risk; A2V is causative, A2T is protective. The A2V mutation enhances aggregation kinetics while the A2T mutation only retards amyloid fibril by increasing the lag phase time. Interestingly the mixture of WT and A2V also retards fibril formation and protects against Alzheimer’s disease [84–86]. The Aβ40 and Aβ42 alloforms were subjected to atomistic REMD in explicit solvent [87, 88]. Using several force fields, the equilibrium configurations are found to be disordered, with crosscollision sections, hydrodynamics radii, and SAXS profiles of the ensembles independent of the force field. Intramolecular β-hairpins spanning residues 17–21 and 30–36 are however observed with a population varying from 1.5 to 13% according to the force field. The ensemble of both alloforms is stabilized by nonspecific interactions with many hydrophobic residues exposed to solvent, explaining therefore their propensity to be toxic at the dimeric level. Simulations also reveal that the Aβ42 dimer has a higher propensity than the Aβ40 dimer to form β-strands at the CHC and at the C-terminal (residues 30–40), consistent with other computational studies [89, 90]. The dimers have no defined interfaces, and the random organization with transient secondary structures is substantially preferred over two chains with β-hairpin and fibril-like conformations [88]. The formation and the transition between the two latter β-rich dimers have also been discussed by other studies [91, 92]. Using atomistic simulations, the dimers of S8C and WT Aβ42 have the same secondary structures and cross-collision sections. Upon S8C mutation, the lifetime of the intramolecular threestranded sheet spanning residues 17–21, 30–36, and 39–41 is increased by a factor of 3. This single common structural feature shared by both species, which does not exist in the Aβ40 WT dimers, is likely to contribute to Aβ42 toxicity [93]. The H6R, D7N, A21G, and D23N mutations change the population of Dynamics of Amyloid Proteins by Simulations 101 β-strand at the CHC region and at the C-terminus and impact the population of intramolecular and intermolecular salt bridges involving E22, D23, and K28 by reducing the formation time of the loop region for a fibril-like conformation. We also find, consistent with experimental studies, that these mutations produce different effects on Aβ dimer depending on whether they occur in Aβ40 or Aβ42 [90, 94–96]. Comparing atomistic REMD simulations of WT, WT-A2V, and A2V-A2V Aβ40 dimers, we find that upon single mutation, the intrinsic disorder and the intermolecular potential energies are reduced, and the population of intramolecular three-stranded β-sheets is increased [97]. A reduced intrinsic disorder upon A2V mutation was already discussed for the Aβ28 monomer [98]. Analyzing REMD simulations of WT-A2T Aβ40 dimer [99], we provide evidence that the retard in the lag phase time of fibrillation results from an increase of intrapeptide stability over interpeptide stability in the heterozygous dimers. This finding was further corroborated by other simulations of A2T and A2V Aβ42 dimers [100, 101]. Investigating the local structure and dynamics of hydration, it was also found that the survival probability of ordered water molecules decays more rapidly for the Aβ N-terminus AD causative (A2V, H6R, D7N and D7H) mutants than the A2T protective mutant [102]. 4 Oligomeric States of Aβ in Lipid Membrane Models Interactions of amyloid oligomers with cell membranes are believed to contribute to toxicity. On the one hand, the membrane can trigger amyloid fibril formation at a lower peptide concentration [103, 104]. On the other hand, accumulation of oligomers on the membrane surface can impart inequal stress on bilayer, extract lipids into and contribute to the formation of stable phospholipid/oligomer complexes, or create pores transporting Ca2+ ions through the membrane [105–110]. Due to structural heterogeneity of the oligomers, many pores of various inner diameters made of different numbers of Aβ subunits have been modeled based on atomic force microscopy images and MD simulations [110]. Based on size exclusion chromatography, transmission electron microscopy, circular dichroism, and NMR information on the structural environment experienced by the Q15, N27, and M35 side chains [111], we recently modeled a tetrameric β-barrel consisting of two distinct β-hairpins, with an asymmetric arrangement of eight antiparallel β-strands and an inner pore diameter of 0.7 nm for both alloforms of Aβ [112]. Using extensive atomistic REMD simulations, we found that this barrel exists transiently for Aβ42 and not for Aβ40 within a dipalmitoylphosphatidylcholine (DDPC) lipid bilayer membrane, and this may explain the higher toxicity 102 Phuong Hoang Nguyen et al. effect of Aβ42 than its Aβ40 counterpart [112]. The same simulations indicate that the lower and higher induced toxicity of the A2T and D23N mutants cannot be correlated to their tetrameric β-barrel pore-forming probabilities, at least in a DDPC membrane environment [113], but addition of cholesterol in the membrane composition could make a difference [114]. As the structures of amyloid oligomers inserted into a membrane remain elusive, we explored the stability of Aβ11–42 trimers with parallel (U-shape fibril) and antiparallel (β-hairpin) β-sheet structures in a DPPC membrane. Our REMD simulations strongly suggest that these two assemblies represent minimal seeds or nuclei for the formation of amyloid fibrils, a variety of β-barrel pores and various aggregates for Aβ sequences [115]. Apart from β-barrel pores, α-helical pores are also possible [107, 116]. The equilibrium structures are in all cases dependent, however, on the lipid composition [117]. For instance, it is known that the omega-3 polyunsaturated fatty acid slows the progression of Alzheimer’s disease, while its omega-6 counterpart is linked to increased risk. We showed by MD simulations that variation in the abundance of the 1-palmitoyl-2-oleoylsn-glycero-3-phosphocholine (POPC), omega-3, and omega-6 modulates the conformational ensemble of the Aβ29–42 dimer [118]. 5 Study of Inhibitors/Aβ Amyloid Oligomers Understanding the mechanistic determinants of Aβ amyloidinhibitor interaction is continuously pursued to develop more efficient drugs [70]. A plethora of inhibitors aimed at targeting either oligomers, the secondary nucleation or fibril elongation, have been tested in bulk solution and in vivo [119]. We performed numerous simulations to study the detailed interactions of Aβ fragments, Aβ40 and Aβ42 oligomers with a large set of inhibitors including N-methylated peptide, EGCG, curcumin, resveratrol, 1,4-naphthoquinon-2-yl-L-tryptophan (NQ-Trp), astaxanthin, and betanin [120–127]. We have shown, as other computational studies [128–130], that there are many binding sites with small occupancies and contact surfaces. Even with a binding pocket, there are multiple binding modes, demonstrating the transient character of Aβ oligomer/inhibitor interactions in bulk solution. This is consistent with the absence of nonspecific interactions as evidenced by NMR and the low affinity of drugs for Aβ monomers and small oligomers [131] . Besides small molecules, carbon nanoparticles such as fullerene and carbon nanotubes can also impede the fibrillation of Aβ and β2m proteins [132, 133]. In this context, we have shown by atomistic REMD simulations that carbon nanotubes impact both Dynamics of Amyloid Proteins by Simulations 103 the primary and secondary nucleation mechanism by destabilizing β-rich oligomers and fibril assemblies for both the Aβ16–22 and Aβ25–35 peptides [134, 135]. Several reasons have been put forward to explain the repetitive failures of small drugs and antibodies targeting Aβ oligomers [131]. As the lipid membrane can catalyze the formation of toxic amyloid intermediates, the interplay between amyloid oligomers, inhibitors, and the lipid membrane should also be considered more systematically [136, 137]. Similarly, clinical trials might consider a synergy of multiple drugs targeting Aβ and tau along with the use of laser or bubble cavitation techniques to destabilize amyloid aggregates [138–141]. 6 Long Amyloid Constructs The conformational ensemble of several monomeric prion fragments has been explored by coarse-grained OPEP Monte Carlo [48], and atomistic MD and REMD simulations. The CG simulations on PrP (143–158), starting from random states, showed that this fragment forms helix by itself, consistent with experiments [142]. In contrast, the fragment PrP(128–164) was found to code either for an alpha/β topology, as found in the NMR structure of recombinant full-length PrP, or a β-hairpin spanning residues 142–167, as found by NMR for PrP(142–167) [142]. Using atomistic MD, we showed that the helix H1 (residues 143–158) is rather stable upon P102L, M129V, and G131V mutation and deletion of the residues coding the first or the second β-strand [143, 144]; using REMD simulations, we found an intermediate state characterized by a significant detachment of helix H1 from PrP-core [145], consistent with other studies [146], which forms very easily for the E211D mutation by standard MD simulations [147]. This is of interest as mutation at position 211 drives a switch between Creutzfeldt-Jakob disease (CJD) and Gerstmann-Str€aussler Scheinker syndrome [147]. Apart from the implication of H1 detachment in the early steps of aggregation, we found that the CJD-causing T183A mutation accelerates the conversion of the helix H2 into β-sheets in the monomer and dimer of PrP [148]. In comparison to the Aβ protein, the number of computational studies on α-synuclein monomer and long tau constructs is very small. We recently performed atomistic REMD simulations on the tau R3-R4 domain (residues 306–378) starting from the fibril topology [149]. Note that the cryo-electron microscopy of fulllength tau fibril in the brain of an individual with Alzheimer’s disease reveals an R3–R4 domain with a C-shaped topology of eight β-sheets, while the rest of the protein is flexible [9]. We found that the WT R3–R4 dimer populates elongated, U-shaped, V-shaped, and globular topologies rather than C-shaped forms. 104 Phuong Hoang Nguyen et al. MD simulations revealed that upon phosphorylation of Ser356 (pSer356) there is a substantial decrease of intermediates near the fibril-like conformers, compared to its WT counterpart [149]. This result explains why WT K18 develops seeding activity more rapidly than pSer356 K18 [150]. Recently, we applied the chain-growth PEP-FOLD approach, developed for ab initio structure prediction of well-folded proteins [151–155] and protein–peptide complexes [156] to the monomeric α-synuclein protein [25]. As expected from experiments, the α-synuclein monomer was found highly dynamic (Rg distribution varying between 1.5 nm and 2.5 nm) and without any well-defined structure (65% of turn/coil). We observed however a high propensity of broken helices in the N-terminus, consistent with α-synuclein’s membrane binding properties [25] . Using a total of 500 PEP-FOLD simulations, we explored here the conformations of the R3–R4 tau monomer, the R3–R4 domain being known to be the core of the amyloid fibril. As seen on the free energy landscape, there are no dominant structures, the Rg distribution varying between 1.2 and 2.0 nm and the Cα end-to-end distance between 0.5 and 5.0 nm (Fig. 1). Yet, β-strands signals are detected by the structural alphabet profile along the sequence, and regions with a β-strand propensity of more than 0.2 encompass residues 306–315, 317–321, 329–332, 335–345, 349–354, 361–364, and 367–378 (Fig. 2). It is interesting that these regions fit well the β-strands observed in the fibril, which encompass residues 306–311, 313–322, 327–331, 336–341, 343–347, 349–354, 356–363, and 368–378 [9]. Most of structures are however devoid of β-sheets and are random coil with some helical content (Fig. 1a– c), and a few structures have the propensity to form a double U-shape free of intramolecular H-bonds (Fig. 1d). Overall, we do not find any evidence of a fibril-like conformation encoded in the dimeric ensemble. 7 Conclusions We have reviewed our contributions to the field of amyloid simulations. Conformational ensembles of amyloid monomer and oligomers have been addressed from bulk solution or membrane environment to complexes with different classes of inhibitors. Improved coarse-grained and atomistic force fields should tell us more about the link between oligomers and toxicity. This requires however to mimic in vivo conditions such as crowding and lipids and to play with aging parameters such as variation of the fluid flow in the brain extracellular space [157, 158]. Dynamics of Amyloid Proteins by Simulations 105 Fig. 1 PEP-FOLD free energy landscape (in kcal/mol) of the tau (306–378) monomer projected on the radius of gyration and the end-to-end distance. The N-terminus is denoted by a sphere, and the C-terminus by a square Acknowledgments We acknowledge support by the “Initiative d’Excellence” program from the French State (Grant “DYNAMO”, ANR-11-LABX0011-01, and “CACSICE”, ANR-11-EQPX-0008). Notes The authors declare no competing financial interest. 106 Phuong Hoang Nguyen et al. Fig. 2 Structural alphabet profile predicted for tau-306-378. Green tones correspond to extended conformations, red tones to helical conformations and blue one to coil regions. Extended regions. Each column corresponds to a fragment of four amino acids (first column corresponds to VQIV, last column corresponds to KLTF) References 1. Goldberg MS, Lansbury PT Jr (2000) Is there a cause-and-effect relationship between alphasynuclein fibrillization and Parkinson’s disease? Nat Cell Biol 2:E115–E119 2. Hardy J, Selkoe DJ (2002) The amyloid hypothesis of Alzheimer’s disease: progress and problems on the road to therapeutics. Science 297:353–356 3. Scheckel C, Aguzzi A (2018) Prions, prionoids and protein misfolding disorders. Nat Rev Genet 19:405–418 4. Barthélemy NR, Li Y, Joseph-Mathurin N, Gordon BA, Hassenstab J, Benzinger TLS, Buckles V, Fagan AM, Perrin RJ, Goate AM et al (2020) Dominantly Inherited Alzheimer Network. A soluble phosphorylated tau signature links tau, amyloid and the evolution of stages of dominantly inherited Alzheimer’s disease. Nat Med 26:398–407 5. Auluck PK, Caraveo G, Lindquist S (2010) Alpha-Synuclein: membrane interactions and toxicity in Parkinson’s disease. Annu Rev Cell Dev Biol 26:211–233 6. Riek R, Hornemann S, Wider G, Billeter M, Glockshuber R, Wuthrich K (1996) NMR structure of the mouse prion protein domain PrP(121-321). Nature 382:180–182 7. Dobson CM (1999) Protein misfolding, evolution and disease. Trends Biochem Sci 24:329–332 8. Lu JX, Qiang W, Yau WM, Schwieters CD, Meredith SC, Tycko R (2013) Molecular structure of β-amyloid fibrils in Alzheimer’s disease brain tissue. Cell 154:1257–1268 9. Fitzpatrick AWP, Falcon B, He S, Murzin AG, Murshudov G, Garringer HJ, Crowther RA, Ghetti B, Goedert M, Scheres SHW (2017) Cryo-EM structures of tau filaments from Alzheimer’s disease. Nature 547:185–190 10. Guerrero-Ferreira R, Taylor NM, Mona D, Ringler P, Lauer ME, Riek R, Britschgi M, Stahlberg H (2018) Cryo-EM structure of alpha-synuclein fibrils. elife 7:e36402 11. Meisl G, Michaels TCT, Linse S, Knowles TPJ (2018) Kinetic analysis of amyloid formation. Methods Mol Biol 1779:181–196 12. Buttstedt A, Wostradowski T, Ihling C, Hause G, Sinz A, Schwarz E (2013) Different morphology of amyloid fibrils originating from agitated and non-agitated conditions. Amyloid 2:86–92 13. Luo XD, Kong FL, Dang HB, Chen J, Liang Y (2016) Macromolecular crowding favors the fibrillization of β2-microglobulin by accelerating the nucleation step and inhibiting fibril disassembly. Biochim Biophys Acta 1864:1609–1619 14. Xu W, Zhang C, Derreumaux P, Gr€aslund A, Morozova-Roche L, Mu Y (2011) Intrinsic determinants of Aβ(12-24) pH-dependent self-assembly revealed by combined computational and experimental studies. PLoS One 6: e24329 15. Habchi J, Chia S, Galvagnion C, Michaels TCT, Bellaiche MMJ, Ruggeri FS, Sanguanini M, Idini I, Kumita JR, Sparr E et al (2018) Cholesterol catalyses Aβ42 aggregation through a heterogeneous nucleation Dynamics of Amyloid Proteins by Simulations pathway in the presence of lipid membranes. Nat Chem 10:673–683 16. Sibille N, Sillen A, Leroy A, Wieruszeski JM, Mulloy B, Landrieu I, Lippens G (2006) Structural impact of heparin binding to fulllength Tau as studied by NMR spectroscopy. Biochemistry 45:12560–12572 17. Balbach JJ, Ishii Y, Antzutkin ON, Leapman RD, Rizzo NW, Dyda F, Reed J, Tycko R (2000) Amyloid fibril formation by A beta 16-22, a seven-residue fragment of the Alzheimer’s beta-amyloid peptide, and structural characterization by solid state NMR. Biochemistry 39:13748–13759 18. Inouye H, Sharma D, Goux WJ, Kirschner DA (2006) Structure of core domain of fibril-forming PHF/Tau fragments. Biophys J 90:1774–1789 19. Bodles AM, Guthrie DJ, Greer B, Irvine GB (2001) Identification of the region of non-Abeta component (NAC) of Alzheimer’s disease amyloid responsible for its aggregation and toxicity. J Neurochem 78:384–395 20. Mousseau N, Derreumaux P (2005) Exploring the early steps of amyloid peptide aggregation by computers. Acc Chem Res 38:885–891 21. Nasica-Labouze J, Nguyen PH, Sterpone F, Berthoumieu O, Buchete NV, Coté S, De Simone A, Doig AJ, Faller P, Garcia A et al (2015) Amyloid beta protein and Alzheimer’s disease: when computer simulations complement experimental studies. Chem Rev 115:3518–3563 22. Lasagna-Reeves CA, Castillo-Carranza DL, Guerrero-Muoz MJ, Jackson GR, Kayed R (2010) Preparation and characterization of neurotoxic tau oligomers. Biochemistry 49:10039–10041 23. Fusco G, Chen SW, Williamson PTF, Cascella R, Perni M, Jarvis JA, Cecchi C, Vendruscolo M, Chiti F, Cremades N et al (2017) Structural basis of membrane disruption and cellular toxicity by α-synuclein oligomers. Science 2017(358):1440–1443 24. Graen T, Klement R, Grupi A, Haas E, Grubmüller H (2018) Transient secondary and tertiary structure formation kinetics in the intrinsically disordered state of α-synuclein from atomistic simulations. ChemPhysChem 19:2507–2511 25. Nguyen PH, Derreumaux P (2020) Structures of the intrinsically disordered Aβ, tau and α-synuclein proteins in aqueous solution from computer simulations. Biophys Chem 264:106421 107 26. Santini S, Wei G, Mousseau N, Derreumaux P (2004) Pathway complexity of Alzheimer’s beta-amyloid Abeta16-22 peptide assembly. Structure 12:1245–1255 27. Santini S, Mousseau N, Derreumaux P (2004) In silico assembly of Alzheimer’s Abeta16-22 peptide into beta-sheets. J Am Chem Soc 126:11509–11516 28. Wei G, Mousseau N, Derreumaux P (2004) Sampling the self-assembly pathways of KFFE hexamers. Biophys J 87:3648–3656 29. Melquiond A, Boucher G, Mousseau N, Derreumaux P (2005) Following the aggregation of amyloid-forming peptides by computer simulations. J Chem Phys 122:174904 30. Melquiond A, Mousseau N, Derreumaux P (2006) Structures of soluble amyloid oligomers from computer simulations. Proteins 65:180–191 31. Boucher G, Mousseau N, Derreumaux P (2006) Aggregating the amyloid Abeta (11-25) peptide into a four-stranded betasheet structure. Proteins 65:877–888 32. Melquiond A, Gelly JC, Mousseau N, Derreumaux P (2007) Probing amyloid fibril formation of the NFGAIL peptide by computer simulations. J Chem Phys 126:065101 33. Man VH, He X, Derreumaux P, Ji B, Xie XQ, Nguyen PH, Wang J (2019) Effects of all-atom molecular mechanics force fields on amyloid peptide assembly: the case of Aβ16-22 dimer. J Chem Theory Comput 15:1440–1452 34. Mo Y, Lu Y, Wei G, Derreumaux P (2009) Structural diversity of the soluble trimers of the human amylin(20-29) peptide revealed by molecular dynamics simulations. J Chem Phys 130:125101 35. Lu Y, Derreumaux P, Guo Z, Mousseau N, Wei G (2009) Thermodynamics and dynamics of amyloid peptide oligomerization are sequence dependent. Proteins 75:954–963 36. Wei G, Song W, Derreumaux P, Mousseau N (2008) Self-assembly of amyloid-forming peptides by molecular dynamics simulations. Front Biosci 13:5681–5692 37. Song W, Wei G, Mousseau N, Derreumaux P (2008) Self-assembly of the beta2microglobulin NHVTLSQ peptide using a coarse-grained protein model reveals a betabarrel species. J Phys Chem B 112:4410–4418 38. Nguyen PH, Li MS, Derreumaux P (2011) Effects of all-atom force fields on amyloid oligomerization: replica exchange molecular dynamics simulations of the Aβ(16-22) 108 Phuong Hoang Nguyen et al. dimer and trimer. Phys Chem Chem Phys 13:9778–9788 39. De Simone A, Derreumaux P (2010) Low molecular weight oligomers of amyloid peptides display beta-barrel conformations: a replica exchange molecular dynamics study in explicit solvent. J Chem Phys 132:165103 40. Nguyen PH, Derreumaux P (2013) Conformational ensemble and polymorphism of the all-atom Alzheimer’s Aβ(37-42) amyloid peptide oligomers. J Phys Chem B 117:5831–5840 41. Spill YG, Pasquali S, Derreumaux P (2011) Impact of thermostats on folding and aggregation properties of peptides using the optimized potential for efficient structure prediction coarse-grained model. J Chem Theory Comput 7:1502–1510 42. Nasica-Labouze J, Meli M, Derreumaux P, Colombo G, Mousseau N (2011) A multiscale approach to characterize the early aggregation steps of the amyloid-forming peptide GNNQQNY from the yeast prion sup-35. PLoS Comput Biol 7:e1002051 43. Lu Y, Wei G, Derreumaux P (2012) Structural, thermodynamical, and dynamical properties of oligomers formed by the amyloid NNQQ peptide: insights from coarse-grained simulations. J Chem Phys 137:025101 44. Nguyen PH, Okamoto Y, Derreumaux P (2013) Communication: simulated tempering with fast on-the-fly weight determination. J Chem Phys 138:061102 45. Chebaro Y, Pasquali S, Derreumaux P (2012) The coarse-grained OPEP force field for non-amyloid and amyloid proteins. J Phys Chem B 116:8741–8752 46. Wei G, Mousseau N, Derreumaux P (2007) Computational simulations of the early steps of protein aggregation. Prion 1:3–8 47. Derreumaux P (1999) From polypeptide sequences to structures using Monte Carlo simulations and an optimized potential. J Chem Phys 5:2301–2310 48. Derreumaux P (2001) Generating ensemble averages for small proteins from extended conformations by Monte Carlo simulations. Phys Rev Lett 1:206–209 49. Maupetit J, Tuffery P, Derreumaux P (2007) A coarse-grained protein force field for folding and structure prediction. Proteins 69:394–408 50. Sterpone F, Nguyen PH, Kalimeri M, Derreumaux P (2013) Importance of the ion-pair interactions in the OPEP coarse-grained force field: parametrization and validation. J Chem Theory Comput 9:4574–4584 51. Sterpone F, Melchionna S, Tuffery P, Pasquali S, Mousseau N, Cragnolini T, Chebaro Y, St-Pierre JF, Kalimeri M, Barducci A et al (2014) The OPEP protein model: from single molecules, amyloid formation, crowding and hydrodynamics to DNA/RNA systems. Chem Soc Rev 43:4871–4893 52. Kalimeri M, Derreumaux P, Sterpone F (2015) Are coarse-grained models apt to detect protein thermal stability? The case of OPEP force field. J Non-Cryst Solids 407:494–501 53. Mousseau N, Derreumaux P, Barkema GT, Malek R (2001) Sampling activated mechanisms in proteins with the activation-relaxation technique. J Mol Graph Model 19:78–86 54. Wei GH, Derreumaux P, Mousseau N (2003) Sampling the complex energy landscape of a simple beta-hairpin. J Chem Phys 13:6403–6406 55. Mousseau N, Derreumaux P (2008) Exploring energy landscapes of protein folding and aggregation. Front Biosci 13:4495–44516 56. Kreutzer AG, Nowick JS (2018) Elucidating the structures of amyloid oligomers with macrocyclic β-hairpin peptides: insights into Alzheimer’s disease and other amyloid diseases. Acc Chem Res 51:706–718 57. Laganowsky A, Liu C, Sawaya MR, Whitelegge JP, Park J, Zhao M, Pensalfini A, Soriaga AB, Landau M, Teng PK et al (2012) Atomic view of a toxic amyloid small oligomer. Science 335:1228–1231 58. Matthes D, Gapsys V, Brennecke JT, de Groot BL (2016) An atomistic view of amyloidogenic self-assembly: structure and dynamics of heterogeneous conformational states in the pre-nucleation phase. Sci Rep 6:33156 59. Levine ZA, Shea JE (2017) Simulations of disordered proteins and systems with conformational heterogeneity. Curr Opin Struct Biol 43:95–103 60. Decatur SM (2006) Elucidation of residuelevel structure and dynamics of polypeptides via isotope-edited infrared spectroscopy. Acc Chem Res 39:169–175 61. Jia Z, Schmit JD, Chen J (2020) Amyloid assembly is dominated by misregistered kinetic traps on an unbiased energy landscape. Proc Natl Acad Sci U S A 117:10322–10328 62. Samantray S, Yin F, Kav B, Strodel B (2020) Different force fields give rise to different amyloid aggregation pathways in molecular dynamics simulations. J Chem Inf Model 60:6462–6475 63. Robustelli P, Piana S, Shaw DE (2018) Developing a molecular dynamics force field for Dynamics of Amyloid Proteins by Simulations both folded and disordered protein states. Proc Natl Acad Sci U S A 115:E4758–E4766 64. Rahman MU, Rehman AU, Liu H, Chen HF (2020) Comparison and evaluation of force fields for intrinsically disordered proteins. J Chem Inf Model 60:4912–4923 65. Huang J, Rauscher S, Nawrocki G, Ran T, Feig M, de Groot BL, Grubmüller H, MacKerell AD Jr (2017) CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat Methods 14:71–73 66. Carballo-Pacheco M, Ismail AE, Strodel B (2018) On the applicability of force fields to study the aggregation of amyloidogenic peptides using molecular dynamics simulations. J Chem Theory Comput 14:6063–6075 67. Baftizadeh F, Pietrucci F, Biarnés X, Laio A (2013) Nucleation process of a fibril precursor in the C-terminal segment of amyloid-β. Phys Rev Lett 110:168103 68. Šarić A, Michaels TCT, Zaccone A, Knowles TPJ, Frenkel D (2016) Kinetics of spontaneous filament nucleation via oligomers: insights from theory and simulation. J Chem Phys 145:211926 69. Lee CT, Terentjev EM (2017) Mechanisms and rates of nucleation of amyloid fibrils. J Chem Phys 147:105103 70. Nguyen P, Derreumaux P (2014) Understanding amyloid fibril nucleation and aβ oligomer/drug interactions from computer simulations. Acc Chem Res 47:603–611 71. Tran TT, Nguyen PH, Derreumaux P (2016) Lattice model for amyloid peptides: OPEP force field parametrization and applications to the nucleus size of Alzheimer’s peptides. J Chem Phys 144:205103 72. Chiricotto M, Tran TT, Nguyen PH, Melchionna S, Sterpone F, Derreumaux P (2017) Coarse-grained and all-atom simulations towards the early and late steps of amyloid fibril formation. Isr J Chem 57:564–573 73. Sterpone F, Doutreligne S, Tran TT, Melchionna S, Baaden M, Nguyen PH, Derreumaux P (2018) Multiscale simulations of biological systems using the OPEP coarsegrained model. Biochem Biophys Res Commun 498:296–304 74. Szała-Mendyk B, Molski A (2020) Clustering and fibril formation during GNNQQNY aggregation: a molecular dynamics study. Biomol Ther 10:1362 75. Villali J, Dark J, Brechtel TM, Pei F, Sindi SS, Serio TR (2020) Nucleation seed size determines amyloid clearance and establishes a 109 barrier to prion appearance in yeast. Nat Struct Mol Biol 27:540–549 76. Sterpone F, Derreumaux P, Melchionna S (2015) Protein simulations in fluids: coupling the OPEP coarse-grained force field with hydrodynamics. J Chem Theory Comput 11:1843–1853 77. Chiricotto M, Sterpone F, Derreumaux P, Melchionna S (2016) Multiscale simulation of molecular processes in cellular environments. Philos Trans A Math Phys Eng Sci 374:20160225 78. Chiricotto M, Melchionna S, Derreumaux P, Sterpone F (2016) Hydrodynamic effects on β-amyloid (16-22) peptide aggregation. J Chem Phys 145:035102 79. Chiricotto M, Melchionna S, Derreumaux P, Sterpone F (2019) Multiscale aggregation of the amyloid Aβ16-22 peptide: from disordered coagulation and lateral branching to amorphous prefibrils. J Phys Chem Lett 10:1594–1599 80. Nguyen PH, Sterpone F, Derreumaux P (2020) Aggregation of disease-related peptides. Prog Mol Biol Transl Sci 170:435–460 81. Lasagna-Reeves CA, Glabe CG, Kayed R (2011) Amyloid-β annular protofibrils evade fibrillar fate in Alzheimer disease brain. J Biol Chem 286:22122–22130 82. Lesné SE, Sherman MA, Grant M, Kuskowski M, Schneider JA, Bennett DA, Ashe KH (2013) Brain amyloid-β oligomers in ageing and Alzheimer’s disease. Brain 136:1383–1398 83. Müller-Schiffmann A, Andreyeva A, Horn AH, Gottmann K, Korth C, Sticht H (2011) Molecular engineering of a secreted, highly homogeneous, and neurotoxic Aβ dimer. ACS Chem Neurosci 2:242–248 84. Rousseau F, Schymkowitz J, De Strooper B (2014) The Alzheimer disease protective mutation A2T modulates kinetic and thermodynamic properties of amyloid-β (Aβ) aggregation. J Biol Chem 289:30977–30989 85. Zheng X, Liu D, Roychaudhuri R, Teplow DB, Bowers MT (2015) Amyloid β-protein assembly: differential effects of the protective A2T mutation and recessive A2V familial Alzheimer’s disease mutation. ACS Chem Neurosci 6:1732–1740 86. Messa M, Colombo L, del Favero E, Cantù L, Stoilova T, Cagnotto A, Rossi A, Morbin M, Di Fede G, Tagliavini F et al (2014) The peculiar role of the A2V mutation in amyloid-β (Aβ) 1-42 molecular assembly. J Biol Chem 289:24143–24152 110 Phuong Hoang Nguyen et al. 87. Tarus B, Tran TT, Nasica-Labouze J, Sterpone F, Nguyen PH, Derreumaux P (2015) Structures of the Alzheimer’s wildtype Aβ1-40 dimer from atomistic simulations. J Phys Chem B 119:10478–10487 88. Man VH, Nguyen PH, Derreumaux P (2017) High-resolution structures of the amyloid-β 1-42 dimers from the comparison of four atomistic force fields. J Phys Chem B 121:5977–5987 89. Côté S, Derreumaux P, Mousseau N (2011) Distinct morphologies for amyloid beta protein monomer: Aβ1-40, Aβ1-42, and Aβ1-40 (D23N). J Chem Theory Comput 7:2584–2592 90. Côté S, Laghaei R, Derreumaux P, Mousseau N (2012) Distinct dimerization for various alloforms of the amyloid-beta protein: Aβ (1-40), Aβ(1-42), and Aβ(1-40)(D23N). J Phys Chem B 116:4043–4055 91. Cao Y, Jiang X, Han W (2017) Self-assembly pathways of β-sheet-rich amyloid-β(1-40) dimers: markov state model analysis on millisecond hybrid-resolution simulations. J Chem Theory Comput 13:5731–5744 92. Chebaro Y, Mousseau N, Derreumaux P (2009) Structures and thermodynamics of Alzheimer’s amyloid-beta Abeta(16-35) monomer and dimer by replica exchange molecular dynamics simulations: implication for full-length Abeta fibrillation. J Phys Chem B 113:7668–7675 93. Man VH, Nguyen PH, Derreumaux P (2017) Conformational ensembles of the wild-type and S8C Aβ1-42 Dimers. J Phys Chem B 121:2434–2442 94. Viet MH, Nguyen PH, Derreumaux P, Li MS (2014) Effect of the English familial disease mutation (H6R) on the monomers and dimers of Aβ40 and Aβ42. ACS Chem Neurosci 5:646–657 95. Viet MH, Nguyen PH, Ngo ST, Li MS, Derreumaux P (2013) Effect of the Tottori familial disease mutation (D7N) on the monomers and dimers of Aβ40 and Aβ42. ACS Chem Neurosci 4:1446–1457 96. Huet A, Derreumaux P (2006) Impact of the mutation A21G (Flemish variant) on Alzheimer’s beta-amyloid dimers by molecular dynamics simulations. Biophys J 91:3829–3840 97. Nguyen PH, Sterpone F, Campanera JM, Nasica-Labouze J, Derreumaux P (2016) Impact of the A2V mutation on the heterozygous and homozygous Aβ1-40 dimer structures from atomistic simulations. ACS Chem Neurosci 7:823–832 98. Nguyen PH, Tarus B, Derreumaux P (2014) Familial Alzheimer A2 V mutation reduces the intrinsic disorder and completely changes the free energy landscape of the Aβ1-28 monomer. J Phys Chem B 118:501–510 99. Nguyen PH, Sterpone F, Pouplana R, Derreumaux P, Campanera JM (2016) Dimerization mechanism of Alzheimer Aβ40 peptides: the high content of intrapeptidestabilized conformations in A2V and A2T heterozygous dimers retards amyloid fibril formation. J Phys Chem B 120:12111–12126 100. Das P, Chacko AR, Belfort G (2017) Alzheimer’s protective cross-interaction between wild-type and A2T variants alters Aβ42 dimer structure. ACS Chem Neurosci 8:606–618 101. Li H, Nam Y, Salimi A, Lee JY (2020) Impact of A2V mutation and histidine tautomerism on Aβ42 monomer structures from atomistic simulations. J Chem Inf Model 60:3587–3592 102. Aggarwal L, Biswas P (2020) Effect of Alzheimer’s disease causative and protective mutations on the hydration environment of amyloid-β. J Phys Chem B 124:2311–2322 103. Banerjee S, Hashemi M, Lv Z, Maity S, Rochet JC, Lyubchenko YL (2017) A novel pathway for amyloids self-assembly in aggregates at nanomolar concentration mediated by the interaction with surfaces. Sci Rep 7:45592 104. Alvarez AB, Caruso B, Rodrı́guez PEA, Petersen SB, Fidelio GD (2020) Aβ-amyloid fibrils are self-triggered by the interfacial lipid environment and low peptide content. Langmuir 36:8056–8065 105. Farrugia MY, Caruana M, Ghio S, Camilleri A, Farrugia C, Cauchi RJ, Cappelli S, Chiti F, Vassallo N (2020) Toxic oligomers of the amyloidogenic HypF-N protein form pores in mitochondrial membranes. Sci Rep 10:17733 106. Ghio S, Camilleri A, Caruana M, Ruf VC, Schmidt F, Leonov A, Ryazanov S, Griesinger C, Cauchi RJ, Kamp F et al (2019) Cardiolipin promotes pore-forming activity of alpha-synuclein oligomers in mitochondrial membranes. ACS Chem Neurosci 10:3815–3829 107. Österlund N, Moons R, Ilag LL, Sobott F, Gr€aslund A (2019) Native ion mobility-mass spectrometry reveals the formation of β-barrel shaped amyloid-β hexamers in a membranemimicking environment. J Am Chem Soc 141:10440–10450 108. Ait-Bouziad N, Lv G, Mahul-Mellier AL, Xiao S, Zorludemir G, Eliezer D, Walz T, Dynamics of Amyloid Proteins by Simulations Lashuel HA (2017) Discovery and characterization of stable and toxic Tau/phospholipid oligomeric complexes. Nat Commun 8:1678 109. Jang H, Arce FT, Ramachandran S, Kagan BL, Lal R, Nussinov R (2014) Disordered amyloidogenic peptides may insert into the membrane and assemble into common cyclic structural motifs. Chem Soc Rev 43:6750–6764 110. Connelly L, Jang H, Arce FT, Capone R, Kotler SA, Ramachandran S, Kagan BL, Nussinov R, Lal R (2012) Atomic force microscopy and MD simulations reveal porelike structures of all-D-enantiomer of Alzheimer’s β-amyloid peptide: relevance to the ion channel mechanism of AD pathology. J Phys Chem B 116:1728–1735 111. Serra-Batiste M, Ninot-Pedrosa M, Bayoumi M, Gairı́ M, Maglia G, Carulla N (2016) Aβ42 assembles into specific β-barrel pore-forming oligomers in membranemimicking environments. Proc Natl Acad Sci U S A 113:10866–10871 112. Nguyen PH, Campanera JM, Ngo ST, Loquet A, Derreumaux P (2019) Tetrameric Aβ40 and Aβ42 β-barrel structures by extensive atomistic simulations. I. In a bilayer mimicking a neuronal membrane. J Phys Chem B 123:3643–3648 113. Ngo ST, Nguyen PH, Derreumaux P (2020) Impact of A2T and D23N mutations on tetrameric Aβ42 barrel within a dipalmitoylphosphatidylcholine lipid bilayer membrane by replica exchange molecular dynamics. J Phys Chem B 124:1175–1182 114. Di Scala C, Yahi N, Boutemeur S, Flores A, Rodriguez L, Chahinian H, Fantini J (2016) Common molecular mechanism of amyloid pore formation by Alzheimer’s β-amyloid peptide and α-synuclein. Sci Rep 6:28781 115. Ngo ST, Nguyen PH, Derreumaux P (2020) Stability of Aβ11-40 trimers with parallel and antiparallel β-sheet organizations in a membrane-mimicking environment by replica exchange molecular dynamics simulation. J Phys Chem B 124:617–626 116. Ngo ST, Derreumaux P, Vu VV (2019) Probable transmembrane amyloid α-helix bundles capable of conducting Ca2+ ions. J Phys Chem B 123:2645–2653 117. Sahoo A, Matysiak S (2019) Computational insights into lipid assisted peptide misfolding and aggregation in neurodegeneration. Phys Chem Chem Phys 21:22679–22694 118. Lu Y, Shi XF, Nguyen PH, Sterpone F, Salsbury FR Jr, Derreumaux P (2019) Amyloid-β (29-42) dimeric conformations in membranes 111 rich in omega-3 and omega-6 polyunsaturated fatty acids. J Phys Chem B 123:2687–2696 119. Doig AJ, Derreumaux P (2015) Inhibition of protein aggregation and amyloid formation by small molecules. Curr Opin Struct Biol 30:50–56 120. Chebaro Y, Derreumaux P (2009) Targeting the early steps of Abeta16-22 protofibril disassembly by N-methylated inhibitors: a numerical study. Proteins 75:442–452 121. Chebaro Y, Jiang P, Zang T, Mu Y, Nguyen PH, Mousseau N, Derreumaux P (2012) Structures of Aβ17-42 trimers in isolation and with five small-molecule drugs using a hierarchical computational procedure. J Phys Chem B 116:8412–8422 122. Zhang T, Zhang J, Derreumaux P, Mu Y (2013) Molecular mechanism of the inhibition of EGCG on the Alzheimer Aβ(1-42) dimer. J Phys Chem B 117:3993–4002 123. Zhang T, Xu W, Mu Y, Derreumaux P (2014) Atomic and dynamic insights into the beneficial effect of the 1,4-naphthoquinon-2-yl-Ltryptophan inhibitor on Alzheimer’s Aβ1-42 dimer in terms of aggregation and toxicity. ACS Chem Neurosci 5:148–159 124. Tarus B, Nguyen PH, Berthoumieu O, Faller P, Doig AJ, Derreumaux P (2015) Molecular structure of the NQTrp inhibitor with the Alzheimer Aβ1-28 monomer. Eur J Med Chem 91:43–50 125. Berthoumieu O, Nguyen PH, Castillo-Frias MP, Ferre S, Tarus B, Nasica-Labouze J, Noël S, Saurel O, Rampon C, Doig AJ et al (2015) Combined experimental and simulation studies suggest a revised mode of action of the anti-Alzheimer disease drug NQ-Trp. Chemistry 21:12657–12666 126. Minh Hung H, Nguyen MT, Tran PT, Truong VK, Chapman J, Quynh Anh LH, Derreumaux P, Vu VV, Ngo ST (2020) Impact of the astaxanthin, betanin, and EGCG compounds on small oligomers of amyloid Aβ40 peptide. J Chem Inf Model 60:1399–1408 127. Nguyen PH, Del Castillo-Frias MP, Berthoumieux O, Faller P, Doig AJ, Derreumaux P (2018) Amyloid-β/drug interactions from computer simulations and cell-based assays. J Alzheimers Dis 64:S659–S672 128. Liang C, Savinov SN, Fejzo J, Eyles SJ, Chen J (2019) Modulation of amyloid-β42 conformation by small molecules through nonspecific binding. J Chem Theory Comput 15:5169–5174 112 Phuong Hoang Nguyen et al. 129. Tran L (2018) Understanding the binding mechanism of amyloid-β inhibitors from molecular simulations. Curr Pharm Des 24:3341–3346 130. Zhu M, De Simone A, Schenk D, Toth G, Dobson CM, Vendruscolo M (2013) Identification of small-molecule binding pockets in the soluble monomeric form of the Aβ42 peptide. J Chem Phys 139:035101 131. Doig AJ, Del Castillo-Frias MP, Berthoumieu O, Tarus B, Nasica-Labouze J, Sterpone F, Nguyen PH, Hooper NM, Faller P, Derreumaux P (2017) Why is research on amyloid-β failing to give new drugs for Alzheimer’s disease? ACS Chem Neurosci 8:1435–1437 132. Kim JE, Lee M (2003) Fullerene inhibits beta-amyloid peptide aggregation. Biochem Biophys Res Commun 303:576–579 133. Linse S, Cabaleiro-Lago C, Xue WF, Lynch I, Lindman S, Thulin E, Radford SE, Dawson KA (2007) Nucleation of protein fibrillation by nanoparticles. Proc Natl Acad Sci U S A 104:8691–8696 134. Fu Z, Luo Y, Derreumaux P, Wei G (2009) Induced beta-barrel formation of the Alzheimer’s Abeta25-35 oligomers on carbon nanotube surfaces: implication for amyloid fibril inhibition. Biophys J 97:1795–1803 135. Li H, Luo Y, Derreumaux P, Wei G (2011) Carbon nanotube inhibits the formation of β-sheet-rich oligomers of the Alzheimer’s amyloid-β(16-22) peptide. Biophys J 101:2267–2276 136. Limbocker R, Mannini B, Ruggeri FS, Cascella R, Xu CK, Perni M, Chia S, Chen SW, Habchi J, Bigi A et al (2020) Trodusquemine displaces protein misfolded oligomers from cell membranes and abrogates their cytotoxicity through a generic mechanism. Commun Biol 3:435 137. Cox SJ, Lam B, Prasad A, Marietta HA, Stander NV, Joel JG, Sahoo BR, Guo F, Stoddard AK, Ivanova MI et al (2020) Highthroughput screening at the membrane interface reveals inhibitors of amyloid-β. Biochemistry 59:2249–2258 138. Man VH, Derreumaux P, Li MS, Roland C, Sagui C, Nguyen PH (2015) Picosecond dissociation of amyloid fibrils with infrared laser: a nonequilibrium simulation study. J Chem Phys 143:155101 139. Man VH, Derreumaux P, Nguyen PH (2016) Nonequilibrium all-atom molecular dynamics simulation of the bubble cavitation and application to dissociate amyloid fibrils. J Chem Phys 145:174113 140. Kawasaki T, Man VH, Sugimoto Y, Sugiyama N, Yamamoto H, Tsukiyama K, Wang J, Derreumaux P, Nguyen PH (2020) Infrared laser-induced amyloid fibril dissociation: a joint experimental/theoretical study on the GNNQQNY peptide. J Phys Chem B 124:6266–6277 141. Man VH, Wang J, Derreumaux P, Nguyen PH (2021) Nonequilibrium Molecular Dynamics Simulations Of Infrared LaserInduced Dissociation of a tetrameric Aβ42 β -barrel in a neuronal membrane model. Chem Phys Lipids 234:105030 142. Derreumaux P (2001) Evidence that the 127-164 region of prion proteins has two equi-energetic conformations with beta or alpha features. Biophys J 81:1657–1665 143. Santini S, Derreumaux P (2004) Helix H1 of the prion protein is rather stable against environmental perturbations: molecular dynamics of mutation and deletion variants of PrP (90-231). Cell Mol Life Sci 61:951–960 144. Santini S, Claude JB, Audic S, Derreumaux P (2003) Impact of the tail and mutations G131V and M129V on prion protein flexibility. Proteins 51:258–265 145. De Simone A, Zagari A, Derreumaux P (2007) Structural and hydration properties of the partially unfolded states of the prion protein. Biophys J 93:1284–1292 146. Wille H, Dorosh L, Amidian S, SchmittUlms G, Stepanova M (2019) Combining molecular dynamics simulations and experimental analyses in protein misfolding. Adv Protein Chem Struct Biol 118:33–110 147. Peoc’h K, Levavasseur E, Delmont E, De Simone A, Laffont-Proust I, Privat N, Chebaro Y, Chapuis C, Bedoucha P, Brandel JP et al (2012) Substitutions at residue 211 in the prion protein drive a switch between CJD and GSS syndrome, a new mechanism governing inherited neurodegenerative disorders. Hum Mol Genet 21:5417–5428 148. Chebaro Y, Derreumaux P (2009) The conversion of helix H2 to beta-sheet is accelerated in the monomer and dimer of the prion protein upon T183A mutation. J Phys Chem B 113:6942–6948 149. Derreumaux P, Man VH, Wang J, Nguyen PH (2020) Tau R3-R4 domain dimer of the wild type and phosphorylated ser356 sequences. I. In solution by atomistic simulations. J Phys Chem B 124:2975–2983 150. Haj-Yahya M, Gopinath P, Rajasekhar K, Mirbaha H, Diamond MI, Lashuel HA (2020) Site-specific hyperphosphorylation inhibits, rather than promotes, tau Dynamics of Amyloid Proteins by Simulations fibrillization, seeding capacity, and its microtubule binding. Angew Chem Int Ed Engl 59:4059–4067 151. Maupetit J, Derreumaux P, Tuffery P (2009) PEP-FOLD: an online resource for de novo peptide structure prediction. Nucleic Acids Res 37:W498–W503 152. Maupetit J, Derreumaux P, Tufféry P (2010) A fast method for large-scale de novo peptide and miniprotein structure prediction. J Comput Chem 31:726–738 153. Thévenet P, Shen Y, Maupetit J, Guyon F, Derreumaux P, Tufféry P (2012) PEP-FOLD: an updated de novo structure prediction server for both linear and disulfide bonded cyclic peptides. Nucleic Acids Res 40: W288–W293 154. Shen Y, Maupetit J, Derreumaux P, Tufféry P (2014) Improved PEP-FOLD approach for peptide and miniprotein structure prediction. J Chem Theory Comput 10:4745–4758 155. Sutherland GA, Grayson KJ, Adams NBP et al (2018) Probing the quality control 113 mechanism of the Escherichia coli twinarginine translocase with folding variants of a de novo-designed heme protein. J Biol Chem 293:6672–6681 156. Lamiable A, Thévenet P, Rey J, Vavrusa M, Derreumaux P, Tufféry P (2016) PEP-FOLD3: faster de novo structure prediction for linear peptides in solution and in complex. Nucleic Acids Res 44:W449–W454 157. Ngo ST, Nguyen PH, Derreumaux P (2021) Cholesterol molecules alter the energy landscape of small Aβ 1-42 oligomers. J Phys Chem B 125(9):2299–2307. https://doi. org/10.1021/acs.jpcb.1c00036 158. Ramamoorthy A, Sahoo BR, Zheng J, Chiricotto M, Straub JE, Dominguez L, Shea J-E, Dokholyan NV, De Simone A et al (2021) Amyloid oligomers: a joint experimental/computational perspective on Alzheimer’s disease, Parkinson’s disease, type II diabetes and amyotrophic lateral sclerosis. Chem Rev 121(4):2545–2647. https://doi. org/10.1021/acs.chemrev.0c01122 Chapter 6 Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes Charles H. Chen, Karen Pepper, Jakob P. Ulmschneider, Martin B. Ulmschneider, and Timothy K. Lu Abstract Understanding the interactions between peptides and lipid membranes could not only accelerate the development of antimicrobial peptides as treatments for infections but also be applied to finding targeted therapies for cancer and other diseases. However, designing biophysical experiments to study molecular interactions between flexible peptides and fluidic lipid membranes has been an ongoing challenge. Recently, with hardware advances, algorithm improvements, and more accurate parameterizations (i.e., force fields), all-atom molecular dynamics (MD) simulations have been used as a “computational microscope” to investigate the molecular interactions and mechanisms of membrane-active peptides in cell membranes (Chen et al., Curr Opin Struct Biol 61:160–166, 2020; Ulmschneider and Ulmschneider, Acc Chem Res 51(5):1106–1116, 2018; Dror et al., Annu Rev Biophys 41:429–452, 2012). In this chapter, we describe how to utilize MD simulations to predict and study peptide dynamics and how to validate the simulations by circular dichroism, intrinsic fluorescent probe, membrane leakage assay, electrical impedance, and isothermal titration calorimetry. Experimentally validated MD simulations open a new route towards peptide design starting from sequence and structure and leading to desirable functions. Key words Protein design, Molecular dynamics simulations, Membrane-active peptides, Protein folding, Pore formation 1 Introduction Membrane-active peptides (MAPs) are a ubiquitous part of the innate immune defense system and also play a prominent role in protein misfolding diseases, such as Alzheimer’s disease and Parkinson’s disease. Antimicrobial peptides (AMPs), a large subgroup of MAPs, are typically amphiphilic peptides that selectively target and kill bacteria at low micromolar concentrations, often without harming mammalian cells [4–6]. Until now, more than 3000 AMPs have been reported and characterized, seven of which have been approved as antibacterial agents by the U.S. Food and Drug Administration (FDA) [7]. These 3000 AMPs vary widely in size, Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_6, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 115 116 Charles H. Chen et al. sequence, secondary structure, and physicochemical properties (e.g., hydrophobic content and net charge), and no common sequence motif has been discovered to date. This lack of known sequence–function relationship is unusual for proteins. For the subgroup of pore-forming AMPs, these peptides bind to cell membrane and spontaneously assemble in the lipid bilayer as a channel or pore-like structure, though not all are cytolytic; however, we do not yet understand the root causes of membrane-disruption activity. For example, a small number of amino acid mutations in melittin, a powerful helical AMP, can significantly change pore stability [8], as well as other functional properties, such as antimicrobial activity [9, 10] and cell selectivity [10]. Wiedman et al. showed that just a few amino acid substitutions in Melp5, a gain-of-function melittin variant, can alter membrane poration activity by disrupting liposomes but can do so only in acidic conditions [11, 12], and even the potency against cell membranes and membrane pore size are affected by these minor changes in amino acid sequence. In nature, minor changes of a MAP sequence can promote membrane disruption and cause protein-misfolding diseases. Typical examples are neurodegenerative peptides, such as amyloid-beta, alpha-synuclein, and TDP-43 C-terminal fragments, in which minor mutations can result in protein misfolding and correlate with neurodegenerative diseases [13–18]. Peptide length is also an important factor for hydrophobic mismatch to span the cell membrane [19–21]. Ulrich et al. reported several rationally designed helical peptides with repeated KIAGKIA motifs with peptide lengths between 14 and 28 amino acids, and they validated that long peptide length can affect a peptide’s ability to damage cell membranes [20] and penetrate into the membrane [21]. They found that longer peptides were more likely to damage cell membranes than shorter peptides with similar amino acid content. Although these studies have provided us with an improved understanding of the correlation between peptide sequence and pore formation, the pore-forming mechanisms and multimeric functional structures in the membrane still remain largely undetermined. Several studies have addressed the difficulty of predicting the form of ensembles of transient channel structures and capturing highly dynamic peptide–peptide and peptide–lipid interactions in a fluid lipid bilayer [2, 7, 22]. Multimeric MAP pore models and pore-forming mechanisms using molecular dynamics (MD) simulations have been proposed [23–25]. However, these studies generally assume an initial pore configuration, which given the near-infinite number of possible structures, is unlikely to yield a firm basis for functional optimization. We have demonstrated the feasibility of using unbiased long-timescale MD simulations to predict the functional pore structures of the AMPs without bias [26, 27]. Advanced MD simulations can provide insight, at the level of atomic detail, into the diverse dynamic structures formed in complex fluid cell membranes. Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes 2 117 Materials 2.1 Molecular Dynamics (MD) Simulations MD simulations can be used as a computational microscope [3] to study the atomic details of peptide folding and interactions with other molecules, e.g., peptide, lipid, and water 1. GROningen MAchine for Chemical Simulations (GROMACS) program can be downloaded through the link http://manual. gromacs.org/documentation/ [28]. 2. HIPPO simulation package can be downloaded through the link https://www.biowerkzeug.com. 3. Visual Molecular Dynamics (VMD) molecular visualization program can be downloaded through the link http://www.ks. uiuc.edu/Research/vmd/ [29]. 4. CHARMM-GUI web-based graphical user interface can be accessed through the link http://www.charmm-gui.org/ [30] to build the simulation model. 2.2 Combinatorial Peptide Libraries A combinatorial peptide library can be adapted to various kinds of experiments for screening and offers a useful approach to study sequence–function relationship 1. Preparing beads: Tentagel® NH2 macrobeads (280–320 μm particle size) (Rapp-Polymere; MB300002) are solvated in methanol and incubated overnight. 2. Reaction vials: Peptide synthesis vessels, solid phase, T-bore PTFE stopcocks 10 mL (Chemglass Life Science; CG-186401). 3. Photolinker: Fmoc-photolabile linker (Advanced Chem Tech; RT1095). 4. Preparing cocktail for deprotection of the side chains: 88% vol TFA, 5% vol phenol (preheated on 40 C hot plate), 5% vol pure water, and 2% vol triisopropylsilane (using needle) are mixed in a glass vial. The mixture is allowed to cool at 20 C for 15 min. 5. Treating peptides before dissolving for stock solution: The synthesized peptides on the resin are dissolved in the predissolving buffer (50% vol hexafluoroisopropanol [HFIP] and 50% vol water) and treated with UV light for 3 h until dry. This step is performed in the hood because HFIP is highly volatile. 2.3 Circular Dichroism (CD) Spectroscopy CD spectroscopy can be used to characterize the secondary structure of the peptide in aqueous conditions with or without lipid vesicles 118 Charles H. Chen et al. 1. Cuvette: Macro quartz rectangular cuvette 1 mm (Fisher Scientific; 14958110). 2. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 is prepared using 0.292 g sodium phosphate, monobasic (MW: 137.99) and 0.773 g sodium phosphate, dibasic (MW: 268.07) in 500 mL Milli-Q® water. The pH is adjusted using 1 M hydrochloric acid. The buffer is filtered using Stericup-GV Sterile Vacuum Filtration System. 2.4 Oriented Circular Dichroism (OCD) Spectroscopy OCD spectroscopy can be used to characterize the transmembrane activity of the peptide in lipid bilayers 2.5 Liposome Fluorescent Leakage Assay Liposome fluorescent leakage assays can be used to determine the extent of membrane lysis or poration by the peptide. This assay can be utilized as a screening platform to evaluate membrane-active peptides 1. Quartz glass plate: Circular quartz glass high performance plates 200 nm to 2500 nm with 20-mm diameter and 1.25 mm thickness (Hellma Analytics; 202-QS) are used. 1. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 with 100 mM potassium chloride is prepared using 0.292 g sodium phosphate, monobasic (MW: 137.99), 0.773 g sodium phosphate, dibasic (MW: 268.07), and 3.728 g potassium chloride in 500 mL Milli-Q® water. The pH is adjusted using 1 M hydrochloric acid. The buffer is filtered using Stericup-GV Sterile Vacuum Filtration System. 2. Fluorescent buffer: 0.061 g HEPES (4-(2-hydroxyethyl)-1piperazineethanesulfonic acid), 0.059 g sodium chloride, 0.268 g ANTS (8-aminonaphthalene-1,3,6-trisulfonic acid, disodium salt), and 0.950 g DPX ( p-xylene-bis-pyridinium bromide) are mixed in 10 mM sodium phosphate pH 7 buffer with 100 mM potassium chloride. 3. Chromatography column: DWK Life Sciences Kimble™ Kontes™ FlexColumn™ Economy Columns (Fisher Scientific; K4204010750) are used. 2.6 Tryptophan Fluorescence Quenching Assay Tryptophan fluorescence quenching assay can be used to study peptide–lipid interactions and evaluate peptide-binding specificity 1. 96-well microplate: Greiner UV-Star® 96 flat-bottom well plates made of clear cyclic olefin copolymer (COC) (SigmaAldrich; M3812-40EA) are used. 2. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 with 100 mM potassium chloride is prepared using 0.292 g sodium phosphate, monobasic (MW: 137.99), 0.773 g sodium phosphate, dibasic (MW: 268.07), and 3.728 g potassium chloride Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes 119 in 500 mL Milli-Q® water. The pH is adjusted using 1 M hydrochloric acid. The buffer is filtered using Stericup-GV Sterile Vacuum Filtration System. 2.7 Electrical Impedance Spectroscopy Electrical impedance spectroscopy can be used to monitor the lipid membrane (i.e., resistance and conductance) and study peptide– lipid interactions. 1. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 with 100 mM potassium chloride is prepared using 0.292 g sodium phosphate, monobasic (MW: 137.99), 0.773 g sodium phosphate, dibasic (MW: 268.07), and 3.728 g potassium chloride in 500 mL Milli-Q® water. The pH is adjusted using 1 M hydrochloric acid. The buffer is filtered using the StericupGV Sterile Vacuum Filtration System. 2. Silicon plate: Polished n-type silicon wafers (<1 1 1>, ρ ¼ 0.001–0.005 Ω cm) (Silicon Quest International, San Jose, CA) are used. 2.8 Isothermal Titration Calorimetry (ITC) ITC can be used to characterize the thermodynamic parameters of peptide–lipid interactions, e.g., binding stoichiometry, binding enthalpy, and binding constant. 1. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 with 100 mM potassium chloride is prepared using 0.292 g sodium phosphate, monobasic (MW: 137.99), 0.773 g sodium phosphate, dibasic (MW: 268.07), and 3.728 g potassium chloride in 500 mL Milli-Q® water. The pH is adjusted using 1 M hydrochloric acid. The buffer is filtered using Stericup-GV Sterile Vacuum Filtration System. The buffer is degassed under vacuum over 30 min. 3 Methods Peptide assembly and oligomerization in the cell membrane play a critical role in many biological processes [31–35]. These peptides adsorb, fold, cross, and form a functional structure or aggregate in lipid bilayers, processes that often involve transient structures and mechanisms. The difficulty of observing these transient structures experimentally [2, 26, 33, 36, 37] limits our understanding of how peptides interact with lipids in biological systems, especially as most of the experiments do not directly provide molecular details of the transitions but show their equilibrium states [21, 38–42]. Thus, the details of the molecular mechanisms underpinning activity and the chemical interactions driving them remain unclear. Atomic-detail MD simulations, fueled by hardware advances, algorithm improvements, and more accurate force fields are becoming an increasingly 120 Charles H. Chen et al. powerful way to study these dynamic and transient events [2, 43– 46]. This method allows us to study atomic details, movements, kinetics, chemical interactions, and assemblies of peptides in cell membranes or model lipid bilayers [1, 2, 26, 36, 47, 48]. MD calculates the physical movement of atoms by applying Newton’s laws of motion at the atomic level. An atom can be represented by a point of mass m and charge q. 2 m ∂ r ¼ F́ ðŕ, v́, t Þ ∂t 2 ð1Þ The force (F́ ) on each atom is a function of its coordinate (ŕ), velocity (v́), and time (t). m is the mass of each atom. The overall potentials and parameters are determined by the force field. Each atom can be attached to other atoms either via springs with force constants ki (covalent bonds) or via electromagnetism. Gravitation is neglected in MD simulations of biomolecules as it is significantly smaller than the corresponding electrostatic forces. There are two types of interaction between any two atoms: bonded (covalent bonds) and nonbonded (electromagnetism). Therefore, the overall potential energy (Vtotal) function is given by: V total ¼ V bonded þ V nonbonded ð2Þ The bonded potential energy can be presented as: V bonded ¼ V bonds þ V angles þ V improperdihedrals þ V torsionangles ð3Þ The nonbonded potential energy can be represented by a combination of the Lennard-Jones potential and the Coulomb potential: V bonded ¼ V LJ þ V Coulomb ð4Þ Therefore, all atoms in the system are simulated by integrating Newton’s equation Eq. (1), which can be done using the Verlet algorithm. This algorithm is an ideal finite difference scheme and is used in MD simulations because it is stable, time-reversible, and energy-conserving. m 3.1 Molecular Dynamics (MD) Simulations r ðt þ dt Þ þ r ðt dt Þ 2r ðt Þ ¼ F ðt Þ Δt 2 ð5Þ In the following example, we applied unbiased all-atom MD simulations (see Note 1) using GROMACS [28] and Hippo BETA simulation packages http://www.biowerkzeug.com and the VMD molecular visualization program http://www.ks.uiuc.edu/ Research/vmd/ [29]. The coordinates and structures of extended peptides were generated using Hippo BETA [49]. These initial structures were relaxed in the isothermal–isobaric (NPT) ensemble using atomic detail Monte Carlo (MC) simulations, a computational algorithm relying on repeated random sampling to achieve Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes 121 energy minimization, for 200 MC steps. The relaxation step allows the initial structures to form a low-energy configuration and prevent unlikely interaction between atoms. Water was treated implicitly using a generalized Born implicit solvent (GBIS) [50]. The combination of MC simulations with GBIS greatly accelerates structural relaxation and equilibrates angles, torsions, and distances between atoms. After relaxation, the peptides were placed in all-atom peptide/lipid/water systems containing model membranes (i.e., lipid bilayers) with 100 mM potassium and chloride ions using CHARMM-GUI http://www.charmm-gui.org/, a web-based graphical user interface to generate the input files for the simulation setup [30]. Protein folding simulations were equilibrated for 10 ns to relax the system, applying position restraints to the peptide. For pore-forming simulations, single peptides were allowed to fold in the lipid bilayer for ~600 ns; subsequently, the systems were multiplied four times in both the x and y directions (i.e., 2 2 in x- and y-axis). MD simulations were performed with GROMACS 5.0.4 using the CHARMM36 force field [28, 51] in conjunction with the TIP3P water model [52]. Electrostatic interactions were computed using particle mesh Ewald (PME) [53], and a cutoff of 10 Å was used for Van der Waals interactions [54]. Bonds involving hydrogen atoms were constrained using the LINCS algorithm [55]. The integration time-step was 2 fs, and neighbor lists were updated every five steps. All simulations were performed in the NPT ensemble, without any restraints or biasing potentials. Water and the peptide were each coupled separately to a heat bath (i.e., simulation temperature) with a time constant τT ¼ 0.5 ps using velocity rescale temperature coupling [56]. Atmospheric pressure of 1 bar was maintained using weak semi-isotropic pressure coupling with compressibility κ z ¼ κ xy ¼ 4.6 · 105 bar1 and time constant τP ¼ 1 ps. In order to reveal the most highly populated pore assemblies during the simulations, a complete list of all oligomers was constructed for each trajectory frame. A transmembrane (TM) pore assembly, which is an oligomer of the order n (number of peptides), is considered any set of n TM peptides that are in mutual contact, defined as having heavy atoms (nitrogen, carbon, or oxygen) with a minimum distance of <3.5 Å between them. This definition frequently overcounts the oligomeric state due to numerous transient surface-bound (S-state) peptides that are only loosely attached to the transmembrane-inserted peptides that make up the core of the oligomer. These S-state peptides on the membrane frequently change position or enter and leave the stable part of the TM pore assembly. To focus the analysis on true longer-lived TM pores, the tilt angle τ of the peptides with a cut-off criterion of 65 was introduced. Any peptide with τ 65 was considered to be in the S-state (i.e., the peptide stayed at the membrane interface and did not span the membrane) and removed from the oligomeric analysis. 122 Charles H. Chen et al. This strategy greatly reduced the background noise (i.e., it eliminated S-state peptides near the TM pore assembly) in the oligomeric clustering algorithm by focusing on the true long-lived pore structures. Population plots of the occupation percentage of oligomer n multiplied by its number of peptides (n) were then constructed. These plots reveal how many peptides were concentrated in which oligomeric state during the simulation time. 3.2 Combinatorial Peptide Libraries Combinatorial peptide libraries (see Note 2) can be synthesized on Tentagel® NH2 macrobeads with 280–320 μm particle size (~65,550 beads/g) using Fmoc solid-phase peptide synthesis. The sequences are made using a split and pool method [57]. Briefly, the batch is split into several portions after the first reaction, and each of the portions is synthesized with different amino acids. The completed portions are pooled, mixed, and then split again into several portions. This cycle is repeated until the combinatorial peptide library has been synthesized. Each macrobead is attached to only one peptide sequence, via a photolinker attached between peptide and bead. The molecular weight and peptide sequence of the peptide library are verified by matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) mass spectrometry and Edman sequencing. Edman sequencing, first developed by Pehr Edman in 1950 [58], can be divided into four steps: (1) coupling—the amino group at the N-terminal end of the peptide is coupled to phenyl isothiocyanate; (2) cleavage—the first peptide bond is cleaved in strong acid (trifluoroacetic acid; TFA) resulting in smaller peptide fragments and cyclized anilinothiazolinone (ATZ) amino acid; (3) conversion—the ATZ amino acid is separated from the peptide fragment by organic extraction with ethyl acetate and converted to phenylthiohydantoin (PTH) amino acid in 25% TFA (v/v in ddH2O); and (4) analysis using MALDI-TOF mass spectrometry. The beads are solvated in a minimum amount of methanol, spread as a dispersed single layer on a glass plate, and dried under air. The photolinker between peptide and bead is cleaved by exposure to 5 h of low-power UV light on a dry bead. The UV-cleaved beads are transferred to 96-well microplates as one bead per well. Peptides on the UV-cleaved beads are each dissolved in HFIP/ water (1:1 ratio) for another 3 h of low-power ultraviolet (UV) light (dual optical wavelength at 365 nm and 405 nm) until dry, dissolved in water or DMSO, quantified by tryptophan absorbance (molar extinction coefficient of tryptophan at 280 nm is 5690 M1 cm1) using a nanodrop, and stored at 20 C. Sample size of the screening can be estimated by using simple random sampling with >80% coverage of the sequences (P). Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes 123 Fig. 1 Secondary structure of peptide. CD spectra of three different peptides with varied secondary structures: random coil (black; HSP1 peptide in aqueous buffer), alpha helix (red; LDKA peptide in lipids), and beta strand (blue; TDP-43 D1 peptide in lipids at high temperature, which forms protein misfolding) n 1 P ¼1 1 ð6Þ N N denotes the total size of the sequences in the peptide library, and n indicates the sample size for screening. 3.3 Circular Dichroism (CD) Spectroscopy CD spectroscopy uses circularly polarized light to investigate optically active chiral molecules and records the differential absorption of left- and right-handed (circularly polarized) light (Δε ¼ εL εR), which can yield an estimation of the secondary structure (e.g., helix, beta-strand, and random coil) of proteins in native environments [59]. Different secondary structures result in varied CD spectra (Fig. 1). Alpha helices have negative bands at 222 nm and 208 nm and a positive band at 193 nm, and beta strands have a negative band at 218 nm and a positive band at 195 nm [60]. As an example of the use of CD spectroscopy (see Note 3), peptide solutions (50 μM) in 10 mM phosphate buffer (pH 7.0) were co-incubated with 800 μM large unilamellar vesicles (LUVs) in identical buffer. LUVs were made by lipid extrusion [61]. Briefly, lipids were dissolved in chloroform then mixed and dried under nitrogen gas in a glass vial, and the remaining chloroform was removed under vacuum overnight. Then lipids were resuspended in 10 mM sodium phosphate buffer (pH ¼ 7) with 100 mM potassium chloride. LUVs were generated by extruding the lipid suspension 10 times through 0.1 μm nucleopore polycarbonate filters to give LUVs of 100 nm diameter. CD spectra were recorded using synchrotron radiation circular dichroism (SRCD) spectroscopy with the CD beam lines on ASTRID2 at Aarhus University in Denmark and ANKA at Karlsruhe Institute of Technology in 124 Charles H. Chen et al. Germany. Spectra were recorded from 270 to 170 nm with a step size (λ) of 0.5 nm, a bandwidth of 0.5 nm, and a dwell time of 2 s. The averaged baseline was subtracted from each spectrum and then averaged over three repeat scans. The averaged spectra were normalized to molar ellipticity [θ] per residue, which is a common measurement unit for estimating secondary structure in proteins, peptides, and polymers. Molar ellipticity is defined as the tangent ratio of the minor to major elliptical axis: tan θ ¼ (EL ER)/ (EL + ER), where θ is the ellipticity given by the machine. EL and ER are the magnitudes of the electric field vectors of the left- and right-circularly polarized light, respectively. The raw data were analyzed using DichroWeb http://dichroweb.cryst.bbk.ac.uk/ [59, 62, 63], a web-based algorithm based on analysis techniques using reference datasets derived from characterized peptides/proteins with known structures. 3.4 Oriented Circular Dichroism (OCD) Spectroscopy Membrane-active peptides can spontaneously span the cell membrane through peptide–peptide interactions, membrane defects, or water permeation. As they traverse the membrane, peptides assume one of two common states: an S-state or a TM state. S-state and TM peptides in lipid membranes can be identified by OCD spectroscopy (see Note 4). In an experiment that we studied, a small membrane-active AMP from Hyla punctata in lipid bilayer [47]. 20 μg of peptides were dissolved in chloroform and added to lipid(s) in chloroform at the specific molar ratio, e.g., peptide: lipid (P:L) ¼ 1:10. The mixtures were dried by a low flow of nitrogen gas followed by high vacuum overnight. The dried sample was resuspended in 40 μL of pure HFIP, which is an organic solvent that can make a consistent thin lipid film on a glass surface. 20 μL of the peptide/lipid mixture in HFIP was dripped and spread on a glass plate. The other 20 μL was used for the replicate. After vacuum drying to remove the HFIP, 2 μL ddH2O (sterile-filtered) was added to the glass plate to saturate the lipid film, and the plate was placed in a chamber containing a saturated solution of potassium sulfate (120 g/L at 25 C). Oriented bilayers formed after equilibrating the lipid sample in the chamber at 25 C overnight. Spectra were recorded from 270 to 160 nm with a step size (λ) of 0.5 nm, a bandwidth of 0.5 nm, and a dwell time of 2 s, and averaged over eight rotational angles, which rotated the sample around the beam axis by 360 . Each spectrum was averaged over three repeat scans. The averaged baseline was subtracted from the spectra, averaged over three times, and normalized to molar ellipticity [θ] using the equation: ½θ ¼ 100 θ=ðC l Þ ð7Þ where θ is the ellipticity, C is the peptide density in molar concentration, and l is the pathlength of the film. OCD procedures on the ANKA beamline at Karlsruhe Institute of Technology gave similar Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes 125 results, even though the cuvette settings were different in that ANKA had a hydration chamber and an automatic rotation system for rotational angles. 3.5 Liposome Fluorescent Leakage Assay The liposome fluorescent leakage assay is a common biophysical method for studying interactions between peptides and lipid membranes (see Note 5). The fluorescent dyes can be self-quenched or quenched by another compound at a threshold concentration when the donor–acceptor distance is within 15 Å, which is called Dexter electron transfer. At that proximity, the electron of the quencher (donor) is transferred to the lowest unoccupied molecular orbital (LUMO) of the excited fluorescent dye (acceptor), and one electron from the acceptor moves to the highest occupied molecular orbital (HOMO) of the donor, thus stopping the fluorescence. When the peptides induce pore formation or membrane disruption, the release of fluorescent dyes from the liposome will show the fluorescent intensity, which can be recorded, and the peptideinduced leakage fraction from the vesicles, which can be calculated. Different size of the fluorescent dyes can be used to measure the peptide-induced pore size in the membrane. Two examples are shown below: ANTS/DPX leakage assay (small fluorescent dye; MW ¼ 427) and macromolecule release assay (large fluorescent dye; MW 1000; the size is dependent on the dextran). 3.5.1 ANTS/DPX Leakage Assay Lipids in chloroform were mixed and dried under nitrogen gas in a glass vial, and the remaining organic solvent was removed under vacuum overnight. Then lipids were resuspended in 5 mM 8-aminonaphthalene-1,3,6-trisulfonic acid (ANTS) and 12.5 mM p-Xylene-bis(N-pyridinium bromide) (DPX) phosphate buffer at pH 7 (10 mM sodium phosphate with 100 mM potassium chloride). The dyes were entrapped in 0.1 μm diameter-extruded LUVs with lipids. Gel filtration chromatography of Sephadex G-100 (GE Healthcare Life Sciences Inc) was used to remove externalfree ANTS/DPX from LUVs with entrapped contents. LUVs were diluted to 0.5 mM and used to measure the leakage activity by addition of aliquots of peptide. Leakage was measured using fluorescence emission spectra after 3 h incubation. The spectra were recorded using excitation and emission wavelengths of 350 nm and 510 nm, respectively, for ANTS/DPX with a BioTek Synergy H1 Hybrid Multi-Mode Reader. 10% vol nonionic surfactant Triton X-100 (Triton) was used as the positive control to measure the maximum leakage of the vesicle. The leakage fraction can be calculated using the equation: %leakage ¼ I peptide I 0 =ðI Triton I 0 Þ ð8Þ 126 Charles H. Chen et al. where Ipeptide, I0, and ITriton are the fluorescent intensity of the peptide-induced vesicle, untreated vesicle, and Triton-induced vesicle, respectively. 3.5.2 Macromolecule Release Assay Dextrans of several sizes were prepared and coupled with both 5-carboxytetramethylrhodamine (TAMRA) and biotin as TAMRA-biotin–dextran (TBD) conjugates. The conjugated TBD was entrapped in LUVs as described above. External-free TBD conjugate was removed by incubation with an immobilized streptavidin agarose resin, which has high affinity for the biotin in the conjugate. The resin was spun down, and the TBD-containing vesicles in the supernatant were transferred to a new glass vial. Streptavidin labeled with an Alexa-488 fluorophore was added during the leakage experiment with the peptide, as previously described [8, 12]. The sample was incubated for 3 h before measuring Alexa-488 fluorescence. A control without added peptide served as the 0% leakage signal, and the addition of 10% vol detergent Triton was used to determine 100% leakage as positive control. The TBD conjugate released from the vesicle will bind to the streptavidin-Alexa-488 fluorophore and cause Förster resonance energy transfer (FRET) between Alexa-488 (donor) and released TAMRA (acceptor) in TBD. The electron transfer of FRET is different from the Dexter mechanism. In the latter case, the charge fluctuations in donor and acceptor can affect each other over a distance through energy transfer (not electron) from the electronic excited state of the acceptor to the donor through nonradiative dipole–dipole coupling. The fluorescence of the AlexaFluor 488 was measured using excitation and emission wavelengths of 490 nm and 525 nm, respectively, by a BioTek Synergy H1 Hybrid Multi-Mode Reader. The normalized leakage fraction can be calculated using the equation: %leakage ¼ I 0 I peptide =ðI 0 I Triton Þ ð9Þ where Ipeptide, I0, and ITriton are the fluorescent intensity of the peptide-induced vesicle, untreated vesicle, and Triton-induced vesicle, respectively. 3.6 Tryptophan Fluorescence Quenching Assay Tryptophan fluorescence quenching is a biophysical method used for determining the degree of burial of a tryptophan side chain [64]. It is usually applied to quantify conformational changes in protein folding and the strength of peptide binding to membranes (see Note 6). The intrinsic fluorescence of aromatic tryptophan residues is highly sensitive to its environment and is quenched in nonpolar environments, e.g., in lipid bilayers, in hydrophobic protein cores, or buried in the interface of a binding partner. The emission spectrum of the quenched fluorescence will be shifted toward lower wavelengths (blue shift) upon increasing hydrophobicity of the local environment. This process is dynamic and reversible. Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes 127 Tryptophan residues emit fluorescence in aqueous solution at a wavelength of 350 nm, but when tryptophan is fully buried in a lipid membrane (which acts as a quencher), the emission undergoes a blueshift to ~320 nm. The spectrum of tryptophan fluorescence emission between 300 and 350 nm was recorded following excitation at 280 nm wavelength to monitor the interactions between peptides and lipids. Peptides (50 μM) and extruded LUVs (600 μM) were prepared in 10 mM phosphate buffer (pH 7.0). The solutions were incubated and measured after 60 min. Excitation was fixed at 280 nm (slit 9 nm), and emission was collected from 300 to 450 nm (slit 9 nm). The spectra were recorded using a Synergy H1 Hybrid Multi-Mode Reader and a Cytation™ 5 Cell Imaging Multi-Mode Reader from BioTek and were averaged by three scans. The scattering of the fluorescence spectrum was normalized [65] to evaluate the maximum wavelength shift [40, 66]. The negative control consisted of free peptide in buffer. Additional LUVs were titrated with a fixed concentration of the peptide until the blueshift reached an equilibrated state, i.e., until the tryptophan was fully buried in the hydrophobic core of the membrane. The membrane partitioning can be calculated according to White et al. [65] and Rodnin et al. [67] using the following equation: K x ∙½L I ¼ 1 þ ðI 1 1Þ ð10Þ ½W þ ðK x ∙½L Þ where I is relative emission intensity of tryptophan, [L] is the lipid concentration, I1 is the emission intensity of tryptophan at infinite lipid saturation, [W] is the concentration of water (55.3 M), and Kx is the mole fraction partitioning coefficient. Kx ¼ ½P bil =½L ½P water =½W ð11Þ where [Pbil] and [Pwater] are the bulk concentration of peptide associated with the lipid membrane and in water, respectively. The calculated Kx can be used to determine membrane partitioning free energy (ΔG): ΔG ¼ RT ∙ ln ðK x Þ ð12Þ where R is the gas constant (1.985 103 kcal/mol∙K) and T is the temperature in Kelvin. However, this technique may not be equally accurate for all peptide structures. Although most of the peptides have tryptophan fluorescence peaks at ~348 nm (indicative of monomeric peptides or low multimeric soluble aggregates), some peptides can fold into a helix and form multimeric aggregates in the aqueous phase or at 128 Charles H. Chen et al. higher concentration that bury the tryptophan in the hydrophobic core. This folding results in blue-shifted spectra and small spectral widths and affects the accuracy of the measurements. 3.7 Electrical Impedance Spectroscopy Electrical impedance spectroscopy measures the resistance and conductance of a lipid bilayer coated on a silica plate (see Note 7). This method, which involves a three-electrode setup with a silver/silver chloride reference electrode and a platinum counter electrode, can be used to monitor the status of a lipid bilayer over time, e.g., membrane lysis and membrane poration [8]. The supported bilayer preparation and the measurement of the impedance were modified following techniques first established by the Hristova and Searson Labs [8, 68]. As an example, the top leaflet of the bilayers contained 100% POPC (1-palmitoyl-2-oleoyl-glycero-3-phosphocholine), and the bottom leaflet consisted of 18.5% wt PEG (polyethylene glycol; average Mn ¼ 2000) and 81.5% wt POPC. The bilayers were prepared on a silicon plate of orientation (111) plane, in which the Miller indices represent the symbolic vector for atomic planes in crystal lattices, and the bilayers were determined by the LangmuirBlodgett (LB) method. The LB method is used to compress a lipid– polymer (POPC-PEG) monolayer on the water surface and deposit the monolayer on the silicon plate. The plate is then transferred to the three-electrode setup and connected to the electrical impedance spectroscope. Impedance was measured over a frequency range of 105 to 1 Hz with a 20 mV root-mean-squared (RMS) AC perturbation and at a potential of 0 V with respect to the reference electrode. Spectra were recorded at 2-min intervals in the first hour and at 1-h intervals subsequently. The experiments were performed in the dark to prevent photo effects in the silicon. The results were fitted to an equivalent circuit model to determine the values of resistance and capacitance of the semiconductor–liquid interface (Rp: resistance of the semiconductor–liquid interface, and Cp: capacitance of the semiconductor–liquid interface) and the bilayer membrane (Rm: resistance of the lipid bilayer membrane, and Cm: capacitance of the lipid bilayer membrane). The analysis was conducted using Electrochemical Impedance Spectroscopy Software (Gamry Instruments Inc., Pennsylvania, USA). The values were used to determine the normalized membrane resistance (Rm/ R0, which demonstrates the permeation of the membrane) and change of the capacitance (Cm C0, which is correlated to the membrane thickness). The normalized membrane resistance, which is a force that impedes the flow of electric current across the membrane, offers the physical property of the membrane and can be used to study membrane poration and membrane lysis. Change of the capacitance provides the relative thickness of the membrane, as measured by the equation: Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes C ε ¼ A d 129 ð13Þ where C is the capacitance, A is the surface area, ε is the permittivity, and d is the thickness of the membrane. The change of membrane conductance was calculated from the difference of the inverse bilayer resistance at 0 and 17 h after addition of the peptide. The conductance per mole of the peptide was calculated from the experimental peptide-to-lipid value. The lipid concentrations on the silicon plate were quantified using the change of lipid concentrations in the buffer by a colorimetric Stewart assay with ammonium ferrothiocyanate [69]. 3.8 Isothermal Titration Calorimetry 4 Isothermal titration calorimetry (ITC) measurements with a MicroCal VP-ITC microcalorimeter (Microcal, Inc.) were performed to determine the thermodynamic parameters of the peptide–lipid interactions [70, 71] (see Note 8). The lipid vesicle suspension (16.23 mM) was titrated into a peptide (38.1 μM) solution in the sample cell. All the samples, which had been degassed under vacuum over 30 min, were prepared in 10 mM phosphate buffer (pH 7.0). The lipid solution was added in 6 μL aliquots into the reaction cell (volume ¼ 1.46 mL) containing a 38.1 μM peptide solution with injection duration of 12 s. The equilibration time between each titration step was 15 min. A first titration of 6 μL was disregarded to ensure that a premixing of both solutions during the equilibration time would not affect the first titration step. The stirring speed of the injection syringe was 307 rounds per minute (rpm). The thermodynamic parameters were calculated using the standard ITC software, which utilizes a stoichiometric model of binding. Membrane partitioning, however, is not stoichiometric [65]; consequently, the actual errors in free energy determination might be larger due to cross correlation of binding stoichiometry (n) and dissociation constant (Kd) fitting parameters. Notes 1. Molecular dynamics (MD) simulations: MD simulations provide atomic details of how peptides fold in water and interact with lipid membranes [1, 2]. Many rare events and thermodynamic parameters can be studied and validated by this method, e.g., peptide binding and folding at the membrane interface and peptide aggregation and assembly within the lipid membrane. Disordered aggregates have been observed with several small peptides. The peptide-induced water permeation and ion flux can be monitored throughout the simulation [17, 27, 47]. MD simulations allow us to determine the critical amino acids that bind and interact with lipids, peptides, and other 130 Charles H. Chen et al. compounds [26, 27, 36, 43, 47, 72–74]. The lifetime of the peptide assembly (e.g., functional channel-like structure) can be measured using the Arrhenius equation by performing the simulations at different temperatures [26]. Simulations performed at higher temperatures can increase sampling kinetics and allow us to study rare events, such as peptide folding, bilayer partitioning, and pore assembly, without the need for advanced sampling techniques [75, 76]. However, this technique is suitable only for thermostable peptides and lipid membranes; therefore, other experiments are strongly needed for verification. Most of the analysis can be conducted using the GROMACS package, VMD software, and basic programming (e.g., Python). 2. Combinatorial peptide libraries vs MD simulations: Our previous study has shown that the hydrophobic moment can be correlated with peptide binding for the zwitterionic lipid bilayer and anionic lipid bilayer [22, 47]. The peptides were evaluated using the liposome fluorescent leakage assay with fixed peptide and lipid concentrations (0.5 μM peptide concentration against 0.5 mM lipid concentration; peptide to lipid ratio is 1:1000). This assay allowed us to determine whether the peptides have the ability to porate or lyse the membrane. The selected peptides were then studied in MD simulations with two different lipid bilayers. The peptides can either be placed on one side or on both sides of the bilayer, and the model can be built using the GROMACS package, the CHARMM-GUI web-based interface, and VMD software. After several microseconds of simulation, the interactions between peptides and lipid membranes can be varied. The snapshots can be captured using VMD software, and the trajectories can be refined and analyzed by using the GROMACS package (e.g., gmx trjconv and gmx mindist). 3. Circular dichroism (CD) spectroscopy vs MD simulations: The simulated secondary structure of the peptide can be averaged and compared with the experimental secondary structure from the CD spectroscopy to validate the accuracy [72]. In CD spectroscopy, secondary structure can be characterized in aqueous conditions, at varied lipid concentrations, and for different lipid types. The fractional content of alpha helix and beta sheet derived from CD spectroscopy can be analyzed and quantified using DichroWeb [59, 62, 63] and compared with the averaged secondary structure from the simulations. The simulated secondary structure can be analyzed using VMD software (Extension ! Analysis ! Timeline ! Calculate ! Cal. Sec. Struct) and GROMACS package (e.g., gmx helix). 4. Oriented circular dichroism (OCD) spectroscopy vs MD simulations: Some membrane-active peptides that can insert into the membrane and traverse it have a tilt angle, depending on the Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes 131 peptide length and membrane thickness. As noted above (Subheading 3.4), OCD spectroscopy offers an averaged fraction of the TM peptides and S-state peptides [76]. The results of OCD spectroscopy can be compared with the MD simulations as an indication of how likely the peptides are to get to the TM state. The simulations can be analyzed using the GROMACS package (e.g., gmx helixorient). 5. Liposome fluorescent leakage assay vs MD simulations: Some membrane-active peptides can pierce the cell membrane. Liposome fluorescent leakage assays are an easy and quick approach to measure the resulting pore size using fluorescent dyes [8, 12, 27], such as small fluorescent dyes (ANTS and DPX; MW ¼ 422–427) or macromolecule dyes (TAMRA-labeled dextran and AF488-labeled streptavidin; MW ¼ 3k or 10k). The size of the peptide-induced pore can be characterized in the simulations by measuring the size of the aggregate and water flux using the GROMACS package (e.g., gmx hole [77]). 6. Tryptophan fluorescence quenching assay vs MD simulations: The tryptophan fluorescence quenching assay can be used as a platform to evaluate peptide binding to different types of lipids [22, 67]. This technique can be limited by the hydrophobicity of the peptides and their aggregation and folding in aqueous buffer, and binding may not involve a two-state transition. Nevertheless, the tryptophan fluorescence quenching assay is useful for screening peptide libraries consisting of 102 to 104 peptides to identify membrane-binding peptides for specific applications. The assay is done with microwell plates read by a microplate reader. The simulations can be utilized to study how peptides interact with the lipid bilayer and bind onto the membrane interface using VMD software and the GROMACS package (e.g., gmx mindist). 7. Electrical impedance spectroscopy vs MD simulations: Electrical impedance spectroscopy is a tool with which to monitor the resistance and capacitance of the lipid bilayer on a silicon plate [8, 68, 78], which can be correlated to the membrane permeability and membrane thickness, respectively. Electrical impedance spectroscopy can reveal whether membrane poration is a transient event or an equilibrium state and whether a particular peptide promotes membrane poration or lyses the membrane. For membrane poration, the resistance decreases while the capacitance remains constant after the peptide is added into the chamber. For membrane lysis, resistance also decreases but the capacitance increases, because the peptide can peel off the membrane from the silico plate and reduce the membrane thickness. Electrical impedance spectroscopy can be compared with the peptide assembly simulations using the GROMACS package (e.g., gmx trjconv and gmx traj). 132 Charles H. Chen et al. 8. Isothermal titration calorimetry (ITC) vs MD simulations: ITC can measure the thermodynamic parameters of peptide–lipid interactions [22, 47], e.g., binding stoichiometry, binding enthalpy, and binding constant. Similar to the tryptophan fluorescence quenching assay, the measured values may not accurately represent a two-state transition and can be difficult to analyze, and the errors of the measured quantity may be larger due to cross correlation of the binding stoichiometry and binding-constant fitting parameters. However, ITC still can be useful to determine whether the peptide has selectivity for certain lipid types, and ITC measurements can be checked against the simulations obtained with simulated peptide folding (e.g., helical fraction). 5 Conclusions and Future Perspective Recent developments in all-atom MD simulations of polypeptides have provided insights into the molecular details of their mechanism of action and the pathways utilized for membrane interaction, making MD simulations essential tools to complement biophysical and in vitro experiments [2, 23–26, 43, 45, 48, 76, 79–82]. In addition, they are useful for in silico protein design for drug discovery. Several examples have applied MD simulations to design peptides for pharmaceutical applications, e.g., as peptide chaperones for stabilizing the human butyrylcholinesterase [73] and more potent antimicrobial peptides [48]. Although MD simulations are a powerful tool, experimental techniques are required to validate their accuracy. Different proteins and environmental conditions may require comparisons of several forcefields [43, 44, 83–85], with a special focus on protein–lipid interactions [86] and more realistic multicomponent membrane compositions [87]. The bottlenecks of simulation timescales and simulation box size also limit our understanding. Simulation events that are observed may not have reached their equilibrium state. For example, amyloid peptides may induce fibril formation or induce other complicated pathways that take much longer than the available simulation time. The size of the simulation box can also yield less accurate results than would be obtained at a more realistic scale. Here, we have shown several experimental techniques that can be used in conjunction with MD simulations to validate the results. There are also many other experimental techniques that have not been mentioned in this chapter, e.g., fluorescent-labeled peptide for membrane partitioning [67], nuclear magnetic resonance [25, 81], and neutron diffraction [81]. Ultimately, MD simulations provide a wide range of applications for structure prediction, mechanism and pathway elucidation, protein design, and drug discovery. The results of MD simulations Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes 133 compare well with those of other computational techniques, e.g., machine learning [88], high-throughput screening [89], and molecular docking [90]. The advancement of computing hardware and algorithms will extend the timescales, allow for building larger and more realistic simulation systems, and in the near future increase our understanding of complex biological functions. Acknowledgments This work was supported by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (NIH) under Award Number U19AI142780. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The authors thank Kalina Hristova at Johns Hopkins University, William Wimley at Tulane University, Alexey Ladokhin at University of Kansas Medical Center, Jochen Bürck at Karlsruhe Institute of Technology, Nykola Jones at Aarhus University, Katherine Tripp at Johns Hopkins University, Gregory Wiedman at Seton Hall University, Sarah Kim at Duke University, Evan Troendle at King’s College London, and Yukun Wang at Yale University for valuable discussions about the experimental setups and simulations. References 1. Chen CH et al (2020) Understanding and modelling the interactions of peptides with membranes: from partitioning to self-assembly. Curr Opin Struct Biol 61:160–166 2. Ulmschneider JP, Ulmschneider MB (2018) Molecular dynamics simulations are redefining our view of peptides interacting with biological membranes. Acc Chem Res 51(5):1106–1116 3. Dror RO et al (2012) Biomolecular simulation: a computational microscope for molecular biology. Annu Rev Biophys 41:429–452 4. Zasloff M (1987) Magainins, a class of antimicrobial peptides from Xenopus skin: isolation, characterization of two active forms, and partial cDNA sequence of a precursor. Proc Natl Acad Sci U S A 84(15):5449–5453 5. Lehrer RI et al (1989) Interaction of human defensins with Escherichia coli. Mechanism of bactericidal activity. J Clin Invest 84 (2):553–561 6. Yeaman MR, Yount NY (2003) Mechanisms of antimicrobial peptide action and resistance. Pharmacol Rev 55(1):27–55 7. Chen CH, Lu TK (2020) Development and challenges of antimicrobial peptides for therapeutic applications. Antibiotics 9(1):24 8. Wiedman G et al (2014) Highly efficient macromolecule-sized poration of lipid bilayers by a synthetically evolved peptide. J Am Chem Soc 136(12):4724–4731 9. Krauson AJ, He J, Wimley WC (2012) Gainof-function analogues of the pore-forming peptide melittin selected by orthogonal highthroughput screening. J Am Chem Soc 134 (30):12732–12741 10. Krauson AJ et al (2015) Conformational finetuning of pore-forming peptide potency and selectivity. J Am Chem Soc 137 (51):16144–16152 11. Wiedman G, Wimley WC, Hristova K (2015) Testing the limits of rational design by engineering pH sensitivity into membrane-active peptides. Biochim Biophys Acta 1848 (4):951–957 12. Wiedman G et al (2017) pH-triggered, macromolecule-sized poration of lipid bilayers by synthetically evolved peptides. J Am Chem Soc 139(2):937–945 13. Sreedharan J et al (2008) TDP-43 mutations in familial and sporadic amyotrophic lateral sclerosis. Science 319(5870):1668–1672 134 Charles H. Chen et al. 14. Chen AK et al (2010) Induction of amyloid fibrils by the C-terminal fragments of TDP-43 in amyotrophic lateral sclerosis. J Am Chem Soc 132(4):1186–1187 15. Liu GC et al (2013) Delineating the membrane-disrupting and seeding properties of the TDP-43 amyloidogenic core. Chem Commun 49(95):11212–11214 16. Sun CS et al (2014) The influence of pathological mutations and proline substitutions in TDP-43 glycine-rich peptides on its amyloid properties and cellular toxicity. PLoS One 9 (8):e103644 17. Chen CH et al (2016) Mechanisms of membrane pore formation by amyloidogenic peptides in amyotrophic lateral sclerosis. Chemistry 22(29):9958–9961 18. Laos V et al (2019) Characterizing TDP-43307319 oligomeric assembly: mechanistic and structural implications involved in the etiology of amyotrophic lateral sclerosis. ACS Chem Neurosci 10(9):4112–4123 19. Gagnon MC et al (2017) Influence of the length and charge on the activity of α-helical amphipathic antimicrobial peptides. Biochemistry 56(11):1680–1695 20. Grau-Campistany A et al (2015) Hydrophobic mismatch demonstrated for membranolytic peptides, and their use as molecular rulers to measure bilayer thickness in native cells. Sci Rep 5:9388 21. Grau-Campistany A et al (2016) Extending the hydrophobic mismatch concept to amphiphilic membranolytic peptides. J Phys Chem Lett 7 (7):1116–1120 22. Chen CH et al (2020) Rational tuning of a membrane-perforating antimicrobial peptide to selectively target membranes of different lipid composition. bioRxiv:2020.11.01.364091 23. Leveritt JM, Pino-Angeles A, Lazaridis T (2015) The structure of a melittin-stabilized pore. Biophys J 108(10):2424–2426 24. Perrin BS, Pastor RW (2016) Simulations of membrane-disrupting peptides I: alamethicin pore stability and spontaneous insertion. Biophys J 111(6):1248–1257 25. Perrin BS et al (2016) Simulations of membrane-disrupting peptides II: AMP Piscidin 1 favors surface defects over pores. Biophys J 111(6):1258–1266 26. Wang Y et al (2016) Spontaneous formation of structurally diverse membrane channel architectures from a single antimicrobial peptide. Nat Commun 7:13535 27. Chen C et al (2019) Simulation-guided rational de novo design of a small pore-forming antimicrobial peptide. J Am Chem Soc 141 (12):4839–4848 28. Pronk S et al (2013) GROMACS 4.5: a highthroughput and highly parallel open source molecular simulation toolkit. Bioinformatics 29(7):845–854 29. Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14(1):33–38, 27-8 30. Lee J et al (2016) CHARMM-GUI input generator for NAMD, GROMACS, AMBER, OpenMM, and CHARMM/OpenMM simulations using the CHARMM36 additive force field. J Chem Theory Comput 12(1):405–413 31. Quist A et al (2005) Amyloid ion channels: a common structural link for protein-misfolding disease. Proc Natl Acad Sci U S A 102 (30):10427–10432 32. Li J et al (2017) Membrane active antimicrobial peptides: translating mechanistic insights to design. Front Neurosci 11:73 33. Guha S et al (2019) Mechanistic landscape of membrane-permeabilizing peptides. Chem Rev 119(9):6040–6085 34. Sani MA, Separovic F (2016) How membraneactive peptides get into lipid membranes. Acc Chem Res 49(6):1130–1138 35. Mangoni ML, McDermott AM, Zasloff M (2016) Antimicrobial peptides and wound healing: biological and therapeutic considerations. Exp Dermatol 25(3):167–173 36. Ulmschneider JP (2017) Charged antimicrobial peptides can translocate across membranes without forming channel-like pores. Biophys J 113(1):73–81 37. Wimley WC, Hristova K (2011) Antimicrobial peptides: successes, challenges and unanswered questions. J Membr Biol 239(1–2):27–34 38. Kreutzberger MA, Pokorny A, Almeida PF (2017) Daptomycin-phosphatidylglycerol domains in lipid membranes. Langmuir 33 (47):13669–13679 39. Lee MT et al (2018) Comparison of the effects of daptomycin on bacterial and model membranes. Biochemistry 57(38):5629–5639 40. Kim SY et al (2019) Mechanism of action of peptides that cause the pH-triggered macromolecular poration of lipid bilayers. J Am Chem Soc 141(16):6706–6718 41. Kurgan KW et al (2019) Retention of native quaternary structure in racemic melittin crystals. J Am Chem Soc 141(19):7704–7708 42. Keener JE et al (2019) Chemical additives enable native mass spectrometry measurement of membrane protein oligomeric state within Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes intact nanodiscs. J Am Chem Soc 141 (2):1054–1061 43. Wang Y et al (2014) How reliable are molecular dynamics simulations of membrane active antimicrobial peptides? Biochim Biophys Acta 1838(9):2280–2288 44. Huang J, MacKerell AD (2018) Force field development and simulations of intrinsically disordered proteins. Curr Opin Struct Biol 48:40–48 45. Venable RM, Kr€amer A, Pastor RW (2019) Molecular dynamics simulations of membrane permeability. Chem Rev 119(9):5954–5997 46. Pan AC et al (2019) Atomic-level characterization of protein-protein association. Proc Natl Acad Sci U S A 116(10):4244–4249 47. Chen CH, Ulmschneider JP, Ulmschneider MB (2020) Mechanisms of a small membrane-active antimicrobial peptide from Hyla punctata. Aust J Chem 73(3):236–245 48. Chen CH et al (2019) Simulation-guided rational de novo design of a small pore-forming antimicrobial peptide. J Am Chem Soc 141 (12):4839–4848 49. Ulmschneider JP, Ulmschneider MB, Di Nola A (2006) Monte Carlo vs molecular dynamics for all-atom polypeptide folding simulations. J Phys Chem B 110(33):16733–16742 50. Ulmschneider JP, Jorgensen WL (2004) Polypeptide folding using Monte Carlo sampling, concerted rotation, and continuum solvation. J Am Chem Soc 126(6):1849–1857 51. Huang J, MacKerell AD (2013) CHARMM36 all-atom additive protein force field: validation based on comparison to NMR data. J Comput Chem 34(25):2135–2145 52. Jorgensen WL, Chandrasekhar J, Madura JD (1983) Comparison of simple potential functions for simulating liquid water. J Chem Phys 79:926–935 53. Essmann U, Perera L, Berkowitz ML (1995) A smooth particle mesh Ewald method. J Chem Phys 103(19):8577–8593 54. Huang K, Garcı́a AE (2014) Effects of truncating van der Waals interactions in lipid bilayer simulations. J Chem Phys 141(10):105101 55. Hess B, Bekker H, Berendsen HJC, Fraaije JGEM (1997) LINCS: a linear constraint solver for molecular simulations. J Comput Chem 18(12):1463–1472 56. Mor A, Ziv G, Levy Y (2008) Simulations of proteins with inhomogeneous degrees of freedom: the effect of thermostats. J Comput Chem 29(12):1992–1998 57. Lam KS et al (1991) A new type of synthetic peptide library for identifying ligand-binding activity. Nature 354(6348):82–84 135 58. Edman P (1949) A method for the determination of amino acid sequence in peptides. Arch Biochem 22(3):475 59. Whitmore L, Wallace BA (2008) Protein secondary structure analyses from circular dichroism spectroscopy: methods and reference databases. Biopolymers 89(5):392–400 60. Greenfield NJ (2006) Using circular dichroism spectra to estimate protein secondary structure. Nat Protoc 1(6):2876–2890 61. Hope MJ et al (1985) Production of large unilamellar vesicles by a rapid extrusion procedure: characterization of size distribution, trapped volume and ability to maintain a membrane potential. Biochim Biophys Acta 812 (1):55–65 62. Whitmore L, Wallace BA (2004) DICHROWEB, an online server for protein secondary structure analyses from circular dichroism spectroscopic data. Nucleic Acids Res 32(Web Server issue):W668–W673 63. Lobley A, Whitmore L, Wallace BA (2002) DICHROWEB: an interactive website for the analysis of protein secondary structure from circular dichroism spectra. Bioinformatics 18 (1):211–212 64. Akbar SM, Sreeramulu K, Sharma HC (2016) Tryptophan fluorescence quenching as a binding assay to monitor protein conformation changes in the membrane of intact mitochondria. J Bioenerg Biomembr 48(3):241–247 65. White SH et al (1998) Protein folding in membranes: determining energetics of peptidebilayer interactions. Methods Enzymol 295:62–87 66. Ladokhin AS, Jayasinghe S, White SH (2000) How to measure and analyze tryptophan fluorescence in membranes properly, and why bother? Anal Biochem 285(2):235–245 67. Rodnin MV et al (2020) Experimental and computational characterization of oxidized and reduced protegrin pores in lipid bilayers. J Membr Biol 253(3):287–298 68. Lin J et al (2008) Impedance spectroscopy of bilayer membranes on single crystal silicon. Biointerphases 3(2):FA33 69. Stewart JC (1980) Colorimetric determination of phospholipids with ammonium ferrothiocyanate. Anal Biochem 104(1):10–14 70. Breukink E et al (2000) Binding of Nisin Z to bilayer vesicles as determined with isothermal titration calorimetry. Biochemistry 39 (33):10247–10254 71. Abraham T et al (2005) Isothermal titration calorimetry studies of the binding of a rationally designed analogue of the antimicrobial 136 Charles H. Chen et al. peptide gramicidin s to phospholipid bilayer membranes. Biochemistry 44(6):2103–2112 72. Chen CH et al (2014) Absorption and folding of melittin onto lipid bilayer membranes via unbiased atomic detail microsecond molecular dynamics simulation. Biochim Biophys Acta 1838(9):2243–2249 73. Wang Q et al (2018) Proline-rich chaperones are compared computationally and experimentally for their abilities to facilitate recombinant butyrylcholinesterase tetramerization in CHO cells. Biotechnol J 13(3):e1700479 74. Ulmschneider MB et al (2015) Peptide folding in translocon-like pores. J Membr Biol 248 (3):407–417 75. Ulmschneider MB et al (2010) Mechanism and kinetics of peptide partitioning into membranes from all-atom simulations of thermostable peptides. J Am Chem Soc 132 (10):3452–3460 76. Ulmschneider MB et al (2014) Spontaneous transmembrane helix insertion thermodynamically mimics translocon-guided insertion. Nat Commun 5:4863 77. Smart OS, Goodfellow JM, Wallace BA (1993) The pore dimensions of gramicidin A. Biophys J 65(6):2455–2460 78. Wiedman G et al (2013) The electrical response of bilayers to the bee venom toxin melittin: evidence for transient bilayer permeabilization. Biochim Biophys Acta 1828 (5):1357–1364 79. Upadhyay SK et al (2015) Insights from microsecond atomistic simulations of melittin in thin lipid bilayers. J Membr Biol 248(3):497–503 80. Pino-Angeles A, Lazaridis T (2018) Effects of peptide charge, orientation, and concentration on melittin transmembrane pores. Biophys J 114(12):2865–2874 81. Mihailescu M et al (2019) Structure and function in antimicrobial piscidins: histidine position, directionality of membrane insertion, and pH-dependent permeabilization. J Am Chem Soc 141(25):9837–9853 82. Westerfield J et al (2019) Ions modulate key interactions between pHLIP and lipid membranes. Biophys J 117(5):920–929 83. Robustelli P, Piana S, Shaw DE (2018) Developing a molecular dynamics force field for both folded and disordered protein states. Proc Natl Acad Sci U S A 115(21):E4758–E4766 84. Poger D, Caron B, Mark AE (2016) Validating lipid force fields against experimental data: progress, challenges and perspectives. Biochim Biophys Acta 1858(7, Part B):1556–1565 85. van Gunsteren WF et al (2018) Validation of molecular simulation: an overview of issues. Angew Chem Int Ed Engl 57(4):884–902 86. Corradi V et al (2019) Emerging diversity in lipid-protein interactions. Chem Rev 119 (9):5775–5848 87. Marrink SJ et al (2019) Computational modeling of realistic cell membranes. Chem Rev 119 (9):6184–6226 88. H€ase F et al (2019) How machine learning can assist the interpretation of. Chem Sci 10 (8):2298–2307 89. Doerr S et al (2016) HTMD: high-throughput molecular dynamics for molecular discovery. J Chem Theory Comput 12(4):1845–1852 90. Salmaso V, Moro S (2018) Bridging molecular docking to molecular dynamics in exploring ligand-protein recognition process: an overview. Front Pharmacol 9:923 Chapter 7 Coarse-Grain Simulations of Membrane-Adsorbed Helical Peptides Manuel N. Melo Abstract The amphipathic α-helix is a common motif for peptide adsorption to membranes. Many physiologically relevant events involving membrane-adsorbed peptides occur over time and size scales readily accessible to coarse-grain molecular dynamics simulations. This methodological suitability, however, comes with a number of pitfalls. Here, I exemplify a multi-step adsorption equilibration procedure on the antimicrobial peptide Magainin 2. It involves careful control of peptide freedom to promote optimal membrane adsorption before other interactions are allowed. This shortens preparation times prior to production simulations while avoiding divergence into unrealistic or artifactual configurations. Key words Peptide, Alpha-helix, Amphipathicity, Molecular dynamics, Coarse grain, Membrane adsorption, Equilibration 1 Introduction Amphipathicity in proteins has long been recognized as a driving feature for adsorption to lipid membranes; one that takes advantage of the membrane’s own amphipathic environment at the interface between the lipids’ aliphatic tails and their polar headgroups [1]. In this context, amphipathic α-helices are a common adsorption motif [2], in which amino acid residues are organized around a helix in a way that segregates apolar side chains from polar/charged ones, usually along the helical diameter. In proteins, the structural role of amphipathicity in α-helices is not restricted to membrane adsorption: Schiffer and Edmundson first proposed the amphipathicity-highlighting representation now commonly known as “Edmundson wheel” (Fig. 1) for the visualization of soluble protein features [4]. Besides wheel representations, metrics such as the hydrophobic moment [5] or hydrophobic angle [6] can be used in the quantification and visualization of amphipathicity. In bioactive peptides, however, amphipathicity is a Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_7, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 137 138 Manuel N. Melo Fig. 1 (a): Edmundson wheel representation of the sequence and α-helical structure of Magainin 2, highlighting the polar/apolar segregation of residues along its surface (drawn with the aid of the NetWheels tool [3]); residues are color-coded red for charged anionic, blue for charged cationic, green for polar uncharged, and white for apolar. The same color code is used throughout the structural representations in this chapter. (b): The two types of position restraints employed in this chapter for peptide equilibration; full line: harmonic position restraint (from Eq. 1); dashed line: flat-bottom harmonic position restraint (from Eq. 2). (c): The Martini α-helical CG structure for Magainin 2, where for simplicity only backbone particles are shown; the black arrow represents the hydrophobic moment. Also in (c) are represented the degrees of freedom affected by the restraints in (b), with red arrows indicating restrained translations and green arrows indicating free translations/rotations. (d): Initial system setup for single Magainin 2 peptides on a POPC membrane (at a global 1:312 peptide-to-lipid ratio). (e): Initial system setup for multiple 28 Magainin 2 peptides on a POPC membrane (at a global 1:48 peptide-to-lipid ratio) hallmark of membrane interaction. Prominent examples can be found, among others, in antimicrobial peptides [7], in cellpenetrating peptides [8, 9] and in membrane-remodeling peptides [10]. Over the past two decades, the membrane activity of bioactive peptides gradually came into the scope of molecular dynamics (MD) simulations. This was boosted both by the continued increase in computational power and by the development of coarse-grain (CG) MD models. Biomolecular CG models, such as the popular Martini framework [11, 12], simplify the structural representation of molecules, exchanging fine detail for large simulation speedup. This allows the extension of simulation times into the microsecond to millisecond range, and system sizes up to hundreds of nanometers [13]. CG MD is also useful as an accelerated step prior to conversion to full detail for subsequent simulation at atomistic resolution [14]. 1.1 Simulation Pitfalls This chapter focuses on the setup of CG MD systems of membraneadsorbed peptides. The process of membrane adsorption by amphipathic peptides can, in principle, be followed by MD (see the caveat for CG in point 1 below). However, phenomena of interest usually occur after peptides are adsorbed. To obviate this potentially time- CG Simulations of Membrane-Adsorbed Helical Peptides 139 consuming step, I describe methods to directly prepare systems with already-adsorbed peptides. These take into account a number of considerations: 1. Amphipathic helical peptides often fold as a response to an amphipathic environment and are unstructured otherwise [1]. This is one of the reasons why simulating the adsorption process from the aqueous phase can be time consuming. Another limitation is that CG models do not always have the ability to reproduce folding dynamics—Martini, for instance, restrains protein structure to a given input throughout the simulation [15, 16], and a Martini helix will always be a helix. A corollary is that any peptide configuration simulated as an amphipathic α-helix away from the membrane’s hydrophilic/ hydrophobic interface is likely to be unrealistic. 2. Besides the adsorption distance in the previous point, another aspect for stability is helix orientation with respect to the membrane. Peptides will likely orient so that their hydrophobic residues face the membrane core [2]. Simulation of configurations that are not stably equilibrated in this respect is, again, unrealistic. 3. Simulated systems may have multiple adsorbed peptides, or other membrane components such as transmembrane proteins. Peptide interactions with one another or with additional components must only be allowed after proper equilibration according to points 1 and 2. Ignoring the above points can have a range of consequences. When simulating isolated peptides in membranes, a poorly equilibrated adsorption can result in peptide diffusion back to the aqueous phase (see the illustrative example with the Magainin 2 antimicrobial peptide in Fig. 2) or in adsorption at non-representative depths/orientations. While the former entails a waste of computational resources, the latter, if undetected, will yield erroneous measurements of membrane interaction parameters. The most serious consequences of inadequate equilibration, however, occur when simulated systems contain multiple adsorbed peptides—incidentally, a condition of many of the peptides’ proposed mechanisms [17]. Premature peptide–peptide contacts, before hydrophilic/hydrophobic interactions with the membrane are satisfied, will lead to artificial oligomerization propensities. Namely, exposed peptidic hydrophobic patches will tend to bind and form aggregates that will then have reduced drive for deeper membrane interaction. When large numbers of mis-equilibrated peptides are allowed to interact in a simulation, disordered peptide aggregates may accumulate atop of the membrane and actively perturb it in an unrealistic fashion (compare the interaction features 140 Manuel N. Melo Fig. 2 (a) and (b): z-distance between the peptides’ center-of-geometry and the membrane’s top leaflet PO4 layer, for 12 independent single-peptide systems; in (a), the 12 systems were simulated unrestrained, and traces for peptides that ultimately leave the membrane are highlighted in red; in (b), a flat-bottom position restraint in z was applied to the peptide backbone particles, with onset (rfb) at 3.0 nm from the membrane’s center (roughly 1 nm above the PO4 layer) and force constant 500 kJ mol1 nm2 (the average position of the onset is represented in (b) by the dotted line). (c): Helix orientation relative to the z-axis for the systems simulated with z-restraints, expressed as the dihedral angle between the plane containing the hydrophobic moment and the helix axis (see Fig. 1) and the plane containing the helix axis and the z-axis; adsorbed peptides converge on orientations where their hydrophobic moment points roughly away from the + z direction and into the membrane core. (d): Final structure of one of the simulations without restraints where the peptide left the membrane. (e): Representative final structure of the peptides simulated with flat-bottom restraining potentials preventing their desorption back into the aqueous phase; the restraining potential shape relative to the membrane is illustrated by the dashed line. Overall, the use of restraints in z promotes a quick and consistent membrane adsorption, even if the restraining potential, with onset outside of the membrane, does not actively affect the adsorbed state of high densities of Magainin 2 in Fig. 3e–h and in refs. 20 and 18). In reality, by contrast, peptides that are water-soluble are less likely to have structures with spatially segregated hydrophobic residues before complete adsorption and are therefore also unlikely to aggregate with one another at that stage. CG Simulations of Membrane-Adsorbed Helical Peptides 141 Fig. 3 Adsorption equilibration for simulations under different restraints. (a) and (b): z-distance to the PO4 layer and hydrophobic moment angle with the z-axis, as in Fig. 2a–c, for 28 Magainin 2 peptides simulated simultaneously, with only flat-bottom restraints in z; (c) and (d) are the same measurements for an analogous system, but with the first and last backbone particles of each peptide also pinned in the xy-plane by harmonic potentials of force constant 500 kJ mol-1 nm-2. (e): Snapshot after 30 ns of a 28-peptide system simulated without any restraints, where peptides quickly and stably aggregate with one another, away from the membrane interface. (f) and (g) are the final snapshots corresponding to the simulations in panels (a)/(b) and (c)/(d), respectively. In (g), yellow arrows indicate the xy-pinned backbone particles for one of the peptides; this xy restraining effectively prevents peptide lateral diffusion, yet panel (d) shows that peptides retain their ability to orient relative to the membrane. In (f), without lateral restraints, part of the peptides is able to correctly associate into dimers [18, 19] but proper membrane adsorption is delayed (compare (a) and (b) with (c) and (d)) and at least 5 peptides form an artifactual aggregate protruding into the aqueous phase (yellow arrow). Finally, (h) shows that after the restraints in (g) are lifted, the system quickly (600 ns) progresses towards a realistic distribution of dimers and monomers, all membrane-adsorbed 1.2 Equilibration Strategy To properly equilibrate membrane-adsorbed peptides according to the above requirements, specific restraints to their freedom must be imposed while they converge to stable adsorption depths and orientations [18, 21]. This is akin to the usual practice of restraining atomistic protein backbone motion when initially equilibrating a system after solvation: as much as possible, introduced instability should be resolved by the faster degrees of freedom, rather than being allowed to drive the system into states from which convergence back to representative configurations may be too slow. In this chapter I use the adsorption of the antimicrobial peptide Magainin 2 onto a palmitoyl-oleoyl-phosphatidylcholine (POPC) bilayer (Fig. 1) to exemplify in practice how restraints along the z axis can be used to keep peptides in the membrane vicinity, promoting the equilibration of adsorption depth (Fig. 2). For systems with multiple peptides I employ further restraints, pinning peptide termini in the xy plane, which lets adsorption depth and orientation equilibrate while preventing lateral diffusion and 142 Manuel N. Melo untimely peptide–peptide contacts (Fig. 3). See Fig. 1b and c for a depiction of the employed restraints and their effect on the peptides’ degrees of freedom. 2 2.1 Software and Models Forcefield The protocols in this chapter were tested with the Martini 2.2 forcefield, but instructions should hold unaltered for most Martini protein implementations [15, 16]. See Note 1 for applicability to other forcefields. 2.2 Simulation Package System preparation and simulation is exemplified with the GROMACS 2020.5 [22] simulation package, but the procedure is compatible with GROMACS versions 5.0 or higher (when flat-bottom position restraints were introduced). See Note 2 for other compatibility considerations. 2.3 System Construction CG structures and topologies are constructed using the martitool [23] from α-helical atomistic structures (in the examples in Figs. 1, 2 and 3, the starting Magainin 2 atomistic structure was first constructed as an ideal helix using Avogadro v1.2.0 [24]). Previous versions of martinize2 or of martinize [25] can also be used. Membranes are constructed with the insane.py script [26] but any other source of flat, equilibrated Martini membranes is acceptable (such as those generated by the CHARMM-GUI tool [27]). Peptide–membrane juxtaposition is done using the MDAnalysis v1.0.0 Python package [28] together with tools from the GROMACS suite. 2.4 Restraining Potentials Two types of restraining potentials are needed. The first is a simple harmonic potential V that restrains particle position r along a given dimension to a reference position r0, with force constant k, according to Eq. 1: nize2 ð1Þ V ¼ kðr r 0 Þ2 The second restraining potential is a piecewise extension of the harmonic potential, in that the potential only starts increasing at a distance rfb from r0. The potential is flat at zero between r0 rfb and r0 + rfb, hence the name “flat-bottom potential”: 2 k (r − r0 − rfb ) , r > r0 + rfb V = 2 k (r − r0 + rfb ) , r < r0 − rfb 0, otherwise ð2Þ CG Simulations of Membrane-Adsorbed Helical Peptides 143 The shape of the two potentials can be compared in Fig. 1b. Either potential can be independently applied to each of the x, y, and z dimensions. For a membrane with normal aligned with z, the procedures in this chapter use flat-bottom restraining potentials along z and harmonic restraining potentials on x and y. 2.5 Equilibration Monitoring 3 Evolution of equilibration in Figs. 2 and 3 was monitored visually, using VMD v1.9.3 [29], and quantitatively, using custom tools written in Python using the MDAnalysis, NumPy v1.19 [30], and Matplotlib v3.3.3 [31] packages. Two metrics were followed: l Each peptide’s center-of-geometry position in z relative to the top leaflet’s PO4 layer (assuming peptides are being adsorbed onto the top leaflet). l The alignment with the z-axis of a reference vector for each peptide helix. Alignment can be the simple angle of the reference vector with + z or, as in Figs. 2 and 3, it can be the dihedral torsional angle around the helical axis between the reference vector and + z. Figures 1, 2 and 3 depict/employ as reference vector the hydrophobic moment (as implemented in the 3D-HM tool [32]); this highlights hydrophobic orientation towards the membrane core during equilibration, but see Note 4 for simpler metrics. Methods These steps assume that typical Martini run parameters [33, 34] for energy minimization, pressure and temperature equilibration, and production are used, but these can be adapted if other forcefields are employed. Pressure coupling should be done semi-isotropically (in xy and z separately) unless the specific application demands otherwise. When employing pressure coupling together with position restraints GROMACS requests that you decide how to scale the restraint reference points (r0) with pressure scaling. For the restraints used here it is advisable to set the refcoord_scaling¼com run parameter. These instructions also assume that the peptides will be added to only one of the membrane leaflets. Nonetheless, the steps are readily extensible to equilibrating adsorption on both leaflets simultaneously by simply adding two layers of peptides. The involved restraints do not require any adjustment in that case and only the equilibration monitoring must be adapted to reverse distance/ angle signs for part of the peptides. 144 Manuel N. Melo 3.1 Common Preparation 1. Create a membrane of suitable size and composition using insane.py. Energy-minimize it, equilibrate pressure and temperature, and then equilibrate lipid mixing, if needed, for a suitable amount of time (which will depend on membrane size and composition). 2. Obtain the CG topology and structure for your helical peptide using martinize2. 3. Ensure that the peptide lies with its helical axis parallel to the membrane surface. The gmx editconf GROMACS command can do this using the -princ flag and subsequently the rotate flag, if needed. Alternatively, MDAnalysis can be used to orient the molecule programmatically in Python. 3.2 Single-Peptide Adsorption 1. Modify the peptide topology to add a flat-bottom position restraint to all backbone particles. Make this restraint operate along the z-axis to confine the particles to a horizontal slab with a force constant of 500 kJ mol1 nm2. The flat-bottom distance rfb should be 3.0 nm—the reference point will be later set to the membrane center, so this potential will leave a clearance of about 1 nm on either side of the membrane (assuming a typical membrane thickness of about 2 nm per leaflet; adjust if working with membranes of significantly different thickness). For a GROMACS topology, the position restraint directive will look like this: [ position_restraints ] 1 2 5 3.0 500 ... 2. Optionally, modify the topology to enclose the position_restraints block in a GROMACS preprocessing #ifdef MACRO_NAME/#endif directive (the actual name for MACRO_NAME can be chosen by the user). This enables easy restraint control in run parameter (.mdp) files using the define keyword. 3. Add the peptide’s topology to the membrane system description. Juxtapose the peptide’s structure coordinates with those of the membrane system, making sure that the peptide backbone is placed close to, but above the phosphate layer (or below, if adding peptides to the bottom leaflet); any needed vertical displacement can be done prior to juxtaposition using gmx editconf, MDAnalysis, or even interactively, using the structure modification capabilities of VMD. The juxtaposition itself can be done using the MDAnalysis.Merge functionality, or by concatenating structure files by hand (with due care to keep file format integrity). CG Simulations of Membrane-Adsorbed Helical Peptides 145 4. Generate a reference structure for GROMACS by setting r0 for each position restraint. This is done by creating a copy of the juxtaposed structure file where every backbone particle is placed at the z-level of the membrane center. This is most easily accomplished with MDAnalysis: import MDAnalysis as mda u = mda.Universe(’juxtaposed.gro’) # adjust for the appropriate lipid selection membrane = u.select_atoms(’resname POPC’) # adjust backbone name if not using Martini bb = u.select_atoms(’name BB’) membrane_zcog = membrane.center_of_geometry()[2] pos = bb.positions pos[:,2] = membrane_zcog bb.positions = pos u.atoms.write(’reference.gro’) 5. You can now energy-minimize and equilibrate the system. The Martini CG forcefield is usually robust to the blunt coordinate juxtaposition strategy used here (see Note 3). Flat-bottom restraints should be active until adsorption and orientation converge, and then switched off for production runs. This equilibration may be carried out simultaneously with pressure/temperature equilibration. 6. Equilibration monitorization in the previous step can be done as in Figs. 2 and 3, by measuring peptide–PO4 distances and helix orientations. Distances can be measured using the gmx traj GROMACS command but the gmx helixorient command, unfortunately, cannot process CG structures. MDAnalysis can be used to measure both distance and helix orientation (see Note 4). 3.3 Multiple Peptide Adsorption 1. Modify the peptide topology to add a flat-bottom position restraint to all backbone particles, as in Subheading 3.2, step 1. Add a second set of position restraints on the first and last backbone beads of the peptide (see Note 5 for other possibilities when peptide density is low). These restraints should be of the harmonic type, and act only in the x and y dimensions, with the same force constant as their flat-bottom counterparts: [ position_restraints ] 1 1 500 500 0 51 1 500 500 0 2. Optionally, you can split the restraints in the topology into separate [ position_restraints ] blocks, each under its 146 Manuel N. Melo own GROMACS preprocessing #ifdef/#endif directive. This enables independent restraint control. 3. Multiply the peptide structure in x and y, using the gmx genconf GROMACS tool to achieve the desired number of peptides. To control inter-peptide spacing use either the -dist flag or adjust the empty space around the template peptide structure using gmx editconf with the -d flag. See Note 5 on how to reduce bias in peptide distribution at this step. 4. Add the peptide’s topology to the membrane system description. Juxtapose the peptides’ structure coordinates with those of the membrane system, following the same considerations as in Subheading 3.2, step 3. See Note 6 if using large membranes where buckling prevents proper peptide placement or if the membrane buckling amplitude is larger than the z-restraints’ flat-bottom region. 5. Generate a reference structure for GROMACS as in Subheading 3.2, step 4. The above code snippet is still valid in this context but if generating the reference by other means note that while in Subheading 3.2, step 4 only the z-coordinate of the beads in the reference file mattered, when also restraining in x and y those coordinates are no longer arbitrary and must be kept unchanged (so that peptide termini are pinned to their initial xy position). 6. As in Subheading 3.2, step 5, you can now energy-minimize and equilibrate the system, with restraints active until adsorption and orientation converge (monitor convergence as in Subheading 3.2, step 6). Afterwards, include an additional equilibration period in your unrestrained production run to allow for unbiased peptide redistribution. Depending on the system, this may take from hundreds of nanoseconds to many microseconds. 4 Notes 1. The procedure in this chapter is, in principle, generic and applicable to different forcefields, including atomistic ones. However, for models that allow folding/unfolding [35] or that have breakable/soft elastic networks [36], care must be taken that peptides remain in their desired adsorption structures; otherwise, peptides may misfold and become trapped in less representative membrane-interacting configurations. 2. While harmonic position restraints are ubiquitous across simulation software, the used flat-bottom restraints are much less common. When unavailable, soft harmonic position restraints in z, centered on the expected peptide adsorption depth, can be CG Simulations of Membrane-Adsorbed Helical Peptides 147 used as a substitute. These can also be used when membranes have a non-flat geometry for which no readily usable flatbottom potential exists. This alternative has the disadvantage of imposing a nonzero restraint force even when peptides are close to their adsorption equilibrium depths. 3. Energy minimization after direct juxtaposition of peptide coordinates may not converge if particles become overlapped. While energy minimization with Martini is typically robust to close contacts, large systems, or juxtapositions too deep into the membrane may be too unstable. Solutions are to try i) shallower juxtapositions, ii) removal of solvent molecules in the immediate vicinity of peptides, or iii) minimization using softcore Lennard-Jones potentials (that do not have a singularity at zero distance; GROMACS allows the use of such potentials when free-energy mode is activated in the run parameters). If this procedure is extended to atomistic systems, solution ii is likely the only viable route to a stable system construction. 4. The use of the hydrophobic moment as reference vector for measuring helix orientation is needlessly complex—it involves defining it for the initial atomistic structure and then expressing it as a function of CG particle positions—; it was only used here so that the orientation of hydrophobic residues towards the membrane core could also be visualized in Figs. 2c, 3b, and 3d. Any vector with a significant component orthogonal to the helix axis can be used to gauge orientation (for instance, an i ! i + 1 backbone–backbone vector). Likewise, the measure of the simple angle (rather than the dihedral torsion angle) with + z is also sufficient to monitor convergence. These two simplifications can be easily implemented using MDAnalysis with the following snippet (exemplified for the single-peptide case): import MDAnalysis as mda import numpy as np # adjust for the appropriate topology/trajectory files u = mda.Universe(’topology.tpr’, ’trajectory.xtc’) # adjust backbone name if not using Martini bbs = u.select_atoms(’name BB’) mid_bb = len(bbs)//2 angles = [] for frame in u.trajectory: vec = bbs.positions[mid_bb+1] - bbs.positions[mid_bb] norm = np.linalg.norm(vec) angles.append(np.arccos(vec[2]/norm)) 5. It is good practice, when multiplying a structure, to assign random rotations in xy to each copy so as to minimize initial structure bias. gmx genconf can do this using the -rot and - 148 Manuel N. Melo flags but note that at high peptide densities there may be no other option than to set peptides parallel to one another. If space does allow for random rotation, then xy-restraining should be applied not at the termini, but on a single residue at the helix center, to allow rotation in place around the z-axis also during equilibration. maxrot 6. If the target membrane is large enough to spontaneously buckle with significant amplitude, a solution is to employ zaxis flat-bottom position restraints also on the lipids, thus damping the buckling. Such restraints can be applied to the lipids’ glycerol moieties, restraining them to be within 2.0 nm of the membrane center. This should only be done during adsorption equilibration, to allow proper action of the peptide restraints. References 1. Sankaram MB, Marsh D (1993) Protein-lipid interactions with peripheral membrane proteins. In: Watts A (ed) Protein-lipid interactions, new comprehensive biochemistry, vol 25. Elsevier, chap 6, pp 127–162, https://doi. org/10.1016/S0167-7306(08)60235-5 2. Hristova K, Wimley WC, Mishra VK, Anantharamiah GM, Segrest JP, White SH (1999) An amphipathic α-helix at a membrane interface: a structural study using a novel X-ray diffraction method. J Molecular Biol 290(1): 99–117. https://doi.org/10.1006/jmbi. 1999.2840 3. Mól AR, Castro MS, Fontes W (2018) NetWheels: a web application to create high quality peptide helical wheel and net projections. bioRxiv https://doi.org/10.1101/416347 4. Schiffer M, Edmundson AB (1967) Use of helical wheels to represent the structures of proteins and to identify segments with helical potential. Biophys J 7(2):121–135. https:// doi.org/10.1016/S0006-3495(67)86579-2 5. Eisenberg D, Weiss RM, Terwilliger TC (1982) The helical hydrophobic moment: a measure of the amphiphilicity of a helix. Nature 299(5881):371–374. https://doi.org/10.103 8/299371a0 6. Wieprecht T, Dathe M, Epand RM, Beyermann M, Krause E, Maloy WL, MacDonald DL, Bienert M (1997) Influence of the angle subtended by the positively charged helix face on the membrane activity of amphipathic, antibacterial peptides. Biochemistry 36(42): 12869–12880. https://doi.org/10.1021/ bi971398n 7. Tossi A, Sandri L, Giangaspero A (2000) Amphipathic, alpha-helical antimicrobial peptides. Biopolymers 55(1):4–30. https:// doi.org/10.1002/1097-0282(2000)55:1 $langle$4::AID-BIP30$rangle$3.0.CO;2-M 8. Zaro JL, Shen WC (2015) Cationic and amphipathic cell-penetrating peptides (CPPs): Their structures and in vivo studies in drug delivery. Front Chem Sci Eng 9(4):407–427. https:// doi.org/10.1007/s11705-015-1538-y 9. Henriques ST, Melo MN, Castanho MARB (2006) Cell-penetrating peptides and antimicrobial peptides: how different are they? Bioc h e m J 3 9 9 ( 1 ) : 1 – 7 . h t t p s : // d o i . org/10.1042/BJ20061100 10. Drin G, Antonny B (2010) Amphipathic helices and membrane curvature. FEBS Lett 584(9):1840–1847. https://doi.org/10.101 6/j.febslet.2009.10.022 11. Marrink SJ, Risselada HJ, Yefimov S, Tieleman DP, de Vries AH (2007) The MARTINI force field: coarse grained model for biomolecular simulations. J Phys Chem B 111(27): 7812–7824. https://doi.org/10.1021/jp0 71097f 12. Bruininks BMH, Souza PCT, Marrink SJ (2019) A practical view of the martini force field, Springer, New York, pp 105–127. https://doi.org/10.1007/978-1-4939-960 8-7_5 13. Pezeshkian W, König M, Wassenaar TA, Marrink SJ (2020) Backmapping triangulated surfaces to coarse-grained membrane models. Nature Commun 11(1). https://doi. org/10.1038/s41467-020-16094-y 14. Rzepiela AJ, Sengupta D, Goga N, Marrink SJ (2009) Membrane poration by antimicrobial peptides combining atomistic and coarse- CG Simulations of Membrane-Adsorbed Helical Peptides grained descriptions. Faraday Discussions 144: 431–443. https://doi.org/10.1039/b90161 5e 15. Monticelli L, Kandasamy SK, Periole X, Larson RG, Tieleman DP, Marrink SJJ (2008) The MARTINI coarse-grained force field: extension to proteins. J Chem Theory Comput 4(5):819–834. https://doi.org/10.1021/ ct700324x 16. Periole X, Cavalli M, Marrink SJ, Ceruso MA (2009) Combining an elastic network with a coarse-grained molecular force field: structure, dynamics, and intermolecular recognition. J Chem Theory Comput 5(9):2531–2543. https://doi.org/10.1021/ct9002114 17. Melo MN, Ferre R, Castanho MARB (2009) Antimicrobial peptides: linking partition, activity and high membrane-bound concentrations. Nat Rev Microbiol 7(3):245–50. https://doi. org/10.1038/nrmicro2095 18. Su J, Marrink SJ, Melo MN (2020) Localization preference of antimicrobial peptides on liquid-disordered membrane domains. Front Cell Develop Biol 8. https://doi.org/10. 3389/fcell.2020.00350 19. Mukai Y, Matsushita Y, Niidome T, Hatekeyama T, Aoyagi H (2002) Parallel and antiparallel dimers of magainin 2: their interaction with phospholipid membrane and antibacterial activity. J Peptide Sci 8(10):570–577. https://doi.org/10.1002/psc.416 20. Woo HJ, Wallqvist A (2011) Spontaneous buckling of lipid bilayer and vesicle budding induced by antimicrobial peptide magainin 2: a coarse-grained simulation study. J Phys Chem B 1 1 5 ( 2 5 ) : 8 1 2 2 – 8 1 2 9 . h t t p s : // d o i . org/10.1021/jp2023023 21. Su J, Thomas AS, Grabietz T, Landgraf C, Volkmer R, Marrink SJ, Williams C, Melo MN (2018) The N-terminal amphipathic helix of Pex11p self-interacts to induce membrane remodelling during peroxisome fission. Biochimica et Biophysica Acta (BBA) - Biomembranes 1860(6):1292–1300. https://doi. org/10.1016/j.bbamem.2018.02.029 22. Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindah E (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2: 19–25. https://doi.org/10.1016/j.softx.201 5.06.001 23. Kroon PC (2021) Martinize2 and Vermouth. https://github.com/marrink-lab/vermouthmartinize. Accessed 28 Jan 2021 24. Hanwell MD, Curtis DE, Lonie DC, Vandermeersch T, Zurek E, Hutchison GR 149 (2012) Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J Ch eminfor m 4 (1):17. https://doi. org/10.1186/1758-2946-4-17 25. Martinize (2017) http://cgmartini.nl/index. php/tools2/proteins-and-bilayers/204martinize. Accessed 28 Jan 2021 26. Wassenaar TA, Ingólfsson HI, Böckmann RA, Tieleman DP, Marrink SJ (2015) Computational lipidomics with insane: a versatile tool for generating custom membranes for molecular simulations. J Chem Theory Comput 11(5): 2144–2155. https://doi.org/10.1021/acs. jctc.5b00209 27. Lee J, Hitzenberger M, Rieger M, Kern NR, Zacharias M, Im W (2020) CHARMM-GUI supports the Amber force fields. J Chem Phys 153(3). https://doi.org/10.1063/5.0012280 28. Gowers RJ, Linke M, Barnoud J, Reddy TJE, Melo MN, Seyler SL, Domański J, Dotson DL, Buchoux S, Kenney IM, Beckstein O (2016) MDAnalysis: a Python package for the rapid analysis of molecular dynamics simulations. In: Benthall S, Rostrup S (eds) Proceedings of the 15th Python in science conference, SciPy, pp 98–105. https://doi.org/ 10.25080/Majora-629e541a-00e 29. Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Molecular Graph 14(1):33–38. https://doi. org/10.1016/0263-7855(96)00018-5 30. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Rı́o JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi. org/10.1038/s41586-020-2649-2. 200 6.10256 31. Hunter JD (2007) Matplotlib: A 2D graphics environment. Comput Sci Eng 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55 32. Reißer S, Strandberg E, Steinbrecher T, Ulrich AS (2014) 3D hydrophobic moment vectors as a tool to characterize the surface polarity of amphiphilic peptides. Biophys J 106(11): 2385–2394. https://doi.org/10.1016/j. bpj.2014.04.020 33. De Jong DH, Baoukina S, Ingólfsson HI, Marrink SJ (2016) Martini straight: boosting performance using a shorter cutoff and GPUs. Comput Phys Commun 199:1–7. https://doi. org/10.1016/j.cpc.2015.09.014 150 Manuel N. Melo 34. Martini run input parameters (2017). http:// c g m a r t i n i . n l / i n d e x . p h p / f o r c e - fi e l d parameters/input-parameters. Accessed 28 Jan 2021 35. Darré L, Machado MR, Brandner AF, González HC, Ferreira S, Pantano S (2015) SIRAH: a structurally unbiased coarse-grained force field for proteins with aqueous solvation and longrange electrostatics. J Chem Theory Comput 11(2):723–739. https://doi.org/10.1021/ ct5007746 36. Poma AB, Cieplak M, Theodorakis PE (2017) Combining the MARTINI and StructureBased Coarse-Grained Approaches for the Molecular Dynamics Studies of Conformational Transitions in Proteins. J Chem Theory Comput 13(3):1366–1374. https://doi.org/ 10.1021/acs.jctc.6b00986 Chapter 8 Peptide Dynamics and Metadynamics: Leveraging Enhanced Sampling Molecular Dynamics to Robustly Model Long-Timescale Transitions Joseph Clayton, Lokesh Baweja, and Jeff Wereszczynski Abstract Molecular dynamics simulations can in theory reveal the thermodynamics and kinetics of peptide conformational transitions at atomic-level resolution. However, even with modern computing power, they are limited in the timescales they can sample, which is especially problematic for peptides that are fully or partially disordered. Here, we discuss how the enhanced sampling methods accelerated molecular dynamics (aMD) and metadynamics can be leveraged in a complementary fashion to quickly explore conformational space and then robustly quantify the underlying free energy landscape. We apply these methods to two peptides that have an intrinsically disordered nature, the histone H3 and H4 N-terminal tails, and use metadynamics to compute the free energy landscape along collective variables discerned from aMD simulations. Results show that these peptides are largely disordered, with a slight preference for α-helical structures. Key words Peptide dynamics, Accelerated molecular dynamics, Collective variables, Metadynamics 1 Introduction Molecular dynamics (MD) simulations have become an invaluable tool in the study of biomolecular structure, function, and dynamics [1, 2]. Through the development of carefully optimized force fields [3, 4], they model biologically relevant motions by integrating Newton’s equations of motion in complex heterogeneous systems. This can lead to powerful insights into the atomic-level descriptions of biologically relevant systems, producing models of biomolecular mechanisms across vast time and length scales, providing novel insights into experimental results, and aiding the design of new experiments. Although there has been a significant rise in computational power over the past decade, which in no small part is due to the development of GPU programming [5, 6] and special purpose machines [7, 8], MD simulations are still typically on the Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_8, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 151 152 Joseph Clayton et al. microseconds or shorter timescale—primarily due to the small timestep required to capture motions between bonded atoms. Because of this, sampling of long-timescale events and achieving biological equilibrium are often impossible to observe through conventional molecular dynamics (cMD) simulations, especially for highly flexible systems such as peptides. To circumvent this limitation, enhanced sampling methods have been developed to efficiently extend MD to characterize the thermodynamic properties of hard to sample systems. In this chapter, we describe two complementary approaches for using enhanced sampling methods to determine the conformations of peptides in silico, along with their respective solution populations and free energies. First, we describe accelerated molecular dynamics (aMD) simulations. In aMD, the energy landscape of a system is flattened through the addition of a “boost” potential [9]. This lowers energy barriers and promotes the rapid sampling of novel conformational states [10–14]; however, it can be difficult to robustly reweight the results to compute the free energy of states in the physically relevant unaccelerated system. Second, we describe the method of metadynamics, in which energy is adaptively added to the system along a low number of predefined reaction coordinates [15–17]. In metadynamics, more care must be taken than in aMD to ensure the reaction coordinates are properly chosen, making these simulations more difficult to properly set up. However, metadynamics allows for the robust calculation of the system’s underlying free energy landscape [18–20]. Given the strengths and weaknesses of both approaches, it is natural to use aMD and metadynamics in a hierarchal protocol. Here, we illustrate how this can be achieved for two intrinsically disordered proteins (IDPs): the N-terminal tails of the H3 and H4 histones [21– 23]. aMD simulations are used to gain a qualitative understanding of these systems to quickly sample phase space and define appropriate reaction coordinates, whereas metadynamics calculations are used to refine these results and quantitatively compute the underlying free energy landscape. Both methods are implemented in a wide array of popular MD packages, and here we have used NAMD v 2.14 [6] for all calculations. 2 2.1 Finding Peptide Conformations Using Accelerated Molecular Dynamics (aMD) Theory Sampling in cMD simulations is limited in part by the presence of high energy barriers between conformations. To overcome this, aMD aims to increase the rate of sampling rare motions and configurations by altering a system’s potential energy landscapes through the addition of a “boost” potential [5, 9, 10, 24]: Simulating Peptide Dynamics with aMD and Metadynamics 153 E thresh E(x) α=0.5 α=2.0 α= 10.0 x Fig. 1 A demonstration of the alterations done by accelerated MD to a given potential, dictated by Eq. 1. The altered potentials (dashed lines) all have the same threshold energy (dotted line), but have different tuning parameters. The threshold energy dictates which portion of the potential is affected, while the tuning parameter controls the alteration; as the tuning parameter approaches zero, the altered region becomes constant and approaches the threshold energy V ðrÞ ¼ V ðrÞ þ V boost ðrÞ ( 0 V boost ðrÞ ¼ V ðrÞ > E thresh 2 ðE V ðrÞÞ α þ thresh E thresh V ðrÞ ð1Þ else The altered potential reduces the depth of energy wells and flattens the overall landscape, making energy barriers easier to cross and rare conformations accessible on a shorter timescale. The functional form of the boost potential has two parameters: Ethresh, which is the threshold energy below which the boost potential is applied, and α, a tuning parameter that dictates how deep the energy wells are in the accelerated landscape. Figure 1 demonstrates how the aMD potential landscape is altered by the choice of α; smaller values of α produce landscapes with shallower wells and lower barriers, but only for regions below the threshold energy. Practically, running an aMD simulation requires choosing not only the parameters Ethresh and α but also the degrees of freedom to which aMD should be applied. This is especially important since, as Fig. 1 demonstrates for aMD on a simple one-dimensional landscape, even simple biomolecular systems have thousands to millions of degrees of freedom. Although any combination of potential energy terms could be used, in general, there are three typical implementations of aMD: 1. aMD may be applied to the total dihedral energy. Given their importance in driving biomolecular structures, accelerating the 154 Joseph Clayton et al. motions of all dihedrals is a natural choice for improving the sampling of peptide conformations. 2. aMD may be applied to the entire potential energy surface. Since the total potential energy is dominated by electrostatic and, in explicit solvent simulations, solvent–solvent interactions, accelerating along the total potential landscape may help speed sampling between states where charge–charge interactions provide stabilizing forces and it may increase the diffusive properties of the system. 3. As a hybrid of these two approaches, in the “dual boost” method there are two aMD potentials applied: one to the dihedral potential energy and the other to the total potential energy. This combines the advantages of both approaches: increased conformational sampling of backbones and sidechains, increased breaking and formation of hydrogen bonds and other electrostatically driven interactions, and reduced solvent viscosity. In general, the dual boost approach is the one we recommend most new users try for their system of interest. In theory, other choices of the aMD potential form may be made, such as applying aMD to only selected dihedrals of importance in a protein’s binding site; however, the act of choosing the appropriate degrees of freedom to accelerate increases the complexity of implementing aMD, which takes away from the algorithm’s simplicity [13, 25] (see Note 1). In addition to choosing the aMD implementation, one must also select a set of aMD parameters. In general, we have found that there are a wide range of potential aMD parameters for any given system, but a good place to start is to set the tuning parameter (s) according to the size of the system and the threshold energy (or energies) according to the average for the potential in a short cMD simulation. Here, in a dual boost implementation, we chose the following: αD ¼ ð1=5Þ ð3:5 kcal=molÞ ðnum: of residuesÞ E thresh,D ¼ αD þ ðavg: dihed: potentialÞ αtotal ¼ ð1=5Þ ð1 kcal=molÞ ðnum: of atomsÞ E thresh,total ¼ αtotal þ ðavg: total potentialÞ Here αD and αtotal are the tuning parameters for the dihedral and total energy terms, and Ethresh,D and Ethresh,total the respective threshold energy. If running only a dihedral or total boost simulation, then only the respective α or Ethresh should be used. In some cases, we have found that these parameters may be insufficient to achieve the desired level of sampling, in which case one may try increasing each Ethresh parameter by an additional value of α. However, care must be taken to not “over-accelerate” and create Simulating Peptide Dynamics with aMD and Metadynamics 155 unwanted distortions in the system of interest, such as melting of stable secondary structure elements. 2.2 Example: Exploring the Conformational Space of the Histone H3 and H4 N-Terminal Tails For the purpose of this example, we took the N-terminus tails of histone-3 (H3) and histone-4 (H4) from the nucleosome core particle and modeled each as individual peptides of length 42 and 23 residues using the AMBER19SB force field [4]. Both have been shown to have disordered states in solution and provide excellent examples of difficult to sample IDPs based on their lengths and cationic nature. To further enhance sampling, we used a generalized Born implicit solvent model [26, 27], which speeds sampling by drastically reducing both the system sizes and the friction within each system. To find suitable tuning parameters and threshold energies, we first obtained the average dihedral and total potential energies by running a short 500 ps cMD simulation of each system. We then set the tuning parameter and energy thresholds using the protocol detailed above. Here, the average dihedral and potential energies for H3 were 199 and 1361 kcal/mol; using the suggestion above, the tuning parameters were set to 30.8 and 136.2 kcal/mol, and the thresholds were set to 230 and 1225 kcal/mol for the dihedral and potential boosts, respectively. Using NAMD (see Notes 2 and 3), we simulated 600 ns of aMD for each system with both the dihedral and potential energies boosted using the above recommended values, as well as cMD simulations for comparison. For each of these simulations, the root-mean-squared-deviation (RMSD) matrix of each frame compared to the rest of the simulation is shown in Fig. 2. In the aMD simulations, the RMSD varies quickly between frames, with differences as high as 12 and 7 Å in the case of the H3 and H4 tails. In contrast, the cMD simulations show reduced sampling for both peptides, with systems remaining in long-lived/stable conformations for significantly longer periods of time as seen by the reduction in the bright-colored lines and appearance of “boxes” of low RMSD values. Since the altered potential lowers energy barriers, aMD simulations can show states not easily sampled in cMD simulations—thus revealing motions that occur on timescales longer than the simulation. These states and motions can frequently be defined through a set of collective variables. There are multiple ways to use aMD to determine collective variables, including dimension reduction analysis like principal component analysis (PCA) [28] and leveraging previous experimental and computational results (known conformational changes [29–31], for example). Visualization is a good first step; here we chose to use VMD, as it has a graphical user interface (GUI) plugin that allows the user to define a collective variable and visualize how the quantity evolves over the course of the trajectory [32] (see Note 4). This GUI uses the Colvars module [33], the NAMD implementation of collective variables, thus any 156 Joseph Clayton et al. Fig. 2 The root-mean-square deviation (RMSD) matrix for the H3 and H4 aMD simulations (left), with the cMD simulations (right) for comparison variable defined by the GUI can be easily used in a new simulation. From our observations, we found that both the H3 and H4 tails sampled a range of compact and extended conformations with helical regions (Fig. 3). Both had an average helicity near 0.5, and H3 sampled end-to-end distances up to 60 Å, whereas H4 only sampled extensions up to 45 Å due to its shorter length. Based on this, we defined two collective variables: the distance between the backbone atoms of the terminal residues and the alpha Colvars component which estimates the overall helicity (see Notes 5 and 6). In general, these are natural collective variables for sampling peptides that have a helical propensity [34]. These were utilized in the next section, in which the underlying free energy landscape was rigorously quantified with metadynamics calculations. Bin count (normalized) Simulating Peptide Dynamics with aMD and Metadynamics 157 H3 Bin count (normalized) 0.4 0.6 Helicity H3 H4 0 25 50 H4 75 End to end distance (Å) Fig. 3 Sampling along the two selected collective variables (helicity and end-toend distance) for the aMD simulations. A sample structure from each system (left) shows both peptides can form short helices separated by unstructured loops 3 3.1 Quantifying Free Energy Landscapes with Metadynamics Theory In theory, converged potentials of mean force (PMFs) can be calculated from performing a Boltzmann inversion of the sampling in cMD simulations. However, the computational effort required to do this for even small systems is typically intractable, since they will spend the majority of their time in local minima and fail to sample transitions and new free energy minima states. Metadynamics helps to overcome this issue by adding a history-dependent bias along a low number of collective variables [15–17]. This bias consists of Gaussian deposits that are periodically added at time intervals of τ: X ðx x ðt ÞÞ2 V bias ðx, t Þ ¼ w ð2Þ exp 2δx 2 t¼τ, 2τ, ... where x is a collective variable space, x(t) is the value of a collective variable at time t, and w and δx are height and width parameters for the Gaussian deposits, respectively. As the system samples a local free energy minima, the introduced bias slowly grows until it “fills” the minima—allowing the system to escape and sample other states. To show how the bias grows, an example of a one-dimensional metadynamics simulation is shown in Fig. 4; the system initially remains in the global minimum, causing the bias to increase in that region over time. The bias eventually compensates for the minimum, allowing the system to easily sample the second minimum; Joseph Clayton et al. Intermediate G(x) G(x) Initial CV x CV x Estimated G(x) G(x) Final G(x) 158 CV x CV x Fig. 4 An illustration of metadynamics calculations along a single variable with two stable states separated by a barrier. The simulation initially starts in one of these states, and as it progresses the periodic deposits to the bias “fill” the potential well (top row), allowing the simulation to cross the barrier and sample the second state. As the simulation progresses and samples the second state, the bias fully compensates the underlying free energy surface and provides an estimation of the surface along the variable (bottom row) eventually the bias compensates for both minima, making the effective landscape flat. Note that the flattening of the landscape is akin to aMD; unlike aMD, however, the bias from metadynamics is time dependent and is not uniform over the course of the simulation. Since the bias aims to sample and match the underlying landscape, the negative of the bias will estimate the shape of the system’s free energy landscape [35–37]: lim V bias ðx, t Þ ¼ G ðx Þ þ C t!1 Once the underlying energy surface has been balanced by the bias, any additional Gaussian deposits introduce error into the estimate and causes it to fluctuate around the true landscape. To prevent this fluctuation, Barducci et al. developed a “well-tempered” version where the height of the Gaussian deposits decreases as they are deposited in the same region [38]: Simulating Peptide Dynamics with aMD and Metadynamics V ðx, tÞ bias w 0 ðx, tÞ ¼ w exp ΔT ðT þ ΔT Þ V bias ðx, tÞ GðxÞ ¼ ΔT 159 ð3Þ Here a new parameter, ΔT, determines how quickly deposits decrease in height as a minimum is filled. Since simulations cannot be run indefinitely, this parameter also introduces a maximum threshold to Vbias; this threshold can be used to limit the collective variable sampling to only biologically relevant regions [38] (see Note 7). In NAMD, collective variables can be defined by activating the Colvars module, which will take a configuration as input. This configuration file consists of blocks that define the variables and biases; an example Colvars configuration file is shown in Fig. 5, where our two collective variables (helicity and end-to-end distance) and a metadynamics protocol are defined for the H3 tail system. For this system, we created a grid with a 0.025 and 2.5 Å resolution for the helicity and distance, respectively, which results in approximately 40 bins in both the helicity and distance coordinates; NAMD uses this spacing to determine the width of the Gaussian deposits and the resolution of the resulting energy landscape estimate. Increasing the resolution (i.e., decreasing the grid width) will give more detailed estimates; however, the bias will evolve more slowly and thus the landscape estimate will require more simulation time to converge. 3.2 Example: Using Metadynamics to Quantify Free Energy Landscapes of the Histone H3 and H4 N-Terminal Tails To examine the thermodynamics of the H3 and H4 tails, both peptides were solvated using the OPC model [39] in a 150 mM NaCl environment. We then performed a single 2 μs well-tempered metadynamics simulation for each model, utilizing the two collective variables based on our aMD results and the metadynamics parameters discussed above. In each of these simulations, there was significant sampling in the end-to-end and helicity coordinates spaces as the peptides rapidly transitioned between diverse configurations (see Fig. 6 for details). There are multiple methods for assessing convergence in metadynamics simulations. Here, we take advantage of the property inherent in well-tempered simulations that the hill heights will decrease as a region of phase space is repeatedly sampled. The heights of the Gaussian deposits were monitored over time (Fig. 6), and while they started at the initial value of 0.5 kcal/mol in both systems, new deposits heights approached zero around 1.2 μs in the case of the H3 tails and 1.5 μs for the H4 tails. This indicates that additional sampling will have minimal effect on the computed PMF, as is also shown by the cumulative hill heights converging around these times as well. Indeed, we observed little difference in the PMFs computed after 1.2 μs in H3 and 1.5 μs in 160 Joseph Clayton et al. colvarsTrajFrequency 500 colvarsRestartFrequency 1000 colvar { name heli width 0.025 lowerBoundary 0.0 upperBoundary 1.0 alpha { residueRange 2-43 } } colvar { name dist width 2.5 lowerBoundary 0.0 upperBoundary 120.0 distance { group1 { # Selection: "resid 2 and backbone" atomNumbers 7 9 15 17 } group2 { # Selection: "resid 43 and backbone" atomNumbers 652 654 674 675 } } } metadynamics { name meta-H3 colvars heli dist hillWeight 0.5 newHillFrequency 500 dumpFreeEnergyFile yes writeHillsTrajectory on hillwidth 1.0 wellTempered on biasTemperature 2000 } Fig. 5 An example of a Colvars configuration file, generated from the Colvars Dashboard VMD plugin. The file consists of two types of code blocks: a type that defines a variable and one that defines a bias or protocol. Here two blocks define the helicity and distance parameter, while the final block defines a metadynamics protocol. Note that the first two lines are not set in a block; these are two global parameters in the Colvars module that set how often output files are written H4 tails. Plotting the hill height as a function of time is also instructive for highlighting when systems sample a new region of collective variable space, as sudden increases in the hill height (such as in H3 around 0.7 μs) are indicative of sampling regions without previously deposited hills. 161 H3 0.5 10000 0 0.0 2.0 Cumulative hill height Hill height Simulating Peptide Dynamics with aMD and Metadynamics End-to-end Helicity 0.75 0.50 0.25 50 Helicity End-to-end distance (Å) Simulation time ( s) 2.0 H4 0.5 5000 0 0.0 2.0 Cumulative hill height Hill height Simulation time ( s) 0.75 0.50 0.25 50 0 0.0 0.5 1.0 1.5 Simulation time ( s) Helicity End-to-end distance (Å) Simulation time ( s) 2.0 Fig. 6 Sampling and convergence of metadynamics calculations. Both the H3 and H4 peptides rapidly transition between helical and nonhelical, as well as extended vs compact states. The hill heights as a function of time can be used to gauge convergence, as the heights of hills added in well-sampled landscapes will approach zero, and the cumulative hill heights will level off The final free energy landscapes are shown in Fig. 7; both landscapes are largely flat and indicate these peptides exist in a variety of conformations in solution. The global minima for both systems correspond to an end-to-end distance of 4 Å which corresponds to interactions between the N- and C-terminal residues; however, both landscapes are relatively flat and sample a wide range of helicities and end-to-end distances, and neither has large free energy barriers dividing metastable states. From this, we can conclude that both peptides are capable of sampling a wide range of conformations and are not locked into discrete states. While the 162 Joseph Clayton et al. Fig. 7 Potentials of mean force (PMFs) for H3 and H4 tails as computed from well-tempered metadynamics calculations. Both landscapes are overall relatively flat, with broad energies wells indicating that conformations covering a wide range of helical and end-to-end distances are easily accessible in solution. All energies are in kcal/mol sampling in these simulations is similar to that in the implicit aMD simulations above (Fig. 3), those simulations suggested a stronger preference for helical structures. For example, the H4 tail strongly sampled helicities between 0.4 and 0.6 in the aMD simulations, but the metadynamics PMFs reveal a range of accessible states between 0.2 and 0.8 with little preference for a particular value. This discrepancy is likely due to both the more robust reweighting mechanisms used in metadynamics and the different solvent models; in this case, the explicit model using the OPC water model is known to match experiments for the H4 tail [40]. Nevertheless, an implicit model can be useful for finding long-timescale motions as the simplified model is less computationally intensive. Simulating Peptide Dynamics with aMD and Metadynamics 4 163 Conclusions Here, we described a hierarchal approach for characterizing the conformational space of peptides. Initially, accelerated MD simulations are used to quickly scan the accessible peptide states. Given that aMD already disturbs the potential energy landscape, we elected to use an implicit solvent model in this stage as the goal was to qualitatively describe the energetically accessible peptide conformations. These aMD simulations were then manually inspected to determine appropriate collective variables, which were then used in well-tempered metadynamics simulations with explicit solvent. Although explicit solvent systems run at a fraction of the speed of implicit solvent simulations, these were able to accurately quantify the underlying free energy landscapes of each system. Both aMD and metadynamics have their respective strengths and weaknesses, and combining both of them can lead to the efficient and rigorous characterization of peptide states. 5 Notes 1. Several variations of aMD exist, including selective aMD [25], windowed aMD [41], replica exchange aMD [42], rotatable dihedral aMD [43], and Gaussian aMD [40, 44]. Among these methods are ways to approximate the underlying free energy by reweighting frames directly from the aMD trajectories [45]; here, we elected to use aMD with implicit solvent only to quickly sample different peptide configurations, then to robustly compute the free energy surface with metadynamics in explicit solvent. 2. Here we chose to use NAMD, but many molecular dynamic engines have aMD and free energy estimation methods implemented including umbrella sampling [46, 47], steered molecular dynamics [48, 49], adaptive biasing force [50], and adaptively biased molecular dynamics [51]. These methods work similar to metadynamics, as they each incorporate a bias into the system to sample and estimate the free energy along a collective variable space. Each vary in how the bias is applied and how the free energy is calculated, but all have been well studied and developed [52, 53]; if metadynamics is not implemented in the desired engine, these can be suitable alternatives. 3. Here we boosted both the dihedral and total potential terms; however, the implementation in NAMD applies a boost to the dihedral and (total—dihedral) potentials. The reason behind this is to avoid boosting the dihedral potential twice, as the total potential includes the dihedral energy. Our recommended 164 Joseph Clayton et al. method of finding optimal aMD parameters thus should be adjusted, such that the energy threshold for the total potential uses the average total potential energy from a short cMD simulation without the dihedral term. This discrepancy is not present in AMBER, as it will boost the total potential and add an additional boost to the dihedral term while in dual boost mode. 4. While other visualization packages exist, VMD has dashboard plugins that work well with the Colvars module and PLUMED, another collective variable module used in GROMACS [54] (as well as NAMD v2.12 and later). These plugins provide several useful tools, including defining variables from VMD’s atom selection, plotting the evolution of a variable over the trajectory, and plotting one variable against a second. The user can thus create and visualize a collective variable, compare variables to each other, and output a configuration file once a suitable set has been found. 5. Finding good collective variables can be difficult and can depend on the question at hand. Here we used implicit solvent and dual boost aMD to enhance sampling; however, both of these techniques can lead to artifacts. A good rule of thumb is to ensure known secondary structure, if any, is conserved; if this is not conserved, reduce the aMD bias by reducing the threshold energies by factors of alpha. Here we used both methods in order to quickly determine extreme motions, which may or may not be relevant to biological processes of the histone tails. 6. Here we focused on using two collective variables, but in theory one can use any number of variables in metadynamics. However, increasing the number of variables exponentially increases the sampling space, which in turn increase the simulation effort to reach equilibrium, and in practice it is often not feasible to use more than three dimensions—so the aim should always be to find the minimum number of variables needed to describe the quantity in question. The correlation between two collective variables can be estimated by plotting one variable against another; such a pairwise plot can be easily made using the Colvars dashboard. 7. This chapter discussed the original and well-tempered metadynamics methods; however, other variants exist including using multiple walkers [55], ensemble biased [56], and merging metadynamics with the adaptive biasing force algorithm [57]. The multiple walker variant is a popular method, as it allows the user to run parallel simulations; since the communications between simulations are minimal, this method can efficiently scale on clusters of loosely coupled nodes. One could follow our example here and use aMD to seed different initial conditions for multiple walkers, thus sampling different regions of collective variable space simultaneously. Simulating Peptide Dynamics with aMD and Metadynamics 165 Acknowledgments This work in the Wereszczynski group was supported by the National Science Foundation [MCB-1716099] and the National Institutes of Health [1R35GM119647]. References 1. Karplus M, McCammon JA (2002) Molecular dynamics simulations of biomolecules. Nat Struct Biol 9:646–652. https://doi.org/10. 1038/nsb0902-646 2. Hollingsworth SA, Dror RO (2018) Molecular dynamics simulation for all. Neuron 99: 1129–1143. https://doi.org/10.1016/j.neu ron.2018.08.011 3. Huang J, Rauscher S, Nawrocki G et al (2017) CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat Methods 14:71–73. https://doi.org/10. 1038/nmeth.4067 4. Tian C, Kasavajhala K, Belfon KAA et al (2020) ff19SB: amino-acid-specific protein backbone parameters trained against quantum mechanics energy surfaces in solution. J Chem Theory Comput 16:528–552. https://doi.org/10. 1021/acs.jctc.9b00591 5. Salomon-Ferrer R, Götz AW, Poole D et al (2013) Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald. J Chem Theory Comput 9:3878–3888. https://doi. org/10.1021/ct400314y 6. Phillips JC, Hardy DJ, Maia JDC et al (2020) Scalable molecular dynamics on CPU and GPU architectures with NAMD. J Chem Phys 153: 044130. https://doi.org/10.1063/5. 0014475 7. Shaw DE, Deneroff MM, Dror RO et al (2008) Anton, a special-purpose machine for molecular dynamics simulation. Commun ACM 51:91–97. https://doi.org/10.1145/ 1364782.1364802 8. Ohmura I, Morimoto G, Ohno Y et al (2014) MDGRAPE-4: a special-purpose computer system for molecular dynamics simulations. Phil Trans R Soc A 372:20130387. https:// doi.org/10.1098/rsta.2013.0387 9. Hamelberg D, Mongan J, McCammon JA (2004) Accelerated molecular dynamics: a promising and efficient simulation method for biomolecules. J Chem Phys 120: 11919–11929. https://doi.org/10.1063/1. 1755656 10. Hamelberg D, de Oliveira CAF, McCammon JA (2007) Sampling of slow diffusive conformational transitions with accelerated molecular dynamics. J Chem Phys 127:155102. https:// doi.org/10.1063/1.2789432 11. Grant BJ, Gorfe AA, McCammon JA (2009) Ras conformational switching: simulating nucleotide-dependent conformational transitions with accelerated molecular dynamics. PLoS Comput Biol 5:e1000325. https://doi. org/10.1371/journal.pcbi.1000325 12. de Oliveira CAF, Grant BJ, Zhou M, McCammon JA (2011) Large-scale conformational changes of Trypanosoma cruzi proline racemase predicted by accelerated molecular dynamics simulation. PLoS Comput Biol 7: e1002178. https://doi.org/10.1371/journal. pcbi.1002178 13. Doshi U, Hamelberg D (2015) Towards fast, rigorous and efficient conformational sampling of biomolecules: advances in accelerated molecular dynamics. Biochim Biophys Acta Gen Subj 1850:878–888. https://doi.org/ 10.1016/j.bbagen.2014.08.003 14. Kamenik AS, Lessel U, Fuchs JE et al (2018) Peptidic macrocycles—conformational sampling and thermodynamic characterization. J Chem Inf Model 58:982–992. https://doi. org/10.1021/acs.jcim.8b00097 15. Laio A, Gervasio FL (2008) Metadynamics: a method to simulate rare events and reconstruct the free energy in biophysics, chemistry and material science. Rep Prog Phys 71:126601. https://doi.org/10.1088/0034-4885/71/ 12/126601 16. Barducci A, Bonomi M, Parrinello M (2011) Metadynamics. WIREs Comput Mol Sci 1: 826–843. https://doi.org/10.1002/wcms.31 17. Bussi G, Laio A (2020) Using metadynamics to explore complex free-energy landscapes. Nat Rev Phys 2:200–212. https://doi.org/10. 1038/s42254-020-0153-0 18. Bochicchio D, Panizon E, Ferrando R et al (2015) Calculating the free energy of transfer of small solutes into a model lipid membrane: comparison between metadynamics and umbrella sampling. J Chem Phys 143: 144108. https://doi.org/10.1063/1. 4932159 19. Capelli R, Bochicchio A, Piccini G et al (2019) Chasing the full free energy landscape of neuroreceptor/ligand unbinding by metadynamics 166 Joseph Clayton et al. simulations. J Chem Theory Comput 15: 3354–3361. https://doi.org/10.1021/acs. jctc.9b00118 20. Tanida Y, Matsuura A (2020) Alchemical free energy calculations via metadynamics: application to the theophylline-RNA aptamer complex. J Comput Chem 41:1804–1819. https://doi.org/10.1002/jcc.26221 21. Potoyan DA, Papoian GA (2011) Energy landscape analyses of disordered histone tails reveal special organization of their conformational dynamics. J Am Chem Soc 133:7405–7415. https://doi.org/10.1021/ja1111964 22. Iwasaki W, Miya Y, Horikoshi N et al (2013) Contribution of histone N-terminal tails to the structure and stability of nucleosomes. FEBS Open Bio 3:363–369. https://doi.org/10. 1016/j.fob.2013.08.007 23. Erler J, Zhang R, Petridis L et al (2014) The role of histone tails in the nucleosome: a computational study. Biophys J 107: 2911–2922. https://doi.org/10.1016/j.bpj. 2014.10.065 24. Wang Y, Harrison CB, Schulten K, McCammon JA (2011) Implementation of accelerated molecular dynamics in NAMD. Comput Sci Disc 4:015002. https://doi.org/10.1088/ 1749-4699/4/1/015002 25. Wereszczynski J, McCammon JA (2010) Using selectively applied accelerated molecular dynamics to enhance free energy calculations. J Chem Theory Comput 6:3285–3292. https://doi.org/10.1021/ct100322t 26. Onufriev A, Bashford D, Case DA (2000) Modification of the generalized born model suitable for macromolecules. J Phys Chem B 104:3712–3720. https://doi.org/10.1021/ jp994072s 27. Onufriev A, Bashford D, Case DA (2004) Exploring protein native states and large-scale conformational changes with a modified generalized born model. Proteins 55: 383–394. https://doi.org/10.1002/prot. 20033 28. Wereszczynski J, McCammon JA (2012) Nucleotide-dependent mechanism of Get3 as elucidated from free energy calculations. Proc Natl Acad Sci 109:7759–7764. https://doi. org/10.1073/pnas.1117441109 29. Bešker N, Gervasio FL (2012) Using metadynamics and path collective variables to study ligand binding and induced conformational transitions. In: Baron R (ed) Computational drug discovery and design. Springer, New York, NY, pp 501–513 30. Matsunaga Y, Komuro Y, Kobayashi C et al (2016) Dimensionality of collective variables for describing conformational changes of a multi-domain protein. J Phys Chem Lett 7: 1446–1451. https://doi.org/10.1021/acs. jpclett.6b00317 31. Ahalawat N, Mondal J (2018) Assessment and optimization of collective variables for protein conformational landscape: GB1 β-hairpin as a case study. J Chem Phys 149:094101. https:// doi.org/10.1063/1.5041073 32. Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14:33–38. https://doi.org/10. 1016/0263-7855(96)00018-5 33. Fiorin G, Klein ML, Hénin J (2013) Using collective variables to drive molecular dynamics simulations. Mol Phys 111:3345–3362. https://doi.org/10.1080/00268976.2013. 813594 34. Hazel A, Chipot C, Gumbart JC (2014) Thermodynamics of Deca-alanine folding in water. J Chem Theory Comput 10:2836–2844. https://doi.org/10.1021/ct5002076 35. Laio A, Rodriguez-Fortea A, Gervasio FL et al (2005) Assessing the accuracy of metadynamics { . J Phys Chem B 109:6714–6721. https:// doi.org/10.1021/jp045424k 36. Bussi G, Laio A, Parrinello M (2006) Equilibrium free energies from nonequilibrium metadynamics. Phys Rev Lett 96:090601. https:// doi.org/10.1103/PhysRevLett.96.090601 37. Crespo Y, Marinelli F, Pietrucci F, Laio A (2010) Metadynamics convergence law in a multidimensional system. Phys Rev E 81: 055701. https://doi.org/10.1103/ PhysRevE.81.055701 38. Barducci A, Bussi G, Parrinello M (2008) Welltempered metadynamics: a smoothly converging and tunable free-energy method. Phys Rev Lett 100:020603. https://doi.org/10.1103/ PhysRevLett.100.020603 39. Izadi S, Anandakrishnan R, Onufriev AV (2014) Building water models: a different approach. J Phys Chem Lett 5:3863–3871. https://doi.org/10.1021/jz501780a 40. Shabane PS, Izadi S, Onufriev AV (2019) General purpose water model can improve atomistic simulations of intrinsically disordered proteins. J Chem Theory Comput 15: 2620–2634. https://doi.org/10.1021/acs. jctc.8b01123 41. Sinko W, de Oliveira CAF, Pierce LCT, McCammon JA (2012) Protecting high energy barriers: a new equation to regulate boost energy in accelerated molecular dynamics simulations. J Chem Theory Comput 8: 17–23. https://doi.org/10.1021/ct200615k Simulating Peptide Dynamics with aMD and Metadynamics 42. Fajer M, Hamelberg D, McCammon JA (2008) Replica-exchange accelerated molecular dynamics (REXAMD) Applied to Thermodynamic Integration. J Chem Theory Comput 4:1565–1569. https://doi.org/10.1021/ ct800250m 43. Doshi U, Hamelberg D (2012) Improved statistical sampling and accuracy with accelerated molecular dynamics on rotatable torsions. J Chem Theory Comput 8:4004–4012. https://doi.org/10.1021/ct3004194 44. Miao Y, Feher VA, McCammon JA (2015) Gaussian accelerated molecular dynamics: unconstrained enhanced sampling and free energy calculation. J Chem Theory Comput 11:3584–3595. https://doi.org/10.1021/ acs.jctc.5b00436 45. Miao Y, Sinko W, Pierce L et al (2014) Improved reweighting of accelerated molecular dynamics simulations for free energy calculation. J Chem Theory Comput 10:2677–2689. https://doi.org/10.1021/ct500090q 46. Kumar S, Rosenberg JM, Bouzida D et al (1992) THE weighted histogram analysis method for free-energy calculations on biomolecules. I. The method. J Comput Chem 13:1011–1021. https://doi.org/10. 1002/jcc.540130812 47. Kumar S, Rosenberg JM, Bouzida D et al (1995) Multidimensional free-energy calculations using the weighted histogram analysis method. J Comput Chem 16:1339–1350. https://doi.org/10.1002/jcc.540161104 48. Park S, Khalili-Araghi F, Tajkhorshid E, Schulten K (2003) Free energy calculation from steered molecular dynamics simulations using Jarzynski’s equality. J Chem Phys 119: 3559–3566. https://doi.org/10.1063/1. 1590311 49. Jarzynski C (1997) Nonequilibrium equality for free energy differences. Phys Rev Lett 78: 2690–2693. https://doi.org/10.1103/Phy sRevLett.78.2690 167 50. Darve E, Rodrı́guez-Gómez D, Pohorille A (2008) Adaptive biasing force method for scalar and vector free energy calculations. J Chem Phys 128:144120. https://doi.org/10.1063/ 1.2829861 51. Babin V, Roland C, Sagui C (2008) Adaptively biased molecular dynamics for free energy calculations. J Chem Phys 128:134101. https:// doi.org/10.1063/1.2844595 52. Wereszczynski J, McCammon JA (2012) Statistical mechanics and molecular dynamics in evaluating thermodynamic properties of biomolecular recognition. Q Rev Biophys 45: 1–25. https://doi.org/10.1017/ S0033583511000096 53. Chipot C (2014) Frontiers in free-energy calculations of biological systems: WIREs Computational Molecular Science: frontiers in free-energy calculations. WIREs Comput Mol Sci 4:71–89. https://doi.org/10.1002/wcms. 1157 54. Abraham MJ, Murtola T, Schulz R et al (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2:19–25. https://doi.org/10.1016/j.softx. 2015.06.001 55. Raiteri P, Laio A, Gervasio FL et al (2006) Efficient reconstruction of complex free energy landscapes by multiple walkers metadynamics. J Phys Chem B 110:3533–3539. https://doi. org/10.1021/jp054359r 56. Marinelli F, Faraldo-Gómez JD (2015) Ensemble-biased metadynamics: a molecular simulation method to sample experimental distributions. Biophys J 108:2779–2782. https:// doi.org/10.1016/j.bpj.2015.05.024 57. Fu H, Shao X, Cai W, Chipot C (2019) Taming rugged free energy landscapes using an average force. Acc Chem Res 52:3254–3264. https:// doi.org/10.1021/acs.accounts.9b00473 Chapter 9 Metadynamics Simulations to Study the Structural Ensembles and Binding Processes of Intrinsically Disordered Proteins Rui Zhou and Mojie Duan Abstract The structures of intrinsically disordered proteins (IDPs) are highly dynamic. It is hard to characterize the structures of these proteins experimentally. Molecular dynamics (MD) simulation is a powerful tool in the understanding of protein dynamic structures and function. This chapter describes the application of metadynamics-based enhanced sampling methods in the study of phosphorylation regulation on the structure of kinase-inducible domains (KID). The structural properties of free pKID and KID were obtained by parallel tempering metadynamics combined with well-tempered ensemble (PTMetaD WTE) method, and the binding free energy surfaces of pKID/KID and KIX were characterized by bias-exchanged metadynamics (BE-MetaD) simulations. Key words Structure ensemble, Intrinsically disordered protein, Binding processes, Molecular dynamics simulations, Metadynamics, Kinase-inducible domain 1 Introduction In this chapter, we focus on how to use the metadynamics simulation to study intrinsically disordered proteins [1]. The kinase-inducible domain (KID) is used as an example [2, 3]. As a phosphorylated inducible protein, the phosphorylation on Ser133 of KID stimulates gene expression, which depends on interaction between the coactivator KIX domain and the transcriptional coactivator CREB-binding protein (CBP) [4]. The phosphorylated KID (pKID) undergoes a disordered-to-helical structure transition upon binding to KIX. The bound structure of pKID is formed by two α-helices, i.e., αA (from residue 120 to 129) and αB (from residue 132 to 144) [5]. The phosphorylation on Ser133 is critical for the binding between pKID and KIX and increases the affinity by almost two orders of magnitude [6, 7]. Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_9, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 169 170 Rui Zhou and Mojie Duan The conformational spaces of free KID and pKID were sampled by parallel-tempering metadynamics combined with welltempered ensemble (PTMetaD-WTE) [8], and the binding process between pKID and KIX was studied by bias-exchange metadynamics (BE-MetaD) [9]. The metadynamics [10, 11] was developed to compute the free energy profile on some predefined reaction coordinates (RCs), namely collective variables (CVs). The evolution of the system is driven by an external bias potential updated with a given period. However, metadynamics suffers from two major problems: (a) it is hard to evaluate the convergence of the free energy surface and decide when to stop a simulation; (b) it is not trivial to select appropriate CVs for describing complex processes. Here, two advanced metadynamics technologies, i.e., PTMetaD-WTE and BE-MetaD, were employed to overcome the above problems. In PTMetaD-WTE, multiple replicas are simulated at different temperatures and periodic exchanges between replicas are performed. In this way, the approach allows the system to overcome high free energy barriers. BE-MetaD allows conformational exchanges between different CVs and therefore dramatically increases the sampling efficiency. The results show that both pKID and KID are disordered, with some transient helical structures [1]. However, more hydrophobic interactions are formed in pKID. Our results revealed that the binding of the intrinsically disordered pKID follows a flexible conformational selection mechanism. 2 2.1 Theory Metadynamics In metadynamics, a history-dependent external bias potential is added to the energy function of the system. This potential can be written as a sum of Gaussians deposited as a function of the CVs to prevent the system from visiting conformations similar to those that have already been sampled. The bias potential is as following: Z τ m X S i ðR Þ S i ðR ðt ÞÞ2 ω exp dt ð1Þ V ðS, t Þ ¼ 2 σ 2i 0 i¼1 where ω and σ i are the height and weight of the Gaussian bias, respectively. Si(R) is the ith CV value of coordinates R. 2.2 PTMetaD-WTE For temperature-based REMD (tREMD), multiple replicas run simultaneously at different temperatures, and the adjacent replicas are randomly exchanged based on the Metropolis criterion. It is possible to overcome the energetic barriers after exchanging a low-temperature replica and a high-temperature replica. The acceptance ratio for an exchange involving replicas a and b is: Metadynamics for Intrinsically Disordered Proteins 171 min f1, exp ½ðβb βa Þ ðU ðR b Þ U ðR a ÞÞ þ βa ðV a ðS ðR a ÞÞ V a ðS ðR b ÞÞ þ βb ðV b ðS ðR b ÞÞ V b ðS ðR a ÞÞg ð2Þ where β is 1/KBT, KB is the Boltzman constant, and T is the temperature of a given replica. U(R) is the potential energy of the system and V is the bias potential. If the exchange is accepted, the coordinates in the replica a and b are exchanged. 2.3 3 BE-MetaD Like the PTMetaD, the bias-exchange metadynamics (BE-MetaD) also exchanged between different replicas. However, unlike the replicas are corresponding to the system under different temperatures, the replicas in BE-MetaD method relate to different reaction coordinates at the same temperature. Based on this strategy, this method is able to consider larger number of CVs simultaneously and can efficiently reach equilibration. MD Settings 3.1 System Settings The initial structures of free pKID and KID were built based on the experimental structure of the pKID-KIX complex (PDB ID: 1KDX [2]). The phosphorylated Ser133 was mutated back to serine for KID. The pKID and KID were capped with acetyl (ACE) and amine (NH2) groups at the N- and C-terminus, respectively. The box size was set to 53 53 63 Å3 for free state pKID and 55 55 64 Å3 for KID. TIP3P [12] water molecules were added to solvate the systems. Sodium and chloride ions were added to neutralize the systems, and the final concentrations of the ions were set to be 100 mM. The amber99SB-ILDN force field [13] was employed. Unbiased molecular dynamics simulations were performed to equilibrate the conformations of free pKID and KIX in aqueous solution. An isotropic scheme was utilized to couple the pressures. The Particle-Mesh Ewald method [14] was employed to calculate long-range electrostatics with a real-space cutoff of 10 Å. The temperature was kept at 300 K with the V-rescale method [15], and the pressure was controlled by Parrinello-Rahman barostat [16]. 3.2 PTMetaD-WTE The initial structures of free pKID/KID in the PTMetaD-WTE simulation were obtained from unbiased molecular dynamics simulations at high temperature. 12 replicas were simulated spanning the temperatures: 288 K, 300 K, 313 K, 327 K, 342 K, 359 K, 377 K, 398 K, 421 K, 446 K, 475 K, and 508 K. The PTMetaDWTE simulations were implemented in a two-step scheme. First, the parallel tempered simulations in the well-tempered ensemble (PT-WTE) on the potential energy surface were performed. The bias factor was set to be 30. The height of the initial bias energy was 172 Rui Zhou and Mojie Duan Fig. 1 The system energy in the replicas under different temperatures as a function of simulation time. (a) pKID; (b) KID 1.0 kJ/mol and the width was 300 kJ/mol (Eq. 1). Exchange of configurations between adjacent replicas was attempted every 150 fs. After 30 ns simulations of each replica, the height of the bias energy decreased to a value close to 0 and the exchange acceptance probability between adjacent replicas was about 0.3 (Eq. 2). The potential energy underwent large fluctuations and exchanged between the neighboring replicas (Fig. 1). The average potential energy in the PT-WTE simulation remains close to the canonical value but had large fluctuations. Next, simulations of all replicas were performed with a static energy bias in the potential energy space, constructed in PT-WTE. The historydependent energy bias was added to two collective variables to enhance the sampling of the structure of the αA and αB regions of KID, i.e., the α-score for residues 120–129 and the α-score for residues 134–144 [17]. The definition of these CVs is as following: X 1 ri 8 0:08 α‐score ¼ ð3Þ ri 12 i 1 0:08 where the bias factor γ was set to be 16, the height of the initial bias was 1.0 kJ/mol, the width was 0.2 rad, MetaD bias was deposited every 500 steps, where each step was 1.5 fs. Exchanges of configurations between neighboring replicas were attempted every 750 fs. 3.3 BE-MetaD By combining the replica exchange and metadynamics, the simulations are exchanged in different replicas, which could be present by different collective variables. In this work, for the Bias-exchange metadynamics (BE-MetaD) simulation, the initial structures were built based on the experimental complex structure (PDB ID: 1KDX). Similar to the regular simulations, sodium and chloride ions were added to neutralize the systems, and the final concentration of the ions was set to be 100 mM. 10,092 and 9746 water molecules were added to solvate the pKID+KIX and KID+KIX Metadynamics for Intrinsically Disordered Proteins 173 Fig. 2 The collective variables as a function of simulation time in BE-MetaD simulations. (a) The CV1 values of pKID. (b) The CV1 values of KID systems, respectively. The box sizes were 74 74 73 Å3 and 74 74 72 Å3 for pKID+KIX and KID+KIX, respectively. Four biased replicas were run along with four CVs for the BE-MetaD simulations. The exchanges between the replicas were attempted every 4 ps. 450 ns simulation were performed on each replica, and a total of 1.8 μs for each system. The simulations reached equilibrium when the systems covered the CV-space (Fig. 2). The Gaussian bias was applied to four CVs: α-score for residues 120–129 in pKID or KID (CV1), α-score for residues 134–144 in pKID or KID (CV2), the COM distance between pKID/KID and KIX (CV3), and the number of native contacts between pKID/ KID and KIX (CV4). The α-score CVs were employed to describe the folding of pKID, the CV3 was used to depict the binding process, and the CV4 describes the progress of binding between pKID/KID and KIX. The COM distance in CV3 was limited to less than 4.0 nm with a harmonic restrained potential during the simulation, to focus sampling on the relevant regions of configurational space. The harmonic potential had the following form: 174 Rui Zhou and Mojie Duan ( VM ¼ 1 kðS S 0 Þ2 , if S > S 0 2 0, if S S 0 ð4Þ where S corresponds to the COM distance between pKID/KID and KIX. S0 was 3.0 nm. The force constant k was 500 kJ/ (mol·nm2). The CV4 was calculated as a sum of switching functions: X 1 ð5Þ Q ¼ 0 1 þ exp β r λr ij ij ij where rij represents the COM distance between heavy atoms in pKID/KID and KIX whose distances are closer than 0.45 nm in the experimental structure. We used r 0ij ¼0.45, λ ¼ 1.8, β ¼ 50 nm1 [18]. The Gaussian potential height w was set to 2.0 kJ/mol for all CVs, the Gaussian width was 0.2 for CV1 to CV3 and 10 for CV4. The bias factor was set to 32 in all replicas. The Gaussian bias was deposited every 5 ps. 4 Implementation 4.1 PT-WTE Metadynamics 1. Preprocessing of the protein. Software: Gromacs2018 [19] gmx. Module usage: pdb2gmx, editconf, solvate, genion. 2. Energy minimization. Software: Gromacs2018 gmx. Module usage: grompp, mdrun. Command: gmx grompp -f minim.mdp -c pKID_solv_ions.gro -p pKID.top -o em.tpr gmx mdrun –s em.tpr –deffnm em 3. NVT equilibration. Software: Gromacs2018 gmx. Module usage: grompp, mdrun. Command: gmx grompp -f tem_nvt0.mdp -c em.gro -p pKID.top -n index.ndx -o pKID-$TEMP-anne-nvt.tpr gmx mdrun -s pKID-$TEMP-anne-nvt.tpr -deffnm pKID-$TEMP-annenvt Metadynamics for Intrinsically Disordered Proteins 175 4. NPT equilibration. Software: Gromacs2018 gmx. Module usage: grompp, mdrun. Command: gmx grompp -f npt.mdp -c pKID-$TEMP-anne-nvt.gro -t pKID$TEMP-anne-nvt.cpt -p pKID.top -n index.ndx -o pKID-$TEMPnpt.tpr mdrun -s pKID-$TEMP-npt.tpr -deffnm pKID-$TEMP-npt -gpu_id 01 -nt 12 5. Parallel tempering simulation. Software: Gromacs2018 with plumed-2.4 [20, 21] patched. Module usage: grompp, mdrun. Command: gmx grompp -f remd.mdp -c pKID-$TEMP-npt.gro -t pKID-$TEMPnpt.cpt -p pKID.top -o pKID-PT.tpr mpirun -np 12 -hostfile hosts mdrun_mpi -s pKID-PT-charm5ns -deffnm pKID-PT-5ns -plumed plumed_PT.dat -multi 12 -replex 500 -gpu_id 0011 6. Parallel tempering with well-tempered ensemble. Software: Gromacs2018 with plumed-2.4 patched. Module usage: grompp, mdrun. Command: gmx grompp -f remd.mdp -c pKID-PT-5ns0.gro -t pKID-PT-5ns0.cpt -p pKID.top -o pKID-PTWTE-30ns0.tpr mpirun -np 12 -hostfile hosts mdrun_mpi -s pKID-PTWTE-30ns -plumed plumed_PTWTE -deffnm pKID-PTWTE-30ns- -multi 12 -replex 100 -gpu_id 0011 7. PT-WTE metadynamics. Software: Gromacs2018 with plumed-2.4 patched. Module usage: grompp, mdrun. Command: gmx grompp -f remd.mdp -c pKID-PTWTE-30ns.gro -t pKID-PTWTE30ns-.cpt -p pKID.top -o pKID-PTMetaDWTE-300ns0.tpr mpirun -np 12 -hostfile hosts mdrun_mpi -s pKID-PTMetaDWTE300ns -plumed plumed_PTMetaDWTE -deffnm pKID-PTMetaDWTE-300ns-multi 12 -replex $steps -gpu_id 0011 176 Rui Zhou and Mojie Duan Plumed files: ######### plumed file for PT-WTE ######### MOLINFO STRUCTURE=pKID-exp.pdb # set up the two CVs and total energy ALPHARMSD RESIDUES=2-11 TYPE=DRMSD LESS_THAN=<RATIONAL R_0=0 .08 NN=8 MM=12 > LABEL=CV1 ALPHARMSD RESIDUES=16-26 TYPE=DRMSD LESS_THAN=<RATIONAL R_0= 0.08 NN=8 MM=12 > LABEL=CV2 ene: ENERGY # Activate metadynamics in ene # Well-tempered metadynamics is activated # wte: METAD ARG=ene PACE=500 HEIGHT=$HEIGHT SIGMA=$SIGMA FILE=HILLS_PTWTE_ BIASFACTOR=20 TEMP=$TEMP # monitor the three variables and the metadynamics bias potential PRINT STRIDE=1000 ARG=CV1.lessthan,CV2.lessthan,ene,wte.bias FILE=COLVAR_PTWTE_PT-WTE ############################################################ ######### PT-WTE Metadynamics ######### RESTART MOLINFO STRUCTURE=pKID-exp.pdb # set up two CVs and total energy ALPHARMSD RESIDUES=2-11 TYPE=DRMSD LESS_THAN=<RATIONAL R_0=0 .08 NN=8 MM=12 > LABEL=CV1 ALPHARMSD RESIDUES=16-26 TYPE=DRMSD LESS_THAN=<RATIONAL R_0= 0.08 NN=8 MM=12 > LABEL=CV2 ene: ENERGY # Activate metadynamics in ene # Well-tempered metadynamics is activated # wte: METAD ARG=ene PACE=999999999 HEIGHT=$HEIGHT SIGMA=$SIGMA FILE=HILLS_PTWTE_ BIASFACTOR=20 TEMP=$TEMP #active metadynamics,depositing a Gaussian every 500 time steps metad: METAD ARG=CV1.lessthan,CV2.lessthan PACE=500 HEIGHT= $HEIGHT SIGMA=$S1,$2 FILE=HILLS_PTMetaDWTE BIASFACTOR=16 TEMP= $TEMP Metadynamics for Intrinsically Disordered Proteins 177 # monitor the three variables and the metadynamics bias potential PRINT STRIDE=1000 ARG=CV1.lessthan,CV2.lessthan,ene,wte.bias, metad.bias FILE=COLVAR_PTMetaDWTE ############################################################ 4.2 BE-MetaD 1. The preprocessing, energy minimization, and equilibration steps were similar to the PT-WTE metadynamics. 2. BE-MetaD. Software: Gromacs2018 with plumed-2.4 patched. Module: grompp, mdrun. Command: gmx grompp -f mdrun.mdp -cpt npt_$replic -p pKID-KIX.top -o mdrun_$replica.tpr mpirun -np 4 mdrun -s mdrun_ -plumed plumed_be.dat -deffnm mdrun_extend_ -gpu_id 01 -nt 4 ############## plumed.1 file for BE-MetaD ############## INCLUDE FILE=plumed-common.dat # include the definition of CVs be: METAD ARG=CV1.lessthan HEIGHT=$HEIGHT SIGMA=$SIGMA PACE=2500 BIASFACTOR=32 GRID_MIN=$CV_MIN GRID_MAX=$CV_MAX GRID_BIN=100 FILE=HILLS PRINT ARG=CV1.lessthan,CV2.lessthan,CV3,CV4,be.bias,duwall. bias STRIDE=2500 FILE=COLVAR ############################################################################## 5 Concluding Notes 1. Both pKID and KID are disordered with some transient helical structures. 2. More hydrophobic interactions are formed in the phosphorylated KID, which promote the formation of the special hydrophobic residue cluster (HRC). 3. The binding mechanism of the intrinsically disordered pKID follows a flexible conformational selection mechanism. 178 Rui Zhou and Mojie Duan References 1. Liu N, Guo Y, Ning S, Duan M (2020) Phosphorylation regulates the binding of intrinsically disordered proteins via a flexible conformation selection mechanism. Commun Chem 3:123 2. Radhakrishnan I et al (1997) Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator:coactivator interactions. Cell 91(6):741–752 3. Sugase K, Dyson HJ, Wright PE (2007) Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature 447(7147):1021–1025 4. Zor T et al (2002) Roles of phosphorylation and helix propensity in the binding of the KIX domain of CREB-binding protein by constitutive (c-Myb) and inducible (CREB) activators. J Biol Chem 277(44):42241–42248 5. Radhakrishnan I et al (1998) Conformational preferences in the Ser(133)-phosphorylated and non-phosphorylated forms of the kinase inducible transactivation domain of CREB. FEBS Lett 430(3):317–322 6. Dahal L, Shammas SL, Clarke J (2017) Phosphorylation of the IDP KID modulates affinity for KIX by increasing the lifetime of the complex. Biophys J 113(12):2706–2712 7. Zor T et al (2002) Roles of phosphorylation and helix propensity in the binding of the KIX domain of CREB-binding protein by constitutive (c-Myb) and inducible (CREB) activators. J Biol Chem 277(44):42241–42248 8. Prakash MK, Barducci A, Parrinello M (2011) Replica temperatures for uniform exchange and efficient roundtrip times in explicit solvent parallel tempering simulations. J Chem Theory Comput 7(7):2025–2027 9. Piana S, Laio A (2007) A bias-exchange approach to protein folding. J Phys Chem B 111(17):4553–4559 10. Laio A, Parrinello M (2002) Escaping freeenergy minima. Proc Natl Acad Sci U S A 99(20):12562–12566 11. Valsson O, Tiwary P, Parrinello M (2016) Enhancing important fluctuations: rare events and metadynamics from a conceptual viewpoint. Annu Rev Phys Chem 67:159–184 12. Jorgensen WL, Chandrasekhar J, Madura JD et al (1983) Comparison of simple potential functions for simulating liquid water. J Chem Phys 79:926–935 13. Lindorff-Larsen K et al (2010) Improved sidechain torsion potentials for the Amber ff99SB protein force field. Proteins 78(8):1950–1958 14. Essmann U, Perera L, Berkowitz ML (1995) A smooth particle mesh Ewald method. J Chem Phys 103:8577 15. Bussi G, Donadio D, Parrinello M (2007) Canonical sampling through velocity rescaling. J Chem Phys 126(1):014101 16. Parrinello M, Rahman A (1980) Crystal structure and pair potentials: a molecular-dynamics study. Phys Rev Lett 45:1196 17. Pietrucci F, Laio A (2009) A collective variable for the efficient exploration of protein betasheet structures: application to SH3 and GB1. J Chem Theory Comput 5(9):2197–2201 18. Best RB, Hummer G, Eaton WA (2013) Native contacts determine protein folding mechanisms in atomistic simulations. Proc Natl Acad Sci U S A 110(44):17874–17879 19. Berendsen HJ, Spoel D, Drunen R (1995) GROMACS: a message-passing parallel molecular dynamics implementation. Comput Phys Commun 91:43–56 20. Tribello GA et al (2014) PLUMED2: new feathers for an old bird. Comput Phys Commun 185:604 21. Bussi G, Tribello GA (2019) Analyzing and biasing simulations with PLUMED. Methods Mol Biol 2022:529–578 Chapter 10 Computational and Experimental Protocols to Study Cyclo-dihistidine Self- and Co-assembly: Minimalistic Bio-assemblies with Enhanced Fluorescence and Drug Encapsulation Properties Asuka A. Orr, Yu Chen, Ehud Gazit, and Phanourios Tamamis Abstract Our published studies on the self- and co-assembly of cyclo-HH peptides demonstrated their capacity to coordinate with Zn(II), their enhanced photoluminescence and their ability to self-encapsulate epirubicin, a chemotherapy drug. Here, we provide a detailed description of computational and experimental methodology for the study of cyclo-HH self- and co-assembling mechanisms, photoluminescence, and drug encapsulation properties. We outline the experimental protocols, which involve fluorescence spectroscopy, transmission electron microscopy, and atomic force microscopy protocols, as well as the computational protocols, which involve structural and energetic analysis of the assembled nanostructures. We suggest that the computational and experimental methods presented here can be generalizable, and thus can be applied in the investigation of self- and co-assembly systems involving other short peptides, encapsulating compounds and binding to ions, beyond the particular ones presented here. Key words Molecular dynamics, Nanostructure, Biomaterials, Drug delivery, Electron microscopy, Charmm program, Generalized Born, Association free energy 1 Introduction Supramolecular self-assembly of biomolecules into nanostructures with diverse hierarchical architectures is essential to the physiological functions across all kingdom of life [1]. As the keys to fundamental working principles of biology, proteins and peptides are endowed with the propensity to form complex architectures uniquely suited for specialized functions. These multiple, welldefined, supramolecular self-assemblies, with different sequences, shapes, and functions, enable living systems to respond to internal Asuka A. Orr and Yu Chen contributed equally to this work. Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_10, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 179 180 Asuka A. Orr et al. and external stimuli and engage in complex behavior [1, 2]. Peptides and peptide derivatives have received significant attention as potential nanotechnology building blocks due to their flexibility, variability in molecular design, and ease of large-scale synthesis [2]. In the past few years, many short peptides and their synthetic analogues have been assembled into structures according to the minimalist approach originally described by DeGrado and coworkers [3]. We and others have taken a reductionist approach to form numerous self-assembled peptide structures. One of the prominent examples is diphenylalanine (FF) [4], a self-assembling dipeptide sequence initially identified as the minimal core recognition motif of amyloid β-protein, the amyloidogenic polypeptide associated with Alzheimer’s disease [5]. FF peptide self-assemblies have been shown to display intriguing features, including biosensing, energy storage, super hydrophobic surfaces, and photoluminescence [4– 10]. Furthermore, recent studies revealed that cyclo-dipeptides with 2,5-diketopiperazine backbone configurations, derived from dehydration condensation of linear-dipeptides, self-assemble into oligomeric nanostructures [11]. Especially, inspired by the molecular structure of BFPms1 [12], we successfully constructed a short fluorescent peptide core encapsulated by the peptide scaffold building module to implement the concept of “self-assembly locking strategy” [13, 14]. We reported the demonstration of a bright fluorescent peptide with quantum yields of up to 70% for green fluorescence, exemplifying the potential of such structures to serve as bioinspired, organic, supramolecular alternatives to complement their state-of-the-art inorganic counterparts [14]. Importantly, our studies also aimed to provide fundamental insights into the underlying molecular self-assembling mechanisms and modulation of the photoluminescence properties of these materials, which remain difficult to solve. Such insights can be of utmost importance for further utilization and future exploitation of the assemblies, toward novel biological materials with advanced functional applications, including but not limited to cancer drug delivery. In this chapter, we provide a detailed description of selected computational and experimental methods for the study of cycloHH peptide self- and co-assembly mechanisms, photoluminescence, and drug encapsulation properties, included in our recently published papers [13, 14]. We first give an overview of the experimental protocols used to study the co-assembly of cyclo-HH with different ions and molecules to ultimately produce a fluorescent drug nanocarrier, and we describe three key experimental protocols (fluorescence spectroscopy, transmission electron microscopy, and atomic force microscopy) for studying the morphology and fluorescence of the formed nanostructures. We then give an overview of the computational protocol, based on molecular dynamics (MD) simulations, and the structural and energetic analysis of the early stages of cyclo-HH self-assembly and co-assembly. The Cyclo-Dihistidine Self- and Co-Assembly 181 protocol was used to study cyclo-HH self-assembly in the presence and absence of Zn(II) ions [13], and the co-assembly of cyclo-HH with Zn(II) ions, in the presence or absence of and nitrate ions, and Zn(II) ions, nitrate ions, and the chemotherapy drug epirubicin (EPI) [14]. We particularly focus here on the theoretical developments used to study these systems. We consider that the computational and experimental methods presented here in detail are generalizable, and thus can be applied to the self- and co-assembly of systems involving other short peptides, encapsulating compounds and binding to ions, beyond cyclo-HH. 2 2.1 Materials Peptide 2.2 Simulation and Analysis Software 3 The peptide, cyclic(L-histidine-D-histidine) (cyclo-HH), was purchased from GL Biochem (Shanghai, China) and had a degree of purity higher than 95%. Zinc nitrate (Zn(NO3)2), dimethylformamide (DMF), dimethyl sulfoxide (DMSO), and isopropanol were purchased from Sigma-Aldrich (Rehovot, Israel). Epirubicin hydrochloride was purchased from Glentham life science. All materials were used as received without further purification. Water was processed using a Millipore purification system (Darmstadt, Germany) with minimum resistivity of 18.2 MΩ cm. In the computational methods described, the CHARMM [15] program (http://charmm.chemistry.harvard.edu) was used to perform MD simulations and additional energetic calculations. The analysis was primarily performed using FORTRAN programs and other programs listed below. Visual inspection of simulations was performed with VMD [16]. Methods 3.1 Experimental Methods 3.1.1 Co-assembly 1. To study assembly, cyclo-HH and co-assembling molecules or ions are mixed under controlled experimental conditions, resulting in the formation of nanostructures. For the assembly of cyclo-HH in the presence of Zn(II) and nitrate ions, we first prepared a fresh stock solution of cyclo-HH. 5.48 mg of lyophilized cyclo-HH peptide powder was dissolved into 5% (v/v) DMF/isopropanol mixed solvent in a 2 mL scintillation vial at a concentration of 0.02 mmol mL1. Then, 2.97 mg of metal salt Zn(NO3)2 was added into the peptide solution under vigorous sonication for 5 min. The vial was heated at 80 C for 1 h with ramping rate 1 C/min and was cooled down to room temperature overnight. The color of the solution will subsequently change to a light yellow. For co-assembly of cyclo-HH in the presence of Zn(II) ions, nitrate ions, and EPI, 182 Asuka A. Orr et al. lyophilized cyclo-HH peptide powder was dissolved into 4% (v/v) DMSO/isopropanol at a concentration of 5.48 mg mL1. Following that, 0.25 mg EPI and 2.97 mg metal salt Zn(NO3)2 were added under vigorous sonication, followed by one-hour incubation inside of an 80 C (with a temperature rise rate of 1 C/min) water bath and subsequent cooling to ambient temperature. The elevated temperature is intended to accelerate selfassembly, followed by cooling to equilibrium at room temperature. The obtained red suspension was then centrifuged at 12,557 g for 20 min, and the precipitates were washed three times with Milli-Q water to remove any excess EPI and Zn(NO3)2. 3.1.2 Fluorescence Spectroscopy 2. Fluorescence spectroscopy is a key measurement to assess the optical property of peptide assemblies. For the assembly of cyclo-HH in the presence of Zn(II) and nitrate ions, a 600 μL sample solution of the assembled material formed by cyclo-HH in the presence of Zn(II) and nitrate ions (cycloHH-Zn(NO3)2) was pipetted into a 1.0 cm path-length quartz cuvette (Hellma Analytics, item no.: HL108-F-10-40, light path: 10 4 mm), and the spectrum was collected using a FluoroMax-4 spectrofluorometer (Horiba Jobin Yvon, Kyoto, Japan) at ambient temperature. The excitation and emission wavelengths were set at 300–500 nm and 300–700 nm, respectively, with a slit of 2 nm (see Note 1). The resulting excitation– emission matrix contour profile shows that the cyclo-HH-Zn (NO3)2 material has bright fluorescence properties (Fig. 1). 3.1.3 Transmittance Electron Microscopy and Atomic Force Microscopy 3. Transmittance electron microscopy and atomic force microscopy (AFM) were employed to identify the morphologies of the nanostructures newly formed by the co-assembled cycloHH. To study the material formed by the co-assembly of cycloHH in the presence of Zn(II) and nitrate ions through AFM, we first attached a mica sheet to a microscope slide using nonconductive double-sided adhesive tab (as shown in Fig. 2). Then, the mica sheet (see Note 2) (Highest Grade V1 Mica Discs, 12 mm, item no.: 50-12, Ted Pella, Inc) was rinsed with water and gently purge-dried with nitrogen (99.99%). 5 μL of cyclo-HH-Zn(NO3)2 sample solution was dropped onto freshly cleaved mica surface and dried by N2 purge. A topographic image was recorded under a Dimension icon AFM (Bruker) in the tapping mode at ambient temperature, with a 512 512 pixel resolution and a scanning speed of 1.0 Hz. Nanoscope Analysis software was used for data collection and analysis. As for the sample preparation protocol of TEM, cycloHH-Zn(NO3)2 was first sonicated for 10 min. Then, 10 μL cyclo-HH-Zn(NO3)2 sample solution was gently dropped onto a glow discharge copper grid coated with a thin carbon film (Formvar Carbon Film 400 mesh, Copper, item no.: FCF400-CU, Electron Microscopy Sciences). After 2 min, Cyclo-Dihistidine Self- and Co-Assembly 183 Fig. 1 Typical excitation–emission matrix contour profile of cyclo-HH-Zn(NO3)2 Fig. 2 Mica sheet attached to microscope slide the excess solution was removed with a filter paper. TEM images were viewed using an FEI Tecnai F20 electron microscope operating at 80 kV. AFM and TEM images show that the morphology of cyclo-HH-Zn(NO3)2 consists of nanoparticles of about 30 nm (Fig. 3). 3.2 Computational Methods In the following steps, we describe the methodology followed to simulate and analyze the produced trajectories to obtain insights into the short peptide self-assembly properties. The short peptide simulation systems highlighted here correspond to cyclo-HH selfand co-assembly in different environments. Specifically, we 184 Asuka A. Orr et al. Fig. 3 AFM (left) and TEM (right) images of the morphology of cyclo-HH-Zn(NO3)2 Fig. 4 Schematic of the overall computational methodology to study cyclo-HH self-assembly considered the self-assembly of cyclo-HH in the presence and absence of Zn(II) ions in methanol [13] and the co-assembly cyclo-HH with Zn(II) ions in the presence or absence of nitrate ions, and Zn(II) ions, nitrate ions, and EPI in isopropanol [14]. An overview of the computational methodology is presented in Fig. 4. Cyclo-Dihistidine Self- and Co-Assembly 3.2.1 MD Simulation Setup and Execution 185 1. The 3D structures of the molecules under investigation are constructed to match the experimental systems. The 3D structures for compounds can be obtained from existing structure databases, such as the ZINC Database [17] or PubChem [18], or can be manually built through programs such as Marvin Sketch [19]. For the studies of cyclo-HH co-assembly [13, 14], the 3D structures of all molecules were manually built through Marvin Sketch to ensure the correct protonation state observed in the experimentally resolved crystal structures. 2. Molecular force fields are chosen to describe the molecules’ and solvents’ interactions. In our studies, we used polarizable force fields and nonpolarizable force fields in separate studies of cyclo-HH self- and co-assembly [13, 14]. In the studies involving EPI and nitrate ions, polarizable force fields were not used as parameters and topologies for both nitrate ions and EPI are not readily available (see Note 3). To ensure that a force field can adequately describe the co-assembly system under investigation, computationally derived results can be compared to experimental results or additional computational derived results with higher accuracy. For the simulations of cyclo-HH co-assembly using nonpolarizable force fields, elementary structures of cyclo-HH with Zn(II) ions were in line with both experimentally derived crystal structures and elementary structures derived from simulations using the Drude polarizable force field [13, 14, 20]. 3. Different initial conformations of the molecules under investigation are generated through short infinite dilution simulations. These different initial conformations of the molecules will be used as starting points for the finite dilution simulations described in the following steps (see Note 4). To generate different configurations of the molecules under investigation, infinite dilution simulations were performed for each molecule, independently. In the infinite dilution simulations, two bonded atoms are aligned and fixed to remove the translation and rotation of the molecule. Translations and rotations are introduced to copies of the molecule when they are initially placed on a grid to build the starting structure of the finite dilution simulations (see step 4). In the simulations of cyclo-HH selfand co-assembly, the monomer molecules (cyclo-HH and EPI) were independently simulated for 10 ns with structures extracted every 10 ps to generate 1000 possible configurations of each molecule [13, 14]. 4. The initial structures of the finite dilution simulations are generated by placing the molecules in random configurations and orientations on a grid. The configurations are randomly selected from the pool of 1000 possible configurations (see step 3) for each molecule. The molecules are placed on a grid 186 Asuka A. Orr et al. such that they are equally spaced; the initial distance between each molecule’s nearest atoms is within the cutoff of nonbonded interactions to facilitate the formation of an initial aggregate [21]. In this way, each molecule can initially “interact” with another molecule within the simulations. The number of molecules and ions within the simulation system should be sufficiently large to enhance statistical analysis of interactions formed [21]. 5. The initial configuration of the grid of molecules and ions under investigation is solvated in solvent boxes. The solvent molecules used to build the solvent box should be in line with experiments (methanol and isopropanol for references [13] and [14], respectively). The solvent box is periodically replicated through periodic boundary conditions, and the molecules and ions within the simulations are free to move within the replicated solvent box. The size of the solvent box could be set to a value that is not very large, to facilitate the interaction of the molecules and ions and enhance the sampling and formation of clusters within the simulations [21] (see Note 5). Subsequently, the charge of the simulation systems is neutralized by introducing counterions through Monte Carlo simulations [22, 23]. In the simulations of cyclo-HH self- and co-assembly, the size of the grid of molecules and the solvent boxes was selected to increase the simulated concentration of the co-assembling molecules, compared to experiments, to facilitate self-assembly [13, 14]. Simulations were performed at 300 K, in line with the room temperature used at the experiments for the systems to cool and equilibrate. Simulation input files followed a general flow indicated by CHARMM- GUI [22, 23], adjusted and changed for the current systems under investigation. 6. Prior to the production simulation runs, the simulation systems are first energetically minimized and equilibrated. An energy minimization is performed on the starting structures. The simulation systems are subsequently subjected to a position-restrained equilibration, which aims to avoid any unnecessary and sudden structural distortion when initiating the MD simulation production stage. In this step, all heavy atoms are constrained to their starting positions, allowing the solvent molecules and ions to equilibrate around the assembling molecules. 7. Finally, the simulation systems are investigated using multinanosecond MD simulations with all constraints imposed in the previous steps released. The duration of the MD simulations can be tailored to the systems under investigation (see Note 6). In the production stage, simulation snapshots are saved throughout the duration of the simulations for subsequent structural and energetic analysis (Fig. 4, see steps 9–31). For the simulations of cyclo-HH self- and co-assembly, 100 ns MD simulations were performed; this duration proved Cyclo-Dihistidine Self- and Co-Assembly 187 to be sufficiently long to observe the formation and reformation of aggregates within the simulation and the convergence of structural and energetic properties [13, 14]. Additional details on the simulation methodology are provided in references [13] and [14]. 8. During the construction of the starting structures of each simulation system and the subsequent MD simulations, it is recommended to check the simulation visually and the corresponding output files to ensure that the starting structures were built appropriately and that the simulations are progressing properly. 3.2.2 Structural Analysis 9. The simulation trajectories provide structures of aggregates formed by the self-assembling molecules. These aggregates can be characterized in terms of the specific interactions formed within the aggregates or the overall structural properties (e.g., compactness, solvent accessibility) by postprocessing the simulation trajectories in structural analysis programs (Fig. 4). In simulations of cyclo-HH self- and co-assembly, the structural analysis programs focused on the formation of specific interactions within the formed aggregates as well as the overall geometric properties of the aggregates (composition, compactness, location of molecules within the aggregates) [13, 14]. 10. Pair-wise interactions, based on atom-to-atom distances, between co-assembling molecules in the simulations can be recorded and characterized by post-processing the simulation trajectories in structural analysis programs. The interactions that are chosen to be tracked and their corresponding distance cutoffs can be guided by experimental results (e.g., crystal structures) and/or visual inspection of the MD simulation trajectories (see Note 7). 11. Information on which molecules or ions are interacting can be tabulated in suitably defined matrices, which can be defined in FORTRAN programs. The raw data from the trajectories are used to populate matrices of the form g(axis,entity,atom,resid, i) containing the coordinates of each atom. The index “axis” runs from 1 to 3 and corresponds to the x-, y-, and z- axis, respectively. The index “entity” runs from 1 to k, the total number of molecule or ion types (two for cyclo-HH and Zn (II) ions, three for cyclo-HH, Zn(II) ions, and nitrate ions, and four for cyclo-HH, Zn(II) ions, nitrate ions, and EPI. The index “atom” runs from 1 to a, the total number of atoms in “entity.” The index “resid” runs from 1 to j(entity), the total number of molecule or ion copies of “entity”; and index “i” runs from 1 to S, the total number of snapshots to be analyzed in the trajectory. 188 Asuka A. Orr et al. 12. To determine if two molecules or ions are bonded and what type of interaction they are bonded through, the distance between each of their atoms per simulation snapshot is first calculated from matrix g. A nested FORTRAN loop exhaustively calculates the distances between atoms in each simulation snapshot and stores the distances in a temporary variable, “dist.” If “dist” is less than the defined distance cutoff (3.5 Å for the simulations of cyclo-HH self- and co-assembly) for a given pair of atoms belonging to different individual molecules or ions, then the two atoms (and their corresponding molecules) are considered to be bonded. Once a pair of atoms are within the distance cutoff, the program compares the interacting atoms to a list of interaction type definitions, and information on how the atoms are bonded is stored in matrix flag(entity1, resid1, entity2, resid2, type,i). The indices “entity1” and “entity2” run from 1 to k, the total number of molecule or ion types; the indices “resid1” and “resid2” run from 1 to j(entity1) or j(entity2), the total number of molecule or ion copies of “entity1” or “entity2”; the index “type” runs from 1 to T, the total number of possible interaction types in the library of user-defined interactions; and index “i” runs from 1 to S, the total number of snapshots to be analyzed in the trajectory. If an atom of entity1 and resid1 is bonded to another atom of entity2 and resid2 through interaction type T1, where T1 is a number between 1 and T (see Note 8), then flag(entity1, resid1, entity2, resid2, T1, i) will be populated with a 1; otherwise, it will be populated with a 0. After all possible bonded atoms between each pair of molecules or ions are identified in all analyzed simulation snapshots, the data are printed in an output text file for further analysis with the first, second, third, fourth, fifth, and sixth columns populated with i, entity1, resid1, entity2, resid2, and flag, respectively. In this way, each row contains information on how two molecules or ions interact and at what snapshot in the simulation the interaction is formed. For the study of cyclo-HH self- and co-assembly, this file is named “pairs.dat.” 13. The output of the previous step, “pairs.dat,” can be used to group interacting molecules and ions into clusters such that a number of s molecules or ions (cyclo-HH molecules, Zn (II) ions, nitrate ions, or EPI) are defined to form a cluster when a molecule or ion of any entity is in the vicinity of at least one other. The clustering can be viewed as a two-stage process in which temporary clusters of increasing size are detected in the first stage (Fig. 5a), and a final list of clusters is output in the second stage, with redundancies or the presence of smaller clusters within larger clusters removed (Fig. 5b). Cyclo-Dihistidine Self- and Co-Assembly 189 Fig. 5 Schematic of how clusters are detected in programs. (a) Temporary clusters are detected and expanded through comparisons to a list of interacting pairs in “pairs.dat.” (b) Redundant temporary clusters are removed in “cluster.dat” such that repeated clusters or smaller clusters that are simultaneously present within larger clusters are removed. Clusters larger than two molecules or ions are colored in accordance to which cluster they belong to. Colored lines indicate matches between molecules or ions within temporary clusters and bonded pairs of molecules or ions listed in “pairs.dat.” Larger clusters were observed in the simulations [13, 14] and are omitted for clarity 14. In the first stage, for each simulation snapshot, clusters of increasing size are identified with individual molecules or ions added one at a time (Fig. 5a). In a given snapshot, temporary clusters of two molecules or ions are compared to a list of bonded pairs of molecules or ions. Both the temporary clusters of two and the list of bonded pairs of molecules or ions correspond to pairs.dat. If any one of the molecules or ions present in the list of bonded pairs is present in the 190 Asuka A. Orr et al. temporary cluster of two, then the molecule or ion is added to the cluster and the cluster size is expanded to a temporary cluster of three. Subsequently, the temporary clusters of three molecules or ions are compared to the same list of bonded pairs of molecules or ions. If any one of the molecules or ions present in the list of bonded pairs is present in the cluster of three, then the molecule or ion is added to the cluster and the cluster size is expanded to a temporary cluster of four. This process is repeated until no larger clusters are detected. Through this process, temporary clusters of smaller sizes can also be detected within larger clusters and the same temporary cluster may be listed repeatedly, with the molecules or ions in different orders (Fig. 5a, clusters in green, blue, and red). 15. In the second stage, after the largest temporary cluster is detected, the presence of duplicate clusters and the presence of smaller clusters within larger clusters are removed (Fig. 5b). For example, if resid 17 of entity 2 is present in a cluster of 5 at snapshot 210 (Fig. 5b), then it is no longer considered to be part of a cluster of 4, 3, or 2 (Fig. 5a). The remaining data are printed in an output text file for further structural and energetic analysis with the first, second, and third columns populated with i, s, and the molecules or ions belonging to the cluster. In this way, each row corresponds to an isolated cluster within the specified simulation snapshot i, and the listed molecules or ions belong to the same isolated cluster. A cluster of size s can be composed of several combinations of entities. For example, a cluster composed of 7 (cyclo-HH) + 3(EPI) + 4(Zn(II)) + 6(nitrate) and another cluster composed of 6(cyclo-HH) + 1(EPI) + 5(Zn(II)) + 8 (nitrate) both have a cluster size of 20. Processing “pairs.dat” prior to the detection of clusters, or processing “cluster.dat” through Unix commands and FORTRAN programs, can focus the analysis on clusters containing desired interactions or compositions (see Note 9). For the study of cyclo-HH selfand co-assembly, this file is named “cluster.dat” (Fig. 5b). 16. The clusters of molecules and ions detected by the structural analysis programs are extracted from the simulation trajectories and further analyzed. For the simulations of cycloHH self- and co-assembly, each cluster was independently extracted and analyzed as described in steps 17–20. 17. The percent solvent exposure of a molecule or ion within a cluster can provide insights into the geometric properties of the cluster and the location of each molecule or ion within the cluster. The solvent accessible surface area (SASA) provides a metric for how “buried” a molecule or ion is within a cluster; the larger the SASA of a molecule or ion, the more exposed it is and the more likely it is to be at the surface of the cluster; Cyclo-Dihistidine Self- and Co-Assembly 191 the smaller the SASA of a molecule or ion, the more “buried” it is and the more likely it is to be encapsulated in the interior of the cluster. The percent solvent exposure of a molecule or ion can be measured by the SASA of the molecule or ion divided by the total molecular surface area (TSA) of the same molecule or ion. For such calculations, it is important to select a probe radius in accordance with the solvent used in the simulations (see Note 10). For the simulations of cyclo-HH co-assembly in isopropanol, the percent solvent exposure was calculated for each molecule and ion within a given cluster, as defined by “cluster. dat,” to determine the existence of exterior and interior layers and the composition within the layers in the observed clusters. As isopropanol was used as the solvent in the simulations, the probe radius used in the calculations was set to 2.2 Å [24, 25]. Subsequently, the running average percent exposure of each molecule or ion within a given cluster was calculated, beginning with the most buried entity (lowest percent exposure) and moving outwards. Molecules or ions with a running average solvent exposure equal to or less than a specific percentage (chosen to be 45% [14]) were considered to be in the interior layer of the clusters, whereas molecules or ions with a running average solvent exposure greater than this percentage were considered to be at the exterior of the clusters. The probe radius and the percent cutoff can be tuned to ensure that the cutoff adequately identifies and differentiates the molecules or ions at the interior and the exterior of the cluster. Subsequently, the percent population of cyclo-HH, Zn (II) ions, and nitrate ions or cyclo-HH, Zn(II) ions, nitrate ions, and EPI within the interior and exterior layers of the clusters was calculated. The analysis showed that EPI and Zn (II) ions were predominantly located in the interior of the clusters, nitrate ions were predominantly located at the exterior of the clusters, and cyclo-HH was located in both the interior and exterior layers of the clusters [14]. 18. The radius of gyration of specific molecules or ions within the formed aggregates of the simulation systems can provide insights into their compactness within the aggregates or their location with respect to other molecules or ions of the aggregates. The radius of gyration (Rg) is the square root of the average deviation of N atoms (rk) from the geometric center (r ), and can be calculated using trajectory analysis tools such as Wordom [26, 27]: rffiffiffiffiffi 1 XN Rg ¼ ðr r Þ2 ð1Þ k¼1 k N 192 Asuka A. Orr et al. When comparing the radius of gyration of specific molecules or ions across different simulation systems, it is important to ensure a “fair” comparison. Particularly, in the calculations comparing the compactness of Zn(II) ions in clusters across different simulation systems, the comparison was performed between clusters containing the same number of cyclo-HH and the number of Zn(II) ions within each cluster of the same size were similar across the two systems. Thus, the difference in radius of gyration was not due to a lower number of Zn (II) ions present in the clusters of one system over the other. In the simulations of cyclo-HH co-assembly with Zn (II) ions in the absence and presence of nitrate ions, the radius of gyration calculations suggested that Zn(II) ions are more densely packed within cyclo-HH assemblies formed with nitrate ions [14]. 19. For co-assembled clusters containing different entities, the radius of gyration of one entity within a cluster can be compared to the radius of gyration of another within the same cluster to indicate if one entity is encapsulated by the other. To compare the radius of gyration of different entities within a given cluster, the radius of gyration of one entity (e.g., EPI) should be subtracted from the radius of gyration of another (e.g., cyclo-HH) within the same cluster. It is also important to ensure a “fair comparison” when comparing the radius of gyration of specific molecules or ions within different clusters of the same simulation system. Restricting the analysis to clusters of a given percent composition of each entity and calculating the difference in radius of gyration between the entities of interest per cluster (rather than comparing the average radius of gyration for each entity across all clusters) can enable a “fair comparison.” For the simulations of cyclo-HH in the presence of Zn(II) ions, nitrate ions, and EPI, the calculations were performed for aggregates containing at least 10 molecules, with a composition ranging from 30% cyclo-HH and 70% EPI to 70% cyclo-HH and 30% EPI [14]. The percent composition criterion was introduced to ensure that each cluster had a sufficient number of cyclo-HH and EPI [14]. These calculations suggested that cyclo-HH was encapsulating EPI (Fig. 6) [14]. 20. Time evolution analysis, tracking geometric properties with respect to simulation time, can provide insights into the process by which the molecules and ions of the clusters co-assemble. Using the data in “pairs.dat,” each interaction type can be plotted with respect to simulation time to observe the order in which the interactions are formed. Using data from “cluster.dat,” the composition of the clusters formed within the simulations can also be plotted with respect to Cyclo-Dihistidine Self- and Co-Assembly 193 Fig. 6 Molecular graphics image of EPI (red) and Zn(II) ions (yellow) encapsulated by cyclo-HH (blue) and nitrate ions (green) simulation time, to observe the order in which the molecules or ions aggregate into clusters. The analysis can be tuned to include only molecules or ions that eventually form large clusters specified by “cluster.dat.” For the simulations comparing the self-assembly of cyclo-HH in the absence and presence of Zn(II) in methanol, we tracked interactions between pairs of cyclo-HH to uncover the order in which interactions were formed between the pairs to ultimately form ordered elementary structures of bonded cyclo-HH pairs [13]. For the simulations of cyclo-HH with Zn(II) ions in the presence or absence of nitrate ions, and the simulations of cyclo-HH, Zn(II) ions, nitrate ions and EPI, we investigated the time evolution of clusters composed of individual molecules or ions that eventually form large clusters [14]. In this case, due to the presence of additional components, the definition of a “large” cluster should balance statistical significance with cluster complexity. For example, a larger cluster may be sufficiently complex, but only occur one time within the simulation. Likewise, a small cluster may occur many times within a simulation allowing for sufficient statistical significance, but may not be complex enough to describe a cluster. For the studies involving cycloHH with Zn(II) in the presence or absence of nitrate ions, we focused on clusters containing at least 10 cyclo-HH. For the simulations involving cyclo-HH, Zn(II) ions, nitrate ions, and EPI, we focused on clusters that eventually lead to clusters containing at least 10 molecules (either cyclo-HH or EPI) with a composition ranging from 30% cyclo-HH and 70% EPI to 70% cyclo-HH and 30% EPI. The additional percent composition criterion for the simulations of cyclo-HH, Zn (II) ions, nitrate ions, and EPI ensured that all clusters analyzed had a sufficient number of cyclo-HH and EPI molecules to observe both interior and exterior layers of the clusters. 194 Asuka A. Orr et al. The plotted data showed that the interior cluster (composed predominantly by EPI, Zn(II) ions, and cyclo-HH) forms first, followed by exterior molecules and ions of the cluster (composed predominantly by cyclo-HH and nitrate ions) wrapping around the preformed interior [14]. 3.2.3 Energetic Analysis 21. Association free energy calculations can provide valuable insights into the mechanism and driving forces leading to the co-assembly and stabilization of clusters formed by cyclo-HH (Fig. 4). The energy calculations can also serve to complement the structural analysis and ensure that the conclusions derived from the analyses correlate and are consistent. The MM-GBSA approximation [28, 29] provides a relatively fast and effective means to evaluate the association free energy of the clusters in “thought” energy calculations examining different potential pathways of co-assembly [13, 14]. 22. In these association free energy calculations, the Generalized Born with a simple Switching (GBSW) implicit-solvent model [30] was used to account for the solvent. In the implicitsolvent model, the dielectric constant can be tuned in accordance with the solvent used within the simulations and experiments. For the studies involving methanol, we used a dielectric constant of 33.5 to account for the dielectric environment of methanol [13, 31]; for the studies involving isopropanol, we used a dielectric constant of 18.4 to account for the dielectric environment of isopropanol [14, 32]. 23. The inclusion of nonpolar solvation effects in these calculations is important and can cause inaccuracies in the calculated energy values if not accounted carefully. Such contributions can be calculated through a surface tension coefficient multiplying the solvent accessible surface area, and the surface tension coefficient can be obtained by fitting the experimental hydration energies [33]. We suggest that if comprehensive studies of the appropriate surface tension coefficient in balance with the GBSW-implicit solvent are lacking for the solvent under investigation, nonpolar solvation effects may be omitted to avoid any bias or inaccuracy driven by an arbitrarily chosen value. Nevertheless, additional calculations can be performed using the surface tension coefficient’s default value for GBSW, which corresponds to 0.03 kcal mol1 Å2, determined for water solvation [30], to ensure that the overall trends remain the same. While the inclusion of nonpolar solvation effects using the default surface tension coefficient value may not be sufficiently accurate for other solvents, such a calculation could be used to verify that its inclusion does not change the overall trends, but only affects the resulting Cyclo-Dihistidine Self- and Co-Assembly 195 absolute values. In the simulations of cyclo-HH self- and co-assembly, energy calculations were performed with the nonpolar solvation effects omitted, as a consensus surface tension coefficient for isopropanol had not been previously reported [13, 14]. However, we confirmed that the overall trends remained the same using the default value for the GBSW surface tension coefficient. 24. Clusters detected in the simulations are isolated from the simulation systems and undergo a series of “thought” energy calculations. To perform the calculations for all detected clusters independently, a FORTRAN program reads each line of “cluster.dat.” For each line, the program writes a CHARMM [15] script file for each “thought” energy calculation, executes the CHARMM [15] script, extracts energetic data from the CHARMM [15] output, and calculates the association free energies normalized by the size of the cluster. The FORTRAN program executes CHARMM [15] and extracts data from the CHARMM [15] output through system calls combined with Unix commands. The calculated normalized energies for each “thought” energy calculation are printed into separate data files for further analysis. Any conclusions derived from such thought energy calculations are recommended to be cross-validated with structural analysis described in the sections above (see Note 11). 25. In the series of “thought” energy calculations, the isolated cluster is subjected to different conditions. Figure 7 shows possible “thought” free energy calculations that can be used to examine different possible pathways of cyclo-HH co-assembly. In these energy calculations, different initial states of the molecules and ions within the clusters are explored. For example, all molecules and ions comprising a cluster can be assumed to initially be completely isolated and immersed in the surrounding solvent for one set of energy calculations. In another set of energy calculations, a portion of the molecules and ions composing a cluster can be assumed to be preformed initially prior to co-assembly. Insights gained from the energy calculations presented in Fig. 7 can lead to the formulation of additional “thought” energy calculations exploring intermediate states that lead to the final formation of the cluster, as presented in Fig. 8. 26. To represent the free energy for isolated individual molecules or ions, which are part of a cluster, to spontaneously selfassemble into the cluster, the MM-GBSA association free energy is calculated through Eq. 2. This corresponds to pathways B, C, E, I of Fig. 7. Xs ΔG ðs Þ ¼ E cluster E ð2Þ i¼1 i 196 Asuka A. Orr et al. Fig. 7 Schematic of “thought” energy calculations performed to examine possible pathways of self-assembly for (a–d) cyclo-HH in the presence of Zn(II) ions, (e–h) cyclo-HH in the presence of Zn(II) and nitrate ions, and (i–l) cyclo-HH in the presence of Zn(II) ions, nitrate ions, and EPI in references [13] and [14] The energy of the cluster, Ecluster, represents the energy of the constituent molecules and ions, with all other molecules or ions deleted. The energy of each of the isolated, individual molecules, or ions of the cluster, Ei, is calculated by assuming that each molecule or ion, i, has the same conformation as within the cluster, but isolated and fully immersed in solution, with the all other molecules or ions within the cluster deleted. Cyclo-Dihistidine Self- and Co-Assembly 197 Fig. 8 Schematic of potential pathways related to the most energetically favorable pathway of cyclo-HH-Zn (NO3)2-EPI cluster formation according to Fig. 8. The free energy to associate the interior cluster from the individual EPI and Zn(II) ions was unfavorable according to energy calculations (indicated by the red “X”). However, the free energy to associate the interior layer is favorable if the Zn(II) ions are in a peptide-like environment prior to association, and the interior layer is in a peptide-like environment prior to the formation of the cluster 27. To represent the free energy for isolated individual molecules or ions, which are part of a cluster, to aggregate onto a preformed portion of the same final cluster, the MM-GBSA association free energy is calculated through Eq. 3. This corresponds to pathways F, G, J, K of Fig. 7. Xs ΔG ðs Þ ¼ E cluster E preformed E ð3Þ i¼1 i The energy of the cluster, Ecluster, represents the energy of the constituent molecules and ions, with all other molecules or ions deleted. The energy of the cluster, Epreformed, represents the energy of the molecules and ions constituting the preformed portion of the cluster, with all other molecules or ions deleted. The energy of the isolated, individual molecules, or ions of the cluster, Ei, is calculated by assuming that each molecule or ion, i, has the same conformation as within the cluster, but isolated and fully immersed in solution, with the all other molecules or ions within the cluster deleted. 28. To represent the free energy for two preformed portions of the cluster to aggregate with each other, the MM-GBSA association free energy is calculated through Eq. 4. This corresponds to pathways A, H, L of Fig. 7. 198 Asuka A. Orr et al. ΔG ðs Þ ¼ E cluster E preformed 1 E preformed 2 ð4Þ The energy of the cluster, Ecluster, represents the energy of the constituent molecules and ions, with all other molecules or ions deleted. The energy of the cluster, Epreformed 1, represents the energy of the molecules and ions constituting the first preformed portion of the cluster, with all other molecules or ions deleted. The energy of the cluster, Epreformed 2, represents the energy of the molecules and ions constituting the second preformed portion of the cluster, with all other molecules or ions deleted. 29. In the aforementioned energy calculations, the cluster, preformed portions of the cluster, and/or individual molecules or ions of the cluster are assumed to be fully immersed in pure solvent (methanol and isopropanol in references [13] and [14], respectively) through the deletion of molecules or ions not involved in the energy calculation. Additional energy calculations can be performed to examine hypothetical pathways in which the interior or the cluster, exterior of the cluster, and/or individual molecules or ions of the cluster are in a peptide-like environment, such as Fig. 7d. In such hypothetical calculations, the nonpolar component of the association free energies is calculated in the same way as described in Eqs. 2–4. The polar component of the association free energies is calculated by setting to zero the charge of all molecules or ions of the cluster, except for the molecules or ions involved in the energy calculation. For example, to examine the contribution of molecules or ions co-assembling in a peptide-like environment, the association free energy would be calculated through Eq. 2, except that the polar component of Ei for a given molecule or ion is calculated by setting the charge of all other molecules and ions within the cluster to zero (Fig. 7d). In this way, the calculations represent the energy in a peptide-like dielectric environment, rather than a pure solvent dielectric environment (see Note 12). 30. To provide insights into the role of Zn(II) ions in clusters formed by cyclo-HH in the presence of Zn(II), association “thought” free energy calculations were performed under several assumptions: (a) preformed assemblies of cyclo-HH and preformed assemblies of Zn(II) join to form the final cluster (Fig. 7a), (b) individual cyclo-HH molecules and Zn (II) ions spontaneously assemble to form the final cluster (Fig. 7b), (c) individual cyclo-HH molecules spontaneously assemble to form the final cluster with Zn(II) ions not contributing energetically (Fig. 7c), and (d) individual cyclo-HH Cyclo-Dihistidine Self- and Co-Assembly 199 molecules and Zn(II) ions spontaneously assemble to form the final cluster, with Zn(II) ions being within the dielectric environment of the final cluster (Fig. 7d). These calculations revealed that cyclo-HH co-assembles with Zn(II) ions through an “environment switching mechanism” by which Zn(II) ions are first pulled from the dielectric environment of the surrounding methanol solvent by coordinating with individual or pairs of cyclo-HH, followed by the assembly of the coordinated Zn(II) ions and cyclo-HH into the final clusters [13]. 31. To understand the mechanism by which cyclo-HH co-assembles with Zn(II) and nitrate ions, as well as the mechanism by which cyclo-HH co-assembles with Zn(II) ions, nitrate ions, and EPI, association “thought” free energy calculations were also performed under several assumptions: (a) the individual molecules and ions forming the cluster are initially completely immersed in pure isopropanol and spontaneously selfassemble into a cluster (Fig. 7e, i), (b) the interior layer assembly is preformed in pure isopropanol and individual molecules and ions of the exterior layer, completely immersed in pure isopropanol, subsequently aggregate onto the preformed interior layer to form a cluster (Fig. 7f, j), (c) the exterior layer assembly is preformed in pure isopropanol and individual molecules and ions forming the interior layer, completely immersed in pure isopropanol, subsequently aggregate on the preformed exterior layer to form a cluster (Fig. 7g, k), and (d) the interior layer and the exterior layer assemblies are individually preformed, initially not interacting with each other and completely immersed in pure isopropanol, then subsequently aggregate to form a cluster (Fig. 7h, l). These calculations suggested that the most energetically favored pathway is when the interior nucleus assembles first, followed by individual exterior components wrapping around the interior nucleus to form the clusters [14]. Additional energy calculations examining the most favorable pathway according to Fig. 7 were performed to gain more insights into the formation of clusters within the simulations of cyclo-HH in the presence of Zn(II) ions, nitrate ions, and EPI (Fig. 8). These calculations were in line with structural calculations (see Note 11) and suggested that Zn(II) ions are pulled from the isopropanol environment into a more peptide-like environment by individual molecules or pairs of cyclo-HH, enabling the self-encapsulation of EPI, which further facilitates the co-assembly of individual cyclo-HH and nitrate ions wrapping around the preformed EPI-Zn(II) interior [14]. 200 4 Asuka A. Orr et al. Notes 1. If the fluorescence response is low (<1 105 CPS), the slit value can be increased. If the material has a high fluorescence response (>1.7 107 CPS), then the slit value can be reduced accordingly in order to avoid damage to the detector. 2. The substrate mica sheet should be first peeled off with tape to expose a fresh surface. 3. The accessibility of polarizable force fields has been increased since the debut of FFParam [34], through which users can generate and optimize polarizable force fields compatible with the Drude force field [20]. If polarizable force fields are available for all components of the simulation system or the user has access to CGenFF [35, 36], FFParam [34], and Gaussian [37] or Psi4 [38], then the use of polarizable force fields would be recommended as in reference [13]. 4. The use of multiple, replicate simulations with different initial configurations and conditions can be advantageous to check for reproducibility of computational results across all runs. Additionally, replicate simulations can allow for the analysis of statistical errors or to reveal any “trapping” (failure to explore important configurations outside of an energetic well) within the simulations [39]. 5. If, within the simulations, the molecules and ions spend a large portion of the simulation in remote parts of the solvent box without interacting with other molecules or ions within the simulation, then the size of the solvent box may be decreased as a means to facilitate and accelerate the self-assembly process. 6. Convergence of MD simulations can be checked throughout all the structural and energetic analysis. Plotting geometric properties such as radius of gyration of the entire system, number of clusters formed, or the types of interactions formed as a function of time can provide a visual indication of whether the plotted values become steady as the simulations progress. Likewise, plotting the running average energy of the system can also indicate whether longer simulation times are needed. 7. The interactions tracked in the simulations of cyclo-HH selfand co-assembly were guided by experimentally resolved crystal structures and visual inspection of the simulation trajectories. This helped to ensure that elements of the crystal structure were reproduced in the simulations. 8. Each interaction type number corresponds to a specific interaction between atoms belonging to bonded molecules or ions. 9. To examine clusters containing a particular interaction type or set of interaction types, “pairs.dat” can be processed to only Cyclo-Dihistidine Self- and Co-Assembly 201 include the interaction types of interest prior to the detection of clusters. In this way, the molecules or ions in the detected clusters will be “connected” through the interactions of interest. This can be particularly useful when searching for ordered structures within the simulations. To examine clusters composed of a certain composition of molecules or ions, for example, 50% cyclo-HH and 50% Zn(II) ions, then “cluster.dat” can be processed to isolate the clusters with the desired composition. 10. The choice of the probe radius can affect the calculated surface area. It is recommended that the user consults the literature to set the probe radius for the solvent under investigation. 11. The conclusions derived from the results of structural analysis can be verified with the results of the energetic analysis, and vice versa. If the results do not align, other potential thermodynamic pathways could be evaluated in the energetic analysis, and any user-defined criteria of the structural analysis could be compared with visual inspection of the simulation trajectories. 12. The energy calculations represent “bounds of actual scenarios,” since in reality, the molecules and ions of the cluster cannot be fully immersed in pure solvent (methanol and isopropanol in the references [13] and [14], respectively) or be in the dielectric environment of the formed cluster. However, such calculations can provide insights into the pathways of co-assembly and be cross-validated with the results of the structural analysis. Acknowledgments A.A.O acknowledges the Texas A&M University Graduate Diversity Fellowship from the Texas A&M University Graduate and Professional School. All MD simulations and computational analysis were conducted using the Ada supercomputing cluster at the Texas A&M High Performance Research Computing Facility, and additional facilities at Texas A&M University. E.G. acknowledges the support part by the European Research Council under the European Union Horizon 2020 research and innovation program (No. 694426). E.G. also acknowledges support from NSF-BSF Joint Funding Research Grants (No. 2020752). Y.C. gratefully acknowledges the Center for Nanoscience and Nanotechnology of Tel Aviv University for financial support. PT acknowledges support from the National Science Foundation (Award Number 2104558; NSF-BSF: Computational and Experimental Design of Novel Peptide Nanocarriers for Cancer Drugs). 202 Asuka A. Orr et al. References 1. Wang H, Feng Z, Xu B (2017) Bioinspired assembly of small molecules in cell milieu. Chem Soc Rev 46:2421–2436 2. Wei G, Su Z, Reynolds NP, Arosio P, Hamley IW, Gazit E, Mezzenga R (2017) Selfassembling peptide and protein amyloids: from structure to tailored function in nanotechnology. Chem Soc Rev 46:4661–4708 3. DeGrado WF, Wasserman ZR, Lear JD (1989) Protein design, a minimalist approach. Science 243:622–628 4. Reches M, Gazit E (2003) Casting metal nanowires within discrete self-assembled peptide nanotubes. Science 300:625–627 5. Gazit E (2007) Self assembly of short aromatic peptides into amyloid fibrils and related nanostructures. Prion 1:32–35 6. Yemini M, Reches M, Gazit E, Rishpon J (2005) Peptide nanotube-modified electrodes for enzyme-biosensor applications. Anal Chem 77:5155–5159 7. Handelman A, Kuritz N, Natan A, Rosenman G (2016) Reconstructive phase transition in ultrashort peptide nanostructures and induced visible photoluminescence. Langmuir 32:2847–2862 8. Guo C, Arnon ZA, Qi R, Zhang Q, AdlerAbramovich L, Gazit E, Wei G (2016) Expanding the nanoarchitectural diversity through aromatic di- and tri-peptide coassembly: nanostructures and molecular mechanisms. ACS Nano 10:8316–8324 9. Nikitin T, Kopyl S, Shur VY, Kopelevich YV, Kholkin AL (2016) Low-temperature photoluminescence in self-assembled diphenylalanine microtubes. Phys Lett A 380:1658–1662 10. Guo C, Luo Y, Zhou R, Wei G (2014) Triphenylalanine peptides self-assemble into nanospheres and nanorods that are different from the nanovesicles and nanotubes formed by diphenylalanine peptides. Nanoscale 6:2800–2811 11. Tao K, Fan Z, Sun L, Makam P, Tian Z, Ruegsegger M, Shaham-Niv S, Hansford D, Aizen R, Pan Z, Galster S, Ma J, Yuan F, Si M, Qu S, Zhang M, Gazit E, Li J (2018) Quantum confined peptide assemblies with tunable visible to near-infrared spectral range. Nat Commun 9:3217 12. Barondeau DP, Kassmann CJ, Tainer JA, Getzoff ED (2002) Structural chemistry of a green fluorescent protein Zn biosensor. J Am Chem Soc 124:3522–3524 13. Tao K, Chen Y, Orr AA, Tian Z, Makam P, Gilead S, Si M, Rencus-Lazar S, Qu S, Zhang M, Tamamis P, Gazit E (2020) Enhanced fluorescence for bioassembly by environment-switching doping of metal ions. Adv Funct Mater 30:1909614 14. Chen Y, Orr AA, Tao K, Wang Z, Ruggiero A, Shimon LJW, Schnaider L, Goodall A, RencusLazar S, Gilead S, Slutsky I, Tamamis P, Tan Z, Gazit E (2020) High-efficiency fluorescence through bioinspired supramolecular selfassembly. ACS Nano 14:2798–2807 15. Brooks BR, Brooks CL, Mackerell AD, Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, Woodcock HL, Wu X, Yang W, York DM, Karplus M (2009) CHARMM: the biomolecular simulation program. J Comput Chem 30:1545–1614 16. Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14(33-8):27–28 17. Sterling T, Irwin JJ (2015) ZINC 15--ligand discovery for everyone. J Chem Inf Model 55:2324–2337 18. Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47:D1102–D1109 19. ChemAxon (n.d.) ChemAxon MarvinSketch. Version 17.1.30. http://www.chemaxon.com. Accessed 18 Dec 2020 20. Lin F-Y, Huang J, Pandey P, Rupakheti C, Li J, Roux BT, MacKerell AD (2020) Further optimization and validation of the classical drude polarizable protein force field. J Chem Theory Comput 16:3221–3239 21. Tamamis P, Kasotakis E, Archontis G, Mitraki A (2014) Combination of theoretical and experimental approaches for the design and study of fibril-forming peptides. Methods Mol Biol 1216:53–70 22. Lee J, Cheng X, Swails JM, Yeom MS, Eastman PK, Lemkul JA, Wei S, Buckner J, Jeong JC, Qi Y, Jo S, Pande VS, Case DA, Brooks CL, MacKerell AD, Klauda JB, Im W (2016) CHARMM-GUI Input Generator for NAMD, GROMACS, AMBER, OpenMM, and CHARMM/OpenMM Simulations Using the CHARMM36 Additive Force Field. J Chem Theory Comput 12:405–413 Cyclo-Dihistidine Self- and Co-Assembly 23. Jo S, Kim T, Iyer VG, Im W (2008) CHARMM-GUI: a web-based graphical user interface for CHARMM. J Comput Chem 29:1859–1865 24. Mayer SW (1963) A molecular parameter relationship between surface tension and liquid compressibility. J Phys Chem 67:2160–2164 25. Tang KE, Bloomfield VA (2000) Excluded volume in solvation: sensitivity of scaled-particle theory to solvent size and density. Biophys J 79:2222–2234 26. Seeber M, Cecchini M, Rao F, Settanni G, Caflisch A (2007) Wordom: a program for efficient analysis of molecular dynamics simulations. Bioinformatics 23(19):2625–2627 27. Seeber M, Felline A, Raimondi F, Muff S, Friedman R, Rao F, Caflisch A, Fanelli F (2011) Wordom: a user-friendly program for the analysis of molecular structures, trajectories, and free energy surfaces. J Comput Chem 32:1183–1194 28. Gohlke H, Case DA (2004) Converging free energy estimates: MM-PB(GB)SA studies on the protein-protein complex Ras-Raf. J Comput Chem 25:238–250 29. Hayes JM, Archontis G (2012) MM-GB (PB)SA calculations of protein-ligand binding free energies. Molecular dynamics - studies of synthetic and biological macromolecules 30. Im W, Lee MS, Brooks CL (2003) Generalized born model with a simple smoothing function. J Comput Chem 24:1691–1702 31. Wohlfarth C (2015) Static dielectric constants of pure liquids and binary liquid mixtures: supplement to volume IV/17 32. Khimenko MT, Litinskaya VV, Khomenko GP (1982) Effect of concentration on the polarizability of isopropyl alcohol in dimethyl sulfoxide. Zh Fiz Khim 56:867–870 203 33. Zhang J, Zhang H, Wu T, Wang Q, van der Spoel D (2017) Comparison of implicit and explicit solvent models for the calculation of solvation free energy in organic solvents. J Chem Theory Comput 13:1034–1043 34. Kumar A, Yoluk O, MacKerell AD (2020) FFParam: Standalone package for CHARMM additive and Drude polarizable force field parametrization of small molecules. J Comput Chem 41:958–970 35. Vanommeslaeghe K, MacKerell AD (2012) Automation of the CHARMM General Force Field (CGenFF) I: bond perception and atom typing. J Chem Inf Model 52:3144–3154 36. Vanommeslaeghe K, Raman EP, MacKerell AD (2012) Automation of the CHARMM General Force Field (CGenFF) II: assignment of bonded parameters and partial atomic charges. J Chem Inf Model 52:3155–3168 37. Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR et al (2016) Gaussian 03. Gaussian, Inc., Wallingford, CT 38. Parrish RM, Burns LA, Smith DGA, Simmonett AC, DePrince AE, Hohenstein EG, Bozkaya U, Sokolov AY, Di Remigio R, Richard RM, Gonthier JF, James AM, McAlexander HR, Kumar A, Saitow M, Wang X, Pritchard BP, Verma P, Schaefer HF, Patkowski K, King RA, Valeev EF, Evangelista FA, Turney JM, Crawford TD, Sherrill CD (2017) Psi4 1.1: an open-source electronic structure program emphasizing automation, advanced libraries, and interoperability. J Chem Theory Comput 13:3185–3197 39. Grossfield A, Zuckerman DM (2009) Quantifying uncertainty and sampling quality in biomolecular simulations. Annu Rep Comput Chem 5:23–48 Chapter 11 Computational Tools and Strategies to Develop Peptide-Based Inhibitors of Protein-Protein Interactions Maxence Delaunay and Tâp Ha-Duong Abstract Protein-protein interactions play crucial and subtle roles in many biological processes and modifications of their fine mechanisms generally result in severe diseases. Peptide derivatives are very promising therapeutic agents for modulating protein-protein associations with sizes and specificities between those of small compounds and antibodies. For the same reasons, rational design of peptide-based inhibitors naturally borrows and combines computational methods from both protein-ligand and protein-protein research fields. In this chapter, we aim to provide an overview of computational tools and approaches used for identifying and optimizing peptides that target protein-protein interfaces with high affinity and specificity. We hope that this review will help to implement appropriate in silico strategies for peptide-based drug design that builds on available information for the systems of interest. Key words Sequence-based peptide design, Peptide conformation-based methods, Protein-peptide interface characterization, Peptide hit identification and optimization 1 Introduction Association and dissociation of proteins are molecular events at the basis of many crucial cellular processes. Therefore, perturbation of protein interaction networks generally leads to severe human diseases such as cancer or degenerative diseases. Infectious diseases also involve interactions between host and pathogen proteins [1]. Accordingly, protein-protein interactions (PPIs) have become the target of an increasing number of modulator molecules with therapeutic perspectives but also as chemical biology tools to study protein interactions [2]. Notably, one advantage of targeting PPIs compared with single proteins is to reduce the probability of drug resistance. Indeed, protein-protein interfaces being highly comple- Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_11, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 205 206 Maxence Delaunay and Tâp Ha-Duong mentary, a mutation in one protein would require a second complementary mutation in its partner to preserve their association, which is very unlikely [3]. Despite the generally large protein surfaces involved in PPIs, several small molecules were successfully developed for inhibiting protein-protein associations [4–6]. Nonetheless, similarly to small molecules which competitively inhibit enzymes by mimicking endogenous substrates, peptide derivatives which mimic peptide segments at protein-protein interfaces should be highly specific lead compounds. Moreover, many protein-protein associations, particularly in signaling pathways, are characterized by low affinities [7]. This offers a lot of room for developing peptidic binders with higher affinity. For these reasons, peptides are very promising starting points for deriving potent and selective PPI inhibitors [8, 9]. However, peptides exhibit well-known in vivo issues which have to be circumscribed in drug development projects: They are easily degraded by proteolytic enzymes, they generally have poor membrane permeability, and they potentially induce unwanted immune responses [10]. These drawbacks can be limited by reducing their peptidic nature with various approaches such as cyclization, N-methylation, or incorporation of non-natural amino acids. These peptide derivatives are expected to keep a potent activity on targeted protein-protein interactions but with improved pharmacokinetic properties. Nonetheless, the daunting task in designing peptide derivatives to modulate protein-protein associations remains to find their optimal sequence of natural or non-natural amino acids. In order to assist the medicinal chemists and chemical biologists in this task, many computational approaches have been developed for the past decades. We attempt here to categorize the various in silico strategies that use these computational tools to design peptide-based molecules modulating PPIs. Similarly to the design of small molecules inhibiting enzymes or receptors, strategies to develop therapeutic peptide derivatives can be classified into ligand-based or structure-based approaches. In the first case, a peptide segment with interesting bioactivity has been identified but its targeted protein remains unknown. Peptide derivatives with improved properties will then be searched by similarity with the initial peptide segment. In the second case, the structure of the protein-protein complex is partially or entirely known and the strategies will consist in finding peptide derivatives having an optimal structural and chemical complementary with the targeted protein surfaces. Accordingly, the present chapter will be organized into two parts, the peptide-based and target-based strategies. In Silico Design of Peptide-Based PPI Inhibitors 2 207 Peptide-Based Strategies In ligand-based virtual screening, the search for small molecules similar to a query compound consists in comparing molecular descriptors which encode chemical and structural information into numbers. For example, molecular fingerprints encode in bit vectors the presence in molecules of particular substructures or fragments [11] and pharmacophore models capture in threedimensional arrangements the chemical features of a ligand that are necessary for its bioactivity [12]. Regarding peptides, chemical information is encoded in their sequence and conformational information mainly lies in their secondary structures. Accordingly, we subdivided this section into sequence-based and conformationbased strategies. 2.1 Sequence-Based Approaches Various properties of a peptide can be inferred by searching for homologous sequences in relevant databases. This can be routinely performed by using sequence alignment methods, such as the basic local alignment search tool (BLAST) [13]. For example, BLAST searches in database of essential genes (DEG) [14] or protein subcellular localization database (PSORTdb) [15] can rapidly provide useful information regarding the biological function of query protein sequences to identify novel targets [16]. Likewise, many databases dedicated to peptides, such as database of antimicrobial activity and structure of peptides (DBAAS) [17] or database of bioactive peptides (BIOPEP-UWMTM) [18], can be mined using BLAST to obtain information about biological and therapeutical activities of query peptide sequences. However, similarity searches based on purely sequential representation of proteins or peptides can fail in case of low coverage or low quality sequence alignments. Thus, alternative descriptors of protein and peptide chemical information can be used, such as the amino acid composition (AAC) which is simply the occurrence frequencies of the 20 native amino acids in their sequence, the dipeptide composition, or the pseudo amino acid composition (PseAAC) concept which additionally includes some sequenceorder information via correlation factors of various chemical properties between (i,i+1), (i,i+2), and (i,i+3) pairs of amino acids [19]. By comparing these sequence descriptors with those of proteins and peptides in relevant databases, it is again possible to predict some of its functional, physical, chemical, and structural features [20]. The common thread of sequence-based predictions of peptide bioactivity is first to build a training set of peptides with the experimentally validated desired and non-desired property, to choose some peptide chemical descriptors, and to train a machine learning algorithm for determining the relevant chemical features which 208 Maxence Delaunay and Tâp Ha-Duong Fig. 1 Schematic description of sequence-based prediction of peptide therapeutic property through machine learning algorithms separate the desired from non-desired peptides (Fig. 1). In a second stage, the trained machine learning algorithm is applied to unknown data sets of peptides to predict those with desired and non-desired properties according to the most discriminating features. In subsections below, we highlight some studies which applied these sequence-based approaches to predict peptide bioactivity. 2.1.1 Prediction of Peptide Therapeutic Property Many studies applied this general framework to identify peptides against almost all main pathology classes. Among them, prediction of anticancer peptides has attracted great interest from several research groups [21–23]. By building specialized data sets of antiangiogenic peptides, other predictors were developed for identifying more specific peptides inhibiting angiogenesis as promising cancer treatment [24, 25]. A second major class of diseases for which sequence-based searches for therapeutic peptides have been reported is infectious diseases. According to the property covered by the training data sets, several models were developed to predict antimicrobial [26], antibacterial [27], or antiviral [28] peptides. In Silico Design of Peptide-Based PPI Inhibitors 209 To avoid peptide-induced unwanted immune responses, or, conversely, to design peptide-based vaccines, it can be interesting to anticipate the peptide immunogenicity property. To this end, several studies combining data sets of binders and non-binders to the major histocompatibility complex (MHC) proteins and various machine learning algorithms have been used to predict whether a peptide is likely to be a T-cell epitope presented on the cell surface [29, 30]. It is worthy to note that most recent peptide training data sets were built from the Immune Epitope Database (IEDB) [31] which also includes biological data about peptides involved in inflammatory disorders. Thus, by extracting data sets of peptides which trigger the secretion of inflammatory cytokines, it is possible to develop predictors of peptides with pro- or anti-inflammatory properties [32, 33]. Finally, it should be mentioned that these sequence-based methods can be virtually applied to investigate any peptide property, provided that high-quality training data sets can be built with validated peptides having and not having the desired property [34]. They have been widely used, for example, to predict peptide capability to cross membranes and penetrate into cells [35– 38]. Another peptide property which can be investigated by sequence-based approaches is their capability to bind proteins. The studies which used such methods to predict protein-peptide interactions are reviewed in the next subsection. 2.1.2 Prediction of Protein-Peptide Interactions Three different levels of detail about protein-peptide interactions can be obtained using sequence-based approaches [39]. The first one is to identify proteins that bind a query sequence. In that case, it is first necessary to build training data sets of peptides which bind proteins and other ones which do not. Then, machine learning algorithms are trained to discriminate binders from non-binders by using relevant peptide chemical descriptors. It should be noted that many variants of this approach were originally developed for large scale predictions of protein-protein interactions in the perspective of deciphering interactomes of various organisms [40– 42]. They were also applied to predict interactions between human and bacteria or virus proteins by using relevant data sets of host-pathogen protein-protein interactions [43–45]. Beside these genome scale predictions, several sequence-based studies have been conducted in order to specifically identify peptide segments that mediate protein-protein associations. These peptides can be classified into two types, the short linear motifs (SLiMs) which have generally less than ten residues, a few very conserved ones, and no particular secondary structures, and the molecular recognition features (MoRFs) which have a longer sequence and generally undergo a disorder-to-order transition upon binding. Accordingly, several sequence-based algorithms were specifically 210 Maxence Delaunay and Tâp Ha-Duong developed for mining SLiMs [46–48] and other ones for detecting MoRFs [49–51] in various data sets of protein-protein complexes. It is worthy to note that SLiMs generally bind to specific recognition modules such as SH3 or PDZ domains. Therefore, many specialized peptide motif predictors were trained and developed on specific data sets of these prevalent domains [46, 47, 52–54]. The second level of information that can be investigated with sequence-based approaches is the identification of amino acid residues at protein-peptide interfaces. Just like homology modeling which aims at predicting protein tertiary structures from sequences and data sets of experimentally resolved structures, comparative studies can also be used for determining protein-protein interfaces by searching homologs of query sequences in data sets of known protein-protein complexes [55, 56]. Nonetheless, instead of searching homologous proteins by using sequence alignment tools, such as BLAST, most of the sequence-based predictions of interface residues employ numerical vectors encoding key chemical features of protein sequences and various machine learning algorithms. The latter are generally trained on data sets of interface residues and non-interface ones which were built from known protein-peptide complexes. Commonly, interface residues are defined as those with a solvent accessible surface area (SASA) which decreases by more than 1 Å2 upon binding or as those which are distant from a protein partner by less than a threshold parameter. It is worthy to note that most sequence-based machine learning approaches were applied to predict residues of a protein which are likely to be involved in binding any other protein partners [57– 59]. Nonetheless, several other studies tackled the problem of predicting the interface residues on both proteins of specific complexes [60–62], providing precious detailed information on protein-protein binding modes, especially for transient complexes and those mediated by SLiMs and MoRFs. Regarding specifically protein-peptide interfaces, very few studies based on sequences only were reported in the literature. We found only two sequence-based predictors of protein-peptide binding sites, namely SPRINT [63] and SVMpep [64], both using support vector machines (SVM) for classification of binding and non-binding residues. It can be noted that, among the input physical chemical features of amino acid residues, SVMpep includes intrinsic disorder information predicted by the IUPred web server [65], which seems to improve the prediction accuracy. Finally, the third level of information that can be inferred from sequence-based approaches is the protein-peptide binding affinities. Although machine learning classifiers were developed to discriminate protein-protein interactions with low or high affinity [66, 67], quantitative predictions of binding free energies (ΔG) In Silico Design of Peptide-Based PPI Inhibitors 211 from sequences are generally performed with machine learning regression methods. Using training data sets of experimental binding free energies of known protein-protein complexes and various sequence descriptors, obtained correlations between predicted and experimental affinities are very diverse, ranging from 0.3 to 0.8 with an average value around 0.6, depending on the selected sequence features and external data sets used for testing [67–71]. However, these approaches seem to perform better in predicting changes in binding free energies (Δ ΔG) upon mutations on one of the two partner sequences [52, 68]. Indeed, thanks to data sets of experimental Δ ΔG of mutations at the interface of protein-protein complexes [72, 73], different machine learning regression methods yielded correlations between predicted and experimental changes in binding free energies in the narrower range of 0.7–0.9 on various tested data sets [52, 68, 74– 76]. It should be noted that most of these studies trained their machine leaning algorithms with descriptors extracted from threedimensional structures of protein-protein complexes. Only two purely sequence-based predictors were so far reported in the literature [77, 78]. Moreover, it is important to mention that when predictors are blind tested on completely independent data sets of protein-protein complexes, then correlations between predicted and experimental Δ ΔG upon mutations significantly drop to a range of 0.3–0.6 [74–78], indicating that there is still room for improving these predictors. Lastly, although these sequence-based approaches were mainly developed for protein-protein complexes, some of them have been applied to protein-peptide interactions, including PDZ-peptide associations [52, 68, 71] or complexes of MDM2 with p53 MoRF [74, 75]. These studies pave the way for the investigation and design of peptide sequences with optimal binding free energies for target proteins. 2.2 ConformationBased Approaches It is now recognized that protein sequence determines their threedimensional conformational ensemble which, in turn, confers their biological activity. Therefore, many developments of peptide derivatives have been based on or oriented toward the structural properties of identified bioactive peptides. We highlight here two main in silico conformation-based approaches to discover or design new peptide derivatives, the peptide pharmacophore screening and the stabilization of secondary structure mimics (Fig. 2). 2.2.1 Peptide Pharmacophore-Based Screening Among the ligand-based approaches in drug discovery, the pharmacophore virtual screening is an efficient and popular computational tool which can harness the knowledge of peptide conformations. Indeed, when a peptide segment is known to bind a target, then the residue side chains that are important for binding (hot spots) allow naturally to generate 3D-pharmacophore models. These, in turn, are used to screen compound libraries and identify 212 Maxence Delaunay and Tâp Ha-Duong Fig. 2 Schematic description of a peptide pharmacophore screening method (left) and molecular simulation use for predicting pre-organized conformations of a constrained peptide (right) new binders with a similar three-dimensional pharmacophoric arrangement. It should be mentioned that, although such drug developments are centered around a known peptide ligand, they often require the knowledge of its three-dimensional structure when bound to its target, conferring to these approaches a non-purely ligand-based nature. This method was employed to discover inhibitors of several protein-protein complexes [79–81], notably involved in host-pathogen interactions [82–84]. Nonetheless, it should be noted that, after defining the peptide-based pharmacophore models, these studies often screened libraries of commercially available small compounds, which generally leads to hits being far from a peptide. 2.2.2 Constrained Secondary Structure Mimics Since many protein-protein interactions are mediated by peptide segments which are structured into α-helix, β-strand, or turns, a promising drug design strategy is to stabilize or constrain the peptide unbound state in these common secondary structures to minimize the entropy cost of binding and improve the affinity [85]. This can be achieved by using two main approaches, either by peptide cyclization or by backbone stiffening. The first approach includes the α-helix stapling which consists in linking the side In Silico Design of Peptide-Based PPI Inhibitors 213 chains of two residues located on the same side of an α-helix, with hydrocarbon, lactam, or triazole staples, for example [86, 87], and the β-sheet closure which consists in linking the two proximate residues at the extremities of a pair of β-strands, using hairpin loops or β-turn mimics [88, 89]. On the other hand, the backbone stiffening approach generally consists in inserting a chemical modification into the peptide backbone, such as disubstitution of the αcarbon [90] or substitution of the amide nitrogen [91, 92], in order to restrain its accessible conformational space. In both previous strategies, a particularly helpful computational tool which can assist the design of these constrained peptides is molecular dynamics (MD) simulation. This technique numerically solves the Newton’s equations of motion for a system of particles whose interactions are described by empirical potential functions usually referred to as force fields. When their timescales are sufficiently long, MD simulations can efficiently sample the peptide conformational ensembles and correctly predict their propensity to form secondary structures [93–95]. It could be noted that enhanced sampling techniques, such as replica exchange molecular dynamics [96] or metadynamics simulations [97] can also be used to generate more exhaustive conformational ensembles, especially for constrained cyclic peptides. Hence, more and more peptide derivative developments include MD studies to anticipate the impact of chemical modifications upon stabilization of secondary structures, as shortly presented below. Regarding stapled helices, molecular simulations generally confirmed that they have more restricted conformational space than their non-stapled counterparts, but they still keep a high degree of conformational flexibility [98–100]. Importantly, these studies demonstrated that staples do not necessarily increase the helical propensity (or helicity) of stapled peptides, which seems to result from a fine balance between peptide sequences and position, length, and chemical nature of the staple [100–102]. Also, MD studies of stapled helices in free and bound states emphasized the point that high helicity of stapled peptides does not necessarily correlate with high binding affinity [98, 100, 101]. This could be due to the peptides’ need for sufficient flexibility to adjust their structure in the partner binding site and/or to the fact that staples participate in and therefore modulate their binding [103]. Enhanced molecular simulations of cyclic peptides mimicking turns or β-structures also showed that cyclization certainly reduced the heterogeneity of their conformations, but it still allows a significant amount of flexibility [104–107]. Notably, cyclic backbones can still sample multiple conformational states, from compact to elongated structures, and a major question raised in these studies is whether, among them, there is a pre-organized one close to a bioactive conformation [105, 108–111]. If such a bioactive 214 Maxence Delaunay and Tâp Ha-Duong conformation can be identified within the peptide conformational ensemble, then additional chemical modifications of the peptide backbone, such as α-carbon disubstitution or N-methylation, can be introduced to shift the conformational equilibrium in favor of it. Here again, enhanced MD simulations can help to rationalize and optimize the impact of these modifications on peptide derivative conformational space [92, 110, 112, 113]. All together, these peptide conformation-based studies can guide chemists away from less interesting modulators in order to limit costs of long synthesis campaigns. 3 Target-Based Strategies Target-based approaches for designing peptides modulating protein-protein interactions require to gather structural information about the studied complexes. In many cases, these data are difficult to obtain experimentally due to technical limitations but also due to the low affinity and/or the transient character of many protein-protein associations [114]. In that context, several computational tools have been developed these last decades to gain a better insight into the structural determinants of these interactions. In this section, we will first describe different in silico approaches to investigate the interface of protein-peptide complexes by collecting data about cavities and hot spots or by performing protein-peptide docking. Next, we will see how these tools and information can help the rational design of regulatory peptides by finding minimal recognition motifs or by peptide library virtual screening. We will also discuss the optimization methods to enhance affinity and specificity of these compounds. 3.1 Structural Characterization of Protein-Peptide Interfaces When a protein-peptide binding mode is unknown but the tertiary structure of the unbound protein is resolved, it is possible to anticipate the ligand binding sites on the protein by predicting its cavities and/or the few amino acids that predominantly contribute to the binding free energy (hot spots). It is also possible to model the complex three-dimension structures with protein-peptide docking techniques (Fig. 3). 3.1.1 Cavity Detection In classical structure-based drug design, an exploration of druggable cavities on a protein surface is generally performed prior to chemical library virtual screening or fragment-based design approaches [115]. This can be done with the web servers CASTp [116] or FPOCKET [117], for example. However, as far as we know, these algorithms were mostly applied to detect protein cavities for small ligands but not for peptide binding sites which are wider and more difficult to identify. In Silico Design of Peptide-Based PPI Inhibitors 215 Fig. 3 Computational approaches that can provide structural information about a protein-peptide interface. Computational alanine-scanning is one method to identify hot spots from the three-dimensional structure of protein-peptide complexes We found in the literature only one study which investigates the binding pocket on a protein involved in protein-peptide interactions [118]. Using accelerated molecular dynamics simulations and a pocket identification method called VISM-CFA [119], the authors characterized the dynamic behavior of the Bad peptide binding site on Bcl-xL protein. They showed that the binding pocket of the unbound protein is often in a non-druggable closed state with a volume below 100 Å3. Nevertheless, they also could identify minor conformations of apo Bcl-xL (10%) with a more open binding pocket which could accommodate Bad peptide or small ligands [118]. This study reminds us that detection of druggable pockets on an unbound protein should preferentially be performed on its conformational ensemble rather than on a single structure. 3.1.2 Hot Spot Identification As mentioned in the sequence-based section, hot spots of proteinprotein complexes can be determined with machine learning algorithms trained on data sets of interface and non-interface residues. The sequence descriptors used in those cases are generally intrinsic properties (polarity, hydrophilicity, hydrophobicity. . .) of protein amino acids. However, the accuracy of these predictors can be greatly improved by including structural properties such as residue solvent accessible surface areas or inter-residue distances in known tertiary and quaternary protein structures [120]. Thus, several 216 Maxence Delaunay and Tâp Ha-Duong high-performance hot spots predictors using protein threedimensional structures have been developed and successfully applied to many protein-protein complexes [121–125]. Alternative physics-based or energy-based methods were also developed to identify hot spots on protein surfaces. Most of them use fragment-based approaches which aim at determining the preferential binding sites of small organic probes on known structures of proteins. This can be achieved by running docking calculations of small compounds into target cavities with classical protein-ligand docking programs, such as Gold [126] or Autodock Vina [127], as demonstrated in the study by Wang et al. of human activin receptor hot spots [128]. Another possibility to explore fragment binding sites on proteins is to use molecular simulations, such as the grand canonical ensemble Monte Carlo simulations employed by Kulp III et al. to identify hot spots on various proteins, including lysozyme, RecA, HIV protease, dihydrofolate reductase, elastase, MDM2, and peptide deformylase [129, 130]. Importantly, these studies indicate that hot spots are more correctly predicted by locating high affinity binding sites for organic fragments which are also low affinity binding sites for water molecules. Experimentally, protein hot spots can be identified by using the alanine-scanning mutagenesis method [131]. In the same spirit, they can be predicted by using the computational alanine-scanning (CAS). From a known quaternary structure of a protein-peptide complex, the technique consists in estimating the binding free energy change (Δ ΔG) upon mutation of residues at the interface into alanine. Mutations that significantly impair the proteinpeptide binding energy identify the hot spots. This general scheme was implemented into several molecular modeling software packages, such as Rosetta (Flex_ddG) [132] or BUDE (BudeAlaScan) [133]. The main difference between these programs lies in the methods used to compute binding free energies which can be fast empirical energy functions, MM/PBSA calculations, or thermodynamic integrations [134]. Thanks to its rapidity and low-cost, computational alanine-scanning was applied to identify hot spots of many protein-protein interactions [135–139], including the recent SARS-CoV-2 spike glycoprotein binding to host ACE2 receptors. 3.1.3 Protein-Peptide Docking The knowledge of the three-dimensional structure of a targeted protein-protein complex is an invaluable information for structurebased design of PPI inhibitors. When only the tertiary structures of two unbound partners are known, protein-protein or proteinpeptide docking are the main computational tools to generate structural models of their binding mode. The first protein-protein docking programs commonly consider proteins as rigid bodies. They generally consist in two or three steps: First, the shape In Silico Design of Peptide-Based PPI Inhibitors 217 complementary between the two protein structures is optimized [140, 141]. Then the obtained quaternary structures are re-scored by taking into account physical criteria such as electrostatic, van der Waals interactions, or desolvation energies [142, 143]. Frequently, these two steps are performed simultaneously. Generally, they are followed by a third step consisting in molecular dynamics simulations to allow local relaxation of the protein-protein interface. Naturally, rigid docking methods are not appropriate for highly flexible proteins, especially for those which bind their partner through peptide segments such as SLiMs or MorFs. In these cases, protein-peptide docking programs should be preferred since they take into account the peptide flexibility at an early stage. As for hot spot predictions, one can distinguish knowledgebased from physics-based protein-peptide docking. In knowledgebased approaches, also called template-based docking, the protein structure and peptide sequences are first used to search for homologous protein-peptide complexes in databases of experimentally resolved quaternary structures. Then, similarly to homology modeling, protein structure alignment and peptide sequence alignment are used to generate models of the protein-peptide binding mode. In this type of docking, the peptide backbone flexibility is taken into account by the different homologous peptide structures found in the database. Most often, model building is followed by an energy-based optimization to allow further structural flexibility, such as in GalaxyPepDock [144], HDOCK [145], or InterPep2 [146]. The physics-based methods for flexible peptide docking can be subdivided into three different approaches: ensemble docking, ab initio docking, and fragment-based docking. In ensemble docking, the unbound peptide conformations are pre-sampled and the representative structures are rigidly docked into the protein. PepATTRACT [147], MdockPep [148], PIPER-FlexPepDock [149], or HPEPDOCK [150] can be classified as ensemble docking methods. In ab initio approaches, the peptide conformations are sampled on-the-fly during the docking process, using mainly molecular simulations as in FlexPepDock [151], AnchorDock [152], or CABS-dock [153]. In fragment-based methods, the peptide is cut into shorter compounds and the fragments are docked onto protein. Then, the best modes of binding of each fragment are linked to generate the binding mode of the initial peptide. DINC [154] and IDP-LZerD [155] belong to this type of protein-peptide docking. 3.2 Identification and Optimization of Peptide Hits In drug design, hit identification is the process consisting in finding compounds which bind a target and modify its activity. In this subsection, we describe computational methods to identify peptide hits modulating protein-protein interactions. Peptide hits can be 218 Maxence Delaunay and Tâp Ha-Duong Fig. 4 Computational approaches used to identify a peptide hit and to optimize its sequence for higher affinity and selectivity derived mainly from minimal recognition motifs at structurally known protein-protein interface or with (structure-based) virtual screening of peptide libraries (Fig. 4). 3.2.1 Derivation of Minimal Recognition Motifs At many protein-protein interfaces, even those involving globular proteins, a short peptide segment predominantly contributes to the binding energy and is required to stabilize the complex [156]. Finding this hot segment, also called minimal recognition motif or self-inhibitory peptide, is often a good starting point for developing potent protein-protein inhibitors [157]. When the quaternary structure of a complex is known, several computational tools can assist the researchers to derive these minimal recognition motifs. The first approach consists in identifying the hot spots of a complex and then in extracting the shortest peptide segment which contains as many hot spots as possible [158, 159]. Generally, the binding energies of these hot segments with their targets are subsequently estimated by using docking calculations or molecular simulations and compared to the initial protein-protein interactions to support the minimal recognition motif design [156, 158– 160]. Another example of this approach was reported in two In Silico Design of Peptide-Based PPI Inhibitors 219 different studies of the same target, the Hsp90 dimer. The identification by computational alanine-scanning of four hot spots on the protein C-terminal α-helix served as a basis for designing several peptide-based inhibitors of Hsp90 [161, 162]. In the previous approach, the first and last residues of the minimal recognition motif still have to be chosen by the researchers, and validation of these choices by computing binding energies can be quite tedious. Thus, a systematic method called Rosetta Peptiderive has been developed to automatically identify hot segments from the three-dimensional structure of a given proteinprotein complex [163]. In this algorithm, a sliding window of user-defined size runs along one protein sequence and, at each position, isolates a peptide segment whose binding energy with the protein partner is computed using the Rosetta energy function [156]. Peptides which contribute the most to the protein-protein interaction are selected as hot segments. Peptiderive was made available to the scientific community through a web server [163] and allowed several groups to rapidly design from identified selfinhibitory peptides several inhibitors of various protein-protein interactions [160, 164–166]. It should be noted that, if the hot segments found have the appropriate geometry, then Peptiderive can automatically derive cyclic peptides by mutating their terminal residues into cysteine and linking them by a disulfide-bond [163]. 3.2.2 Structure-Based Virtual Screening When the tertiary structure of a protein is known, one major strategy for drug discovery consists in docking millions of compounds from various chemical libraries into identified target cavities. Structure-based virtual screening has been applied to search for inhibitors of various protein-protein interactions, but mainly within libraries of small organic molecules [167–169]. Probably because docking peptides requires more computational resources than for small compounds, few papers reported the discovery of protein-protein inhibitors by using structure-based screening of peptide libraries. Nevertheless, with the continuous increase of computing power, recent studies using peptide screening have been reported in the literature. In these studies, libraries of natural peptides extracted from food were docked into angiotensin-conversion enzymes [170, 171] or xanthine oxidase [172] to identify potent peptide-based inhibitors of these proteins. Nonetheless, it should be mention that the used libraries were mainly composed of very short tri- or tetrapeptides, limiting the possibility to discover peptides long enough to competitively inhibit large protein-protein interfaces. In this respect, it is worthy to mention that several computational tools can boost virtual screening of peptides by facilitating the generation of libraries of various peptides. The Robetta server, 220 Maxence Delaunay and Tâp Ha-Duong for example, can be used to easily generate libraries of helical, loop, or extended peptides [173]. Another example is the program CycloPs which can simply generate large and diverse libraries of cyclic peptides from natural and commercially available non-natural amino acids [174]. However, as far as we know, no structure-based virtual screening of CycloPs libraries has been reported in the literature so far. This could be due again to the computationally demanding calculations required for reliably docking several thousands of peptides with more than five residues. 3.2.3 Improving Peptide Affinity and Selectivity by Sequence Optimization After having found a peptide hit, it is generally worthwhile to increase its affinity for its target to improve its inhibition potency. Moreover, selectivity of therapeutic compounds is an important requirement in drug development to lower the risk of off-targeting. In the case of peptide-based inhibitors, computational tools can help to improve the affinity and selectivity of identified peptide hits by optimizing their sequence. The guiding principle of this hit-to-lead process is similar to that used in protein redesign to improve their stability [175], since the physical forces that drive protein folding also drive protein-protein and proteinpeptide binding. In favorable cases where the protein-peptide quaternary structure is known, redesign techniques generally consist in exploring the sequence space of the fixed-backbone peptide and finding those which minimize an energy score. This can be the binding free energy variation (Δ ΔG) relative to the initial peptide sequence for affinity improvement, or the difference between binding free energies of the same sequence but for two different protein partners for selectivity enhancement. In essence, these approaches are similar to the computational alanine-scanning technique, except that each residue of the redesigned peptide can be mutated into all possible amino acids. Rosetta [176] is probably the most used software to design or redesign proteins, peptides, and their associations, but several other programs can be used to perform these tasks, including K* [177], ORBIT [178], Proteus [179], or dTERMen [180]. These programs exploit different algorithms to explore the protein and peptide sequence space, such as minimization methods, genetic algorithm, or Monte Carlo sampling. They also differ in their energy functions which combine to varying degrees ingredients of physics-based all-atom force fields, implicit solvation models, and knowledge-based potentials derived from protein complex structures [181, 182]. Interestingly, dTERMen uses a scoring function derived from statistical potentials between tertiary structural motifs (TERMs) frequently observed in protein three-dimensional structures [183]. Since these TERMs have characteristic sequence preferences [184], the structure-based interactions are converted into In Silico Design of Peptide-Based PPI Inhibitors 221 sequence-based scoring functions which are extremely fast to evaluate, allowing to exhaustively explore sequence spaces of long peptides and proteins [180]. Many applications of computational protein-peptide interface redesign have been reported in the literature and subsequent experimental validations of their predictions highlight the reliability of these approaches to improve the affinity and selectivity of peptides for their target proteins. Among the success stories, highly selective peptides were computationally designed against bZIP proteins [185, 186], PDZ domains [177, 187, 188], amyloid fibrils [189], the cytokine TNFα [190, 191], and several anti-apoptotic proteins of the Bcl-2 family [180, 192, 193]. Interestingly, two studies among the previously cited redesigned peptide inhibitors with D-amino acids [189, 191], paving the way for the development of peptide-based drugs with high affinity, selectivity, and metabolic stability. 4 Conclusions In this review, we classified the computational tools and strategies for designing peptide-based inhibitors of PPIs into the two conventional ligand-based and structure-based categories. Nevertheless, the border between the two classes becomes more and more porous and several peptide developments combined both approaches. For example, sequence-based predictions of hot spots at protein-peptide interfaces by machine learning algorithms are more accurate when molecular descriptors include structural information, such as solvent accessible surface areas or inter-residue distances. Hybrid approaches will probably become more frequent in the near future. In both categories, the main challenge remains to determine the optimal peptide sequences which bind a protein target with the best affinity and selectivity. This requires to be able to compute as accurately as possible binding free energies and their relation to sequences, structures, and dynamics. Notably, regarding peptides which have generally more degrees of freedom than small organic compounds, this objective calls for correctly characterizing their conformational ensemble to quantitatively estimate the entropy cost of association, especially for peptides which undergo a disorder-to-order transition upon binding. Lastly, in the perspective of drug design, it remains crucial to reduce the peptidic nature of the identified peptide hits for increasing their stability against proteolytic enzymes (without decreasing their affinity and selectivity). This can be achieved by introducing non-natural amino acids, such as D-amino acids, peptoids, or chemically modified side chains, in the early stages of the 222 Maxence Delaunay and Tâp Ha-Duong development of PPI peptide-based inhibitors. Their membrane permeability is also an important property which is worth investigating as early as possible in order to maximize the chances of success in clinical trials. References 1. Ryan DP, Matthews JM (2005) Proteinprotein interactions in human disease. Curr Opin Struct Biol 15:441–446 2. Milroy L-G, Grossmann TN, Hennig S, Brunsveld L, Ottmann C (2014) Modulators of protein–protein interactions. Chem Rev 114:4695–4748 3. Archakov AI, Govorun VM, Dubanov AV, Ivanov YD, Veselovsky AV, Lewi P, Janssen P (2003) Protein-protein interactions as a target for drugs in proteomics. Proteomics 3: 380–391 4. Sheng C, Dong G, Miao Z, Zhang W, Wang W (2015) State-of-the-art strategies for targeting protein–protein interactions by smallmolecule inhibitors. Chem Soc Rev 44: 8238–8259 5. Modell AE, Blosser SL, Arora PS (2016) Systematic targeting of protein–protein interactions. Trends Pharmacolog Sci 37:702–713 6. Wichapong K, Poelman H, Ercig B, Hrdinova J, Liu X, Lutgens E, Nicolaes GA (2019) Rational modulator design by exploitation of protein–protein complex structures. Future Med Chem 11:1015–1033 7. Yugandhar K, Gromiha MM (2016) Analysis of protein-protein interaction networks based on binding affinity. Current Protein Peptide Sci 17:72–81 8. Nevola L, Giralt E (2015) Modulating protein–protein interactions: the potential of peptides. Chem Commun 51:3302–3315 9. Cunningham AD, Qvit N, Mochly-Rosen D (2017) Peptides and peptidomimetics as regulators of protein–protein interactions. Current Opin Struct Biol 44:59–66 10. Fosgerau K, Hoffmann T (2015) Peptide therapeutics: current status and future directions. Drug Discovery Today 20:122–128 11. Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujadas G (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63 12. Kaserer T, Beck K, Akram M, Odermatt A, Schuster D (2015) Pharmacophore models and pharmacophore-based virtual screening: concepts and applications exemplified on hydroxysteroid dehydrogenases. Molecules 20:22799–22832 13. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410 14. Zhang R, Ou H-Y, Zhang C-T (2004) DEG: a database of essential genes. Nucleic Acids Res 32:D271–D272 15. Rey S, Acab M, Gardy JL, Laird MR, deFays K, Lambert C, Brinkman FSL (2005) PSORTdb: a protein subcellular localization database for bacteria. Nucleic Acids Res 33: D164–D168 16. Gawade P, Ghosh P (2018) Genomics driven approach for identification of novel therapeutic targets in Salmonella enterica. Gene 668: 211–220 17. Pirtskhalava M, Gabrielian A, Cruz P, Griggs HL, Squires RB, Hurt DE, Grigolava M, Chubinidze M, Gogoladze G, Vishnepolsky B, Alekseev V, Rosenthal A, Tartakovsky M (2016) DBAASP v.2: an enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides. Nucleic Acids Res 44:D1104–D1112 18. Minkiewicz P, Iwaniak A, Darewicz M (2019) BIOPEP-UWM database of bioactive peptides: current opportunities. Int J Mol Sci 20:5978 19. Chou K-C (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct Funct Genet 43: 246–255 20. Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 39: W385–W390 21. Chen W, Ding H, Feng P, Lin H, Chou K-C (2016) iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 7: 16895–16909 22. Xu L, Liang G, Wang L, Liao C (2018) A Novel hybrid sequence-based model for identifying anticancer peptides. Genes 9:158 In Silico Design of Peptide-Based PPI Inhibitors 23. Wei L, Zhou C, Chen H, Song J, Su R (2018) ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 34:4007–4016 24. Blanco JL, Porto-Pazos AB, Pazos A, Fernandez-Lozano C (2018) Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection. Sci Rep 8:15688 25. Laengsri V, Nantasenamat C, Schaduangrat N, Nuchnoi P, Prachayasittikul V, Shoombuatong W (2019) TargetAntiAngio: a sequence-based tool for the prediction and analysis of anti-angiogenic peptides. Int J Mol Sci 20:2950 26. Bhadra P, Yan J, Li J, Fong S, Siu SWI (2018) AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 8:1697 27. Khosravian M, Kazemi Faramarzi F, Mohammad Beigi M, Behbahani M, Mohabatkar H (2013) Predicting antibacterial peptides by the concept of Chou’s Pseudo-amino acid composition and machine learning methods. Protein Peptide Lett 20:180–186 28. Schaduangrat N, Nantasenamat C, Prachayasittikul V, Shoombuatong W (2019) Meta-iAVP: a sequence-based meta-predictor for improving the prediction of antiviral peptides using effective feature representation. Int J Mol Sci 20:5743 29. Tung C-W, Ziehm M, K€amper A, Kohlbacher O, Ho S-Y (2011) POPISK: T-cell reactivity prediction using support vector machines and string kernels. BMC Bioinf 12:446 30. Jorgensen KW, Rasmussen M, Buus S, Nielsen M (2014) NetMHCstab - predicting stability of peptide-MHC-I complexes; impacts for cytotoxic T lymphocyte epitope discovery. Immunology 141:18–26 31. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell, JR, Wheeler DK, Sette A, Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47:D339–D343 32. Gupta S, Mittal P, Madhu MK, Sharma VK (2017) IL17eScan: a tool for the identification of peptides inducing IL-17 response. Front Immunol 8:1430 33. Manavalan B, Shin TH, Kim MO, Lee G (2018) AIPpred: sequence-based prediction of anti-inflammatory peptides using random forest. Front Pharmacol 9:276 223 34. Wei L, Zhou C, Su R, Zou Q (2019) PEPredSuite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinf 35:4272–4280 35. Tang H, Su, Z.-D., Wei, H.-H., Chen W, Lin H (2016) Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 477:150–154 36. Wei L, Xing P, Su R, Shi G, Ma ZS, Zou Q (2017) CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency. J Proteome Res 16:2044–2053 37. Pandey P, Patel V, George NV, Mallajosyula SS (2018) KELM-CPPpred: Kernel extreme learning machine based prediction model for cell-penetrating peptides. J Proteome Res 17: 3214–3222 38. Arif M, Ahmad S, Ali F, Fang G, Li M, Yu, D-J (2020) TargetCPP: accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree. J Comput Aided Mol Des 34: 841–856 39. Chen M, Ju C JT, Zhou G, Chen X, Zhang T, Chang K-W, Zaniolo C, Wang W (2019) Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35:i305–i314 40. Hashemifar S, Neyshabur B, Khan AA, Xu J (2018) Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics 34:i802–i810 41. Tran L, Hamp T, Rost B (2018) ProfPPIdb: Pairs of physical protein-protein interactions predicted for entire proteomes. PLOS One 13:e0199988 42. Romero-Molina S, Ruiz-Blanco YB, Harms M, Münch J, Sanchez-Garcia E (2019) PPI-detect: a support vector machine model for sequence-based prediction of protein-protein interactions: PPI-Detect: a support vector machine model for sequencebased prediction of protein-protein interactions. J Comput Chem 40:1233–1242 43. Eid F-E, ElHefnawi M, Heath LS (2016) DeNovo: virus-host sequence-based protein–protein interaction prediction. Bioinf 32:1144–1150 44. Lian X, Yang S, Li H, Fu C, Zhang Z (2019) Machine-learning-based predictor of humanbacteria protein-protein interactions by incorporating comprehensive host-network properties. J Proteome Res 18:2195–2205 45. Kösesoy I, Gök M, Öz C (2019) A new sequence based encoding for prediction of 224 Maxence Delaunay and Tâp Ha-Duong host–pathogen protein interactions. Comput Biol Chem 78:170–177 46. Tan S-H, Hugo W, Sung, W-K, Ng S-K (2006) A correlated motif approach for finding short linear motifs from protein interaction networks. BMC Bioinf 7:502 47. Leung HC-M, Siu M-H, Yiu S-M, Chin FY-L, Sung KW-K (2009) Clustering-based approach for predicting motif pairs from protein interaction data. J Bioinf Comput Biol 07:701–716 48. Hugo W, Ng S-K, Sung W-K (2011) D-SLIMMER: domain-SLiM interaction motifs miner for sequence based proteinprotein interaction data. J Proteome Res 10: 5285–5295 49. Disfani FM, Hsu W-L, Mizianty MJ, Oldfield CJ, Xue B, Dunker AK, Uversky VN, Kurgan L (2012) MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics 28:i75–i83 50. Malhis N, Gsponer J (2015) Computational identification of MoRFs in protein sequences. Bioinformatics 31:1738–1744 51. He H, Zhao J, Sun G (2019) Computational prediction of MoRFs based on protein sequences and minimax probability machine. BMC Bioinf 20:529 52. Chen JR, Chang BH, Allen JE, Stiffler MA, MacBeath G (2008) Predicting PDZ domain–peptide interactions from primary sequences. Nat Biotechnol 26:1041–1045 53. Reimand J, Hui S, Jain S, Law B, Bader GD (2012) Domain-mediated protein interaction prediction: from genome to network. FEBS Lett 586:2751–2763 54. Sarkar D, Jana T, Saha S (2018) LMDIPred: a web-server for prediction of linear peptide sequences binding to SH3, WW and PDZ domains. PLOS One 13:e0200430 55. Xue LC, Dobbs D, Honavar V (2011) HomPPI: a class of sequence homology based protein-protein interface prediction methods. BMC Bioinf 12:244 56. Garcia-Garcia J, Valls-Comamala V, Guney E, Andreu D, Muñoz FJ, Fernandez-Fuentes N, Oliva B (2017) iFrag: a protein–protein interface prediction server based on sequence fragments. J Mol Biol 429:382–389 57. Dhole K, Singh G, Pai PP, Mondal S (2014) Sequence-based prediction of protein–protein interaction sites with L1-logreg classifier. J Theoret Biol 348:47–54 58. Jia J, Liu Z, Xiao X, Liu B, Chou, K-C (2016) iPPBS-Opt: a sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets. Molecules 21:95 59. Hou Q, De Geest PFG, Griffioen CJ, Abeln S, Heringa J, Feenstra KA (2019) SeRenDIP: SEquential REmasteriNg to DerIve profiles for fast and accurate predictions of PPI interface positions. Bioinformatics 35:4794–4796 60. Afsar Minhas FuA, Geiss BJ, Ben-Hur A (2014) PAIRpred: partner-specific prediction of interacting residues from sequence and structure: interface prediction using PAIRpred. Proteins: Struct Funct Bioinf 82: 1142–1155 61. Meyer MJ, Beltrán JF, Liang S, Fragoza R, Rumack A, Liang J, Wei X, Yu H (2018) Interactome INSIDER: a structural interactome browser for genomic studies. Nat Methods 15:107–114 62. Sanchez-Garcia R, Sorzano COS, Carazo JM, Segura J (2019) BIPSPI: a method for the prediction of partner-specific protein-protein interfaces. Bioinf 35:470–477 63. Taherzadeh G, Yang Y, Zhang T, Liew AW-C, Zhou Y (2016) Sequence-based prediction of protein-peptide binding sites using support vector machine. J Comput Chem 37: 1223–1229 64. Zhao Z, Peng Z, Yang J (2018) Improving sequence-based prediction of protein–peptide binding residues by introducing intrinsic disorder and a consensus method. J Chem Inf Model 58:1459–1468 65. Dosztányi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434 66. Yugandhar K, Gromiha MM (2014) Feature selection and classification of protein–protein complexes based on their binding affinities using machine learning approaches. Proteins: Struct Funct Bioinf 82:2088–2096 67. Srinivasulu Y, Wang, J-R, Hsu K-T, Tsai M-J, Charoenkwan P, Huang W-L, Huang H-L, Ho S-Y (2015) Characterizing informative sequence descriptors and predicting binding affinities of heterodimeric protein complexes. BMC Bioinf 16:S14 68. Shao X, Tan CSH, Voss C, Li SSC, Deng N, Bader GD (2011) A regression framework incorporating quantitative and negative interaction data improves quantitative prediction of PDZ domain–peptide interaction from primary sequence. Bioinformatics 27:383–390 69. Moal IH, Agius R, Bates PA (2011) Protein–protein binding affinity prediction on a In Silico Design of Peptide-Based PPI Inhibitors diverse set of structures. Bioinformatics 27: 3002–3009 70. Luo J, Guo Y, Zhong Y, Ma D, Li W, Li M (2014) A functional feature analysis on diverse protein–protein interactions: application for the prediction of binding affinity. J Comput Aided Mol Design 28:619–629. 71. Kamisetty H, Ghosh B, Langmead CJ, BaileyKellogg C (2015) Learning sequence determinants of protein:protein interaction specificity with sparse graphical models. J Comput Biol 22:474–486 72. Jemimah S, Yugandhar K, Michael Gromiha M (2017) PROXiMATE: a database of mutant protein–protein complex thermodynamics and kinetics. Bioinf 33:2787–2788 73. Jankauskaitė J, Jiménez-Garcı́a B, Dapkūnas J, Fernández-Recio J, Moal IH (2019) SKEMPI 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation. Bioinformatics 35:462–469 74. Geng C, Vangone A, Folkers GE, Xue LC, Bonvin AMJJ (2019) iSEE: Interface structure, evolution, and energy-based machine learning predictor of binding affinity changes upon mutations. Proteins: Struct Funct Bioinf 87:110–119 75. Rodrigues CHM, Myung Y, Pires DEV, Ascher DB (2019) mCSM-PPI2: predicting the effects of mutations on protein–protein interactions. Nucleic Acids Res 47: W338–W344 76. Zhang N, Chen Y, Lu H, Zhao F, Alvarez RV, Goncearenco A, Panchenko AR, Li M (2020) MutaBind2: predicting the impacts of single and multiple mutations on protein-protein interactions. iScience 23:100939 77. Jemimah S, Sekijima M, Gromiha MM (2019) ProAffiMuSeq: sequence-based method to predict the binding free energy change of protein–protein complexes upon mutation using functional classification. Bioinformatics 36:1725–1730 78. Li G, Pahari S, Krishna Murthy A, Liang S, Fragoza R, Yu H, Alexov E (2020) SAAMBESEQ: a sequence-based method for predicting mutation effect on protein-protein binding affinity. Bioinformatics 37:btaa761 79. Massa SM, Xie Y, Longo FM (2003) Alzheimer’s therapeutics. J Mol Neurosci 20: 323–326 80. Parthasarathi L, Casey F, Stein A, Aloy P, Shields DC (2008) Approved drug mimics of short peptide ligands from protein interaction motifs. J Chem Inf Model 48: 1943–1948 225 81. Fayaz SM, Rajanikant GK (2015) Modelling the molecular mechanism of protein–protein interactions and their inhibition: CypD–p53 case study. Mol Diversity 19:931–943 82. Caporuscio F, Tafi A, González E, Manetti F, Esté JA, Botta, M (2009) A dynamic targetbased pharmacophoric model mapping the CD4 binding site on HIV-1 gp120 to identify new inhibitors of gp120–CD4 protein–protein interactions. Bioorganic Med Chem Lett 19:6087–6091 83. Hall PR, Leitão A, Ye C, Kilpatrick K, Hjelle B, Oprea TI, Larson RS (2010) Small molecule inhibitors of hantavirus infection. Bioorganic Med Chem Lett 20:7085–7091 84. Pihan E, Delgadillo RF, Tonkin ML, Pugnière M, Lebrun M, Boulanger MJ, Douguet D (2015) Computational and biophysical approaches to protein–protein interaction inhibition of Plasmodium falciparum AMA1/ RON2 complex. J Comput Aided Mol Design 29:525–539 85. Jesus Perez de Vega M, Martin-Martinez M, Gonzalez-Muniz R (2007) Modulation of protein-protein interactions by stabilizing/ mimicking protein secondary structure elements. Current Topics Med Chem 7:33–62 86. Klein M (2017) Stabilized helical peptides: overview of the technologies and its impact on drug discovery. Expert Opin Drug Disc 12:1117–1125 87. Guarracino DA, Riordan JA, Barreto GM, Oldfield AL, Kouba CM, Agrinsoni D (2019) Macrocyclic control in Helix Mimetics. Chem Rev 119:9915–9949 88. Khakshoor O, Nowick JS (2008) Artificial βsheets: chemical models of β-sheets. Current Opin Chem Biol 12:722–729 89. Laxio Arenas J, Kaffy J, Ongeri S (2019) Peptides and peptidomimetics as inhibitors of protein–protein interactions involving βsheet secondary structures. Current Opin Chem Biol 52:157–167 90. Tanaka M (2007) Design and synthesis of chiral α,α-disubstituted amino acids and conformational study of their oligopeptides. Chem Pharmaceut Bull 55:349–358 91. Chatterjee J, Rechenmacher F, Kessler H (2013) N-Methylation of peptides and proteins: an important element for modulating biological functions. Angew Chem Int Edition 52:254–269 92. Sarnowski MP, Pedretty KP, Giddings N, Woodcock HL, Del Valle JR (2018) Synthesis and β-sheet propensity of constrained N-amino peptides. Bioorganic Med Chem 26:1162–1166 226 Maxence Delaunay and Tâp Ha-Duong 93. Matthes D, Groot BLd (2009) Secondary structure propensities in peptide folding simulations: a systematic comparison of molecular mechanics interaction schemes. Biophys J 97:599–608 94. Rauscher S, Gapsys V, Gajda MJ, Zweckstetter M, de Groot BL, Grubmüller H (2015) Structural ensembles of intrinsically disordered proteins depend strongly on force field: a comparison to experiment. J Chem Theory Comput 11:5513–5524 95. Chan-Yao-Chong M, Deville C, Pinet L, van Heijenoort C, Durand D, Ha-Duong T (2019) Structural characterization of N-WASP domain V using MD simulations with NMR and SAXS data. Biophys J 116: 1216–1227 96. Sugita Y, Okamoto Y (1999) Replicaexchange molecular dynamics method for protein folding. Chem Phys Lett 314: 141–151 97. Laio A and Parrinello M (2002). Escaping free-energy minima. Proc Natl Acad Sci 99: 12562–12566 98. Joseph TL, Lane DP, Verma CS (2012) Stapled BH3 peptides against MCL-1: mechanism and design using atomistic simulations. PLOS One 7:e43985 99. Damas JM, Filipe LC, Campos SR, Lousa D, Victor BL, Baptista AM, Soares CM (2013) Predicting the thermodynamics and kinetics of Helix formation in a cyclic peptide model. J Chem Theory Comput 9:5148–5157 100. Cornillie SP, Bruno BJ, Lim CS, Cheatham TE (2018) Computational modeling of stapled peptides toward a treatment strategy for CML and broader implications in the design of lengthy peptide therapeutics. J Phys Chem B 122:3864–3875 101. Lama D, Quah ST, Verma CS, Lakshminarayanan R, Beuerman RW, Lane DP, Brown CJ (2013) Rational optimization of conformational effects induced by hydrocarbon staples in peptides and their binding interfaces. Sci Rep 3:3451 102. Zhu J, Wei S, Huang L, Zhao Q, Zhu H, Zhang A (2020) Molecular modeling and rational design of hydrocarbon-stapled/ halogenated helical peptides targeting CETP self-binding site: Therapeutic implication for atherosclerosis. J Mol Graph Modell 94: 107455 103. Tan YS, Lane DP, Verma CS (2016) Stapled peptide design: principles and roles of computation. Drug Discovery Today 21:1642–1653 104. Spitaleri A, Ghitti M, Mari S, Alberici L, Traversari C, Rizzardi G-P, Musco G (2011) Use of metadynamics in the design of isoDGR-based αvβ3 antagonists to fine-tune the conformational ensemble. Ang Chem Int Edition 50:1832–1836 105. Yedvabny E, Nerenberg PS, So C, HeadGordon T (2015) Disordered structural ensembles of vasopressin and oxytocin and their mutants. J Phys Chem B 119:896–905 106. Yu H, Lin, Y-S (2015) Toward structure prediction of cyclic peptides. Phys Chem Chem Phys 17:4210–4219 107. McHugh SM, Rogers JR, Solomon SA, Yu H, Lin Y-S (2016) Computational methods to design cyclic peptides. Current Opin Chem Biol 34:95–102 108. Quartararo JS, Eshelman MR, Peraro L, Yu H, Baleja JD, Lin Y-S, Kritzer JA (2014) A bicyclic peptide scaffold promotes phosphotyrosine mimicry and cellular uptake. Bioorganic Med Chem 22:6387–6391 109. Razavi AM, Wuest WM, Voelz VA (2014) Computational screening and selection of cyclic peptide hairpin mimetics by molecular simulation and kinetic network models. J Chem Inf Model 54:1425–1432 110. Wakefield AE, Wuest WM, Voelz VA (2015) Molecular simulation of conformational pre-organization in cyclic RGD peptides. J Chem Inf Model 55:806–813 111. Est CB, Mangrolia P, Murphy RM (2019) ROSETTA-informed design of structurally stabilized cyclic anti-amyloid peptides. Protein Eng Design Select 32:47–57 112. Paissoni C, Ghitti M, Belvisi L, Spitaleri A, Musco G (2015) Metadynamics simulations rationalise the conformational effects induced by N-methylation of RGD cyclic hexapeptides. Chem A Europ J 21:14165–14170 113. Slough DP, Yu H, McHugh SM, Lin Y-S (2017) Toward accurately modeling N-methylated cyclic peptides. Phys Chem Chem Phys 19:5377–5388 114. Lensink MF, Velankar S, Wodak SJ (2017) Modeling protein-protein and proteinpeptide complexes: CAPRI 6th edition: modeling protein-protein and protein-peptide complexes. Proteins Struct Funct Bioinf 85: 359–377 115. Gowthaman R, Miller SA, Rogers S, Khowsathit J, Lan L, Bai N, Johnson DK, Liu C, Xu L, Anbanandam A, Aubé J, Roy A, Karanicolas J (2016) DARC: mapping surface topography by ray-casting for effective virtual screening at protein interaction sites. J Med Chem 59:4152–4170 116. Binkowski TA, Naghibzadeh S, Liang J (2003) CASTp: computed atlas of surface In Silico Design of Peptide-Based PPI Inhibitors topography of proteins. Nucleic Acids Res 31: 3352–3355 117. Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket detection. BMC Bioinf 10:168 118. Guo Z, Thorarensen A, Che J, Xing L (2016) Target the more druggable protein states in a highly dynamic protein–protein interaction system. J Chem Inf Model 56:35–45 119. Guo Z, Li B, Dzubiella J, Cheng L-T, McCammon JA, Che J (2013) Evaluation of hydration free energy by level-set variational implicit-solvent model with coulomb-field approximation. J Chem Theory Comput 9: 1778–1787 120. Liu S, Liu C, Deng L (2018) Machine learning approaches for protein–protein interaction hot spot prediction: progress and comparative assessment. Molecules 23:2535 121. Tuncbag N, Gursoy A, Keskin O (2009) Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy. Bioinf 25:1513–1520 122. Xia J-F, Zhao X-M, Song J, Huang D-S (2010) APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinf 11:174 123. Wang L, Liu Z-P, Zhang X-S, Chen L (2012) Prediction of hot spots in protein interfaces using a random forest model with hybrid features. Protein Eng Design Select 25:119–126 124. Deng L, Guan J, Wei X, Yi Y, Zhang QC, Zhou S (2013) Boosting prediction performance of protein-protein interaction hot spots by using structural neighborhood properties. J Comput Biol 20:878–891 125. Qiao Y, Xiong Y, Gao H, Zhu X, Chen P (2018) Protein-protein interface hot spots prediction based on a hybrid feature selection strategy. BMC Bioinf 19:14 126. Jones G, Willett P, Glen RC, Leach AR, Taylor R (1997) Development and validation of a genetic algorithm for flexible docking11Edited by F. E. Cohen. J Mol Biol 267: 727–748 127. Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31:455–461 128. Wang L, Hou Y, Quan H, Xu W, Bao Y, Li Y, Fu Y, Zou S (2013) A compound-based computational approach for the accurate determination of hot spots. Protein Sci 22: 1060–1070 227 129. Kulp JL, Kulp JL, Pompliano DL, Guarnieri F (2011) Diverse fragment clustering and water exclusion identify protein hot spots. J Amer Chem Soc 133:10740–10743 130. Kulp JL, Cloudsdale IS, Kulp JL, Guarnieri F (2017) Hot-spot identification on a broad class of proteins and RNA suggest unifying principles of molecular recognition. PLOS One 12:e0183327 131. Cunningham BC, Wells JA (1989) Highresolution epitope mapping of hGH-receptor interactions by alaninescanning mutagenesis. Science 244: 1081–1085 132. Barlow KA, Ó Conchúir S, Thompson S, Suresh P, Lucas JE, Heinonen M, Kortemme T (2018) Flex ddG: Rosetta ensemble-based estimation of changes in protein-protein binding affinity upon mutation. J Phys Chem B 122:5389–5399 133. Ibarra AA, Bartlett GJ, Hegedüs Z, Dutt S, Hobor F, Horner KA, Hetherington K, Spence K, Nelson A, Edwards TA, Woolfson DN, Sessions RB, Wilson AJ (2019) Predicting and experimentally validating hot-spot residues at protein–protein interfaces. ACS Chem Biol 14:2252–2263 134. Martins SA, Perez M AS, Moreira IS, Sousa SF, Ramos MJ, Fernandes PA (2013) Computational alanine scanning mutagenesis: MM-PBSA vs TI. J Chem Theory Comput 9: 1311–1319 135. Yang XQ, Liu JY, Li XC, Chen MH, Zhang YL (2014) Key amino acid associated with acephate detoxification by cydia pomonella carboxylesterase based on molecular dynamics with alanine scanning and site-directed mutagenesis. J Chem Inf Model 54:1356–1370 136. Dapiaggi F, Pieraccini S, Sironi M (2015) In silico study of VP35 inhibitors: from computational alanine scanning to essential dynamics. Mol BioSyst 11:2152–2157 137. He L, Bao J, Yang Y, Dong S, Zhang L, Qi Y, Zhang JZH (2019) Study of SHMT2 inhibitors and their binding mechanism by computational alanine scanning. J Chem Inf Model 59:3871–3878 138. Laurini E, Marson D, Aulic S, Fermeglia M, Pricl S (2020) Computational Alanine scanning and structural analysis of the SARS-CoV2 Spike protein/angiotensin-converting enzyme 2 complex. ACS Nano 14: 11821–11830 139. Zhao J, Yin B, Sun H, Pang L, Chen J (2020) Identifying hot spots of inhibitor-CDK2 bindings by computational alanine scanning. Chem Phys Lett 747:137329 228 Maxence Delaunay and Tâp Ha-Duong 140. Chen R, Li L, Weng Z (2003) ZDOCK: an initial-stage protein-docking algorithm. Proteins 52:80–87 141. Baspinar A, Cukuroglu E, Nussinov R, Keskin O, Gursoy A (2014) PRISM: a web server and repository for prediction of protein–protein interactions and modeling their 3D complexes. Nucleic Acids Res 42: W285–W289 142. Cheng TM-K, Blundell TL, Fernandez-Recio J (2007) pyDock: electrostatics and desolvation for effective scoring of rigid-body protein-protein docking. Proteins 68:503–515 143. Degryse B, Fernandez-Recio J, Citro V, Blasi F, Cubellis MV (2008) In silico docking of urokinase plasminogen activator and integrins. BMC Bioinf 9:S8 144. Lee H, Heo L, Lee MS, Seok C (2015) GalaxyPepDock: a protein-peptide docking tool based on interaction similarity and energy optimization. Nucleic Acids Res 43: W431–435 145. Yan Y, Wen Z, Wang X, Huang S-Y (2017) Addressing recent docking challenges: a hybrid strategy to integrate template-based and free protein-protein docking. Proteins Struct Funct Bioinf 85:497–512 146. Johansson-Åkhe I, Mirabello C, Wallner B (2020) InterPep2: global peptide–protein docking using interaction surface templates. Bioinformatics 36:2458–2465 147. Schindler C, de Vries S, Zacharias M (2015) Fully blind peptide-protein docking with pepATTRACT. Structure 23:1507–1515 148. Yan C, Xu X, Zou X (2016) Fully blind docking at the atomic level for protein-peptide complex structure prediction. Structure 24: 1842–1853 149. Alam N, Goldstein O, Xia B, Porter KA, Kozakov D, Schueler-Furman O (2017) High-resolution global peptide-protein docking using fragments-based PIPER-FlexPepDock. PLOS Comput Biol 13:e1005905 150. Zhou P, Jin B, Li H, Huang S-Y (2018) HPEPDOCK: a web server for blind peptide–protein docking based on a hierarchical algorithm. Nucleic Acids Res 46: W443–W450 151. Raveh B, London N, Schueler-Furman O (2010) Sub-angstrom modeling of complexes between flexible peptides and globular proteins. Proteins 78:2029–2040 152. Ben-Shimon A, Niv MY (2015). AnchorDock: blind and flexible anchor-driven peptide docking. Structure 23:929–940 153. Kurcinski M, Jamroz M, Blaszczyk M, Kolinski A, Kmiecik S (2015) CABS-dock web server for the flexible docking of peptides to proteins without prior knowledge of the binding site. Nucleic Acids Res 43: W419–W424 154. Antunes DA, Moll M, Devaurs D, Jackson KR, Lizée G, Kavraki LE (2017) DINC 2.0: a new protein-peptide docking webserver using an incremental approach. Cancer Res 77:e55–e57 155. Peterson LX, Roy A, Christoffer C, Terashi G, Kihara D (2017) Modeling disordered protein interactions from biophysical principles. PLOS Comput Biol 13:e1005485 156. London N, Raveh B, Movshovitz-Attias D, Schueler-Furman O (2010) Can selfinhibitory peptides be derived from the interfaces of globular protein-protein interactions? Proteins 78:3140–3149 157. London N, Raveh B, Schueler-Furman O (2013) Druggable protein-protein interactions? from hot spots to hot segments. Current Opin Chem Biol 17:952–959 158. Nomme J, Takizawa Y, Martinez SF, Renodon-Cornière A, Fleury F, Weigel P, Yamamoto K-i, Kurumizaka H, Takahashi M (2008) Inhibition of filament formation of human Rad51 protein by a small peptide derived from the BRC-motif of the BRCA2 protein. Genes Cells 13:471–481 159. Nomme J, Renodon-Cornière A, Asanomi Y, Sakaguchi K, Stasiak AZ, Stasiak A, Norden B, Tran V, Takahashi M (2010) Design of potent inhibitors of human RAD51 recombinase based on BRC motifs of BRCA2 protein: modeling and experimental validation of a chimera peptide. J Med Chem 53:5782–5791 160. Jafary F, Ganjalikhany MR, Moradi A, Hemati M, Jafari S (2019) Novel peptide inhibitors for lactate dehydrogenase a (LDHA): a survey to inhibit ldha activity via disruption of protein-protein interaction. Sci Rep 9:4686 161. Gavenonis J, Jonas NE, Kritzer JA (2014) Potential C-terminal-domain inhibitors of heat shock protein 90 derived from a C-terminal peptide helix. Bioorganic Med Chem 22:3989–3993 162. Bopp B, Ciglia E, Ouald-Chaib A, Groth G, Gohlke H, Jose J (2016) Design and biological testing of peptidic dimerization inhibitors of human Hsp90 that target the C-terminal domain. Biochim et Biophys Acta 1860:1043–1055 163. Sedan Y, Marcu O, Lyskov S, SchuelerFurman O (2016) Peptiderive server: derive peptide inhibitors from protein–protein interactions. Nucleic Acids Res 44:W536–W541 In Silico Design of Peptide-Based PPI Inhibitors 164. Horita S, Nomura Y, Sato Y, Shimamura T, Iwata S, Nomura N (2016) High-resolution crystal structure of the therapeutic antibody pembrolizumab bound to the human PD-1. Sci Rep 6:35297 165. Li D, Song H, Mei H, Fang E, Wang X, Yang F, Li H, Chen Y, Huang K, Zheng L, Tong Q (2018) Armadillo repeat containing 12 promotes neuroblastoma progression through interaction with retinoblastoma binding protein 4. Nat Commun 9:2829 166. Tarsia C, Danielli A, Florini F, Cinelli P, Ciurli S, Zambelli B (2018) Targeting Helicobacter pylori urease activity and maturation: in-cell high-throughput approach for drug discovery. Bioch et Biophys Acta 1862: 2245–2253 167. Geppert T, Bauer S, Hiss JA, Conrad E, Reutlinger M, Schneider P, Weisel M, Pfeiffer B, Altmann K-H, Waibler Z, Schneider G (2012) Immunosuppressive small molecule discovered by structure-based virtual screening for inhibitors of protein–protein interactions. Angew Chem Int Edition 51: 258–261 168. Johnson DK, Karanicolas J (2016) Ultrahigh-throughput structure-based virtual screening for small-molecule inhibitors of protein–protein interactions. J Chem Inf Model 56:399–411 169. Koes DR, Dömling A, Camacho CJ (2018) AnchorQuery: rapid online virtual screening for small-molecule protein–protein interaction inhibitors. Protein Sci 27:229–232 170. Wu H, Liu Y, Guo M, Xie J, Jiang X (2014) A virtual screening method for inhibitory peptides of angiotensin i–converting enzyme J Food Sci 79:C1635–C1642 171. Yu Z, Fan Y, Zhao W, Ding L, Li J, Liu J (2018) Novel angiotensin-converting enzyme inhibitory peptides derived from oncorhynchus mykiss nebulin: virtual screening and in silico molecular docking study. J Food Sci 83:2375–2383 172. Yu Z, Kan R, Wu S, Guo H, Zhao W, Ding L, Zheng F, and Liu, J. (2020). Xanthine oxidase inhibitory peptides derived from tuna protein: virtual screening, inhibitory activity, and molecular mechanisms. J Sci Food Agric 173. Kim DE, Chivian D, Baker D (2004) Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 32: W526–W531 174. Duffy FJ, Verniere M, Devocelle M, Bernard E, Shields DC, Chubb AJ (2011) CycloPs: generating virtual libraries of cyclized and constrained peptides including 229 nonnatural amino acids. J Chem Inf Model 51:829–836 175. Huang P-S, Boyken SE, Baker D (2016) The coming of age of de novo protein design. Nature 537:320–327 176. Kortemme T, Joachimiak LA, Bullock AN, Schuler AD, Stoddard BL, Baker D (2004) Computational redesign of protein-protein interaction specificity. Nat Struct Mol Biol 11:371–379 177. Roberts KE, Cushing PR, Boisguerin P, Madden DR, Donald BR (2012) Computational design of a PDZ domain peptide inhibitor that rescues CFTR activity. PLOS Comput Biol 8:e1002477 178. Sharabi O, Shirian J, Shifman J (2013) Predicting affinity- and specificity-enhancing mutations at protein–protein interfaces. Biochem Soc Trans 41:1166–1169 179. Simonson T, Gaillard T, Mignon D, Schmidt am Busch M, Lopes A, Amara N, Polydorides S, Sedano A, Druart K, Archontis G (2013) Computational protein design: the Proteus software and selected applications. J Comput Chem 34:2472–2484 180. Frappier V, Jenson JM, Zhou J, Grigoryan G, Keating AE (2019) Tertiary structural motif sequence statistics enable facile prediction and design of peptides that bind anti-apoptotic Bfl-1 and Mcl-1. Structure 27:606–617.e5 181. Poole AM, Ranganathan R (2006) Knowledge-based potentials in protein design. Current Opin Struct Biol 16: 508–513 182. Boas FE, Harbury PB (2007) Potential energy functions for protein design. Current Opin Struct Biol 17:199–204 183. Mackenzie CO, Zhou J, Grigoryan G (2016) Tertiary alphabet for the observable protein structural universe. Proc Natl Acad Sci 113: E7438–E7447 184. Zheng F, Zhang J, Grigoryan G (2015) Tertiary structural propensities reveal fundamental Sequence/structure relationships. Structure 23:961–971 185. Grigoryan G, Reinke AW, Keating AE (2009) Design of protein-interaction specificity gives selective bZIP-binding peptides. Nature 458: 859–864 186. Chen TS, Reinke AW, Keating AE (2011) Design of peptide inhibitors that bind the bZIP Domain of Epstein–barr virus protein BZLF1 J Mol Biol 408:304–320 187. Smith CA, Kortemme T (2010) Structurebased prediction of the peptide sequence space recognized by natural and synthetic PDZ domains. J Mol Biol 402:460–474 230 Maxence Delaunay and Tâp Ha-Duong 188. Zheng F, Jewell H, Fitzpatrick J, Zhang J, Mierke DF, Grigoryan G (2015) Computational design of selective peptides to discriminate between similar PDZ domains in an oncogenic pathway. J Mol Biol 427:491–510 189. Sievers SA, Karanicolas J, Chang HW, Zhao A, Jiang L, Zirafi O, Stevens JT, Münch J, Baker D, Eisenberg D (2011) Structure-based design of non-natural amino-acid inhibitors of amyloid fibril formation. Nature 475:96–100 190. Zhang C, Shen Q, Tang B, Lai L (2013) Computational design of helical peptides targeting TNFα. Angew Chem Int Edition 52: 11059–11062 191. Yang W, Zhang Q, Zhang C, Guo A, Wang Y, You H, Zhang X, Lai L (2019) Computational design and optimization of novel d-peptide TNFα inhibitors. FEBS Lett 593:1292–1302 192. Foight GW, Ryan JA, Gullá SV, Letai A, Keating AE (2014) Designed BH3 peptides with high affinity and specificity for targeting Mcl-1 in cells. ACS Chem Biol 9:1962–1968 193. Berger S, Procko E, Margineantu D, Lee EF, Shen BW, Zelter A, Silva D-A, Chawla K, Herold MJ, Garnier J-M, Johnson R, MacCoss MJ, Lessene G, Davis TN, Stayton PS, Stoddard BL, Fairlie WD, Hockenbery DM, Baker D (2016) Computationally designed high specificity inhibitors delineate the roles of BCL2 family proteins in cancer. eLife 5: e20352 Chapter 12 Rapid Rational Design of Cyclic Peptides Mimicking Protein–Protein Interfaces Brianda L. Santini and Martin Zacharias Abstract The cPEPmatch approach is a rapid computational methodology for the rational design of cyclic peptides to target desired regions of protein–protein interfaces. The method selects cyclic peptides that structurally match backbone structures of short segments at a protein–protein interface. In a second step, the cyclic peptides act as templates for designed binders by adapting the amino acid side chains to the side chains found in the target complex. A link to access the different tools that comprise the cPEPmatch method and a detailed step-by-step guide is provided. We outline the protocol by following the application to a trypsin protease in complex with the bovine inhibitor protein (BPTI). An extension of our original approach is also presented, where we give a detailed description of the usage of the cPEPmatch methodology focusing on identifying hot regions of protein–protein interfaces prior to the matching. This extension allows one to reduce the amount of evaluated putative cyclic peptides and to specifically design only those that compete with the strongest protein–protein binding regions. It is illustrated by an application to an MHC class I protein complex. Key words Protein–protein interactions, Protein interaction inhibition, Protein binding modulation, Peptidomimetics, Cyclic peptide design, Drug design with cyclic peptides, Rational cyclic peptide binders 1 Introduction Protein–protein interactions (PPIs) play a critical role in nearly all cellular processes, such as signaling, regulation, metabolism, and proliferation—making them promising drug targets for broadspectrum therapeutic interests [1]. Hence, modulating PPIs is of great clinical relevance, and considerable effort has been put on targeting protein–protein interfaces by rational drug design efforts to interfere or even disrupt interactions. Typical interfaces of PPIs tend to be large, flat, and mainly hydrophobic, and only a few interface residues are crucial for protein–protein binding [2– 5]. These residues, often referred to as hot spots, are major determinants of affinity and specificity [6, 7]. Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_12, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 231 232 Brianda L. Santini and Martin Zacharias It has also been observed that such hot spots interact cooperatively and tend not to be uniformly spread across the interface, but grouped within tightly packed regions, known as hot loops [6, 8, 9]. Most notably, hot loop regions have shown secondary structural features, such as α-helices, β-strands, and turns [10, 11], which have established them as strong candidates for peptidomimetics drug design approaches. The key of peptidomimetics strategies is to graft suitable side chains onto stable backbone scaffolds. In the case of PPIs, a detailed analysis of hot spots at the native interface could serve as a guide for the mimetic rational design. An important step that determines the affinity and specificity of a peptidomimetics drug is the appropriate selection of the scaffold which is influenced by the location of the hot spots and structural details [12]. We have recently proposed a cyclic peptide matching approach (cPEPmatch), a straight forward in silico process for the rapid search and optimization of cyclic peptides as putative PPIs inhibitors based on using known cyclic peptides structures as scaffolds [13]. The great advantages of cyclic peptides compared to linear peptides are the often-improved permeability of cell membranes (important for usage as drug molecules) and the relatively rigid three-dimensional backbone structure compared to linear peptides which lowers the entropic cost for binding in a defined conformation. A standard linear peptide includes two rotatable bonds for each amino acid along the peptide main chain (N–Cα, Cα–C) allowing for many combinations of conformational backbone substates. By contrast, the backbone flexibility of a cyclic peptide is drastically reduced, although not totally eliminated, allowing the structure to make small conformational adjustments upon binding [14]. However, a severe drawback is that it can be difficult to reliably predict the correct three-dimensional structure of arbitrary cyclic peptides. The general idea of our cPEPmatch method [13] is to start with a library of existing crystal structures of cyclic peptides and use them as templates for the construction of cyclic peptides that closely mimic interface sections of a partner protein in PPI complexes. This strategy avoids the de novo prediction of a putative cyclic peptide structure that could fit to the interface. We start the cPEPmatch process by characterizing backbone motifs of the backbone of epitopes found at the PPI interfaces and comparing them to backbone motifs in a database of cyclic peptides with known structure. If a backbone match is found, the cyclic peptide structure is superimposed, and the corresponding interface amino acids of the cyclic peptides are substituted by those of the binding epitope to closely replicate that same interface. In our preliminary proof-ofprinciple study, our automated approach was tested against 154 protein–protein complexes, where for the majority of ~71% we identified cyclic peptide motifs that resulted in stable bound complexes during MD refinement [13]. It was also possible to Cyclic Peptide Protein Interface Binders 233 predict the structure of cyclic peptide binders that were previously found in experiments to strongly bind to proteins at a known interface. In this first work, however, key hot regions and their specific hot spots were not identified prior to the application of cPEPmatch, but it was rather applied to the whole interface. Chapter Overview: Here, the main goal is to outline the stepby-step application of the cPEPmatch approach on examples. First, we describe the application to a complex of the trypsin protease in complex with the inhibitor protein BPTI. In an extension of our original approach, we include the identification of regions in the PPI that contribute most to the interaction (hot regions) prior to the matching process to identify putative cyclic peptides that potentially compete with such regions for binding. It limits the number of complexes that needs to be evaluated using molecular dynamics (MD) simulations in combination with the MMGBSA (Molecular Mechanics Generalized Born Surface Area) calculations. Subsequently, we give a detailed description of the usage and analysis of the cPEPmatch methodology focusing on the hot loop sequences. 2 Materials The cPEPmatch method requires a known protein–protein complex structure as input target and the selection which partner should be the target for cyclic peptide template selection. For the present examples, the 4dg4.pdb file (trypsin receptor, chain A, in complex with trypsin inhibitor protein, chain B) and the MHC class I complex (5fa3.pdb) can be downloaded from the Protein data bank (www.rcsb.org). For all energy minimization and MD simulations, we use the Amber18 package [15] that can be obtained from the following site: https://ambermd.org/ (requires an academic license). The code for performing the cPEPmatch calculations can be downloaded freely from https://www.groups.ph.tum.de/t38/ downloads (for academic use). It runs on any PC with a linux operation system and includes installation instructions. Another important input for the cPEPmatch approach is a collection of known cyclic peptide structures that can be used for superposition with structural motifs found at the interface of PPIs. We have expanded our library from the one presented previously [13] to a set of 72 cyclic peptides that vary in cyclization type and sizes. The cyclic peptide structure types have been chosen to represent common portions of protein–protein interaction hot spots such as α-helices, β-strands, turns, and loops and can be downloaded from: www.groups.ph.tum.de/t38/downloads/. It is regarded as a future goal, to keep improving the database as more and more cyclic peptide crystal structures are resolved. It is possible for the user to add to the data set file or create a new cyclic peptide data set (see Note 1). 234 3 Brianda L. Santini and Martin Zacharias Methods Our cyclic peptide matching approach is divided into three main steps (except of the initial library construction described in the previous paragraph and Note 1), outlined below for targeting the interface structure of the complex of trypsin inhibitor protein (BPTI) and trypsin (pdb4dg4, see Fig. 1). For this case, a cyclic peptide (called sunflower peptide) with high-binding affinity for trypsin that can effectively compete with BPTI is known, and a known structure in complex with trypsin (pdb1sfi) can be used for comparison to the cPEPmatch result. 3.1 PPI Motif Characterization and Matching The following step is to characterize the PPI interface in terms of distances between four consecutive backbone (CA) atoms and match them with the corresponding distances in the data set of cyclic peptides. The program int_analyse identifies all neighboring protein residues from the input PPI complex (pdb4dg4) at the interface and, subsequently, characterizes its backbone atom motifs. Besides of the input complex structure (pdb-file), it requires the name of the cyclic peptide database, and the cutoff for atoms counted as belonging to the interface as well as a threshold for the mean distance deviation that defines a match between a backbone atom segment at the interface vs. a given cyclic peptide backbone. In previous work, we found 7 Å sufficient for the interface cutoff definition to identify all relevant cyclo-peptide matches [13]. We set the threshold for the mean distance deviation to 0.5 Å to accept only sufficiently precise matches. Hence, the int_analyse command is given as: ./int_analyse 4dg4.pdb cPEPdatabase.dat 7.0 0.5 Fig. 1 Trypsin protein (yellow surface) in complex with (a) trypsin inhibitor protein (pdb4dg4, cyan carbons) after 10 ns of MD simulation, (b) best matched cyclic peptide (based on pdb3avb, pink carbons) mimicking the trypsin inhibitor protein after 10 ns MD simulation, and (c) sunflower cyclic peptide (pdb1sfi, silver carbons) shown for comparison Cyclic Peptide Protein Interface Binders 235 The output will be a match_list.dat file that contains a list of the matched cyclic peptides in the order that they were found. For each match, there will be a single row like: 3avb | 748 752 760 768 8 16 25 33 This can be read as: the Cα atoms 8, 16, 25, and 33 of the cyclic peptide with pdb code 3avb structurally match the consecutive Cα atoms with numbers 748, 752, 760, and 768 in the target 4dg4. pdb file. 3.2 Superposition of Cyclic Peptides on Interface Segments The next step is performed using the pose_motif tool, and it superimposes the coordinates of the identified cyclic peptide on the matching backbone interface motif segment at the target PPI. It requires the name of the complex structure, the cyclic peptide structure, and the backbone atom numbers as input, e.g., for our example: ./pose_motif 4dg4.pdb 3avb.pdb 748 752 760 768 8 16 25 33 This step returns a coordinate file of the cyclic peptide and a Fit-RMSD value that measures how close the backbone motif at the interface is represented by the corresponding backbone structure in the cyclic peptide. The process can be repeated for other matches listed in the match_list.dat file (a script is provided in the download to automatize this process). After this step, a visual analysis can be useful to decide which of the superimposed structures from the match list have a good sterical fit. Some of the criteria to be considered are the proper size, secondary structure similarity, and alignment. Only selected putative matches should be processed for sequence adaptation and analysis. 3.3 Sequence Adaptation The final step of the cPEPmatch before the evaluation analysis is to adapt the sequence of the cyclic peptide to mimic the interface of the PPI in the best way possible. The standard procedure we proposed [13] is to replace the side chains in the selected cyclic peptide by the side chains found in the original PPI complex. This standard replacement is included in the pose_motif tool. It provides an out.pdb file with the receptor protein coordinates (trypsin) and the cyclic peptide coordinates including the side chains copied from the interface of the target complex. 236 4 Brianda L. Santini and Martin Zacharias Evaluation of Matched Complexes 4.1 System Preparation For the evaluation and scoring of protein–cyclic peptide complexes, we use the Amber18 package (see Materials). Structures are processed for EM and MD simulations using the tleap module of Amber18 following standard procedures (see also Amber18 manual). Note, for the simulation of disulfide-bonded cyclic peptides, the input PDB for tleap preparation must have the amino acid name CYX instead of the regular CYS for the cysteine residues participating in disulfide bonds. The pdb4amber tool of the package can add this automatically. Special steps have to be added when dealing with the preparation of head-to-tail or similar cyclized peptides described in Note 2. Following the setup in Amber18 protein parameters is retrieved from the ff14SB force field [16]. The complexes are then neutralized by the addition of Na+ or Cl ions and are solvated in an orthorhombic box with a minimum distance to box-boundaries of 10 Å using explicit TIP3P water molecules [17]. First, all simulation systems are energy minimized with the steepest descent method in 2000 steps by using the Amber18 Sander module. Every subsequent MD simulations can be performed with the pmemd.cuda module allowing more rapid simulations than the Sander program. Initially, the systems are heated up to 310 K in three stages (100 K, 200 K, and 310 K). Each stage is simulated for 100 ps with positional restraints on all non-hydrogen atoms with respect to the starting conformation. Subsequently, positional restraints are gradually reduced from the initial 25 to 0.5 kcal·mol1·A2 in five consecutive simulations of 100 ps at 310 K and at a constant pressure of 1 bar. The equilibrated structures serve as input for the production runs for each system, with no restraints. Data gathering simulations are carried out for 10 ns. Coordinates are set to be written out every 500 steps. A time step of 2 fs is used, and all bonds involving hydrogens are constrained to the optimal length using shake [18]. 4.2 Trajectory Analysis to Score the Matches Stable binding of the cyclic peptide can be assessed first by visual analysis of the MD trajectory. We use the MM/GBSA (Molecular Mechanics Generalized Born Surface Area) method for analyzing the mean interaction energy following the well-established single trajectory method [19] as implemented in the MMPBSA.py module of Amber18. Calculations are carried using 500 snapshots retrieved from the last 5 ns of the MD simulation production employing the modified GB model (igb ¼ 5) with mbondi2, and α, β, and γ values of 1.0, 0.8, and 4.85, respectively. Dielectric constants for the solvent and the solute are 80 and 5, respectively. As an output, the approach gives the mean interaction energy between cyclic peptide and protein partner. Cyclic Peptide Protein Interface Binders 237 For our 4dg4.pdb (trypsin/BPTI) example, two cyclo-peptide matches are evaluated using the above procedure, and as best scoring cyclic peptide a structure is obtained that very closely resembles a turn motif at the trypsin/BPTI interface that is also in very close agreement with the structure of the known sunflower cyclic peptide inhibitor (root-mean-square deviation at interface (RMSD) < 0.5 Å, see Fig. 1). 5 Focusing on Hot Spot Regions at the Protein–Protein Interface 5.1 Application to the MHC PPI MHC class I complexes bind small antigenic peptides and present them to the immune system at the cell surface, controlling which fragments of a pathogen or cancer antigen are presented to cytotoxic T cells for immune recognition [20]. Peptide binding is strongly coupled to the complexation of the heavy chain part with the β-microglobin (β2m) partner. Dissociation of β2m leads also to loss of peptide binding [21]. Hence, design of cyclic peptides that interfere with this interaction can potentially be used to control and modulate the immune response (including many undesired autoimmune reactions). For our application, we use the pdb-entry 5fa3 (a human MHC class I molecule). In the present example, our goal is to target the heavy chain partner chain by replacing/mimicking the β2m partner with a cyclic peptide. In Fig. 2, two main contact regions between both segments (chain A and C in the original 5fa3. pdb) are indicated. Fig. 2 Contact regions 1 and 2 between three heavy chain (yellow cartoon, with the marked subdomains α1, α2, and α3) and the β2m partner (cyan) of the major histocompatibility complex (MHC) class I complex (pdb5fa3). The bound antigenic peptide (located in the binding cavity formed by the α1, α2 subdomains) is indicated in orange. The dotted boxes mark the two main interface regions between heavy chain and β2m partner targeted by the cPEPmatch approach 238 Brianda L. Santini and Martin Zacharias 5.2 Identification of Hot Loops for cPEPmatch Since there are more than one contact region between chain A and C in 5fa3.pdb, and each include multiple contact motifs, our direct application of cPEPmatch resulted in a large number of matches. In order to reduce the number of potential complexes that need to be evaluated and to design only the strongest competitors, we used an extension or our original approach. In this extension, we first perform a short MD simulation and MMGBSA application to predict the interface segments that contribute most to the interaction in the heavy chain/β2m complex (hot loops). We then focus the search for cyclic peptide binders to these hot loop interface regions. The setup and MD simulation on the MHC complex are performed exactly in the same way as described above for scoring of the cyclo-peptide/protein complexes. 5.2.1 Trajectory Analysis and Binding Hot Spot Identification Residues that contribute most to the interaction in the MHC class I complex are identified using the MM/GBSA method as described in Subheading 4.2 but employing the option to include a per residue interaction energy decomposition (ΔGres), according to the single trajectory method as implemented in the MMPBSA.py module of Amber18 [19, 22]. As output one obtains the mean interaction energy contribution for each residue along the sequence (Fig. 3). In our case, we use all residues belonging to the β2m partner because our designed cyclic peptide should superimpose on a backbone segment of the β2m partner. We define hot spot residues potentially important for binding as those with a total ΔGres < kBT ¼ 0.6 kcal·mol1 (kB: Boltzmann constant, T: temperature:300 K) and with the majority of its interaction energy contribution due to side chain interactions. Hot loops are segments of 8 to 10 residues that include at least four hot spot residues. For the present example, two hot segments can be clearly identified (Fig. 3). Loop 1 is chosen from residues 278 to 287, and it contains four hot spot residues: Lys279, Gln281, Tyr283, and Arg285. Loop 2 is selected from residues 326 to 335, comprised of five hot spot residues: Asp326, Leu327, Phe329, Trp333, and Phe335. The side chain and backbone contributions for all the hot spots are shown in Table 1. The coordinates of both loops were extracted from the last frame of the MD simulation and used as input for cPEPmatch. It should be noted that alternative methods to identify important interaction regions can also be used at this step. 5.3 Application of cPEPmatch to Hot Loop Regions The application of the cPEPmatch approach follows the same procedure as described in Subheading 4. However, instead of the original MHC class I complex file, we use as input a modified complex file that contains the heavy chain coordinates (receptor) and just the identified hot loop structures (in two separate complex pdb files). With the command, ./int_analyse receptor-and-loop.pdb database.dat 7.0 0.5 Cyclic Peptide Protein Interface Binders 239 Fig. 3 Hot segment selection for each contact regime in the MHC class I heavy chain interface to the β2m subunit. (a) Per residue contributions to the effective interaction energy as calculated by MM/GBSA decomposition. The chosen hot loops along the β2m sequence are shown inside blue rectangles. (b) Loop 1 interface with labeled hot spots, (c) same as (b) but for loop 2. All subunits are represented as cartoons (α1, α2, α3 segments of the heavy chain: yellow, β2m: cyan) we obtain a list of matches stored in a match_list.dat file. The matches are again used to construct complexes of the cyclo-peptide with the MHC class I heavy chain, and the mean interaction energy is calculated following the protocol of Subheading 4. 5.4 Selected Results for Targeting the MHC α Chain Three stable cyclic peptide matches were found for each hot loop (Table 2). Although, it not possible to include every single hot spot for each structure, as many as possible were mutated in each match. Figure 4 shows one representing match for each loop after 10 ns of MD simulations. In both cases, we observe that the mutated hot spots have similar orientations than those found in the β2m interface to the heavy chain (that they are mimicking). Also, the decomposition of the binding free energy shows similar behavior for all of the matches. A total of six putative cyclic peptides have been found to target the α chain of the MHC class 1. Previous work [13] indicated that for known complexes of peptides binding to 240 Brianda L. Santini and Martin Zacharias Table 1 MM/GBSA free energy decomposition for the two chosen hot loops in the β2m subunit. Three free energy values are shown for each residue: total (ΔGres), side chain (ΔGres-ss), and backbone (ΔGres-bb) contributions Loop 1 Loop 2 Residue ΔGres ΔGres-ss (kcal·Mol1) PRO 278 0.02 0.06 LYS 279 2.37 ILE 280 ΔGres ΔGres-ss (kcal·Mol1) ΔGres-bb Residue 0.08 ASP 326 9.11 9.03 0.08 2.49 0.12 LEU 327 1.88 2.38 0.49 0.20 0.10 0.30 SER 328 0.92 0.42 0.51 GLN 281 3.26 3.60 0.35 PHE 329 5.35 5.54 0.20 VAL 282 0.10 0.14 0.24 SER 330 0.70 0.18 0.51 TYR 283 6.89 6.99 0.10 LYS 331 0.64 0.28 0.36 SER 284 1.05 0.04 1.10 ASP 332 0.87 0.70 0.17 ARG 285 3.00 3.77 0.77 TRP 333 8.35 7.28 1.07 HIE 286 0.32 0.18 0.14 SER 334 0.06 0.01 0.05 PRO 287 1.16 0.80 0.36 PHE 335 2.83 2.77 0.06 ΔGres-bb Table 2 Best matches found for both hot loops identified for the MHC class I system Loop Matcha Substitutions in the cyclic peptide F-RMSD (Å) ΔGinteraction (kcal·mol1) 1 1ebp 5eoc 4w4z 9-VAL, 10-TYR, 11-SER, 12-ARG 2-LYS, 3-ILE, 4-GLN, 5-VAL, 6-TYR 3-LYS, 5-GLN, 6-VAL, 7-TYR 0.24 0.04 0.09 25.17 15.75 24.62 2 3zwz 3avb 1ebp 8-ASP, 9-LEU, 10-SER, 11-PHE 1-ASP, 2-TRP, 3-SER, 4-PHE 14-SER, 15-ASP, 16-LEU 0.16 0.06 0.14 39.53 20.88 51.63 a Indicates the pdb-entry of the matching cyclic peptide proteins, calculated interaction energies of similar magnitude are obtained (< 30 kcal·mol1). Hence, some of the suggested cyclopeptides may indeed show stable binding to the target structure. 6 Concluding Notes l We recently reported the cPEPmatch approach for the rational design of cyclic peptides that target protein–protein interfaces. Hundreds of PPIs can be screened within a few seconds for cyclic Cyclic Peptide Protein Interface Binders 241 Fig. 4 Representative matches and modeled structures of protein-cyclic-peptide complexes for the MHC class I heavy chain (yellow) targeting the interaction with β2m. (a) Cyclic peptide match pdb5eoc mimicking Loop 1. (b) Cyclic peptide match pdb3zwz mimicking Loop 2. In both cases, the heavy chain (yellow) and matched cyclic peptides (pink) are shown as cartoon. The labeled residues (sticks) correspond to the β2m residues in the native MHC class I complex that are replaced in the complex with the cyclic peptides peptides that match to backbone structures at the PPI interface, and even with a relatively small set of cyclic peptide templates, we have shown that it is possible to identify putative stable bound cyclic peptide–protein complexes [13]. l An advantage of our cPEPmatch method compared to experimental studies is that we base the construction of a desired cyclic peptide on known stable (high resolution) cyclic template structures, avoiding the uncertainty on how well a select cyclization of a given motif will resemble a desired binding region. l A key to finding a stable binder is the adaptation of the cyclic peptide sequence to closely mimic the essential protein–protein interface interactions. l We described an extension of our original cPEPmatch approach to target PPIs that have multiple and/or large binding sites. It consists of a short MD simulation and MMGBSA application for the identification of hot loops in the PPI prior to the matching process in order to identify putative cyclic peptides that target such regions. l Hot loop identification allows the reduction of the amount of putative cyclic peptides to be evaluated, and the design of only those cyclic peptides that specifically compete with the strongest protein–protein binding regions. l The cPEPmatch hot loop extension was applied to target the heavy chain of an MHC class I example and six putative cyclic peptide binders are suggested. 242 7 Brianda L. Santini and Martin Zacharias Notes 1. Extension of the cyclo-peptide database. Additional cyclic peptides can be added to the database by using our backbo FORTRAN tool. This program calculates distance matrices in sets of four consecutive Cα atoms by iterating through every set of four residues. The output is a set of motif values, which specifies the measured distances and corresponding amino acid positions. Backbo can be run from directly a UNIX terminal. An example of the 8-residue 3avb cyclic peptide is shown below. Run the command: $ backbo -i 3avb.pdb >> cPEPdatabase.dat This will return an output that looks like this and appends it to an existing data file: start 3avb 1 5.64 8.90 6.95 2 8 16 25 2 6.95 8.55 5.34 8 16 25 33 3 5.34 4.81 5.60 16 25 33 41 4 5.60 5.80 6.23 25 33 41 49 5 6.23 9.01 5.38 33 41 49 57 The 3avb cyclic peptide has five motifs, numbered on each of the output rows. The first three numbers correspond to the distance values of that motif, while the last four numbers specify the corresponding Cα atom numbers. All the cyclic peptide sets of motifs should be stored into a database.dat file which is used by the int_analyse tool during the matching processes. 2. Special steps have to be added when dealing with the preparation of head-to-tail or similar cyclized peptides: There are three steps to take: (1) modification of the AMBER force field “leaprc.ff14B” parameter file to eliminate the mapping of terminal residues to allow the cyclic bond. First, a new copy of the file, which should be found at “$AMBERHOME/dat/leap/cmd/,” must be saved into the current working directory. A new name should be given to the file, e.g., “leaprc.cPep.” Then, the section that contains the residue mapping that defines terminal residues as the N- or C-terminal variants of those residues must be eliminated from the copy. It looks as follows: Cyclic Peptide Protein Interface Binders 243 addPdbResMap { { 0 "HYP" "NHYP" } { 1 "HYP" "CHYP" } { 0 "ALA" "NALA" } { 1 "ALA" "CALA" } { 0 "ARG" "NARG" } { 1 "ARG" "CARG" } { 0 "ASN" "NASN" } { 1 "ASN" "CASN" } { 0 "ASP" "NASP" } { 1 "ASP" "CASP" } { 0 "CYS" "NCYS" } { 1 "CYS" "CCYS" } { 0 "CYX" "NCYX" } { 1 "CYX" "CCYX" } { 0 "GLN" "NGLN" } { 1 "GLN" "CGLN" } { 0 "GLU" "NGLU" } { 1 "GLU" "CGLU" } { 0 "GLY" "NGLY" } { 1 "GLY" "CGLY" } { 0 "HID" "NHID" } { 1 "HID" "CHID" } { 0 "HIE" "NHIE" } { 1 "HIE" "CHIE" } { 0 "HIP" "NHIP" } { 1 "HIP" "CHIP" } { 0 "ILE" "NILE" } { 1 "ILE" "CILE" } { 0 "LEU" "NLEU" } { 1 "LEU" "CLEU" } { 0 "LYS" "NLYS" } { 1 "LYS" "CLYS" } { 0 "MET" "NMET" } { 1 "MET" "CMET" } { 0 "PHE" "NPHE" } { 1 "PHE" "CPHE" } { 0 "PRO" "NPRO" } { 1 "PRO" "CPRO" } } (2) Manual editing to the input PDB file: Removal of the last OXT atom, and addition of a “CONECT” bond between the C-terminal carbon and the N-terminal nitrogen at the end of the file. (3) Subsequently, the standard tleap preparation protocol is performed using the modified PDB file as input and sourcing the modified “leaprc.cPep” parameter file. Acknowledgments This research was conducted within the Max Planck School Matter to Life supported by the German Federal Ministry of Education and Research (BMBF) in collaboration with the Max Planck Society. We acknowledge also support by the Leibniz super computer (LRZ) center for providing supercomputer support by grant pr27za. References 1. Fontaine F, Overman J, François M (2015) Pharmacological manipulation of transcription factor protein-protein interactions: opportunities and obstacles. Cell Regen 4:2 2. Bahadur RP, Zacharias M (2018) The interface of protein-protein complexes: analysis of contacts and prediction of interactions. Cell Mol Life Sci 65:1059–1072. https://doi.org/ 10.1007/s00018-007-7451-x 3. Murray JK, Gellman SH (2007) Targeting protein–protein interactions: lessons from 244 Brianda L. Santini and Martin Zacharias p53/MDM2. Biopolymers 88:657–686. https://doi.org/10.1002/bip.20741 4. Corbi-Verge C, Kim PM (2016) Motif mediated protein-protein interactions as drug targets. Cell Commun Signal 14:8 5. Conte LL, Chothia C, Janin J (1999) The atomic structure of protein-protein recognition sites. J Mol Biol 285:2177–2198. https://doi.org/10.1006/jmbi.1998.2439 6. Keskin O, Ma B, Nussinov R (2005) Hot regions in protein-protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol 345: 1281–1294. https://doi.org/10.1016/j.jmb. 2004.10.077 7. Metz A, Pfleger C, Kopitz H et al (2012) Hot spots and transient pockets: predicting the determinants of small-molecule binding to a protein-protein interface. J Chem Inf Model 52:120–133. https://doi.org/10.1021/ ci200322s 8. Wells JA, McClendon CL (2007) Reaching for high-hanging fruit in drug discovery at protein-protein interfaces. Nature 450: 1001–1009. https://doi.org/10.1038/ nature06526 9. Arkin MR, Tang Y, Wells JA (2014) Smallmolecule inhibitors of protein-protein interactions: progressing toward the reality. Chem Biol 21:1102–1114. https://doi.org/10. 1016/j.chembiol.2014.09.001 10. Qiu Y, Li X, He X et al (2020) Computational methods-guided design of modulators targeting protein-protein interactions (PPIs). Eur J Med Chem 207:112764. https://doi.org/10. 1016/j.ejmech.2020.112764 11. Scott DE, Bayly AR, Abell C, Skidmore J (2016) Small molecules, big targets: drug discovery faces the protein-protein interaction challenge. Nat Rev Drug Discov 15:533–550. https://doi.org/10.1038/nrd.2016.29 12. Andrei SA, de Vink P, Sijbesma E et al (2018) Rationally designed semisynthetic natural product analogues for stabilization of 14-3-3 protein-protein interactions. Angew Chemie 130:13658–13662. https://doi.org/10. 1002/ange.201806584 13. Santini BL, Zacharias M (2020) Rapid in silico design of potential cyclic peptide binders targeting protein-protein interfaces. Front Chem 8:2134. https://doi.org/10.3389/ fchem.2020.573259 14. Duffy FJ, Devocelle M, DCS (2015) Computational approaches to developing short cyclic peptide modulators of protein–protein interactions. Methods Mol Biol 1268:241–271. https://doi.org/10.1007/978-1-4939-22857_11 15. Case DA, Belfon K, Ben-Shalom IY, Brozell SR, Cerutti DS, Cheatham TE III, Cruzeiro VWD, Darden TA, Duke RE, Giambasu G, Gilson MK, Gohlke H, Goetz AW, Harris R, Izadi S, Izmailov SA, Kasavajhala K, Kovalenko A, Krasny R, York DM, Kollman PA (2018) AMBER 2018. University of California, San Francisco 16. Maier JA, Martinez C, Kasavajhala K et al (2015) ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J Chem Theory Comput 11: 3696–3713. https://doi.org/10.1021/acs. jctc.5b00255 17. Jorgensen WL, Chandrasekhar J, Madura JD et al (1983) Comparison of simple potential functions for simulating liquid water. J Chem Phys 79:926–935. https://doi.org/10.1063/ 1.445869 18. Ryckaert JP, Ciccotti G, Berendsen HJC (1977) Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J Comput Phys 23:327–341. https://doi.org/ 10.1016/0021-9991(77)90098-5 19. Wang C, Greene D, Xiao L et al (2018) Recent developments and applications of the MMPBSA method. Front Mol Biosci 4: 201–215 20. Maenaka K, Jones Y (1999) MHC superfamily structure and the immune system. Curr Opin Struct Biol 9:745–753 21. Montealegre S, Venugopalan V, Fritzsche S et al (2015) Dissociation of β2-microglobulin determines the surface quality control of major histocompatibility complex class I molecules. FASEB J 29:2780–2788. https://doi.org/10. 1096/fj.14-268094 22. Gohlke H, Kiel C, Case DA (2003) Insights into protein-protein binding by binding free energy calculation and free energy decomposition for the Ras-Raf and Ras-RalGDS complexes. J Mol Biol 330:891–913. https://doi. org/10.1016/S0022-2836(03)00610-7 Chapter 13 Structural Prediction of Peptide–MHC Binding Modes Marta A. S. Perez, Michel A. Cuendet, Ute F. Röhrig, Olivier Michielin, and Vincent Zoete Abstract The immune system is constantly protecting its host from the invasion of pathogens and the development of cancer cells. The specific CD8+ T-cell immune response against virus-infected cells and tumor cells is based on the T-cell receptor recognition of antigenic peptides bound to class I major histocompatibility complexes (MHC) at the surface of antigen presenting cells. Consequently, the peptide binding specificities of the highly polymorphic MHC have important implications for the design of vaccines, for the treatment of autoimmune diseases, and for personalized cancer immunotherapy. Evidence-based machine-learning approaches have been successfully used for the prediction of peptide binders and are currently being developed for the prediction of peptide immunogenicity. However, understanding and modeling the structural details of peptide/MHC binding is crucial for a better understanding of the molecular mechanisms triggering the immunological processes, estimating peptide/MHC affinity using universal physicsbased approaches, and driving the design of novel peptide ligands. Unfortunately, due to the large diversity of MHC allotypes and possible peptides, the growing number of 3D structures of peptide/MHC (pMHC) complexes in the Protein Data Bank only covers a small fraction of the possibilities. Consequently, there is a growing need for rapid and efficient approaches to predict 3D structures of pMHC complexes. Here, we review the key characteristics of the 3D structure of pMHC complexes before listing databases and other sources of information on pMHC structures and MHC specificities. Finally, we discuss some of the most prominent pMHC docking software. Key words Immune system, Major histocompatibility complex, T-cell receptor, Peptide antigen, Peptide docking, Docking algorithms, Molecular mechanics, Ligand binding, Databases 1 Introduction The immune system is constantly defending the host against the invasion of a wide range of infectious pathogens such as viruses, bacteria, and fungi, but also against the emergence of cancer cells. Several groups of molecules and cells are in charge of fighting infections and maintaining a healthy organism. The cellular Marta A.S. Perez and Michel A. Cuendet contributed equally to this work. Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_13, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 245 246 Marta A. S. Perez et al. immune response is based on the natural proteolytic degradation of proteins within cells, producing peptides that are subsequently displayed at the cell surface by the major histocompatibility complex (MHC) molecule [1, 2]. Foreign peptides arising from virus infection or abnormal peptides originating from a malignant transformation can be recognized in complex with an MHC (i.e., a pMHC complex) by a T-cell receptor (TCR). This recognition constitutes a key step in the regulation of T-cell activation and further immune responses against virus-infected cells or tumor cells. There are two major classes of MHC molecules. MHC class I molecules are produced by almost all cells in the body, while MHC class II are found in antigen presenting cells of the immune system such as dendritic cells [3, 4]. With the rise of immunotherapy treatments against cancer [5– 12] and the objective to address autoimmune diseases [13], many approaches, most of them based on sequence data and machinelearning algorithms, have been developed to predict peptide ligands of different MHC allotypes or the immunogenicity of pMHC complexes [14, 15]. Simultaneously, molecular modeling approaches able to predict the binding of a peptide to a given MHC molecule and to determine the corresponding binding mode have been reemerging as a subject of strong interest [16]. Indeed, these information could provide additional insights into the mechanism of peptide binding to MHC, open the door to structure-based estimation of peptide/MHC affinity [17], uncover unknown structural drivers of T-cell activation, and guide the design of novel peptide ligands [18–21]. More than 600 pMHC class I three-dimensional (3D) structures are currently available in the Protein Data Bank. However, this data covers only some tens of the 19,000 known human MHC alleles [18] and a tiny fraction of the peptides that can be produced from the human proteome via proteosomal processing [19]. Despite important progress, experimental methods for protein–ligand structure determination are too time-consuming and expensive to address the missing information or to be routinely used in the context of personalized immunotherapy of cancer, for instance. Consequently, there is a need for rapid and efficient approaches to predict 3D structures of pMHC complexes [17]. However, many challenges have to be overcome to computationally dock a peptide into an MHC molecule, due to the size and the flexibility of the peptide ligands. To address these challenges, many suitable approaches have been developed based on the structural particularities of the pMHC complex, as well as available experimental data. Here, after summarizing the structural characteristics of pMHC class I and II complexes, we briefly review the numerous sources of information regarding the TCRpMHC system in general and the pMHC complex in particular, which can be of interest for developing and evaluating a pMHC docking software. Finally, we review some of these docking approaches, focusing mainly on docking to MHC class I. Structural Prediction of Peptide–MHC Binding Modes 2 247 The Structure of the pMHC Complex 2.1 The pMHC Class I Structure MHC class I molecules are molecular heterodimer complexes composed of a ~44 kDa membrane-anchored polymorphic heavy chain and a 12 kDa invariant soluble β2-microglobulin (β2m) (Fig. 1) [1, 2] [20]. From N- to C-terminal, the heavy chain comprises three extracellular domains, α1, α2 and α3, a transmembrane segment, and a cytoplasmic tail. β2m is noncovalently linked to the α3 domain of the heavy chain. Both β2m and α3 show an immunoglobulin-like fold. The α3 domain also interacts with the CD8 co-receptor of T-cells when present [21, 24]. Although β2m and α3 are not in direct contact with the peptide, β2m interacts with α1 and α2, stabilizing the heavy chain and enhancing peptide binding [25–27]. Very importantly, the α1 and α2 domains of the heavy chains form a β-sheet of eight β-strands, with four strands coming from α1 and α2, respectively. The β-sheet plane is roofed by two helices, one originating from α1 and the second one from α2. These two α-helices overhanging the β-sheet floor are forming a polymorphic peptide-binding groove (Fig. 2a). The most variable residues among the different allotypes point inside this groove as well as in the direction of the TCR, conferring unique peptide and TCR-binding selectivity to each MHC molecule. Most of the variable residues are situated in the center of the cleft. On the contrary, the most conserved residues are present at both ends of the groove, where they form walls that determine the length of the peptide— which is generally restricted to 8 to 11 residues with a majority of 9-mers—and the position of its N- and C-termini (Fig. 2b). Noticeably, these conserved residues are holding the peptide by delimiting the ends of the binding groove and by fixing them through a conserved network of hydrogen bonds with the peptide’s N- and C-termini (Fig. 3). Of note, some exceptions to this rule can be found [28, 29]. Characteristically for MHC class I molecules, the polymorphic residues in the peptide-binding groove generate pockets that can strongly accommodate preferred amino acid side chains of the peptide. The nature of the preferred peptide residues depends on the MHC sequence. The interactions between the peptide and these MHC pockets anchor the peptide in the peptide-binding cleft of the MHC. Six major pockets, labeled A to F, can be identified in the MHC groove (Fig. 4) [30]. Two deep pockets are particularly important. The first one is pocket B, which generally—but not always—holds the second residue of the peptide. The second one is pocket F, which accommodates the side chain of the C-terminal residue of the peptide. The peptide residues in these positions, called the anchor residues, play an important role in peptide/MHC interaction. Secondary anchor residues, generally 248 Marta A. S. Perez et al. Fig. 1 Experimental structure of an MHC HLA-A*02:01 in complex with the nonapeptide ALGIGILTV (PDB ID 1JHT [22]). The MHC heavy chain is colored in light brown and the β2m molecule in orange. Secondary structure elements are shown in ribbon representation, while the solvent accessible surface is displayed in transparent, colored according to the underlying protein. The peptide is shown in ball and stick representation, colored according to the atom types. (All figures were generated with UCSF Chimera [23]) located in the center of the peptide, bind more weakly to other pockets of the MHC and play a less important role for the recognition strength of a given peptide by a given MHC, but they can substantially influence the conformation of the peptide in the MHC groove. The complementarity between peptide anchor residues and MHC anchoring pockets, together with the more limited role of secondary anchor residues, is critical for the specificity of MHC class I. The pockets generated by the polymorphic residues in the MHC groove complement a small number of specific amino acids at given positions in the binding peptide, thus determining which peptides can bind to a specific MHC allotype. In summary, peptides bound to MHC class I proteins have allele-specific sequence motifs characterized by strong preferences for a few amino acids at given positions of the peptide and large permissiveness in the other positions [31, 32]. Of note, an analysis of experimentally determined MHC class I-presented 9-mers showed that they exhibit a localization bias to helical fragments in the source proteins, which could be explained by the fact that, prior to loading on the HLA complexes, the peptides must be cleaved or trimmed at the N- and C-termini to Structural Prediction of Peptide–MHC Binding Modes 249 Fig. 2 (a) The β-sheet forming the floor of the peptide binding cleft is composed of four β-strands from domain α1 and four β-strands from domain α2. Two helices, one from α1 and the other one from α2, are flanking the peptide binding cleft. (b) Surface of the MHC class I molecule, showing the walls that limit the extension of the peptide within the MHC cleft. (Figure made using the experimental structure PDB ID 1JHT [22]) Fig. 3 Hydrogen bond network between the N- and C-termini of the peptide and conserved residues of the MHC. Hydrogen bonds are displayed as green dotted lines. (a) Overall view, (b) zoom on the N-terminus, and (c) zoom on the C-terminus of the peptide. (Figure made using the experimental structure PDB ID 1JHT [22]) 250 Marta A. S. Perez et al. Fig. 4 Major binding pockets of the MHC, labeled from A to F and displayed in different colors. Figure made using the experimental structure PDB ID 1JHT [22] be available, but at the same time must be stable enough to be displayed by MHC. Therefore, the higher resistance of helices to proteolysis could explain the higher frequency of helical regions among HLA-I binding molecules [33]. As mentioned above, peptides bound to MHC class I molecules are generally anchored in the B and F pockets of the MHC, while their N- and C-termini residues are fixed through numerous hydrogen bonds with conserved MHC residues. Most of the conformational diversity of the peptide binding mode thus results from the central residues, which can protrude from the surface and potentially interact with TCRs. Given that the extremities of the peptide are fixed in the MHC, the longer the peptide, the more bulged its central conformation is [34]. Analysis of pMHC class I experimental structures suggest that the peptide backbone shows a conserved conformation despite the diversity of amino acid sequences [35, 36]. The MHC class I groove also shows a limited plasticity upon binding diverse peptides [37]. Fagerberg et al. [38] compared 21 experimental structures of pMHC HLA-A*02:01, spanning different peptides. They found an average root mean square deviation of only 0.4 and 0.9 Å for the backbone and all heavy atoms of the MHC residues, respectively, reflecting the inherent flexibility and crystal packing effect of proteins independently of the nature of the bound peptide. A given MHC class I thus shows only limited induced fit upon binding of different peptide amino acid sequences. Two exceptions were, however, observed in the binding groove: Structural Prediction of Peptide–MHC Binding Modes 251 Fig. 5 Superimposition of 4 pMHC HLA-A*02:01 experimental 3D structures of with resolutions ranging from 1.30 to 1.60 Å: 3D25 [39], 3MRG [40], 5C0G [41], and 2V2X [42]. Some conformationally variable residues are shown in stick representation the β-sheet residues Arg97 and Tyr116, which showed average global displacements of 1.2 Å and 1.6 Å of their heavy atoms, respectively. Of note, due to their spatial proximity, the conformations of these two residues are correlated and exhibit only two possible combinations depending on the peptide sequence and properties. Importantly, other residues exhibit conformational variations as a function of the crystal structure: Glu19, Glu58, Arg65, Arg75, Lys146, Glu154, and Gln155 (Fig. 5). However, due to their position in the MHC, their flexibility is more likely to impact TCR binding than the peptide epitope position. Despite these limitations, the accessible conformational space of pMHC class I remains important due to the length and number of degrees of the freedom (DoF) of the peptide, making structural predictions of binding modes particularly challenging, as for any peptide docking. However, the abovementioned conformational rules can be used to design efficient sampling strategies for the docking of peptides in the binding groove of an MHC molecule. 2.2 The pMHC Class II Structure MHC class II molecules are also heterodimers consisting of two noncovalently associated polymorphic subunits: the 34 kDa α chain and the 29 kDa β chain (Fig. 6). Both α and β chains are anchored to the membrane and contain two extracellular domains (α1 and α2, as well as β1 and β2, respectively) and an intracellular domain. The structure of the MHC class II molecules is very similar to that of the 252 Marta A. S. Perez et al. Fig. 6 MHC class II, HLA-DRA1 and HLA-DRB1, in complex with the 15-mer alpha-enolase peptide 326–340; PDB ID 5NI9 [43] (a) Extracellular domains α1 and α2 (in rosy brown), as well as β1 and β2 (in brown). Secondary structure elements are show in ribbon representation, while the solvent accessible surface is displayed in transparent, colored according to the underlying protein. The alpha-enolase peptide is displayed in ball and stick representation, colored according to the atom types. (b) Surface of the MHC class II molecule showing the absence of walls delimiting the peptide binding groove, which allows the accommodation of a large peptide extending from the N- and C-termini MHC class I molecules [3, 36, 41, 42], with the α1 and β1 domains forming a peptide-binding groove following the model of the α1 and α2 domains of MHC class I (Fig. 6b). The other domains, α2 and β2, play a structural role similar to that of α3 and β2m of MHC class I, respectively (Fig. 6a). The CD4 co-receptor of T-cells, when present, interacts with both the β2 and α2 domains of MHC class II [44]. While the α1 and β1 domains of MHC class II, which are forming the peptide binding groove, are highly polymorphic, the α2 and β2 segments are very conserved among the allotypes of a particular class II gene. The peptide-binding site of MHC II molecules is formed by the N-terminal α1 and β1 domains and, similarly to MHC class I, is composed of an 8-stranded β-sheet defining the floor of the binding site and two helices creating its borders. The most striking difference between MHC class I and II binding sites are the binding groove walls in MHC class I, which limit the size of the peptides that MHC class I can bind, and which are absent in MHC class II. Consequently, MHC class II can accommodate much longer peptides, composed of 13 to 25 amino acids [45], which extend at both the N- and the C-termini compared to class I binding peptides [34, 46]. As a consequence, due to the larger number of degrees of freedom of the peptides in the binding groove, docking into MHC class II is generally much more challenging than docking into MHC class I Structural Prediction of Peptide–MHC Binding Modes 253 [17]. Contrarily to the MHC class I groove, which has only two major anchoring pockets, the groove of MHC class II can present three or four major anchoring pockets to accommodate the primary peptide anchor residues. 3 Databases and Resources The efficiency of a computational docking method is frequently assessed based on its ability to reproduce the native bound conformation as determined experimentally with X-ray crystallography. Therefore, structural databases of pMHC complexes hold important information to benchmark docking methods. Databases containing sequence-based information on pMHC binding are also extremely relevant as i) docking methods are able to predict a pMHC structure by using only the peptide sequence as an input and ii) the output of a docking method suitable to model thousands of pMHC complexes can be used as a basis to discriminate between peptide binders and nonbinders, for example, through a scoring function. In this context, sequence-based tools to qualitatively or quantitatively predict pMHC binding affinities can be used as a source of comparison to assess a docking tool. Beyond pMHC binding, the most relevant endpoint for many applications is the ability of antigenic peptides to elicit a T-cell response. To this end, structural data on TCRpMHC complexes and other sources of information on TCR and pMHC interactions provide essential input. The points outlined above illustrate the key importance of the availability of information on pMHC and TCRpMHC to guide further research in the field. Therefore, we describe below available resources such as datasets, databases, and software tools, giving an overview of relevant material to benchmark or develop new docking software and pMHC prediction methods (Table 1). 3.1 Resources Related to pMHC (and TCRpMHC) Structures The Protein Data Bank (PDB [47, 48]; https://www.ebi.ac.uk/ pdbe [70–73]) is the worldwide archive of structural data of biological macromolecules. Established in 1971, it contains today more than 1092 structures of pMHC class I and more than 212 structures of pMHC class II (as of 25.11.2020). Approximately 650 of these structures are peptide-HLA and more than one-third of these relate to the same MHC allotype (HLA-A*02:01). PDB data are freely and publicly available for download without restrictions. Each entry contains summary information about the structure and experiment, atomic coordinates, and in most cases a reference to a corresponding scientific publication. Individually and in bulk, PDB structures can be downloaded and/or analyzed and visualized online, for example, using tools at PDBe. However, creating datasets from the PDB to benchmark docking tools requires data curation and analysis, for instance, to Database Database Other resources with information on pMHC and TCR Database Tools Datasets Immunogenic and nonimmunogenic peptides Immunogenic and nonimmunogenic peptides A compendium of T and B cells epitope essays Provides a substantial set of updated and novel features for epitope prediction and analysis Examples of predictive tools for peptide–MHC qualitative and quantitative binding affinities Single worldwide archive of structural data of biological macromolecules Curated repository of 3D structures of peptide–MHC class I Manually curated repository of pMHC and TCR-p-MHC Repository of pMHC and TCR-p-MHC with emphasis on structural characterization Repository of 3D structures annotated according to the IMGTONTOLOGY TCRpMHC structures reporting main axes and angles in the complex Annotated TCRpMHC from PDB Wild-type and mutant pMHC together with measured affinities McPAS-TCR [69] VDJdb [68] 10 genomics TCRs with known antigen specificity Linking highly multiplexed antigen recognition to immune repertoire and phenotype TCRs with known antigen, pathogen, and pathology association IPD-MHC [67] Centralized repository for curated MHC sequences of different species IPD-IMGT/HLA [18] Specialist database for sequences of the human MHC Calis et al. [56] Chowell et al. [57] IEDB [58] IEDB-analysis resource [59] NetMHC [60, 61] NetMHCpan [62] NetMHCcons [63] MHCflurry [64] MHCSeqNet [65] MHCAttnNet [66] IMGT/3DstructureDB [52] TCR3D [53] STCRDab [54] ATLAS [55] Databases PDB [47, 48] CrossTope [49] MPID-T [50] MPID-T2 [51] Resources for MHC sequences Resources for pMHC sequences Resources for pMHC (and TCR-pMHC) structures Table 1 Resources (databases, datasets, and some software tools) available for an efficient benchmark/development of pMHC docking tools 254 Marta A. S. Perez et al. Structural Prediction of Peptide–MHC Binding Modes 255 remove low-resolution structures, mutated proteins, and proteins with missing residues, or to restrict the analysis to human molecules or to a particular allotype, to name just a few. To address this need, several databases provide curated and analyzed sets of pMHC structures (and TCRpMHC structures) from the PDB. CrossTope [49] (http://crosstope.com) is a highly curated repository of 3D structures of peptide-MHC class I. The complexes hosted in this database were obtained from the PDB and from in silico modeling. The database contains 182 nonredundant complexes from two human and two murine alleles. From the CrossTope web server, the user can download pMHC class I coordinate files as well as topological and charge distribution map images from their T-cell receptor-interacting surface. The peptide/MHC interaction database version T, MPID-T [50], is a manually curated database containing experimentally determined structures of 187 pMHC complexes and 16 TCRpMHC complexes taken from the PDB. Each structure is manually verified, classified, and analyzed for intermolecular interactions (a) between the MHC and its corresponding bound peptide and (b) between the TCR and its bound pMHC complex, when TCR structural information is available. The MPID-T database retrieval system has precomputed interaction parameters that include solvent accessibility, hydrogen bonds, gap volume, and gap index. The MHC–peptide interaction database-T version 2, MPIDT2 [51], contains pMHC and TCRpMHC complexes with emphasis on structural characterization. As of November 2020, MPID-T2 contains 415 entries from five MHC sources (282 human, 127 murine, 3 rat, 2 chicken, and 1 monkey), spanning 56 alleles. MPID-T2 covers 353 pMHC and 62 TCRpMHC structures. Overall, 327 entries are nonredundant (279 MHC class I and 48 MHC class II). Nonclassical structures and complexes with nonstandard residues are also included in this version. Of note, and as far as we know, CrossTope, MPID-T, and MPID-T2 only include structural data available before their publication date. IMGT/3D structure-DB [33] contains information on the sequences and 3D structures of TCR, pMHC, and related proteins of the immune system from human and other vertebrate species. Experimental 3D data are taken from the PDB and expertly annotated information is provided according to the IMGT criteria, using IMGT/DomainGapAlign, and based on the IMGT-ONTOLOGY concepts and axioms. IMGT/3Dstructure-DB provides standardized identification (IMGT keywords), a standardized nomenclature (IMGT gene and allele names), a standardized description (IMGT labels), and a standardized numbering (IMGT unique numbering). 256 Marta A. S. Perez et al. The T-cell receptor structural repertoire database, TCR3D [53], is a comprehensive, curated collection of T-cell receptor structures from the PDB, analysed for structure, sequence, and antigen recognition, as well as TCR germline gene sequences from http://www.imgt.org and TCR sequencing data from various studies. Users can interactively view TCR structures, search sequences of interest against known structures and sequences, and download curated datasets of structurally characterized TCR. This database is updated on a weekly basis and can serve as a centralized resource for the community studying T-cell receptors and their recognition. Users can download a curated dataset of more than 167 nonredundant TCRpMHC class I and more than 58 nonredundant TCRpMHC class II complexes. Through the database, users can also access updated information regarding the PDB code, description, release date, and resolution of MHC class I and II structures. The Structural T-cell Receptor Database, STCRDab [54], is an online resource that automatically collects and curates TCR structural data from the Protein Data Bank. For each entry, the database provides annotations, such as the α/β or γ/δ chain pairings, MHC details, and, when available, antigen binding affinities. In addition, the orientation between the variable domains and the canonical forms of the complementarity-determining region loops is also provided. Users can select, view, and download individual or bulk sets of structures based on these criteria. When available, STCRDab also finds antibody structures that are similar to TCRs, helping users to explore the relationship between TCRs and antibodies. STCRDab is linked with TCRBuilder [74], a structural TCR modeling tool that returns a model or an ensemble of models covering the potential conformations of the binding site from a paired αβTCR sequence. The Altered TCR Ligand Affinities and Structures database, ATLAS [55], is a manually curated repository containing the binding affinities for wild-type and mutant TCR and their antigens, peptide-MHC. The database links experimentally measured binding affinities with the corresponding 3D structures for TCR-pMHC complexes. ATLAS contains a dataset of TCRpMHC structures with the following curations: renaming of chains, truncation of chains to binding interface, and removal of water molecules. For ATLAS entries lacking full experimental 3D structures, models were generated from template structures using the Rosetta protein modeling suite [75]. The latest update of the website was done in 2017 [55]. Of note, TCRpMHC structures cannot be directly used for comparison of the docking poses of the pMHC alone as the latter can adopt different conformations when bound or not to the TCR [76, 77]. Structural Prediction of Peptide–MHC Binding Modes 3.2 Resources Related to pMHC Sequences 257 The Immune Epitope Database [58] (IEDB) is an up-to-date resource that captures experiments which identify and characterize epitopes and epitope-specific immune receptors along with various other details such as host organism, immune exposures, and induced immune responses. Note that, while most of the components of IEDB can be found separately in other resources, no other database contains them all. A companion site, IEDB-Analysis Resource (IEDB-AR), provides a substantial set of updated and novel features for epitope prediction and analysis [59]. New epitope prediction and analysis tools are regularly added in the IEDB-AR with features useful to advance epitope-based therapeutics and vaccine development. IEDB-AR includes, among others, a tool to predict peptides that are naturally processed by the MHC class I pathway and bind to MHC class I molecules, MHCI-NP [78], and a tool to predict naturally processed MHC class II ligands, MHCII-NP [79]. The tools available in IEDB-AR are summarized in Danda S.K. et al. [59] The IEDB represents a huge body of knowledge regarding which peptide epitopes are presented by which MHC molecules. Peptidomes of various MHC molecules can be utilized to build highly accurate predictors of MHC binding. Predictive tools for qualitative and quantitative pMHC binding affinities include NetMHC [60, 61], NetMHCpan [62], NetMHCcons [63], MHCflurry [64], the IEDB tools [59], MHCSeqNet [65], and MHCAttnNet [66]. Most of these tools use machine-learning– based techniques, require a large amount of training data, and can be rather weak predictors for MHC alleles for which the data is scarce. NetMHCpan [62], however, is a pan-specific artificial neural networks method trained on binding affinity and eluted ligand that leverages the information from both data types and seeks to alleviate the problem of data scarcity. NetMHCpan, MHCSeqNet [65], and the recently published MHCAttnNet [66] aim to predict MHC–peptide binding for unseen alleles. Docking methods for screening MHC binding peptides can be tested using IEDB, and their efficiency can be compared with one or several of the abovementioned prediction tools. In the context of precision medicine, searching for epitopes that are not presented by a patient’s MHCs, even if they are related to pathogens of interest, has little sense as it is unlikely that they can elicit a strong immune response. In such cases, smaller datasets of allele-specific or disease-specific peptide-MHCs can also be relevant [56, 57]. We would also like to mention databases that offer additional information such as the IPD-MHC [67] database that provides a centralized repository for curated MHC sequences from a number of different species or the IPD-IMGT/HLA database that provides a specialized database for sequences of the human MHC [18]. 258 4 4.1 Marta A. S. Perez et al. Computational Approaches for Peptide–MHC Binding Mode Prediction Docking Ligand-protein docking approaches aim to computationally predict the most probable position, orientation, and conformation of a small drug-like molecule at the surface of a targeted protein [32– 38]. Although intensively studied for decades, ligand-protein docking remains largely an unsolved problem [80–83]. Generally speaking, docking software can be decomposed into two components: a sampling algorithm in charge of generating possible geometries of the ligand at the protein surface (i.e., the binding modes) and a scoring function whose purpose is to rank the binding modes according to their probability to correspond to the experimental true binding mode (also called the native binding mode). Since the native binding mode corresponds in principle to the one with the lowest binding free energy for a given ligand-protein pair, the scoring function of a docking software is often trained to achieve two objectives: selecting the native binding mode among all possible binding modes and estimating the binding free energy of the ligand for the target. It thus allows the comparison of different ligands in terms of affinity and opens the door to structure-based virtual screening and drug design. Two different approaches can be used to assess a docking algorithm under development and to benchmark published docking software, namely redocking and cross-docking. Redocking consists in docking a ligand into the protein 3D structure that was experimentally determined in complex with that same ligand. On the contrary, in cross-docking experiments, the ligand is docked into a protein conformation that was experimentally determined in complex with another ligand or in its apo form. Obviously, the first exercise is easier since the protein conformation displays the induced fit necessary to bind the ligand of interest. Although this exercise is very different from the typical use of a docking software, it allows estimating different factors important for its efficiency, such as its ability to correctly sample the conformational space of the ligand and to find binding modes close to the native one (knowing that the protein is in its optimal conformation). Cross-docking is more similar to the typical usage of a docking software, where the induced fit of the protein corresponding to a given ligand is unknown. Successful cross-docking might necessitate the sampling of the conformational space of the protein in addition to the one of the ligand. As such, results of cross-docking benchmarks are generally considered more relevant to assess the overall efficiency of docking software. The ability of a docking algorithm to predict the binding modes of a set of ligand–protein complexes for which the native binding mode is known thanks to available experimental structures can be quantified by several metrics. The most employed one remains the root mean square deviation (RMSD) of heavy atom positions between the binding Structural Prediction of Peptide–MHC Binding Modes 259 mode calculated by the docking algorithm and the native binding mode. The RMSD, which is related to a distance between two ligand positions, is generally given in Å. To allow easy comparison of docking tools, it is generally assumed that a docking run is successful if this RMSD is lower than 2 Å [81]. The efficiency of ligand–protein docking software decreases with increasing number of degrees of freedom of the ligand, especially when they exceed about 10, because of the complexity and the size of the conformational space to explore [84, 85]. Due to the size of the peptides that bind to MHC grooves, even in the case of class I MHC (8 to 11 residues), typical small molecule docking codes are generally inefficient at docking such ligands. However, as we will see below, peptide-MHC docking algorithms are comparable to small-molecule docking programs by many aspects, including some of the sampling engines and scoring functions. The knowledge of the nature of interactions between a peptide and an MHC protein (see section “The structure of the pMHC complex”), notably for MHC class I, can be used to facilitate the docking of peptides, despite the fact that their number of degrees of freedom makes them intractable by standard docking approaches. Following the nomenclature proposed by Antunes et al. [17], we can distinguish approaches that rely on a constrained backbone, on constrained termini, or on incremental peptide reconstruction. All approaches benefit from the fact that the MHC molecules exhibit little induced fit as a function of the peptide nature [38] and that the overall position and N/Cterminus orientation of the peptide in the MHC groove is well known. Constrained backbone approaches employ sampling strategies based on the fact that peptides with the same number of residues and binding to the same MHC allotype show a limited number of backbone conformations in experimental structures [43, 47]. These approaches generally start by constructing conformations of the peptide to dock, bound, or unbound to the MHC, based on experimentally determined backbone conformations of other peptides of the same size. Constrained termini approaches use the preserved networks of hydrogen bonds that exist between conserved residues of the MHC and the N- and C-termini of the peptide to restrain the corresponding peptide atoms in these positions during the docking. In these conditions, docking a peptide into an MHC boils down to a loop closure problem. Approaches based on incremental peptide reconstruction try to limit the effect of the numerous internal degrees of freedom in the peptide by reconstructing it within the MHC groove in consecutive steps, in a way that only a small number of these degrees of freedom are considered at once. In the following paragraph, we describe some of the main peptide–MHC docking approaches belonging to these different categories, focusing on docking to MHC class I (Table 2). Constrained backbone Constrained backbone pDock [86] DockTope [87] FlexPepDock Constrained [88] backbone Strategy Approach FlexPepDock refinement protocol applied to a coarse-grained binding mode generated by sequence threading on babckbone experimental conformation and Tested on 30 experimental Freely available as a web structures of pMHC class server I complexes http://piperfpd.furmanlab. When starting the docking cs.huji.ac.il from a peptide template And as a standalone bound to the same MHC program allotype, 84% success rate Tested on 135 pMHC complexes of class I, covering the 5 MHC Average RMSD between 0.4 and 1.1 Å for the Cα atoms as a function of the MHC allotypes (from 1.7 to 2.5 Å for all-atom RMSD), with an average of 0.9 Å over all allotypes (2.0 Å for all-atom RMSD) Predictive ability Autodock Vina score used to Freely available as a web Generation of the peptide server [86] filter the calculated poses. conformation based on a Best result selected as the preselected template (one one closest in RMSD to all per MHC allotype), other calculated binding followed by two Autodock modes Vina docking rounds (with rigid MHC and rigid peptide backbone) separated by an energy minimization performed with GROMACS (flexible MHC and peptide) Availability Tested on a nonredundant set of 186 pMHC complexes (149 of MHC class I and 37 of MHC class II) A predicted binding mode within 1.0 Å Cα RMSD for 83% of the class I and 95% of the class II complexes. Average Cα RMSD about 0.6 Å over the entire dataset Scoring approach N/A Internal energy of the Single docking and peptide and peptide–MHC refinement using ICM and interaction energy, plus a a Monte Carlo algorithm solvation energy term. ECEPP/3 force field Sampling algorithm Table 2 Summary of the docking approaches reviewed in this chapter 260 Marta A. S. Perez et al. Constrained termini Park et al. [90] Yanover et al. Constrained [91] termini Constrained termini GradDock [89] MODELLER score, or abundance of the binding mode after simulated annealing, or detection of conformational transition Generation of conformations Modified Rosetta all-atom scoring function for the peptide backbone in the MHC groove based on constrained anchor Homology modeling, followed by all-atom MD simulated annealing and MD simulation Generation of conformations Reparameterized Rosetta score for the unbound peptide based on constrained termini and loop closure algorithm, followed by binding simulation using a steered insertion of the peptide into the MHC groove positioning of the peptide in the MHC groove respecting the anchor residues (continued) Tested on 29 MHC class I (11 HLA-A and 18 HLA-B). Docking and sequence-optimizing Tested on 17 HLA-A*02:01 pMHC complexes. Average all-atom RMSD of 1.6 Å after simulated annealing N/A N/A Tested by self-redocking on 107 nonredundant pMHC class I, covering 82 class I MHCs and 8 to 10-mer peptides, as well as on cross-docking of 70 complexes. RMSDs around 1.2 Å and 2.5 Å for backbone and all-atoms, respectively, in both self- and crossdocking tests at 1 Å backbone RMSD and of 52% 2 Å all-atom RMSD among the top-five best binding modes. When starting from a peptide template bound to a different MHC allotype, 60% success rate at 1 Å backbone RMSD among and 55% success rate at 2 Å all-atom RMSD among the top-five best binding modes Freely available for academic purposes Program available at [49] https://www. rosettacommons.org Structural Prediction of Peptide–MHC Binding Modes 261 Constrained termini APE-Gen [92] Bordner et al. Constrained [93] termini Strategy Approach Table 2 (continued) Scoring approach ICM + Monte Carlo, with ECEPP/3 force field restrain in the atoms of the N and C-termini Generation of conformations SMINA force field for the peptide backbone in the MHC groove based on constrained anchor residues and loop closure algorithm, followed energy minimization residues and loop closure algorithm, followed Monte Carlo refinement Sampling algorithm of thousands of peptides provided computed PFMs very close to the experimental PFMs Predictive ability ICM is available under paid Tested by cross-docking of license at [51] 14 peptides into HLA-A*02:01 and 9 peptides into H-2-Kb, as well as docking peptides into homology models for five different MHC allotypes Average backbone RMSD of 1.1 Å and 0.7 Å for cross-docking on HLA-A*02:01 and H-2Kb, respectively. Average backbone RMSD of 1.1 Å for docking on homology models Tested on 535 pMHC APE-Gen is open-source complexes of class I, with and freely available at 8 to 11-mers peptides. https://github.com/ Average RMSD of 0.9 Å KavrakiLab/APE-Gen. It for the Cα atoms and is also available within the 2.0 Å for all atoms HLA-Arena platform between the native which is accessible here: binding mode and the [50] closest sampled peptide conformation Availability 262 Marta A. S. Perez et al. Incremental Reconstruction of the ligand Autodock 4 or Autodock Vina scoring functions peptide by incrementally docking reconstruction larger and larger overlapping peptides fragments CHARMM forcefield including the GB-MV2 implicit solvent model DINC [94, 95] and DINC 2.0 [96] All-atom MD simulated annealing Constrained termini Fagerberg et al. [38] Freely available as a web server [52] N/A (continued) Tested via the redocking of 25 pMHC complexes, spanning 10 different MHC class I and peptides ranging from 8 to 10-mers. Averaged Cα and all-atoms RMSD of 1.0 and 1.9 Å, respectively Tested by the redocking of 14 HLA-A*02:01 pMHC and 27 non-HLA-A*02:01 pMHC For HLA-A*02:01 and selection of output by cluster size, success rate of 86% for backbone RMSD lower than 1.0 Å and 71% for heavy atom RMSD lower than 1.5 Å. For non-HLA-A*02:01, success rates of 70 and 59% for backbone and heavy atoms RMSD, respectively. For selection by the mean effective energy, success rates of 100% and 93% for HLA-A*02:01 pMHC, and of 74 and 67% for non-HLA-A*02:01 pMHC Structural Prediction of Peptide–MHC Binding Modes 263 Incremental Reconstruction of the ligand OPLSAA/L force field peptide by reconnection of amino reconstruction acid conformations obtained by MD simulations and energy minimization Scoring approach DynaPred [97] Sampling algorithm Strategy Approach Table 2 (continued) N/A Availability Tested by cross-docking of 20 complexes of 9-mer peptides in MHC HLA-A*02:01. Average backbone RMSD of 1.5 Å Predictive ability 264 Marta A. S. Perez et al. Structural Prediction of Peptide–MHC Binding Modes 265 Due to the difficulty of docking long peptides, and to the particular flexibility of the side chains, the success rate of peptide– MHC docking software is generally quantified using not only the RMSD calculated on all heavy atoms of the peptide but also the RMSD calculated only on the backbone atoms or even on the Cα atoms. The backbone RMSD allows to estimate if the docking software was successful in reproducing the backbone conformation of the native binding mode, even though the positioning of the side chains may not be correct. 4.2 Constrained Backbone Docking of peptides to MHC class I or class II using pDock [86] requires some preparation steps, followed by a single-step docking and refinement based on the Internal Coordinate Mechanics (ICM) algorithm [98, 99]. First, the peptide and the MHC are prepared for docking by adding missing residues, side chains, and polar hydrogen atoms. A docking grid is positioned to ensure that the peptide ligand will be situated in the vicinity of the MHC binding site. The authors claim that high-quality homology models of the MHC can be used with pDock, although the assessment was only performed through redocking to existing X-ray structures. The peptide is positioned based on existing X-ray structures. Next, the ICM docking algorithm is used to perform a flexible docking of the peptide into the MHC binding groove. During this docking, torsion angle values of the ligand side chains are sampled using a Monte Carlo procedure. The energy function used during this procedure is the sum of the internal energy of the peptide and the interaction energy between the peptide and the MHC, including the internal Van der Waals interaction, hydrophobic potential between the peptide and the MHC, the hydrogen bonding energy, the configurational/conformational entropy, and a surface-based solvation energy, based on the ECEPP/3 force field [100]. Loose restraints are imposed on the position of the peptide to keep it close to the starting conformation during the docking. Finally, all peptide and MHC residues (in the vicinity of the peptide) are refined to eliminate or minimize peptide–MHC atom clashes, again using ICM and a Monte Carlo procedure. pDock was tested on a nonredundant set of 186 pMHC complexes (149 MHC class I and 37 MHC class II) with 3D structures determined by X-ray crystallography. A predicted binding mode within 1.0 Å RMSD from the native binding mode, calculated on the Cα atoms of the nonameric core of the peptide, was obtained for 83% of the class I complexes and 95% of the class II complexes. The average Cα RMSD between the redocked and experimental poses was about 0.6 Å over the entire dataset. DockTope [87] is based on the so-called D1-EM-D2 approach [36] (see below for the definition of D1, EM, and D2) for the modeling of pMHC class I previously published by Antunes et al. This technique divides peptide–MHC docking into four steps. 266 Marta A. S. Perez et al. First, capitalizing on the known conservation of the backbone conformation of peptides binding to the same MHC [36], the input peptide sequence is transformed into a three-dimensional structure. This is performed by threading its sequence on the constrained backbone of a peptide-epitope 3D pattern preselected by the authors. DockTope provides five such patterns (PDB IDs 1LK2 [101], 2V2W [42], 2A83 [102], 1WBX [103], and 1WBY [103]) covering four MHC allotypes, and thus allowing the docking of 8-mers into H-2-Kb, 9-mers into HLA-A*02:01, HLA-B*27:05 and H-2-Db, and 10-mers into H-2-Db. This threading is followed by an energy-minimization to mildly relax the conformation of the peptide, following a protocol identical to the one described in the third step, below. Second, starting from the 3D conformation generated in the first step, an initial molecular docking (D1) is performed using the Autodock Vina program [104]. Before docking, the system is prepared using Autodock Tools [105]. This preparation consists in adding all hydrogens to the MHC macromolecule to calculate the Gasteiger charges of each protein atom, before removing the nonpolar hydrogens. The peptide ligand is setup using the same protocol. The grid box defining the Vina search space is configured to allow the sampling of the peptide poses inside the MHC cleft. Of note, the ϕ and ψ backbone torsional angles of the peptide are excluded from the degrees of freedom, such that only the peptide side chains conformations are optimized during the docking. Twenty independent docking runs are performed, each one providing a best-predicted binding mode with a corresponding calculated binding energy. The best-predicted binding mode among these 20 runs is obtained by (a) removing all binding modes with binding energies lower than the average binding energy of the 20 calculated modes and (b) selecting the binding mode with the lowest average RMSD to all other remaining binding modes. Third, starting from the calculated binding mode generated by D1, an energy minimization (EM) is performed with the steepest descent algorithm, using the GROMACS package [106] and the GROMOS 53A5 force field [107], to correct possible steric clashes between the docked peptide and the MHC. Interestingly, the pMHC system is embedded in a box filled with explicit water molecules and with a 0.15 mol/l NaCl concentration during the minimization to take the solvent effect into account. Fourth, a second docking (D2) is performed in order to refine the structure, because the MHC side chain conformations have been modified during the energy minimization step in presence of the peptide. This docking step follows the same procedure as D1 for the sampling and scoring as well as for the selection of the final binding mode. Structural Prediction of Peptide–MHC Binding Modes 267 DockTope was tested on 135 pMHC class I complexes, covering the five MHC allotypes and the corresponding peptide lengths mentioned above. Given that the MHC structures used for the docking were systematically taken from the five preselected PDB files listed above, this assessment was de facto a cross-docking experiment. The averaged RMSD between the predicted and native binding modes ranged from 0.4 to 1.1 Å for the Cα atoms as a function of the MHC allotypes (from 1.7 to 2.5 Å for all-atom RMSD), with an average of 0.9 Å over all allotypes (2.0 Å for all-atom RMSD). Of note, DockTope is freely available as a web server. Liu et al. tested the Rosetta FlexPepDock [88] refinement protocol in the context of peptide–MHC docking [108]. This protocol can be used when an approximate model of the peptideprotein interaction is already available. It uses a Monte Carlo energy minimization to iteratively optimize the peptide backbone and its rigid-body orientation while sampling the side chain flexibility of the peptide and the protein receptor. Applying a refinement protocol necessitates to first generate coarse grained models of the peptide binding mode in the MHC groove. The authors used two approaches for this: threading the target sequence in experimentally determined backbone positions of peptides bound either to the same MHC allotype or to different MHC allotypes. The conformers obtained this way were then orientated manually into the peptide binding groove so as to position the anchor residues (positions 2 and 9 of 9-mers) into the respective B and F MHC pockets (Fig. 7). The resulting pMHC coarse-grained binding modes were then used as input for the FlexPepDock refinement. Fig. 7 Anchor residues Leu2 and Val9 of the ALGIGILTV peptide in the HLA-A*02:01 peptide groove (PDB ID 1JHT [22]). Peptide Leu2 is situated in pocket B of MHC, while Val9 residue is in pocket F. Their surface is displayed and colored in magenta. The position of the MHC N and C-termini walls is also indicated 268 Marta A. S. Perez et al. This double protocol was used to test the influence of the origin of the backbone conformation on the quality of the prediction. 1000 independent FlexPepDock refinement calculations were performed for each peptide to efficiently sample the conformational space. The resulting binding modes were ranked based on the Rosetta full-atom energy function [109]. The approach was tested on 30 experimental structures of pMHC class I complexes. When starting the docking from a peptide template bound to the same MHC allotype, the authors found that 84% of the complexes were docked with a backbone RMSD from the native binding mode lower than 1 Å if they considered the five best binding modes. In those conditions, the success rate at 2 Å all-atom RMSD among the five best binding modes was 52%. When starting from a peptide template bound to a different MHC allotype, the success rate at 1 Å backbone RMSD among the five best binding modes decreased to 60%. However, in this case, the success rate at 2 Å all-atom RMSD among the five best binding modes remained at 55%. Of note, FlexPepDock is freely accessible as a web server. 4.3 Constrained Termini In GradDock, Kyeong et al. [89] decompose the peptide-MHC docking procedure into three main steps. First, three-dimensional conformations are generated for the unbound peptide. Exploiting the high conservation of the N- and C-termini conformation of the peptides presented by MHC thanks to sequence-independent hydrogen bonds (Fig. 3), GradDock generates the unbound peptide, with only backbone atoms, by growing and joining half-peptides from the two fixed termini taken from a selected experimental pMHC structure (PDB ID 1DUZ [110]). These half-peptides are produced using random ϕ and ψ angle values. After removing the ϕ/ψ combinations with lowest probabilities, the remaining half-peptides are randomly paired and assembled using the cyclic coordinate descent algorithm originally developed for loop closure [111]. Second, the unbound peptide conformations are inserted into the MHC-I molecule through a so-called binding simulation; starting 20 Å above the MHC-I groove, the unbound peptide is pushed into the latter following the binding axis. During this steered insertion, the peptide is moved by the gradient descent algorithm, where the gradient is iteratively calculated and added to the physical forces. The GROMOS 54a7 force field parameters are used to calculate the nonbonded interactions [112], while the bond lengths, angles, and proper and improper dihedral angles are maintained by harmonic restraints. At the bound position, side chains of the terminal residues are optimized using a Monte Carlo approach applied to the torsion angles. Then, the peptide is submitted to a gradient descent energy minimization before applying a topological correction consisting in refining the position of the backbone atoms of the bound peptides using Ramachandran probability Structural Prediction of Peptide–MHC Binding Modes 269 maps [113]. The peptide is ultimately fully hydrogenated using REDUCE [114]. During this binding simulation, the structure of the MHC molecule is held fixed and treated as an AutoDock-style grid [105]. Third, the resulting candidate poses from step two are ranked by the Gradock algorithm to provide the final prediction. Of note, Kyeong et al. reparameterized the Rosetta scoring terms using a linear programming approach. All Rosetta score terms were calculated for the native and calculated binding modes of several pMHC systems, before being normalized. Each of these energy terms was attributed a weight. Following the hypothesis that the crystal structure is in the minimum energy state, the energy of each calculated binding mode of a given peptide constitutes an energy inequality against the corresponding native binding mode. The authors determined the optimal weights of the Rosetta score terms by solving the linear equations. The new ranking functions were validated based on cross-validation using self-docking results as well as crossdockings. Gradock was tested by redocking using a set on 107 nonredundant pMHC class I systems, covering 82 class I MHCs and 8- to 10-mer peptides, and further challenged on cross-docking of 70 complexes. GradDock was found to provide robust crossdocking predictions, with a predictive ability similar to that of redocking, i.e., RMSDs around 1.2 Å and 2.5 Å for backbone and all-atoms, respectively. Of note, although GradDock provides calculated binding modes with an averaged RMSD to the native binding modes similar to the standard Rosetta score-based approach [91], it was found to provide good predictions for three times more targets in cross-docking. Another approach using constrained termini was provided by the work of Park et al. [90] Their method makes use of all-atom molecular dynamics (MD) and simulated annealing (SA) simulations. Their protocol starts by preparing an initial peptide–MHC structure using homology modeling, with MODELLER [115], Rosetta [75], or PRIME [116]. However, only the MODELLER structure was used to initiate the SA protocol in their study. Each SA cycle consists in heating the system from 300 K to 1500 K during 80 ps, followed by an equilibration at 1500 K for another 80 ps and finally a cooling to 300 K in 800 ps. The MD simulations during the SA cycles were performed using Langevin dynamics, calculated using the AMBER9 program [117] and AMBER force field [118]. 100 SA cycles were performed during which the MHC atoms were restrained to their initial position, and the distances of the four hydrogen bonds between the MHC and the N- and C-termini of the peptide were also maintained. The most frequent conformation among the 100 generated ones was selected as predicted binding mode. 270 Marta A. S. Perez et al. The approach was tested on 17 pMHC complexes, all HLA-A*02:01. For each one, the experimental structure corresponding to PDB ID 2V2W [42] was used as a template for the homology modeling step. The authors found that homology modeling already provided calculated binding modes with an averaged peptide all-atom RMSD from the native binding mode of only 1.5 Å for MODELLER, and 3.1 Å for Rosetta and PRIME. The SA protocol, which started from the MODELLER-generated conformations, did not improve the predictions, with an all-atom RMSD of 1.6 Å. The authors decided to change the criteria of selection of the final binding mode. For this, starting from the binding mode generated by the SA protocol, a 10 ns MD simulation was performed at 283 K, again using AMBER and the same restraints on the peptide and MHC. Each trajectory was analyzed to find conformational transitions. Finally, the authors selected the most probable conformational state as the one resulting the most frequently from conformational transitions. They found that this new MD-based protocol could correct the worst MODELLER prediction, obtained for the complex with PDB ID 2V2X [42], decreasing the peptide all-atom RMSD from 2.4 Å after homology modeling to 1.4 Å after simulated annealing followed by MD simulation. Tested on three other complexes for which MODELLER provided a successful prediction, the MD-based protocol provided results similar to that of homology modeling. In addition, the authors used their method to perform a blind docking of three peptides on HLA-A*02:01. Their predicted binding modes were validated by X-ray crystallography. A constrained termini approach, using the Rosetta scoring function, was also used by Yanover et al. [91] to calculate position-specific frequency matrices (PFM) for several MHC alleles. Contrarily to the other approaches mentioned here, their method was developed to also explore the sequence of the peptides during docking. Their docking approach proceeds in two stages. First, a low-resolution backbone model for the peptide bound to the MHC is obtained by fragment assembly. The backbone of the peptide is built outward from the two canonical anchor positions by assembling three-residue fragments from proteins of known structure and similar local sequence [119]. The peptide is randomly cut in two fragments between the two anchor positions to perform an independent sampling of the two halves. The cyclic coordinate descent (CCD) loop closure algorithm [120, 121] is finally applied to provide peptide conformations. Several cut/closure cycles are performed to increase the sampling. In addition, the orientation of the anchor positions is sampled by replacement with orientations derived from a peptide–MHC complex of known structure. In this low-resolution stage, both the peptide and the MHC are represented only by their backbone, and their energy is calculated using a knowledge-based scoring function. Structural Prediction of Peptide–MHC Binding Modes 271 Second, a high-resolution modeling step is performed by adding all side chain atoms to the low-resolution backbone model. Peptide conformations obtained this way are refined using a Monte Carlo optimization procedure. During this stage, the peptide sequence is considered as a degree of freedom and is optimized by Monte Carlo moves. Preferred peptides and binding modes are selected based on the binding energy. The potential energy function used in the second stage is the Rosetta all-atom potential [122] modified to incorporate a short-ranged electrostatics term and an implicit solvation model. Docking and sequence-optimizing of thousands of peptides using this approach provided computed PFMs very close to the experimental PFMs for 29 MHC class I (11 HLA-A and 18 HLA-B). The ICM docking approach was also tested in the context of a constrained termini approach by Bordner et al. [93] using a flexible all-atom model of the complete peptide. The authors used their biased-probability Monte Carlo [123] conformational search method implemented in the ICM program [98, 99] to sample the conformational space of the peptide into a grid potential derived from an X-ray structure of the MHC molecule. Based on the fact that the MHC side chain conformations from all available X-ray crystal structures of HLA-A*02:01 MHC were found to cluster into only two groups, one representative structure from each group, PDB ID 1JF1 [22] and 1I7U [124], was used for docking. A quadratic restraint energy was applied between atoms on the peptide to be docked and the corresponding atoms of the N- and C-termini of the peptide in the original pMHC structure to account for the conserved position of the peptide termini. The 50 lowest energy conformations from the grid docking simulations were ranked using the energy of an all-atom model of the complex, using the ECEPP/3 force field [100], after local minimization of the peptide and nearby MHC residues. The lowest energy conformation was chosen as the final docking solution. The approach was tested by cross-docking of 14 peptides into HLA-A*02:01 and 9 peptides into H-2-Kb, as well as docking peptides into homology models for five different MHC allotypes. For the cross-docking on HLA-A*02:01 and H-2-Kb, the authors obtained an averaged backbone RMSD to the native binding mode of 1.1 Å and 0.7 Å, respectively. For the docking on homology models, the authors obtained an averaged backbone RMSD to the native binding mode of 1.1 Å. Recently, the team of Kavraki and coworkers, which provided several important contributions to the field of peptide–MHC docking notably with DINC [94–96] and HLA-Arena (see below), proposed the APE-Gen [92] (Anchored Peptide-MHC Ensemble Generator) approach to generate ensembles of bound conformations of pMHC complexes. The development of this approach 272 Marta A. S. Perez et al. followed the observation that pMHC complexes are dynamic systems and that taking their flexibility into account can significantly enhance functional interpretations [125]. The objective of APE-Gen is therefore to generate an ensemble of conformations of the pMHC system, as opposed to producing only the most probable one as done in docking. Consequently, APE-Gen is not a docking method strictly speaking. As usual, the approach is decomposed in several steps. First, the MHC structure can be provided as a PDB file if an experimental structure is available. Otherwise, the MHC structure is obtained by homology modeling using MODELLER [115] and selecting the best solution according to the DOPE [126] score. Then, APE-Gen places the termini atoms of the peptide backbone in the MHC groove using a template pMHC conformation (which can be different than the template used by MODELLER above) so as to capitalize on the conserved position of the N- and C-termini of the peptides in the pMHC complexes. Second, 100 backbone conformations are generated for the peptide by loop modeling based on the random coordinate descent [127] (RCD) algorithm, which is a modification of the previously mentioned CCD approach [120, 121]. Third, for each of the backbone conformations generated above, the side chains are added with PDBFixer from OpenMM [128], and the resulting full peptide and MHC side chains are energy minimized using the SMINA force field. APE-Gen was tested on 535 pMHC complexes of class I, with peptide length ranging from 8 to 11-mers, and for which an experimental X-ray structure is available. For each of these test cases, APE-Gen was run 10 times. The averaged RMSD between the native binding mode and the closest sampled peptide conformation, over the entire test sets, is 0.9 Å for the Cα atoms and 2.0 Å for all atoms, showing that, although APE-Gen is not a docking software, it can generate peptide conformations close to the native one. APE-Gen is open source and freely available as a standalone software. It is also available within HLA-Arena, a recently developed platform for structural modeling and analysis of pMHC complexes [129]. In their work, Fagerberg et al. [38] designed an MD conformational sampling protocol based on SA cycles with near logarithmic cooling. Here, the pMHC system is described using the CHARMM force field [130], and calculations are performed using the CHARMM Molecular Mechanics package [131]. Starting from the native binding mode (for redocking runs) or a near-native binding mode (for cross-docking), the system is submitted to successive cycles of SA. An SA cycle starts by assigning random velocities at 100 K to the peptide atoms. The system is heated over 3 ps to a temperature of 1300 K using Langevin dynamics before being equilibrated at this temperature for another 3 ps and Structural Prediction of Peptide–MHC Binding Modes 273 subsequently being cooled for 25 ps. The SA cycle is terminated by a minimization of the system, and the final conformation is stored for additional analysis before being used as a starting point for the next SA cycle. The SA cycle is repeated until a collection of 1000 peptide conformers is obtained. During this process, the solvent effect is approximated by using a distance-dependent dielectric constant (ε ¼ 4r). The MHC molecule is kept rigid during the entire sampling but no constraints are applied to the peptide. Therefore, although the approach is not totally agnostic about the particularities of the typical peptide–MHC binding since it initiates the process from a native or near-native binding mode, no constraint is applied to the backbone conformation nor to the N- and C-termini. However, the rigidity of the MHC during the SA might fix de facto the N- and C-termini in their original position. The authors demonstrated that increasing the temperature up to 1300 K erased very rapidly most of the memory of the starting native state during the SA: some of the conformers generated early in the process lost most of the native contacts, and minor pockets of the MHC were unfilled. Consequently, this SA protocol is expected to erase the structural memory of the native binding mode and to only keep the memory of the global orientation of the peptide in the groove as well as some of the interactions in the anchor binding pockets. At the end of the SA cycles, the 1000 minimized peptide conformations are clustered based on their relative RMSD. Their effective energy, including the CHARMM energy and a GB-MV2 [132, 133] implicit estimation of the solvent effect, is calculated. The final binding mode is the center of a cluster of binding poses, itself selected based on the size of the cluster or on the mean effective energy of its members. The approach was challenged by the redocking of 14 HLA-A*02:01 pMHC and 27 non-HLA-A*02:01 pMHC. When selecting the final result based on the cluster size, 86% of the calculated binding modes for HLA-A*02:01 pMHC have a RMSD to the native binding mode lower than 1.0 Å for the backbone atoms and 71% an RMSD lower than 1.5 Å for all heavy atoms. For non-HLA-A*02:01 pMHC, these success rates become 70 and 59%, respectively. When replacing the cluster size by the mean effective energy to select the output, the success rates increase to 100% and 93% for HLA-A*02:01 pMHC and to 74 and 67% for non-HLA-A*02:01 pMHC. 4.4 Incremental Peptide Reconstruction The predictive ability of standard small molecule/protein ligand software drops if the number of internal degrees of freedom (DoF) of the ligand is larger than 10. However, the number of DoF of 8 to 11-mer peptides largely surpasses this limit, making standard docking software inadequate for the peptide-MHC system. Based on this observation, Antunes et al. [94] proposed a new algorithm, derived from their initial DINC docking software [95]. Their 274 Marta A. S. Perez et al. approach addresses this constraint by limiting the number of peptide-related DoF to 6 at every step of the docking process through an incremental reconstruction of the peptide, rather than by docking the entire peptide and exploring all its degrees of freedom at once. Briefly, DINC (Docking Incrementally) starts by docking only a small fragment of the peptide. This initial root fragment is chosen so as to maximize the potential for hydrogen bonds (i.e., by counting the number of hydrogen bond donors and acceptors), while limiting the number of DoF to 6. This fragment is docked using a standard docking software, and the 10 best binding modes according to the calculated binding free energy are selected for fragment expansion. These docked fragment poses are expanded by adding atoms following the same heuristic as above (i.e., optimizing the number of hydrogen bond donors and acceptors). The expanded fragment, in the different selected binding modes, is used as input for a second round of dockings in which a new set of 6 DoFs is considered flexible regardless of the fragment size. These new DoF are selected to involve some of the newly added atoms and some of the atoms that were already present in the previous fragment. This process of docking and expansion is repeated until the peptide has been entirely reconstructed and docked. Of note, DINC is a meta docking approach and its code is only in charge of the selection of the initial fragment, of its incremental expansion, and of the choice of the DoF used in dockings. In the discussed version of the algorithm, the docking of the fragments itself is delegated to the AutoDock 4 software [105] used with standard parameters. Noticeably, DINC was developed to dock peptides in general, not necessarily in the context of peptide-MHC docking. Consequently, its algorithm is not using the characteristics of the pMHC system, such as the conservation of the position of the N- and C-termini, or the limited number of backbone conformations observed for peptides in the grooves of MHC class I. The approach was tested via the redocking of 25 pMHC complexes [94], spanning 10 different MHC class I and peptides ranging from 8- to 10-mers. The averaged Cα and all-atoms RMSD between the calculated and native binding modes were 1.0 and 1.9 Å, respectively. This result is particularly impressive given that the DINC algorithm is not capitalizing on any specific constraints resulting from the known features of the pMHC complexes. Recently, a new version of DINC was released. DINC 2.0 was modified to use AutoDock Vina and to be more efficient on larger peptides. It was made accessible as a freely available docking web server [96]. An alternative of the incremental peptide reconstruction is offered by DynaPred [97]. Briefly, the latter performs MD simulations to approximate the binding free energy of each peptide residue inside the binding pockets of the MHC cleft. The structural Structural Prediction of Peptide–MHC Binding Modes 275 information obtained by these simulations is used to construct the 3D structure of the pMHC complexes. To stabilize the peptide conformations, single residues are extended to peptide-trimers and dimers by adding glycine residues at both sides. The final docked poses are obtained by connecting the residue conformations from the simulation runs and performing a short energy minimization. The approach was tested by cross-docking on 20 complexes of 9-mer peptides in MHC HLA-A*02:01. The authors obtained an average backbone RMSD of 1.5 Å. Of note, the incremental peptide reconstruction strategy is used in other peptide docking algorithms [35, 134], but these will not be detailed here since they were not intensively tested on pMHC complexes. 5 Conclusion Progresses in experimental approaches have brought a wealth of information regarding MHC specificities for peptides and TCR specificities for pMHC epitopes. Given the importance of this information for the treatment of cancer and autoimmune diseases, these data have been curated and collected in several freely accessible databases. In return, this allowed the development of several rapid and efficient machine-learning or deep-learning approaches to predict these specificities, which are now widely used in immunoinformatics. The number of available pMHC and TCRpMHC 3D structures is also continuously growing. However, it remains neglectable compared to the huge diversity of pMHC that results from the number of possible peptides and MHC allotypes. To address this limitation, several pMHC docking algorithms have been developed and benchmarked, some of them very recently. Docking peptides, even limited to 8 to 11 amino acids in length, is particularly challenging in view of the flexibility of such ligands and their large number of degrees of freedom. However, pMHC docking codes, notably for class I MHC, can rely on different datadriven assumptions regarding the overall orientation of the peptide in the MHC groove, the position of the N- and C-termini of the peptide and the general conformation of its backbone. Thanks to this knowledge, several approaches have been developed and can be categorized according to the nomenclature proposed by Antunes et al. [17] as constrained backbone, constrained termini, and incremental peptide reconstruction approaches. Some of these approaches demonstrated a satisfactory predictive ability in redocking and cross-docking of some pMHC complexes. However, there is still room for improvement in terms of speed and availability. Of note, the predictive ability of the approaches is generally tested on a larger and larger amount of diverse pMHC complexes. However, 276 Marta A. S. Perez et al. the number of such complexes in the test sets remains small in view of the real diversity in terms of MHC allotypes and peptide sequence and length. We can expect that the increasing amount of experimental data available will nurture the design of new pMHC docking software and the enhancement of existing ones. Acknowledgments This work was supported by the University of Lausanne—Department of Oncology UNIL-CHUV, the Ludwig Institute for Cancer Research—Lausanne Branch, the SIB Swiss Institute of Bioinformatics, SNSF grants to VZ (#205321_192019, CRSII5_193749 and CRSK-3_190400) and OM (#31003A_176168), and funds from Research for Life to OM. References 1. Hansen TH, Bouvier M (2009) MHC class I antigen presentation: learning from viral evasion strategies. Nat Rev Immunol 9:503–513. https://doi.org/10.1038/nri2575 2. Hewitt EW (2003) The MHC class I antigen presentation pathway: strategies for viral immune evasion. Immunology 110:163–169. https://doi.org/10.1046/j. 1365-2567.2003.01738.x 3. Jones EY, Fugger L, Strominger JL, Siebold C (2006) MHC class II proteins and disease: a structural perspective. Nat Rev Immunol 6:271–282. https://doi.org/10.1038/ nri1805 4. Roche PA, Furuta K (2015) The ins and outs of MHC class II-mediated antigen processing and presentation. Nat Rev Immunol 15:203–216. https://doi.org/10.1038/ nri3818 5. Schumacher TN, Schreiber RD (2015) Neoantigens in cancer immunotherapy. Science 348:69–74. https://doi.org/10.1126/ science.aaa4971 6. Tran E, Turcotte S, Gros A et al (2014) Cancer immunotherapy based on mutationspecific CD4+ T cells in a patient with epithelial cancer. Science 344:641–645. https:// doi.org/10.1126/science.1251102 7. Sahin U, Türeci Ö (2018) Personalized vaccines for cancer immunotherapy. Science 359:1355–1360. https://doi.org/10.1126/ science.aar7112 8. Wirth TC, Kühnel F (2017) Neoantigen targeting-dawn of a new era in cancer immunotherapy? Front Immunol 8:1848. https://doi.org/10.3389/fimmu.2017. 01848 9. Tran E, Robbins PF, Rosenberg SA (2017) “Final common pathway” of human cancer immunotherapy: targeting random somatic mutations. Nat Immunol 18:255–262. https://doi.org/10.1038/ni.3682 10. Lizée G, Overwijk WW, Radvanyi L et al (2013) Harnessing the power of the immune system to target cancer. Annu Rev Med 64:71–90. https://doi.org/10.1146/ annurev-med-112311-083918 11. Galluzzi L, Chan TA, Kroemer G et al (2018) The hallmarks of successful anticancer immunotherapy. Sci Transl Med 10:eaat7807. https://doi.org/10.1126/scitranslmed. aat7807 12. Comber JD, Philip R (2014) MHC class I antigen presentation and implications for developing a new generation of therapeutic vaccines. Ther Adv Vaccines 2:77–89. https://doi.org/10.1177/ 2051013614525375 13. Yin Y, Li Y, Mariuzza RA (2012) Structural basis for self-recognition by autoimmune T-cell receptors. Immunol Rev 250:32–48. https://doi.org/10.1111/imr.12002 14. Gfeller D, Bassani-Sternberg M, Schmidt J, Luescher IF (2016) Current tools for predicting cancer-specific T cell immunity. Onco Targets Ther 5:e1177691. https://doi.org/10. 1080/2162402X.2016.1177691 15. Mösch A, Raffegerst S, Weis M et al (2019) Machine learning for cancer immunotherapies Structural Prediction of Peptide–MHC Binding Modes based on epitope recognition by T cell receptors. Front Genet 10:1141. https://doi.org/ 10.3389/fgene.2019.01141 16. Adams JJ, Narayanan S, Birnbaum ME et al (2016) Structural interplay between germline interactions and adaptive recognition determines the bandwidth of TCR-peptide-MHC cross-reactivity. Nat Immunol 17:87–94. https://doi.org/10.1038/ni.3310 17. Antunes DA, Abella JR, Devaurs D et al (2018) Structure-based methods for binding mode and binding affinity prediction for peptide-MHC complexes. Curr Top Med Chem 18:2239–2255. https://doi.org/10. 2174/1568026619666181224101744 18. Robinson J, Barker DJ, Georgiou X et al (2020) IPD-IMGT/HLA Database. Nucleic Acids Res 48:D948–D955. https://doi.org/ 10.1093/nar/gkz950 19. Gfeller D, Bassani-Sternberg M (2018) Predicting antigen presentation-what could we learn from a million peptides? Front Immunol 9:1716. https://doi.org/10.3389/fimmu. 2018.01716 20. Klein J, Sato A (2000) The HLA system. First of two parts. N Engl J Med 343:702–709. https://doi.org/10.1056/ NEJM200009073431006 21. Gao GF, Rao Z, Bell JI (2002) Molecular coordination of alphabeta T-cell receptors and coreceptors CD8 and CD4 in their recognition of peptide-MHC ligands. Trends Immunol 23:408–413. https://doi.org/10. 1016/s1471-4906(02)02282-2 22. Sliz P, Michielin O, Cerottini JC et al (2001) Crystal structures of two closely related but antigenically distinct HLA-A2/melanocytemelanoma tumor-antigen peptide complexes. J Immunol 167:3276–3284 23. Pettersen EF, Goddard TD, Huang CC et al (2004) UCSF chimera--a visualization system for exploratory research and analysis. J Comput Chem 25:1605–1612. https://doi.org/ 10.1002/jcc.20084 24. Gao GF, Tormo J, Gerth UC et al (1997) Crystal structure of the complex between human CD8alpha(alpha) and HLA-A2. Nature 387:630–634. https://doi.org/10. 1038/42523 25. Wang H, Capps GG, Robinson BE, Zúñiga MC (1994) Ab initio association with beta 2-microglobulin during biosynthesis of the H-2Ld class I major histocompatibility complex heavy chain promotes proper disulfide bond formation and stable peptide binding. J Biol Chem 269:22276–22281. https://doi. org/10.1016/S0021-9258(17)31787-8 277 26. Shields MJ, Kubota R, Hodgson W et al (1998) The effect of human beta2microglobulin on major histocompatibility complex I peptide loading and the engineering of a high affinity variant. Implications for peptide-based vaccines. J Biol Chem 273:28010–28018. https://doi.org/10. 1074/jbc.273.43.28010 27. Uger RA, Chan SM, Barber BH (1999) Covalent linkage to beta2-microglobulin enhances the MHC stability and antigenicity of suboptimal CTL epitopes. J Immunol 162:6024–6028 28. Collins EJ, Garboczi DN, Wiley DC (1994) Three-dimensional structure of a peptide extending from one end of a class I MHC binding site. Nature 371:626–629. https:// doi.org/10.1038/371626a0 29. Guillaume P, Picaud S, Baumgaertner P et al (2018) The C-terminal extension landscape of naturally presented HLA-I ligands. Proc Natl Acad Sci U S A 115:5083–5088. https://doi.org/10.1073/pnas. 1717277115 30. Matsui M, Hioe CE, Frelinger JA (1993) Roles of the six peptide-binding pockets of the HLA-A2 molecule in allorecognition by human cytotoxic T-cell clones. Proc Natl Acad Sci U S A 90:674–678. https://doi. org/10.1073/pnas.90.2.674 31. Deres K, Beck W, Faath S et al (1993) MHC/peptide binding studies indicate hierarchy of anchor residues. Cell Immunol 151:158–167. https://doi.org/10.1006/ cimm.1993.1228 32. Bassani-Sternberg M, Chong C, Guillaume P et al (2017) Deciphering HLA-I motifs across HLA peptidomes improves neo-antigen predictions and identifies allostery regulating HLA specificity. PLoS Comput Biol 13: e1005725. https://doi.org/10.1371/jour nal.pcbi.1005725 33. Perez MAS, Bassani-Sternberg M, Coukos G et al (2019) Analysis of secondary structure biases in naturally presented HLA-I ligands. Front Immunol 10:823. https://doi.org/10. 3389/fimmu.2019.02731 34. Liu J, Gao GF (2011) Major histocompatibility complex: interaction with peptides. eLS. https://doi.org/10.1002/9780470015902. a0000922.pub2 35. Sezerman U, Vajda S, DeLisi C (1996) Free energy mapping of class I MHC molecules and structural determination of bound peptides. Protein Sci 5:1272–1281. https://doi. org/10.1002/pro.5560050706 278 Marta A. S. Perez et al. 36. Antunes DA, Vieira GF, Rigo MM et al (2010) Structural allele-specific patterns adopted by epitopes in the MHC-I cleft and reconstruction of MHC:peptide complexes to cross-reactivity assessment. PLoS One 5: e10353. https://doi.org/10.1371/journal. pone.0010353 37. Schueler-Furman O, Elber R, Margalit H (1998) Knowledge-based structure prediction of MHC class I bound peptides: a study of 23 complexes. Fold Des 3:549–564. https://doi.org/10.1016/S1359-0278(98) 00070-4 38. Fagerberg T, Cerottini J-C, Michielin O (2006) Structural prediction of peptides bound to MHC class I. Proteins 356:521–546. https://doi.org/10.1016/j. jmb.2005.11.059 39. Nicholls S, Piper KP, Mohammed F et al (2009) Secondary anchor polymorphism in the HA-1 minor histocompatibility antigen critically affects MHC stability and TCR recognition. Proc Natl Acad Sci U S A 106:3889–3894. https://doi.org/10.1073/ pnas.0900411106 40. Reiser J-B, Legoux F, Gras S et al (2014) Analysis of relationships between peptide/ MHC structural features and naive T cell frequency in humans. J Immunol 193:5816–5826. https://doi.org/10.4049/ jimmunol.1303084 41. Cole DK, Bulek AM, Dolton G et al (2016) Hotspot autoimmune T cell receptor binding underlies pathogen and insulin peptide crossreactivity. J Clin Invest 126:2191–2204. https://doi.org/10.1172/JCI85679 42. Lee JK, Stewart-Jones G, Dong T et al (2004) T cell cross-reactivity and conformational changes during TCR engagement. J Exp Med 200:1455–1466. https://doi.org/10. 1084/jem.20041251 43. Pieper J, Dubnovitsky A, Gerstner C et al (2018) Memory T cells specific to citrullinated α-enolase are enriched in the rheumatic joint. J Autoimmun 92:47–56. https://doi. org/10.1016/j.jaut.2018.04.004 44. Wang JH, Meijers R, Xiong Y et al (2001) Crystal structure of the human CD4 N-terminal two-domain fragment complexed to a class II MHC molecule. Proc Natl Acad Sci U S A 98:10799–10804. https://doi.org/ 10.1073/pnas.191124098 45. Chicz RM, Urban RG, Lane WS et al (1992) Predominant naturally processed peptides bound to HLA-DR1 are derived from MHC-related molecules and are heterogeneous in size. Nature 358:764–768. https:// doi.org/10.1038/358764a0 46. Achour A (2001) Major histocompatibility complex: interaction with peptides. eLS. https://doi.org/10.1038/npg.els.0000922 47. Burley SK, Berman HM, Kleywegt GJ et al (2017) Protein data Bank (PDB): the single global macromolecular structure archive. Methods Mol Biol 1607:627–641. https:// doi.org/10.1007/978-1-4939-7000-1_26 48. Berman HM (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242. https://doi. org/10.1093/nar/28.1.235 49. Sinigaglia M, Antunes DA, Rigo MM et al (2013) CrossTope: a curate repository of 3D structures of immunogenic peptide: MHC complexes. Database 2013:bat002. https:// doi.org/10.1093/database/bat002 50. Tong JC, Kong L, Tan TW, Ranganathan S (2006) MPID-T: database for sequencestructure-function information on T-cell receptor/peptide/MHC interactions. Appl Bioinforma 5:111–114. https://doi.org/10. 2165/00822942-200605020-00005 51. Khan JM, Cheruku HR, Tong JC, Ranganathan S (2011) MPID-T2: a database for sequence-structure-function analyses of pMHC and TR/pMHC structures. Bioinformatics 27:1192–1193. https://doi.org/10. 1093/bioinformatics/btr104 52. Kaas Q, Ruiz M, Lefranc M-P (2004) IMGT/ 3Dstructure-DB and IMGT/StructuralQuery, a database and a tool for immunoglobulin, T cell receptor and MHC structural data. Nucleic Acids Res 32:D208–D210. https:// doi.org/10.1093/nar/gkh042 53. Gowthaman R, Pierce BG (2019) TCR3d: the T cell receptor structural repertoire database. Bioinformatics 35:5323–5325. https:// doi.org/10.1093/bioinformatics/btz517 54. Leem J, de Oliveira SHP, Krawczyk K, Deane CM (2018) STCRDab: the structural T-cell receptor database. Nucleic Acids Res 46: D406–D412. https://doi.org/10.1093/ nar/gkx971 55. Borrman T, Cimons J, Cosiano M et al (2017) ATLAS: a database linking binding affinities with structures for wild-type and mutant TCR-pMHC complexes. Proteins 85:908–916. https://doi.org/10.1002/ prot.25260 56. Calis JJA, Maybeno M, Greenbaum JA et al (2013) Properties of MHC class I presented peptides that enhance immunogenicity. PLoS Comput Biol 9:e1003266. https://doi.org/ 10.1371/journal.pcbi.1003266 57. Chowell D, Krishna S, Becker PD et al (2015) TCR contact residue hydrophobicity is a hallmark of immunogenic CD8+ T cell epitopes. Structural Prediction of Peptide–MHC Binding Modes PNAS 112:E1754–E1762. https://doi.org/ 10.1073/pnas.1500973112 58. Vita R, Overton JA, Greenbaum JA et al (2015) The immune epitope database (IEDB) 3.0. Nucleic Acids Res 43: D405–D412. https://doi.org/10.1093/ nar/gku938 59. Dhanda SK, Mahajan S, Paul S et al (2019) IEDB-AR: immune epitope database-analysis resource in 2019. Nucleic Acids Res 47: W502–W506. https://doi.org/10.1093/ nar/gkz452 60. Nielsen M, Lundegaard C, Worning P et al (2003) Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci 12:1007–1017. https://doi.org/10.1110/ps.0239403 61. Andreatta M, Nielsen M (2016) Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics 32:511–517. https://doi. org/10.1093/bioinformatics/btv639 62. Jurtz V, Paul S, Andreatta M et al (2017) NetMHCpan-4.0: improved peptide-MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J Immunol 199:3360–3368. https:// doi.org/10.4049/jimmunol.1700893 63. Karosiene E, Lundegaard C, Lund O, Nielsen M (2012) NetMHCcons: a consensus method for the major histocompatibility complex class I predictions. Immunogenetics 64:177–186. https://doi.org/10.1007/ s00251-011-0579-8 64. O’Donnell TJ, Rubinsteyn A, Bonsack M et al (2018) MHCflurry: open-source class I MHC binding affinity prediction. Cell Syst 7:129–132.e4. https://doi.org/10.1016/j. cels.2018.05.014 65. Phloyphisut P, Pornputtapong N, Sriswasdi S, Chuangsuwanich E (2019) MHCSeqNet: a deep neural network model for universal MHC binding prediction. BMC Bioinformatics 20:270–210. https://doi.org/10.1186/ s12859-019-2892-4 66. Venkatesh G, Grover A, Srinivasaraghavan G, Rao S (2020) MHCAttnNet: predicting MHC-peptide bindings for MHC alleles classes I and II using an attention-based deep neural model. Bioinformatics 36:i399–i406. https://doi.org/10.1093/bioinformatics/ btaa479 67. Maccari G, Robinson J, Ballingall K et al (2017) IPD-MHC 2.0: an improved interspecies database for the study of the major histocompatibility complex. Nucleic Acids 279 Res 45:D860–D864. https://doi.org/10. 1093/nar/gkw1050 68. Shugay M, Bagaev DV, Zvyagin IV et al (2018) VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res 46:D419–D427. https://doi.org/10.1093/nar/gkx760 69. Tickotsky N, Sagiv T, Prilusky J et al (2017) McPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics 33:2924–2929. https://doi.org/10.1093/bioinformatics/ btx286 70. Armstrong DR, Berrisford JM, Conroy MJ et al (2020) PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res 48:D335–D343. https:// doi.org/10.1093/nar/gkz990 71. Velankar S, Alhroub Y, Best C et al (2012) PDBe: Protein Data Bank in Europe. Nucleic Acids Res 40:D445–D452. https://doi.org/ 10.1093/nar/gkr998 72. Gutmanas A, Alhroub Y, Battle GM et al (2014) PDBe: Protein Data Bank in Europe. Nucleic Acids Res 42:D285–D291. https:// doi.org/10.1093/nar/gkt1180 73. Velankar S, Best C, Beuth B et al (2010) PDBe: Protein Data Bank in Europe. Nucleic Acids Res 38:D308–D317. https://doi.org/ 10.1093/nar/gkp916 74. Wong WK, Marks C, Leem J et al (2020) TCRBuilder: multi-state T-cell receptor structure prediction. Bioinformatics 36:3580–3581. https://doi.org/10.1093/ bioinformatics/btaa194 75. Raman S, Vernon R, Thompson J et al (2009) Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins 77 Suppl 9:89–99. https://doi.org/10.1002/prot. 22540 76. Mazza C, Auphan-Anezin N, Gregoire C et al (2007) How much can a T-cell antigen receptor adapt to structurally distinct antigenic peptides? EMBO J 26:1972–1983. https:// doi.org/10.1038/sj.emboj.7601605 77. Buckle AM, Borg NA (2018) Integrating experiment and theory to understand TCR-pMHC dynamics. Front Immunol 9:2898. https://doi.org/10.3389/fimmu. 2018.02898 78. Giguère S, Drouin A, Lacoste A et al (2013) MHC-NP: predicting peptides naturally processed by the MHC. J Immunol Methods 400-401:30–36. https://doi.org/10.1016/ j.jim.2013.10.003 280 Marta A. S. Perez et al. 79. Paul S, Karosiene E, Dhanda SK et al (2018) Determination of a predictive cleavage motif for eluted major histocompatibility complex class II ligands. Front Immunol 9:1795. https://doi.org/10.3389/fimmu.2018. 01795 80. Wang Z, Sun H, Yao X et al (2016) Comprehensive evaluation of ten docking programs on a diverse set of protein-ligand complexes: the prediction accuracy of sampling power and scoring power. Phys Chem Chem Phys 18:12964–12975. https://doi.org/10. 1039/c6cp01555g 81. Gathiaka S, Liu S, Chiu M et al (2016) D3R grand challenge 2015: evaluation of proteinligand pose and affinity predictions. J Comput Aided Mol Des 30:651–668. https://doi. org/10.1007/s10822-016-9946-8 82. Mey ASJS, Juárez-Jiménez J, Hennessy A, Michel J (2016) Blinded predictions of binding modes and energies of HSP90-α ligands for the 2015 D3R grand challenge. Bioorg Med Chem 24:4890–4899. https://doi. org/10.1016/j.bmc.2016.07.044 83. Xu X, Ma Z, Duan R, Zou X (2019) Predicting protein-ligand binding modes for CELPP and GC3: workflows and insight. J Comput Aided Mol Des 33:367–374. https://doi. org/10.1007/s10822-019-00185-0 84. Pagadala NS, Syed K, Tuszynski J (2017) Software for molecular docking: a review. Biophys Rev 9:91–102. https://doi.org/10. 1007/s12551-016-0247-1 85. Kontoyianni M, McClellan LM, Sokol GS (2004) Evaluation of docking performance: comparative data on docking algorithms. J Med Chem 47:558–565. https://doi.org/ 10.1021/jm0302997 86. Khan JM, Ranganathan S (2010) pDOCK: a new technique for rapid and accurate docking of peptide ligands to major histocompatibility complexes. Immunome Res 6 Suppl 1:S2. https://doi.org/10.1186/1745-7580-6S1-S2 87. Rigo MM, Antunes DA, de Freitas MV et al (2015) DockTope: a web-based tool for automated pMHC-I modelling. Sci Rep 5:18413. https://doi.org/10.1038/srep18413 88. London N, Raveh B, Cohen E et al (2011) Rosetta FlexPepDock web server--high resolution modeling of peptide-protein interactions. Nucleic Acids Res 39:W249–W253. https://doi.org/10.1093/nar/gkr431 89. Kyeong H-H, Choi Y, Kim H-S (2018) GradDock: rapid simulation and tailored ranking functions for peptide-MHC class I docking. Bioinformatics 34:469–476. https://doi. org/10.1093/bioinformatics/btx589 90. Park M-S, Park SY, Miller KR et al (2013) Accurate structure prediction of peptideMHC complexes for identifying highly immunogenic antigens. Mol Immunol 56:81–90. https://doi.org/10.1016/j.molimm.2013. 04.011 91. Yanover C, Bradley P (2011) Large-scale characterization of peptide-MHC binding landscapes with structural simulations. PNAS 108:6981–6986. https://doi.org/10.1073/ pnas.1018165108 92. Abella JR, Antunes DA, Clementi C, Kavraki LE (2019) APE-gen: a fast method for generating ensembles of bound peptide-MHC conformations. Molecules 24:881. https:// doi.org/10.3390/molecules24050881 93. Bordner AJ, Abagyan R (2006) Ab initio prediction of peptide-MHC binding geometry for diverse class I MHC allotypes. Proteins 63:512–526. https://doi.org/10.1002/ prot.20831 94. Antunes DA, Devaurs D, Moll M et al (2018) General prediction of peptide-MHC binding modes using incremental docking: a proof of concept. Sci Rep 8:4327. https://doi.org/ 10.1038/s41598-018-22173-4 95. Dhanik A, McMurray JS, Kavraki LE (2013) DINC: a new AutoDock-based protocol for docking large ligands. BMC Struct Biol 13 (Suppl 1):S11–S14. https://doi.org/10. 1186/1472-6807-13-S1-S11 96. Antunes DA, Moll M, Devaurs D et al (2017) DINC 2.0: a new protein-peptide docking webserver using an incremental approach. Cancer Res 77:e55–e57. https://doi.org/ 10.1158/0008-5472.CAN-17-0511 97. Antes I, Siu SWI, Lengauer T (2006) DynaPred: a structure and sequence based method for the prediction of MHC class I binding peptide sequences and conformations. Bioinformatics 22:e16–e24. https:// doi.org/10.1093/bioinformatics/btl216 98. Abagyan R, Totrov M, Kuznetsov D (1994) Icm - a new method for protein modeling and design - applications to docking and structure prediction from the distorted native conformation. J Comput Chem 15:488–506. https://doi.org/10.1002/jcc.540150503 99. Abagyan RA, Totrov M (1999) Ab InitioFolding of peptides by the optimal-Bias Monte Carlo minimization procedure. J Comput Phys 151:402–421. https://doi. org/10.1006/jcph.1999.6233 Structural Prediction of Peptide–MHC Binding Modes 100. Nemethy G, Gibson KD, Palmer KA et al (2002) Energy parameters in polypeptides. 10. Improved geometrical parameters and nonbonded interactions for use in the ECEPP/3 algorithm, with application to proline-containing peptides. J Phys Chem 96:6472–6484. https://doi.org/10.1021/ j100194a068 101. Rudolph MG, Shen LQ, Lamontagne SA et al (2004) A peptide that antagonizes TCR-mediated reactions with both syngeneic and allogeneic agonists: functional and structural aspects. J Immunol 172:2994–3002. https://doi.org/10.4049/jimmunol.172.5. 2994 102. Rückert C, Fiorillo MT, Loll B et al (2006) Conformational dimorphism of self-peptides and molecular mimicry in a disease-associated HLA-B27 subtype. J Biol Chem 281:2306–2316. https://doi.org/10.1074/ jbc.M508528200 103. Meijers R, Lai C-C, Yang Y et al (2005) Crystal structures of murine MHC class I H-2 D (b) and K(b) molecules in complex with CTL epitopes from influenza A virus: implications for TCR repertoire selection and immunodominance. J Mol Biol 345:1099–1110. https://doi.org/10.1016/j.jmb.2004.11. 023 104. Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31:455–461. https://doi.org/10.1002/jcc. 21334 105. Morris GM, Huey R, Lindstrom W et al (2009) AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J Comput Chem 30:2785–2791. https://doi.org/10.1002/jcc.21256 106. Abraham MJ, Murtola T, Schulz R et al (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1-2:19–25 107. Oostenbrink C, Villa A, Mark AE, van Gunsteren WF (2004) A biomolecular force field based on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5 and 53A6. J Comput Chem 25:1656–1676. https://doi.org/10.1002/ jcc.20090 108. Liu T, Pan X, Chao L et al (2014) Subangstrom accuracy in pHLA-I modeling by Rosetta FlexPepDock refinement protocol. J Chem Inf Model 54:2233–2242. https:// doi.org/10.1021/ci500393h 281 109. Rohl CA, Strauss CEM, Misura KMS, Baker D (2004) Protein structure prediction using Rosetta. Methods Enzymol 383:66–93. https://doi.org/10.1016/S0076-6879(04) 83004-0 110. Khan AR, Baker BM, Ghosh P et al (2000) The structure and stability of an HLA-A*0201/octameric tax peptide complex with an empty conserved peptide-N-terminal binding site. J Immunol 164:6398–6405. https://doi.org/10.4049/ jimmunol.164.12.6398 111. Canutescu AA, Dunbrack RL (2003) Cyclic coordinate descent: a robotics algorithm for protein loop closure. Protein Sci 12:963–972. https://doi.org/10.1110/ps.0242703 112. Schmid N, Eichenberger AP, Choutko A et al (2011) Definition and testing of the GROMOS force-field versions 54A7 and 54B7. Eur Biophys J 40:843–856. https://doi. org/10.1007/s00249-011-0700-9 113. Ting D, Wang G, Shapovalov M et al (2010) Neighbor-dependent Ramachandran probability distributions of amino acids developed from a hierarchical Dirichlet process model. PLoS Comput Biol 6:e1000763. https://doi. org/10.1371/journal.pcbi.1000763 114. Word JM, Lovell SC, Richardson JS, Richardson DC (1999) Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J Mol Biol 285:1735–1747. https://doi.org/10.1006/ jmbi.1998.2401 115. Eswar N, Eramian D, Webb B et al (2008) Protein structure modeling with MODELLER. In: Biomolecular simulations. Humana Press, Totowa, NJ, pp 145–159 116. McRobb FM, Capuano B, Crosby IT et al (2010) Homology modeling and docking evaluation of aminergic G protein-coupled receptors. J Chem Inf Model 50:626–637. https://doi.org/10.1021/ci900444q 117. Case DA, Cheatham TE, Darden T et al (2005) The Amber biomolecular simulation programs. J Comput Chem 26:1668–1688. https://doi.org/10.1002/jcc.20290 118. Wang J, Cieplak P, Kollman PA (2000) How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? J Comput Chem 21:1049–1074. https://doi.org/10.1002/1096-987X( 200009)21:12<1049::AID-JCC3>3.0. CO;2-F 119. London N, Movshovitz-Attias D, SchuelerFurman O (2010) The structural basis of 282 Marta A. S. Perez et al. peptide-protein binding strategies. Structure 18:188–199. https://doi.org/10.1016/j.str. 2009.11.012 120. Wang C, Bradley P, Baker D (2007) Proteinprotein docking with backbone flexibility. J Mol Biol 373:503–519. https://doi.org/10. 1016/j.jmb.2007.07.050 121. Rohl CA, Strauss CEM, Chivian D, Baker D (2004) Modeling structurally variable regions in homologous proteins with rosetta. Proteins 55:656–677. https://doi.org/10.1002/ prot.10629 122. Kuhlman B, Dantas G, Ireton GC et al (2003) Design of a novel globular protein fold with atomic-level accuracy. Science 302:1364–1368. https://doi.org/10.1126/ science.1089427 123. ABAGYAN R, Totrov M (1994) Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins. J Mol Biol 235:983–1002. https:// doi.org/10.1006/jmbi.1994.1052 124. Buslepp J, Zhao R, Donnini D et al (2001) T cell activity correlates with oligomeric peptide-major histocompatibility complex binding on T cell surface. J Biol Chem 276:47320–47328. https://doi.org/10. 1074/jbc.M109231200 125. Fodor J, Riley BT, Borg NA, Buckle AM (2018) Previously hidden dynamics at the TCR-peptide-MHC Interface revealed. J Immunol 200:4134–4145. https://doi.org/ 10.4049/jimmunol.1800315 126. Shen M-Y, Sali A (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15:2507–2524. https://doi.org/10.1110/ps.062416606 127. Chys P, Chacón P (2013) Random coordinate descent with spinor-matrices and geometric filters for efficient loop closure. J Chem Theory Comput 9:1821–1829. https://doi. org/10.1021/ct300977f 128. Eastman P, Swails J, Chodera JD et al (2017) OpenMM 7: rapid development of high performance algorithms for molecular dynamics. PLoS Comput Biol 13:e1005659. https:// doi.org/10.1371/journal.pcbi.1005659 129. Antunes DA, Abella JR, Hall-Swan S et al (2020) HLA-arena: a customizable environment for the structural modeling and analysis of peptide-HLA complexes for cancer immunotherapy. JCO Clin Cancer Inform 4:623–636. https://doi.org/10.1200/CCI. 19.00123 130. Mackerell AD, Bashford D, Bellott M et al (1998) All-atom empirical potential for molecular modeling and dynamics studies of proteins. J Phys Chem B 102:3586–3616. https://doi.org/10.1021/jp973084f 131. Brooks BR, Brooks CL, Mackerell AD et al (2009) CHARMM: the biomolecular simulation program. J Comput Chem 30:1545–1614. https://doi.org/10.1002/ jcc.21287 132. Lee MS, Salsbury FR, Brooks CL (2002) Novel generalized born methods. J Chem Phys 116:10606. https://doi.org/10.1063/ 1.1480013 133. Lee MS, Feig M, Salsbury FR, Brooks CL (2003) New analytic approximation to the standard molecular volume definition and its application to generalized born calculations. J Comput Chem 24:1348–1356. https://doi. org/10.1002/jcc.10272 134. Desmet J, Wilson IA, Joniau M et al (1997) Computation of the binding of fully flexible peptides to proteins with flexible side chains. FASEB J 11:164–172. https://doi.org/10. 1096/fasebj.11.2.9039959 Chapter 14 Molecular Simulation of Stapled Peptides Victor Ovchinnikov, Aravinda Munasinghe, and Martin Karplus Abstract Constrained peptides represent a relatively new class of biologic therapeutics, which have the potential to overcome several limitations of small-molecule drugs, and of designed antibodies. Because of their modest size, the rational design of such peptides is becoming increasingly amenable to computer simulation; multimicrosecond molecular dynamic (MD) simulations are now routinely possible on consumer-grade graphical processors (GPUs). Here, we describe the procedures for performing and analyzing MD simulations of hydrocarbon-stapled peptides using the CHARMM energy function, in isolation and in complex with a binding partner, to investigate their conformational properties and to compute changes in their binding affinity upon mutation. Key words Molecular dynamics, Peptide design, MMGBSA, Binding free energy, MDM2, p53 1 Introduction Peptide therapeutics continue to attract pharmaceutical interest as important alternatives to small-molecule and protein-based treatments [1]. Peptide-based drugs generally have favorable pharmacokinetic profiles and can achieve high specificity and selectivity via optimization of their amino acid sequence. The main obstacles to the use of peptides as therapeutics are (1) a potentially high sequence-dependent propensity for aggregation, (2) often lower cellular penetration compared to small-molecule therapeutics, (3) immunogenicity, especially for longer sequences, (4) susceptibility to cellular proteases, and (5) conformational heterogeneity due to the presence of several rotatable bonds per residue. Obstacles (1)–(3) can be partially overcome by optimizing peptide hydrophobicity, solubility, and amino acid sequence length. However, overcoming obstacles (4)–(5) generally requires chemical modification. One possibility, which is considered here, is hydrocarbonbased (or other, e.g. urea-based [2]) “stapling,” i.e. introduction of a cross link between different parts of the peptide [3]. The stapling serves to reduce the number of thermally accessible peptide Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_14, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 283 284 Victor Ovchinnikov et al. conformations, which decreases the configurational entropic penalty of binding, and therefore improve the binding free energy provided that the designed conformation has a high affinity for the target. Furthermore, the stapling can impart resistance to proteases if the staple sterically hinders protease access or if the stapling reduces the population of conformations that are susceptible to proteolysis. In view of the relatively small size and low conformational heterogeneity of most stapled peptides and of the ongoing increases in the speed of molecular dynamics (MD) simulation, meaningful computational analyses of stapled peptides in isolation and in complex with a protein target are now possible using general-purpose computing hardware (rather than supercomputing clusters) and software that is freely available for nonprofit use [4–7]. Consequently, the present protocol, which describes a step-bystep method to simulate hydrocarbon-stapled peptides using MD simulation and to analyze the simulation data, appears to us especially timely. As examples, we consider two stapled peptide systems: (1) stapled deca-alanine in solvent and (2) stapled inhibitor peptides bound to the oncoprotein MDM2. 2 Materials In this section, we list the computer hardware and software used to perform and analyze the simulations described here (Table 1). All of the software is available freely to academic or nonprofit users. (A freely available version of the program CHARMM [5, 8] with a reduced set of features is sufficient for this protocol.) For-profit users are encouraged to contact the relevant software publishers to investigate available licensing options. We note that some of the software listed here can be replaced with other software with similar functionality. For example, some features of CHARMM used to prepare simulation files can be replaced using the psfgen tool distributed with the program VMD [6], which is widely used by the NAMD2 [14] community. However, input scripts to psfgen are not provided with this protocol. We also note that the simulations described here use the CHARMM36 energy function [15, 16]. Although the protocol could be modified to use other energy functions or MD simulation software (e.g. Amber [17], OPLS [18], and GROMOS [19]), the modifications are extensive and are not described here, as our focus is on functionality, not on generality. The users interested in alternative energy functions are encouraged to read the relevant computational literature on stapled peptide simulations [2, 20, 21]. In addition to listing the software required to perform the simulations and analyses in this protocol, we list all of the input scripts provided with this protocol in a supplement. These scripts Molecular Simulation of Stapled Peptides 285 Table 1 Computational resources used in this protocol Computer hardware AMD Ryzen 3900X CPU Nvidia GTX 2080 Ti GPU 32 GB DDR43200 RAM Specialized software Version Used for Availability CHARMM 44 Preparation, MD simulation, and analysis https://charmm.chemistry. harvard.edu [8] CHARMM topology, parameters 36 MD simulation and energy analysis https://mackerell.umaryland. edu/charmm_ff.shtml [9] VMD 1.9.3 Analysis and visualization www.ks.uiuc.edu/Research/ vmd/ [10] OpenMM a770038f125588ab0cced8 cd67f04e5083f24e0da MD simulation and analysis http://openmm.org [11] MDTraj 19bf371eb4ab2792ff88c2ad Analysis 80272e4e54327c82a http://mdtraj.org [12] Generalpurpose software GNU Compiler suite 8 Compiling CHARMM Arch Linux distribution and OpenMM Python 3.8 MD simulation and analysis Arch Linux distribution Cuda Toolkit 11 MD simulations and analysis Arch Linux distribution GNU Octave [13] 6.1 Data plotting and analysis Arch Linux distribution Arch Linux rolling release Kernel 5.8.7 Operating System Arch Linux distribution Bash shell 5.1.0 Driver/shell scripts Arch Linux distribution Essential scripts (continued) 286 Victor Ovchinnikov et al. Table 1 (continued) Name Purpose Software Location (relative to root) buildvac.inp Prepare structure in vacuum CHARMM util/charmm solvate.inp Add explicit water to structure CHARMM util/charmm watercube.str Subroutine for solvate.inp CHARMM util/charmm mdcube.inp MD in exp. solvent via CHARMM/OpenMM CHARMM util/charmm mdene.inp Compute energy from MD trajectory CHARMM util/charmm radius_gbsw. str Subroutine for mdene.inp CHARMM util/charmm read-smallbox. Subroutine for solvate.inp str CHARMM util/toppar staples-toppar. Energy function parameters str for staples CHARMM and OpenMM util/toppar toppar-waterions.str Energy function parameters for water CHARMM and OpenMM util/toppar align.vmd Compute RMSDb from trajectory VMD util/vmd rmwat.vmd Remove explicit water from trajectory VMD util/vmd rgyr.vmd Compute Rgyrc from trajectory VMD util/vmd mdvac.py MD in vacuum or implicit solvent Python/OpenMM util/python mdcube.py MD in explicit solvent Python/OpenMM util/python mdene.py Energy analysis using OpenMM/MDTraj Python util/python mdsasa.py SASAd analysis using MDTraj Python util/python mddssp.py Secondary structure analysis using MDTraj Python util/python show.m Plot RMSD or Rgyr Octave ala10 dssp.m Plot peptide helicity Octave ala10 dgnp.m, dgnps.me Nonpolar solvation energy difference Octave 3V3B Solvation energy difference dgobc2.m, dgobc2s.me Octave 3V3B (continued) Molecular Simulation of Stapled Peptides 287 Table 1 (continued) dggbsw.m, Solvation energy difference dggbsws.me Octave 3V3B dgnp.m Nonpolar solvation energy difference Octave 3V3B dgobc2.m Solvation energy difference Octave 3V3B dggbsw.m Solvation energy difference Octave 3V3B acorr.m Compute sample correlation Octave length util/matlab getchain Extract a single chain from a Bash shell PDB file util addseg Add a segment name to a PDB file Bash shell util toppar.str Read energy function parameters CHARMM util rst2xsc Convert box size from CHARMM rst file to xsc bash util prepare Prepare simulation files bash ala10, 3V3B/complexes, 3V3B/peptides run Perform MD simulations bash ala10, 3V3B/complexes, 3V3B/peptides post Analyze MD simulations bash ala10, 3V3B/complexes, 3V3B/peptides a Git commit ID Root-mean-square deviation c Radius of gyration d Solvent-accessible surface area e Modified for single-trajectory analysis b could be useful starting points for customizing the simulation and analysis procedures and for applying the methods to different proteins and/or peptides. A compressed archive containing the input and analysis scripts, along with partial analysis data, is provided as Supporting Material accompanying this chapter. 3 Methods In the following, we discuss two examples of stapled peptide simulations: (1) a stapled deca-alanine (Ala10) and (2) a complex between stapled peptides derived from the p53 protein and the oncoprotein MDM2 from the PDB entry 3V3B [22]. The two 288 Victor Ovchinnikov et al. Fig. 1 Stapled peptide systems simulated in this protocol: (a) Ala10 with the i,i+4 staple [3]; (b) Ala10 with the i, i+7 staple; (c)–(e) MDM2 protein in complex with three peptides; the peptide backbones are drawn as red ribbons; to show detail, peptide residues are also drawn in stick representations (hydrogens are omitted); and the MDM2 protein in (c)–(e) is shown as a green ribbon. Several residues that are at the interface with MDM2 are indicated. Peptide chemical structures are given in Tables 2 and 3 structures are shown in Fig. 1. These examples should serve as starting points for simulating stapled peptides in more complex environments, such as within lipid membranes [23]. 3.1 Preliminaries We assume that the user has access to a Linux computer with a modern graphical processing unit (GPU) capable of performing computations via CUDA or OpenCL and that the necessary software packages have been installed (see Table 1). For installation procedures, the users should visit the web pages listed in the table. We further assume that the user has basic familiarity with the general principles of molecular dynamics (MD) simulations, which will not be discussed in this protocol. The user is referred to textbooks on the subject [24–26]. Each example below consists of a preparation stage, an MD simulation stage, and a post-processing stage. Each of the stages is itself composed of steps that involve running the software as described below. For the user’s convenience, most of the steps are organized into three shell scripts, “prepare,” “run,” and “post,” which can be run from the command prompt. However, it is recommended that the user examine the contents of all the scripts to understand the details of the procedures. In the description of Molecular Simulation of Stapled Peptides 289 Table 2 Stapled Ala10 peptides. Ac and NMet denote acetylation and N-methylation tags, respectively Peptide i,i+4 i,i+7 Sequence & Staple Structure 1Ac A A A 1Ac A N H N H A A A O A A A A A A O A A10 NMet N H O N H A10 NMet O the steps below, we provide the line number and name of the shell script associated with the step in the format #[line_number], [script_name]. 3.2 Deca-alanine (Ala10) In this example, we set up and perform MD simulations of decaalanine cross-linked using two different hydrocarbon staples i, i + 4 and i, i + 7 [3], shown in Table 2. This is a simple example designed to illustrate the basics of system preparation and simulation. 1. Decide on the peptide sequence. In this example, we start with Ala10 and modify two residues in the sequence to introduce a staple. Following the nomenclature by Verdine and Hilinski [3], we have prepared a special CHARMM topology and parameter file (see Note 1) to describe two nonstandard amino acid residues that correspond to the legs of the staple, R8 and S5, where the letter describes the chirality at the α carbon of the residue, and the digit is the number of carbons in the hydrocarbon chain, excluding the α carbon. The corresponding sequences are Ala3-S5-Ala3-S5-Ala2 and Ala-R8-Ala6-S5-Ala. 2. Generate peptide structure files in vacuum. The sequences from the previous step are used as input to the CHARMM script buildvac.inp (#24, ala10/prepare) along with a flag (“qstaple”) that determines whether stapling is to be performed. If qstaple¼1, the CHARMM script will connect the staple legs. The script is set up for the two types of staples considered here. In this example, the deca-alanine coordinates are generated in α-helical geometry by setting the backbone dihedral angles (ϕ,ψ) to (57∘, 47∘). The peptide N- and C-termini are acetylated and methylated, respectively. 290 Victor Ovchinnikov et al. 3. If a simulation in explicit solvent is desired, immerse the structure in a cubic box of explicit TIP3 water, using the CHARMM script solvate.inp (#26, ala10/prepare). At this point, the structures are ready for MD simulation (see Note 2). 4. Decide whether to perform an MD simulation in implicit or explicit solvent. For the user’s convenience, we include two examples of explicit solvent simulation at constant pressure and temperature (NPT ensemble) [24]; one uses the CHARMM/ OpenMM interface via the CHARMM script “mdcube.inp” (#28, ala10/run), and the other uses the Python interface to OpenMM via “mdcube.py” (#33, ala10/run). The two MD simulations use slightly different simulation parameters, and the Python/OpenMM simulation uses hydrogen mass repartitioning, which allows the use of a 4fs time step; this is also possible with CHARMM/OpenMM but is not illustrated here (see Notes 3 and 4). An implicit solvent MD simulation is illustrated in the Python script “mdvac.py,” which uses the OBC2 Generalized Born implicit solvent model [27]. In this example, the MD simulations are performed for 200 ns (CHARMM/OpenMM) or 400 ns (Python/ OpenMM). For research purposes, multi-microsecond simulations are typically used [2], although simulation convergence is dependent on the biological system under study. An example of calculating statistical errors from simulation data is given as part of the MDM2/peptide test case. 5. After the MD simulation is complete, the recorded coordinates from the simulation trajectory are analyzed to compute various properties of interest. Since the present protocol concerns the simulations of stapled peptides, the analysis here quantifies the conformational differences between stapled and unstapled versions of the two peptides. If one is not interested in the properties of solvent around the peptide, explicit solvent can be removed from the simulation trajectories. This is accomplished by the VMD script “rmwat.vmd” (#40,#59, ala10/post). To quantify the conformational differences between the peptides, we compute the root-mean-square deviation (RMSD) of the peptide structure from the starting α-helical configuration using “align.vmd” (#28, #49, #67, ala10/post), the radius of gyration (Rgyr) using “rgyr.vmd” (#28, #49, #67, ala10/post), and perform secondary structure analysis using the software MDTraj via the Python script “mddssp.py” (#35, #54, #73 ala10/post) to quantify peptide helicity. 6. The analysis results can be plotted using the GNU Octave scripts “ala10/show.m” (RMSD and Rgyr) and “ala10/dssp. m” (secondary structure). We note that the Octave scripts are Molecular Simulation of Stapled Peptides 291 Table 3 Stapled peptides bound to the MDM2 oncoprotein simulated in this study. Inhibition constants Ki are taken from Chang et al. [30]: { corresponds to peptide ATSP1800, { corresponds to peptide ATSP3900, and corresponds to peptide ATSP7342 Peptide 1 2 3 Ki (nM) Sequence & Staple Structure 17Ac Q T F 17Ac L T F 17Ac L T A N H N H N H N L W R L L O H Y W A Q L O E Y W A Q L O N H N H N H Q N29 NMet 25.9† S A29 NMet 1.0‡ S A29 NMet 536∗ O O O also compatible with Matlab [28]. The results are shown in Fig. 2, which show that stapling generally decreases the RMSD to the helical structure (Fig. 2a), the peptide radius of gyration (Fig. 2b), and the probability of extended coil conformation (Fig. 2c). These results are expected in view of the constraints provided by the staples. We note that the conformational ensemble of deca-alanine in solution is generally not α-helical, but composed of partially unstructured coils [29]. 3.3 MDM2/Peptide Complex In this example, we set up and perform MD simulations of three peptides that bind the oncoprotein MDM2 [30]. The peptides were designed to disrupt competitively the p53/MDM2 protein– protein interaction; they use a partial sequence of p53 [22]. The stapled peptide sequences are given in Table 3. Using MD simulation trajectories, we compute approximate binding affinity changes upon mutating peptide 1 to peptides 2 and 3 [31] and compare the results with experimental data [30]. Many details of the preparation procedure are the same as in the previous case, and the descriptions here are therefore shortened. An important difference is the availability of a high-resolution crystal structure for one of the complexes [22], which is the basis for all simulations discussed below. 292 Victor Ovchinnikov et al. Fig. 2 Comparison of simulation statistics between stapled and unstapled peptides to show conformational differences: (a) RMSD from the starting (α-helical) conformation; (b) radius of gyration; and (c) percent coil (defined as 100-percent helix, as computed by the DSSP algorithm [7]); the colored bars correspond to simulations with connected staple legs, and the transparent bars correspond to peptides with unstapled legs. Error bars represent one standard deviation; they are indicated in only one direction for clarity. PyOMM corresponds to the Python/OpenMM interface, and ChOMM corresponds to the CHARMM/OpenMM interface 1. Download the structure 3V3B from the Protein Data Bank (PDB) (#12, 3V3B/complexes/prepare). The protein MDM2 (chain A) has one missing N-terminal residue, and the peptide (chain D) has three missing N-terminal residues. The missing residues appear not to be involved in the stability of the complex and are therefore omitted from the simulation (see Note 5). 2. Ensure that the staple atoms in the PDB structure are named consistently with the topology file “util/toppar/staples-toppar.str,” and store the stapled peptide coordinates in a new file SAH7.pdb (#22–#27, 3V3B/complexes/prepare). Molecular Simulation of Stapled Peptides 293 3. Extract chain A (MDM2) from PDB file, and set the segment name to MDM2 (#44, 3V3B/complexes/prepare). 4. Specify the desired peptide sequences to test for binding (#50, 3V3B/complexes/prepare). The sequences here are mutants of the original PDB sequence; missing coordinates for mismatched residues will be generated by CHARMM. 5. For each mutant peptide sequence, generate simulation structure files that can be used for MD simulations in vacuum (#74, 3V3B/complexes/prepare). 6. If MD simulations in explicit water are desired, add solvent to the structure. (#75, 3V3B/complexes/prepare). Our experience indicates that simulations in explicit solvent generally give superior results to those in implicit solvent, in particular, as regards protein structure stability. 7. For each peptide sequence, perform an MD simulation in implicit or explicit solvent (#32, #37, #42 3V3B/complexes/ run). For the purpose of this protocol, as in the case of Ala10, we performed 400 ns-long simulations in explicit solvent; however, longer simulations are strongly recommended to improve statistical sampling (discussed below). 8. Repeat the above steps to simulate the stapled peptides in isolation from MDM2; these substantially identical steps are performed in the shell scripts “3V3B/peptides/prepare” and “3V3B/peptides/run.” 9. If needed, repeat the above steps to simulate the MDM2 protein in isolation from the peptides (3V3B/protein/prepare and 3V3B/protein/run). This step is not required if only the changes in the binding affinity upon mutation are desired because of cancellation of terms (see Eqs. 1 and 2 below). 10. Free energy differences due to mutations will be computed using the Molecular Mechanics Generalized Born Surface Area (MMGBSA) approach [31]. In the MMGBSA analysis, the free energy of protein/peptide binding is approximated using the equation ΔG bind ¼ vdW elec GBorn cmplx Ecmplx þ Ecmplx þ Ecmplx þ γ SA prot Þ þ γ SA ðEprot þ Eprot þ Eprot vdW elec GBorn pept Þ: ðEpept þ Epept þ Epept þ γ SA vdW elec GBorn ð1Þ ΔGbind is the difference between the values of the energy components of the protein and peptide in complex and that of the separated protein and peptide. The components are 294 Victor Ovchinnikov et al. nonbonded (van der Waals and electrostatic) interaction energies and the polar and nonpolar solvation energies, represented GBorn respectively (the overbar denotes trajectory by E and γ SA, is the averaged Solvent-Accessible Surface averaging, and SA Area [SASA]). The nonpolar solvation energy calculation was performed using the standard water probe radius of 1.4 Å to and γ was set to 0.00542kcal/mol/Å2 [32]. compute SA, The contribution from the protein and peptide configurational and rototranslational entropy changes is neglected in Eq. 1, because (i) it is difficult to compute accurately and precisely [33] and (ii) its differences are expected to be small for mutations that do not perturb the structures significantly; such mutations are considered here. If desired, the user can include entropy differences using harmonic or quasiharmonic analysis [34]. Binding free energy differences of sequence mutation 1 ! i were computed as ΔΔG i ¼ ΔG ibind ΔG 1bind : ð2Þ Finally, we note that the energy terms corresponding to the protein and peptide in separation from each other can be computed from the trajectory of the bound complex, by alternately deleting the peptide or protein from the trajectory, respectively, and repeating trajectory analysis. This variant of the method is called the “single-trajectory” method; it is theoretically incorrect because the conformations of the separated protein and peptide are drawn from incorrect thermodynamic ensembles. However, the method is often used because it is less computationally demanding, since separate MD simulations of system components are not needed) [2]. Both the single- and the multi-trajectory methods are illustrated here. 11. Explicit solvent should be removed from MD trajectories prior to solvation energy analysis (#44 3V3B/complexes/post). MD simulation trajectories of the protein/peptide complexes in explicit solvent may have periodic wrapping artifacts, whereby the peptide or protein alternately appears on opposing sides of the periodic box. These artifacts need to be corrected prior to solvation energy analysis. This step is performed using the “pbctools” package in VMD (#46–64, 3V3B/complexes/ post). Simulations of the protein or a peptide in isolation do not require this step (see Note 6). 12. Compute the nonbonded interaction energies and the polar solvation energies from the different trajectories using an appropriate solvation model. In this protocol, we provide two alternatives, the OBC2 solvation model [27], available from OpenMM through either the CHARMM or the Python interface (“mdene.py,” #69, 3V3B/complexes/post), or the Molecular Simulation of Stapled Peptides 295 GBSW [35] solvation model in CHARMM/OpenMM (“mdene.inp,” #72, 3V3B/complexes/post). Other Generalized Born (GB) models are available in CHARMM, which could potentially yield more accurate results. However, they do not currently have GPU acceleration. The user is encouraged to consult the CHARMM documentation for details of their use; see Note 7. 13. Compute the SASA from each trajectory from which explicit solvent has been removed using MDTraj software via the Python script “mdsasa.py” (#75, 3V3B/complexes/post). The SASA differences are used to provide the nonpolar contribution to the solvation energy in Eq. 1. The Octave script “mkdgnp.m” computes the nonpolar solvation energy differences. 14. Compute the averages in Eq. 1 and the standard error of the mean (SEM). The calculation of SEM is complicated by the fact that the trajectory snapshots are correlated, which requires additional analysis to estimate the number of uncorrelated samples in the time series. In this protocol, we explicitly set the size of correlated trajectory blocks to twice the correlation time scale [36], tblock = 2 t1 →∞ t0 =0 C(t)dt, ð3Þ where C(t) is the auto-correlation function of the nonbonded energies, computed using the Fourier transform in Octave via the script acorr.m (this script is called automatically from the parent scripts mkdgobc2.m, and mkdggbsw.m, depending on which GB model is used for analysis). In Eq. 3, t1 is taken to be the time at which the autocorrelation falls below 0.01 (see Note 8). Typical correlation times computed in this way were around 10 ns, corresponding to about 40 uncorrelated samples in a 400 ns trajectory. This relatively small number corresponds to the uncertainties in the Δ ΔG that are > 2kcal/mol (Fig. 3), making clear the necessity of long MD simulations to reduce statistical errors. 15. The Δ ΔG results are compared to experimental values in Fig. 3, which are created by the Octave scripts mkdgobc2.m and mkdggbsw.m. The experimental binding free energies were approximated from inhibition constants (Ki) reported by Chang et al. [30]. The second mutant peptide has a cyclobutane amino acid (Cba) at position 26, whereas our sequence has leucine. However, other mutants in the experimental dataset of Chang et al. [30] suggest that the binding free energy 296 Victor Ovchinnikov et al. a) 4 b)15 10 Δ Δ G(kcal/mol) Δ Δ G(kcal/mol) 2 0 -2 Experiment Exp. MD ; Δ Δ G from OBC2 -10 Experiment Exp. MD ; Δ Δ G from GBSW Δ Δ G1→2 Δ Δ G1→3 4 Δ Δ G1→3 d)20 15 2 Δ Δ G(kcal/mol) Δ Δ G(kcal/mol) -5 -20 Δ Δ G1→2 0 -2 Experiment Exp. MD ; ΔΔ G from OBC2 -4 0 -15 -4 c) 5 Δ Δ G1→2 Δ Δ G1→3 10 5 0 -5 Experiment Exp. MD ; Δ Δ G from GBSW Δ Δ G1→2 Δ Δ G1→3 Fig. 3 Binding free energy differences of peptide mutations: (a) OBC2 model, separate MD trajectories; (b) GBSW model, separate MD trajectories; (c) OBC2 model, single MD trajectory; and (d) GBSW model, single MD trajectory. Note that the energy scale is different for the subplots. Experimental values are taken from Chang et al. [30] (see Table 3). The results were obtained from the same explicit solvent simulations performed using the Python interface to OpenMM; the differences are in the choice of GB model used to compute free energies and whether the single or multiple trajectory method was used (see text) difference between having these two amino acids at this position is small compared with the differences observed in this protocol. Both GB models, OBC2 [27] and GBSW [35], gave results that are qualitatively consistent with the experimental values, but OBC2 produced values that are in better agreement with experiment (Fig. 3). This difference underscores the importance of trying several GB models to gain more confidence in the simulation results. Furthermore, because of the large statistical uncertainties, the values reported here should be considered qualitative illustrations of the protocol. Much longer simulations would be needed to obtain quantitative results. Molecular Simulation of Stapled Peptides 297 Fig. 4 Conformational properties of MDM2/peptide simulations in explicit water: (a) RMSD of MDM2/peptide complexes; (b) RMSD of peptides without MDM2; (c) percent coil for peptides without MDM2; structure of peptide #1 (d) at t ¼ 0 ns; and (e) at t ¼ 390 ns 16. If desired, conformational properties of the simulation structures can be computed, as done for deca-alanine (e.g. RMSD and secondary structure analysis, #51 and #62 in 3V3B/peptides/post, respectively). For example, in Fig. 4a, we show that the MDM2/peptide trajectories are stable for the duration of the simulations. The peptides in solvent partially unwind at the termini, e.g. Fig. 4e, which explains the higher RMSD shown in Fig. 4b. However, they maintain greater helicity than the decaalanine stapled peptides (Fig. 4c vs. Fig. 2c). 4 Notes 1. The parameter files used here were prepared manually by analogy with existing CHARMM parameters for lysine, 1- and 2-butenes, and propene. An alternative (automatic) method for CHARMM-compatible parameter generation involves submitting the desired chemical structure to the CGENFF [37] website, www.paramchem.org (accessed December 25, 2020). 298 Victor Ovchinnikov et al. More sophisticated parametrization approaches, involving fitting to quantum mechanical calculations, may be needed for chemical modifications other than hydrocarbon staples or to improve parameter accuracy [38]. We also note that other energy functions and simulation software can be used for MD simulations [17–19] and for stapled peptides, in particular [2]. 2. In this protocol, for simplicity, we do not add ions to the explicit solvent, as is usually done to ensure that the simulation system is electrostatically neutral. In this case, the long-range electrostatic solvers used by MD programs (e.g. particle mesh Ewald) impose “tin-foil” boundary conditions (see e.g. Ref. 39 for a discussion of electrostatics in free energy simulations). Although charge neutralization is a common practice, its omission is not expected to influence the present results significantly because computation of interaction energies is performed using an implicit solvation model without long-range electrostatic forces. 3. Using 4fs integration steps in CHARMM is possible after explicit hydrogen mass repartitioning (HMR), i.e. the mass of each hydrogen atom is increased to at most 4 a.m.u., and the mass of the corresponding parent heavy atom is decreased to preserve the total mass (see scalar.doc in the CHARMM documentation). Note that HMR requires rigid bonds (e.g. using the SHAKE method [40]). 4. Other simulation parameters were as follows. The cutoff for van der Waals (vdW) and near-space electrostatic calculations was 10 Å. The vdW interactions were smoothly attenuated to zero in the range of 8.5–10 Å using the CHARMM VSWITCH function (CHARMM/OpenMM) or OpenMM switching function (Python/OpenMM). Long-range electrostatics were treated using fourth-order PME with default OpenMM parameters. The Langevin dynamics integrator was used to maintain the temperature at 298 K using a friction constant of 0.1/ ps, and the Monte Carlo barostat from OpenMM was used to maintain the pressure at 1 atm. All bonds involving hydrogen were treated as rigid. 5. In structures with missing internal (or otherwise important) residues, the user may need to use modeling software such as Modeller [41] or Rosetta [42] as an additional preparation step. 6. In the MD simulations performed here, the connectivity of each protein chain is preserved when coordinates are wrapped across boundaries, which implies that only inter-protein (but not intra-protein) distances can be affected by wrapping. Thus, in simulations of single amino acid chains (e.g. protein or peptide in solvent), wrapping does not change the energy computed from the GB analysis. Molecular Simulation of Stapled Peptides 299 7. Generalized Born (GB) models are developed to provide fast pairwise approximations to the solution of the Poisson–Boltzmann (PB) equation. Some users may prefer to use a Poisson– Boltzmann solver [5, 43–45] in place of GB. However, just as there are important tunable parameters that enter into GB models (such as GB radii), PB solvers also have parameters, such as the type of solver used, resolution, atomic radii, dielectric constant, or Stern layer thickness [5, 43, 45]. For this protocol, we set the nonbonded cutoff distance for the GB energy analysis to 20 Å. However, larger cutoffs may be desirable (which will increase the computational cost of the analysis). 8. Other methods of correlated trajectory analysis exist, such as block bootstrap [46], or checking for the correct asymptotic behavior of the standard error of the mean (SEM) as a function of the number of independent samples, e.g. finding a range of block sizes for which the SEM decreases as the reciprocal square root of the number of blocks [25]. References 1. Fosgerau K, Hoffmann T (2015) Peptide therapeutics: current status and future directions. Drug Discov Today 20(1):122–128. https:// doi.org/10.1016/j.drudis.2014.10.003 2. Cornillie S, Bruno B, Lim C, Cheatham T (2018) Computational modeling of stapled peptides toward a treatment strategy for CML and broader implications in the design of lengthy peptide therapeutics. J Phys chemistry B 122(14):3864–3875. https://doi.org/10. 1021/acs.jpcb.8b01014 3. Verdine GL, Hilinski GJ (2012) Stapled peptides for intracellular drug targets., vol 503, 1st edn. Elsevier Inc., Amsterdam. https://doi. org/10.1016/B978-0-12-3 96962-0.00001-X 4. Friedrichs M, Eastman P, Vaidyanathan V, Houston M, Legrand S, Beberg A, Ensign D, Bruns C, Pande V (2009) Accelerating molecular dynamic simulation on graphics processing units. J Comput Chem 30:864–872 5. Brooks B, Brooks III C, Mackerell Jr A, Nilsson L, Petrella R, Roux B, Won Y, Archontis G, Bartels C, Boresch S, et al (2009) CHARMM: the biomolecular simulation program. J Comput Chem 30: 1545–1614, pMC2810661 6. Humphrey W, Dalke A, Schulten K (1996) VMD - visual molecular dynamics. J Mol Graphics 14:33–38 7. McGibbon R, Beauchamp K, Harrigan M, Klein C, Swails J, Hernãndez C, Schwantes C, Wang L, Lane T, Pande V (2015) MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys J 109(8): 1528–1532. https://doi.org/10.1016/j. bpj.2015.08.015 8. (2020) CHARMM development project. https://charmm.chemistry.harvard.edu. Accessed 25 Dec 2020 9. (2020) CHARMM force field. https:// mackerell.umaryland.edu/charmm_ff.shtml. Accessed 25 Dec 2020 10. (2020) Visual molecular dynamics. https:// www.ks.uiuc.edu/Research/vmd/. Accessed 25 Dec 2020 11. (2020) OpenMM. http://openmm.org. Accessed 25 Dec 2020 12. (2020) MDTraj. http://mdtraj.org. Accessed 25 Dec 2020 13. Eaton JW, Bateman D, Hauberg S, Wehbring R (2015) GNU octave version 4.0.0 manual: a high-level interactive language for numerical computations. http://www.gnu.org/soft ware/octave/doc/interpreter 14. Phillips J, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel R, Kale L, Schulten K (2005) Scalable molecular dynamics with NAMD. J Comput Chem 26: 1781–1802 300 Victor Ovchinnikov et al. 15. Best R, Zhu X, Shim J, Lopes P, Mittal J, Feig M, MacKerell Jr A (2012) Optimization of the additive CHARMM all-atom protein force field targeting improved sampling of the backbone ϕ, ψ and side-chain χ 1 and χ 2 dihedral angles. J Chem Theor Comput 8:3257–3273 16. Guvench O, Mallajosyula S, Raman E, Hatcher E, Vanommeslaeghe K, Foster T, Jamison F, Mackerell A (2011) CHARMM additive all-atom force field for carbohydrate derivatives and its utility in polysaccharide and carbohydrate-protein modeling. J Chem Theor Comput 7(10):3162–3180. https://doi. org/10.1021/ct200328p 17. Pearlman DA, Case DA, Caldwell JW, Ross WS, Cheatham TE, DeBolt S, Ferguson D, Seibel G, Kollman P (1995) Amber, a package of computer programs for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to simulate the structural and energetic properties of molecules. Comput Phys Commun 91(1): 1–41. https://doi.org/10.1016/0010-4655 (95)00041-D. http://www.sciencedirect. com/science/article/pii/001046559500041 D 18. Shivakumar D, Harder E, Damm W, Friesner RA, Sherman W (2012) Improving the prediction of absolute solvation free energies using the next generation OPLS force field. J Chem Theory Comput 8(8):2553–2558. https:// doi.org/10.1021/ct300203w 19. Hess B, Kutzner C, van der Spoel D, Lindahl E (2008) GROMACS 4: algorithms for highly efficient, load-balanced, and scalable molecular simulation. J Chem Theor Comput 4(3): 4 3 5 – 4 4 7 . h t t p s : // d o i . o r g / 1 0 . 1 0 2 1 / ct700301q. http://pubs.acs.org/doi/ pdf/10.1021/ct700301q 20. Brown CJ, Quah ST, Jong J, Goh AM, Chiam PC, Khoo KH, Choong ML, Lee Ma, Yurlova L, Zolghadr K, Joseph TL, Verma CS, Lane DP (2013) Stapled peptides with improved potency and specificity that activate P53. ACS Chem Biol 8(3):506–512. https:// doi.org/10.1021/cb3005148 21. Morrone J, Perez A, Deng Q, Ha S, Holloway M, Sawyer T, Sherborne B, Brown F, Dill K (2017) Molecular simulations identify binding poses and approximate affinities of stapled α-helical peptides to MDM2 and MDMX. J Chem Theor Comput 13(2): 863–869. https://doi.org/10.1021/acs.jctc. 6b00978 22. Baek S, Kutchukian PS, Verdine GL, Huber R, Holak Ta, Lee KW, Popowicz GM (2012) Structure of the stapled P53 peptide bound to MDM2. J Am Chem Soc 134(1):103–106. https://doi.org/10.1021/ja2090367 23. Ovchinnikov V, Stone TA, Deber C, Karplus M (2018) Structure of the EmrE multidrug transporter and its use for inhibitor peptide design. Proc Natl Acad Sci USA 115(34):E7942 24. Frenkel D, Smit B (2001) Understanding molecular simulation: from algorithms to applications, 2nd edn. Academic, San Diego 25. Allen MP, Tildesley DJ (1989) Computer simulation of liquids. Clarendon Press, New York, NY 26. Rapaport DC (1996) The art of molecular dynamics simulation. Cambridge University Press, New York, NY 27. Onufriev A, Bashford D, Case D (2004) Exploring protein native states and large-scale conformational changes with a modified Generalized Born model. Proteins 55(2): 383–394. https://doi.org/10.1002/prot. 20033 28. MATLAB (2010) Version 7.10.0 (R2010a). The MathWorks Inc., Natick, MA 29. Hazel A, Chipot C, Gumbart J (2014) Thermodynamics of deca-alanine folding in water. J Chem Theor Comput 10(7):2836–2844. https://doi.org/10.1021/ct5002076 30. Chang Y, Graves B, Guerlavais V, Tovar C, Packman K, To K, Olson K, Kesavan K, Gangurde P, Mukherjee A, Baker T, Darlak K, Elkin C, Filipovic Z, Qureshi F, Cai H, Berry P, Feyfant E, Shi X, Horstick J, Annis D, Manning A, Fotouhi N, Nash H, Vassilev L, Sawyer T (2013) Stapled α-helical peptide drug development: a potent dual inhibitor of MDM2 and MDMX for p53-dependent cancer therapy. Proc Natl Acad Sci USA 110(36): E3445–3454. https://doi.org/10.1073/ pnas.1303002110 31. Brice A, Dominy B (2011) Analyzing the robustness of the MM/PBSA free energy calculation method: application to DNA conformational transitions. J Comput Chem 32(2): 1431–1440 32. Srinivasan J, Cheatham TE, Cieplak P, Kollman PA, Case DA (1998) Continuum solvent studies of the stability of DNA, RNA, and phosphoramidate–DNA helices. J Am Chem Soc 120(37):9401–9409 33. Ovchinnikov V, Cecchini M, Karplus M (2013) A simplified confinement method (SCM) for calculating absolute free energies and free energy and entropy differences. J Phys Chem B 117:750–762. https://doi.org/10.1021/ jp3080578. pMC3569517 Molecular Simulation of Stapled Peptides 34. Brooks B, Janežič D, Karplus M (1995) Harmonic analysis of large systems. I. Methodology. J Comput Chem 16:1522–1542 35. Im W, Feig M, Brooks III C (2003) An implicit membrane generalized Born theory for the study of structure, stability, and interactions of membrane proteins. Biophys J 85:2900–2918 36. Shirts M (2012) Best practices in free energy calculations for drug design. Methods Mol Biol 819:425–467. https://doi.org/10.1007/ 978-1-61779-465-0_26 37. Vanommeslaeghe K, Hatcher E, Acharya C, Kundu S, Zhong S, Shim J, Darwin E, Guvench O, Lopes P, Vorobyev I, MacKerell Jr A (2009) CHARMM general force field: a force field and drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J Comput Chem 31: 671–690 38. Mayne CG, Saam J, Schulten K, Tajkhorshid E, Gumbart JC (2013) Rapid parameterization of small molecules using the force field toolkit. J Comput Chem 34(32):2757–2770. https:// doi.org/10.1002/jcc.23422 39. Simonson T, Roux B (2016) Concepts and protocols for electrostatic free energies. Mol Simul 42(13):1090–1101. https://doi. org/10.1080/08927022.2015.1121544 40. Ryckaert JP, Ciccotti G, Berendsen H (1977) Numerical integration of the Cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J Comput Phys 23:327–341 41. Eswar N, Webb B, Marti-Renom M, Madhusudhan M, Eramian D, Shen M, 301 Pieper U, Sali A (2006) Comparative protein structure modeling using modeller. Curr Prot Bioinf 54: 5. 6.1–5.6.37. h t t p s : //d o i. org/10.1002/0471250953.bi0506s15 42. Leaver-Fay A, Tyka M, Lewis S, Lange O, Thompson J, Jacak R, Kaufman K, Renfrew P, Smith C, Sheffler W, Davis I, Cooper S, Treuille A, Mandell D, Richter F, Ban Y, Fleishman S, Corn J, Kim D, Lyskov S, Berrondo M, Mentzer S, Popović Z, Havranek J, Karanicolas J, Das R, Meiler J, Kortemme T, Gray J, Kuhlman B, Baker D, Bradley P (2011) Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol 487: 545–574. https://doi.org/10.1016/B978-012-381270-4.00019-6 43. Li L, Li C, Sarkar S, Zhang J, Witham S, Zhang Z, Wang L, Smith N, Petukh M, Alexov E (2012) DelPhi: a comprehensive suite for DelPhi software and associated resources. BMC Biophysics 5:9. https://doi.org/10.11 86/2046-1682-5-9 44. Roux B (1997) Influence of the membrane potential on the free energy of an intrinsic protein. Biophys J 73:2980–2989 45. Baker N, Sept D, Joseph S, Holst M, McCammon J (2001) Electrostatics of nanosystems: application to microtubules and the ribosome. Proc Natl Acad Sci USA 98:10037–10041 46. Zoubir AM, Boashash B (1998) The bootstrap and its application in signal processing. IEEE Signal Process Mag 15(1):56–76. https://doi. org/10.1109/79.647043 Chapter 15 Free Energy-Based Computational Methods for the Study of Protein-Peptide Binding Equilibria Emilio Gallicchio Abstract This chapter discusses the theory and application of physics-based free energy methods to estimate proteinpeptide binding free energies. It presents a statistical mechanics formulation of molecular binding, which is then specialized in three methodologies: (1) alchemical absolute binding free energy estimation with implicit solvation, (2) alchemical relative binding free energy estimation with explicit solvation, and (3) potential of mean force binding free energy estimation. Case studies of protein-peptide binding application taken from the recent literature are discussed for each method. Key words Free energy, Binding free energy, Equilibrium binding constant, Alchemical perturbation, Potential of mean force, Protein-peptide binding modeling, Molecular dynamics, Molecular recognition, Statistical mechanics 1 Introduction Peptide and peptide-derived molecules are widely used to target protein-protein interactions for medicinal purposes and basic biological research. In-silico models play an increasingly significant role in the study of protein-peptide interactions. As excellently reviewed elsewhere, [1–3] computational methods for studying protein-peptide interactions have evolved on somewhat separate tracks from those used for small molecule-protein interactions. These differences are partly due to the greater flexibility and size of peptides and their tendency to interact with proteins through many relatively weak interactions. Nevertheless, because the same fundamental physical forces regulate all molecular recognition phenomena, it is helpful to relate computational models under a standard set of principles. This chapter is devoted to a class of physics-based free energy methods considered the most accurate and detailed for modeling the thermodynamics of molecular binding equilibria. These Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_15, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 303 304 Emilio Gallicchio methods model the interactions between molecules as well as their motion at the atomic level. We derive each method discussed from a well-established statistical mechanics theory of non-covalent molecular association. The chapter attempts to demystify the theory and the seemingly arcane formulas and computational procedures used in the field and point out the specific features of the methods that make them more or less suitable for studying proteinpeptide interactions. There is an implicit acknowledgment here that an understanding of these methodologies and how to select and apply them appropriately cannot be accomplished fully without referring to the underlying theory. The treatment employed here requires only a basic familiarity with concepts of statistics (probability distributions, averages, marginalization) and of classical statistical thermodynamics (classical partition functions and their manipulations, and their relationship with the free energy). After presenting the theory and methods, we then illustrate their applications by discussing three case studies. We hope that this format will help convey the characteristics and relationships between the various methodologies and the fundamental principles on which they are based. 2 Statistical Mechanics Formulation In this section, we derive and discuss a statistical mechanics theory of molecular binding. The concepts and the formulas expressed here will be used later to rationalize the specific computational methods and practices used in the case studies reviewed in Subheading 3. We attempt to use unambiguous notation throughout, but sometimes we adopt a simplified notation to unclutter the equations. For example, in intermediate formulas, we often omit limits of integration and Jacobian factors for curvilinear coordinates when they do not affect the form and interpretation of the final result. In some places, we use function arguments to distinguish two functions. For example, we might denote the ligand and receptor’s potential energy functions with the same symbol U, as U(xL) and U(xR), even though they are different mathematical functions. 2.1 The Standard Free Energy of Binding We will consider here the reversible non-covalent binding equilibrium between receptor molecules R and ligand molecules L to form a complex RL in an ideal solution: R þ LÐRL, with the dimensionless equilibrium constant ð1Þ Free Energy-Based Computational Methods Kb ¼ ½RL=C ∘ , ð½R=C ∘ Þð½L=C ∘ Þ eq 305 ð2Þ where [. . .] are concentrations, C∘ is the standard state concentration (conventionally set as 1 M or 1 molecule/1668 Å3), and the “eq” subscript states that all concentrations are evaluated at equilibrium. The Gibb’s molar standard binding free energy, which is the main objective of the computational models of binding discussed here, is defined as ΔG ∘b ¼ kB T ln K b , ð3Þ where kB is Boltzmann’s constant and T is the temperature in the Kelvin scale (in the following we will assume constant temperature pressure conditions). Implicit in this quasi-chemical description of the binding equilibrium is the idea that the separated species in solution R and L, as well as the complex RL, are defined in some way. In an experimental setting, the apparatus used to measure equilibrium concentrations provides a working definition of the species. The nature of the experimental reporter used to monitor the formation of the complex is of particular relevance [4]. The change of a spectroscopic signal, as in NMR and UV/VIS fluorescence assays, [5] likely probes a set of conformations of the complex in which specific groups of the receptor and the ligand are in contact. Hence, different spectroscopic reporters would, in general, yield different estimates of the standard free energy of binding [6]. Spectroscopic reporters stand in contrast to experimental reporters, such as those in calorimetric, surface plasmon resonance (SPR), amplified luminescent proximity (AlphaScreen), and equilibrium dialysis binding assays, that probe unspecific molecular association [4, 6–9]. Here, we focus mainly on computational models that define the complex using structural means–typically specific distances and angles between groups of atoms [10]–and are therefore more suitable to describe measurements of binding constants with specific spectroscopic experimental reporters. In practice, the association between a peptide ligand and a protein receptor is also often monitored by indirect biochemical means, such as enzymatic inhibition [11] or pull-down assays, [12] that are only indirectly related to the equilibrium binding constant of the ligand-receptor complex. The computational models’ ability to reproduce or explain this type of data is expected to be semiquantitative at best, as it would be a correlation between experimental binding constants and activity data. While ambiguities in relating molecular computer simulations to experimental biophysical data of molecular binding exist for any molecular complex, the issue is explicitly discussed here because it is expected to be particularly widespread for the study of the interactions involving peptides, which are generally more flexible than 306 Emilio Gallicchio most small-molecule drug compounds and engage protein receptors over a large binding surface in a variety of binding modes. It is useful to keep these issues in mind when designing a computational model and the answers that one can reasonably extract from it. Computational modeling can be a valuable tool when used judiciously by exploiting its strengths while managing its unavoidable limitations. 2.2 Statistical Mechanics Theory of Non-covalent Molecular Binding Under the assumptions above, Gilson et al. [13] derived a statistical mechanics expression for the binding constant (Eq. 2) which, with a few reasonable approximations (discussed below), can be written as [14]: Kb ¼ C ∘ z RL , 8π 2 z R z L ð4Þ where zi is the intramolecular configurational partition function of one molecules of species i in solution. A full derivation of Eq. 4 is beyond the scope of this chapter. However, it is briefly outlined here to introduce the notation. Equation 4 is derived by writing the molar standard binding free energy as the difference of the standard chemical potentials of the complex and those of the receptor and ligand ΔG ∘b ¼ μ∘RL μ∘R μ∘L ð5Þ and employing the McMillan-Mayer expression for the standard chemical potential of a solute in an ideal solution [15] μ∘i ¼ kB T ln ϕi 3 ∘ Λi C , ð6Þ where ϕi is the internal canonical molecular partition function of solute i in solution and Λi is the thermal De Broglie wavelength of the center of mass of the solute. The internal molecular partition function includes only the internal degrees of freedom of the solute obtained after separating the translational degrees of freedom of the molecular center of mass. Furthermore, the solute’s internal canonical molecular partition function in solution is understood in the context of the concept of the solvent potential of mean force, [16] in which the solvent degrees are averaged out. While a quantum-mechanical treatment is required in general, adopting a classical expression for the molecular partition function is appropriate for the present discussion limited to non-covalent molecular association equilibria, which do not involve the formation or breaking of chemical bonds. The internal canonical molecular partition function is written as ϕi ¼ 8π 2 z i , ∏ j λ3j ð7Þ Free Energy-Based Computational Methods 307 where the denominator comes from the integration over the momenta,1 the factor of 8π 2 comes from the integration over the orientational degrees of freedom of the solute,2 and zi is the vibrational molecular configurational partition function zi ¼ R dx i e βΨi ðx i Þ , ð8Þ where β ¼ 1/(kBT) is the inverse temperature, x i denotes the collection of the vibrational degrees of freedom of solute i, and Ψi is the solvent-averaged potential of mean force of a specific configuration of the solute in solution.3 In the present notation, Z 1 βΨi ðx i Þ dr ¼ e ZN \upsilonN e βU i ðr \upsilonN ,x i Þ, where r N v denotes the collection of degrees of freedom of N solvent molecules, U i ðr N v , x i Þ is the potential energy of the mixture of N solvent molecules and one solute molecule i, and ZN is the configurational partition function of the pure solvent,4 expressed as the integral in Eq. 9 but without the solute. Equation 4 is obtained by inserting Eq. 6 for each species, using Eqs. 7–9, into Eq. 5 noticing that the kinetic energy factors cancel out, and finally inverting Eq. 3. The definition of the intramolecular configuration partition function of the complex, zRL, receives special consideration in this theory [13]. In the complex, the translational and orientational degrees of freedom of the ligand are represented by the internal degrees of freedom of the complex that specify the position and orientation of the ligand with respect to a coordinate system attached to the receptor [10]. Furthermore, the integration along these coordinates is limited to some specified range of configurational space that encodes our structural definition of what constitutes a valid configuration of a ligand “bound” to the receptor (see the discussion in Subheading 2.1). The structural definition of the bound complex is a necessary and somewhat arbitrary input of the theory [4, 7, 13, 14]. Without it, the free energy of the bound complex relative to the unbound state is undefined and, consequently, the standard binding free energy and the binding constant would also be undefined in this theory. It is customary to represent the bound region of the complex by an indicator function I(ζ L), where ζL represents the collection of the six coordinates5 that 1 The details are omitted since the contributions from momenta cancel out in this classical treatment. We assume that the orientational degrees of freedom can be separated from the vibrational degrees of freedom without significant loss of accuracy. This is generally an excellent approximation at moderate temperatures. 3 It should be noted that the solvent potential of mean force formalism does not introduce new assumptions or approximations than the ones already adopted. In this context, it is only a convenient notation aid. We will discuss later implicit solvation models which approximate the solvent potential of mean force. 4 The notation can be easily extended to solvent mixtures including ions and co-solvents. 5 Three translations and three orientations for a non-linear ligand. 2 308 Emilio Gallicchio specify the position and orientation of the ligand relative to the receptor [10].6 The indicator function is set to 1 if the position and orientation of the ligand is such that receptor and ligand are considered bound and zero otherwise so that zRL can be written as z RL ¼ 2.3 The Binding Free Energy Formula R dx R dx L dζL I ðζ L Þe βΨRL ðx R ,x L ,ζL Þ : ð10Þ Since the direct evaluation of partition functions is not generally feasible, Eq. 4 is not amenable to direct computation. One strategy is to transform it into an average over the conformational ensemble in which receptor and ligand are uncoupled. To do so, we reorganize the integration variables in the numerator so that they match exactly those in the denominator. First, define R dζL I ðζ L Þ ¼ V site Ωsite ð11Þ which measures the spatial (Vsite) and angular (Ωsite) extent of the bound state of the complex when receptor and ligand are uncoupled.7 Then, multiply and divide Eq. 4 by Eq. 11 by keeping the integral form in the denominator and the integrated form in the numerator. The result is K b ¼ C ∘ V site where he βu i0 ¼ R Ωsite βu he i0 , 8π 2 dx R dx L dζ L e βuðx R ,x L ,ζL Þ ρ0 ðx R , x L , ζL Þ ð12Þ ð13Þ is the ensemble average of the Boltzmann weight of the effective binding energy, u, of defined as the difference in effective potential energies of the complex in the specified configuration and of that of the separated receptor and ligand without changing their internal configurations uðx R , x L , ζL Þ ¼ ΨRL ðx R , x L , ζL Þ ΨR ðx R Þ ΨL ðx L Þ ð14Þ with the normalized probability density function ρ0 ðx R , x L , ζL Þ ¼ R I ðζL Þe βΨR ðx R Þ e βΨL ðx L Þ dx R dx L dζL I ðζL Þe βΨR ðx R Þ e βΨL ðx L Þ ð15Þ The specific choice of the ζL coordinates is arbitrary as long as they do not couple directly or indirectly the intramolecular coordinates of the receptor or the ligand. 7 Equation 11 is colloquially referred to as the volume of the receptor binding site. The notation used here suggests that translational and orientational components are not coupled in the definition of I(ζL). The present treatment is still valid if this is not the case, except that in this case the value of the integral of the indicator function is not written as the product of spatial and orientational components. Finally, Ωsite ¼ 8π 2 if the definition of the bound complex does not involve orientational coordinates, that is when only the position of the ligand is used to judge whether it is bound to the receptor. 6 Free Energy-Based Computational Methods 309 which corresponds to an unphysical state of the complex in which the ligand is bound to the receptor (the density is zero unless I(ζL) ¼ 1) but it does not interact with it (the potential function lacks receptor-ligand coupling terms). We will hereafter refer to this state as the decoupled state of the complex. Conversely, the coupled state of the complex is the physical state in which the bound ligand and the receptor interact through the ΨRL ðx R , x L , ζL Þ potential function. Inserting Eq. 12 into Eq. 3 yields the following expression for the standard free energy of binding ΔG ∘b ¼ ΔG ∘b,id þ ΔG b , ð16Þ where ΔG ∘b,id ¼ kB T ln C ∘ V site kB T ln Ωsite 8π 2 ð17Þ is the ideal component of the standard free energy of binding corresponding to the reversible work for transferring a ligand from an ideal solution at concentration C∘ to the binding site region in the absence of ligand-receptor interactions, and ΔG b ¼ kB T ln he βu i0 ð18Þ is the excess component of the standard free energy of binding, corresponding to the reversible work for turning on the receptorligand interactions while the ligand is sequestered within the binding site region of the receptor. The goal of the computational models discussed in this chapter is the estimation of the excess free energy of binding. The ideal component is generally computed analytically by integration of the expression that defines the indicator function of the bound complex. Equation 18 provides, in principle, a computational route to evaluate the binding free energy. The process is often called alchemical because it is unrealizable in Nature. Nevertheless it produces estimates that can be compared to experimental measurements. It instructs to (1) obtain a sample of Boltzmann’s-distributed conformations of the complex in the uncoupled state (by molecular dynamics, typically), (2) evaluate the binding energy function u (Eq. 14) for each sample by turning on without conformational rearrangements the coupling between ligand and receptor, and finally (3) find the average of the Boltzmann weight exp ð βuÞ. While straightforward, this process is numerically ill-conditioned, and it fails for all but the simplest systems. This problem arises because atoms of the ligand and the receptor are very likely to clash when uncoupled. Consequently, the binding energy u is large and positive, and exp ð βuÞ is negligibly small for the vast majority of samples. Effectively, the sampling process generates mostly zeros, and the average is dominated by the very rare cases 310 Emilio Gallicchio Fig. 1 The probability density p0(u) (right, green curve) of the binding energy for the alchemical uncoupled state (λ ¼ 0) of the complex between 3-iodotoluene and the L99A mutant of T4-lysozyme (left) at 300 K [17]. The red curve is p1(u), the probability density of the binding energy in the coupled ensemble (λ ¼ 1), which is proportional to the integrand in Eq. 19 exp ð βuÞp 0 ðuÞ [18]. Note that the y-axis and the positive xaxis are in a logarithmic scale when, by chance, ligand and receptor do not clash and are primed to form favorable interactions even in the absence of such interactions. To appreciate more quantitatively the severity of this numerical problem, let us rewrite the ensemble average in Eq. 18 as a statistical average he βu i0 ¼ R þ1 1 du e βu p0 ðuÞ, ð19Þ where p0(u) is the probability density distribution of the binding energy in the uncoupled state. As shown, for example, in Fig. 1 for the complex between 3-iodotoluene and the L99A mutant of T4-lysozyme, [17] p0(u) (in green) is greatest for large and positive values of the binding energy. For this system, the probability of finding a conformation for which the integrand of Eq. 19 is significant (the red curve) is six or more orders of magnitude smaller than the probability of occurrence of conformations with atomic clashes. It would take a prohibitively large number of independent samples of the decoupled ensemble to obtain a sufficiently large subset at favorable binding energies to estimate the binding free energy with any precision. Effectively, the binding free energy is dominated by the low binding energy tail of p0(u), which is difficult to estimate precisely and which is greatly amplified by the exponential term in Eq. 19. 8 The 3-iodobenzene/T4-lysozyme complex illustrated in Fig. 1 is a rather simple system. The severity of the numerical problem is 8 It is tempting to try to address the clashes caused by coupling by reversing the process by decoupling. However, an equilibrium thermodynamic process like this one must be reversible, so the process’s direction is irrelevant. Free Energy-Based Computational Methods 311 far greater for ligand peptides, which are significantly larger and more flexible than a small molecule. Random placement of a peptide molecule in the protein receptor site will almost inevitably result in conformations with atomic clashes that do not contribute significantly to the binding free energy. Moreover, peptides can assume a large variety of conformations when decoupled from the receptor, with only a small fraction of them compatible with binding, thereby further reducing the probability of generating useful bound conformations. In practice, various strategies ranging from stratification (break up the binding process by introducing appropriate intermediate states) to importance sampling (preferential sampling of bound states) have been devised to overcome the numerical problems in alchemical free energy averages. Some of these strategies will be discussed in the case studies later in this chapter. While often very useful, applying these advanced strategies to protein-peptide complexes remains very challenging, as reflected in the paucity of successful alchemical absolute binding free energy calculations for protein-peptide complexes reported in the literature. 2.3.1 The DoubleDecoupling Method Equation 18 is not directly applicable to the calculation of binding free energies unless the solvent potential of mean force, Ψi ðx i Þ, or a suitable implicit solvent approximation for it, is available for the ligand, the receptor, and their complex. The solvent potential of mean force is required for conformational sampling and the evaluation of effective binding energies for each sample using Eqs. 14 9. The alternative is to employ an explicit representation of the solvent. The relevant partition functions include integrating the solutes’ internal degrees of freedom and the degrees of freedom of the solvent molecules. The result is a binding free energy formulation known as double-decoupling [13] involving two exponential averages of the same form as Eq. 18, one for coupling the ligand from vacuum to the solvated receptor and another for coupling the ligand to the pure solvent. These two processes, the second of which is related to the solvation of the ligand, are part of a thermodynamic cycle that brings the ligand from the solvent bulk to the solvated receptor through an intermediate state in which the ligand is in vacuum (Fig. 2). The double-decoupling method is regarded as the leading computational model for calculating protein-small molecule binding free energies. However, due to their sizes, it is not generally applicable to peptides. It is presented here because it forms the basis for the relative binding free energy method employed in the case study of Subheading 3.2. To see why double-decoupling is not readily applicable to peptides, consider, for example, the first leg in Fig. 2, which is the inverse of the coupling of the peptide to the solvated receptor. For the same reasons outlined above concerning Eq. 16, it would be very challenging to compute the free energy of 312 Emilio Gallicchio Fig. 2 Schematic illustration of the thermodynamic cycle of the doubledecoupling method for the calculation of the binding free energy between a molecular receptor (orange doughnut) and a ligand (black circle). The dashed circle within the receptor represents the binding site region. The blue boxes represent the solvent. The bound and unbound end states are transformed to a common intermediate state in which the ligand is in vacuum (white). The excess binding free energy is the difference of the free energy changes of the two legs, ΔGb ¼ ΔG2 ΔG1 this process because, in addition to the many atomic clashes with the receptor atoms, the uncoupled peptide will also clash with solvent molecules that would be present in the binding site. Similar challenges would exist for the hydration leg. The double-decoupling formula is derived from the statistical mechanics theory outlined in Subheading 2.2 by first inserting the definition of the solvent potential of mean force (Eq. 9) in each of the configurational partition functions in Eq. 4 and then multiplying and dividing by the configurational partition function of the ligand in vacuum Z0,L to obtain Kb ¼ C ∘ Z N ,RL Z N Z 0,L , 8π 2 Z N ,R Z 0,L Z N ,L ð20Þ where ZN,i is the configurational partition function of a system with N solvent molecules with one molecule of species i whose position and orientation, like in Subheading 2.2, is fixed. So, for example, Z N ,RL ¼ R βU ðx R ,x L ,ζ L ,r v Þ dx R dx L dζL I ðζL Þdr N , v e N ð21Þ where U ðx R , x L , ζ L , r N v Þ is the potential energy function of a system with N solvent molecules containing the receptor-ligand complex RL in the configuration specified by the internal degrees Free Energy-Based Computational Methods 313 of freedom x R , x L , and ζL. Z0,L represents the configurational partition function of the ligand in vacuum. The reciprocal of the last term in Eq. 20 can be written as R βU ðx L ,r N v Þ dx L dr N Z N ,L v e ¼R ¼ he βuL iN þL ¼ e βΔG 2 , βU ðx L Þ e βU ðr N Z N Z 0,L v Þ e dx L dr N v ð22Þ N where uL ¼ U ðx L , r N v Þ U ðx L Þ U ðr v Þ is the instantaneous change in potential energy for bringing the ligand from vacuum to solution and h. . .iN+L indicates the ensemble average over pure solvent and the ligand in vacuum. As indicated in Eq. 22, this term is related to the solvation free energy of the ligand9 or the opposite process of leg 2 in Fig. 2. The ratio of partition functions corresponding to the complex in Eq. 20 is converted to an average by multiplying and dividing by Vsite Ωsite as done earlier to derive Eq. 12 Z N ,RL ¼ V site Ωsite he βuRL iN ,RþL ¼ V site Ωsite e βΔG 1 , Z N ,R Z 0,L ð23Þ N is the where uRL ¼ U ðx R , x L , ζ L , r N v Þ U ðx R , r v Þ U ðx L Þ instantaneous change in potential energy for bringing the ligand from vacuum to a position and orientation ζL relative to receptor in a solution containing the receptor, and h. . .iN,R+L, similarly to Eq. 18, indicates the ensemble average over the uncoupled ensemble in which the ligand is bound to the receptor (I(ζ L) ¼ 1) but it does not interact with either the receptor nor the solvent. As indicated in Eq. 23 this ensemble average gives the free energy of the inverse of leg 1 in Fig. 2. Combining Eqs. 22, 23, 20, 16, 17, and 3 we finally arrive at the double-decoupling expression for the excess binding free energy: ΔG b ¼ ΔG 2 ΔG 1 ð24Þ as illustrated in Fig. 2. Note that the free energy formula for each leg is in the same form of an exponential average (Eq. 23) of the alchemical potential energy change as the direct binding free energy formula we derived in Subheading 2.3. Thus, similar considerations apply for each leg of double-decoupling. In each case, the formula instructs to obtain samples of configurations of either the systems with the ligand in solution or the ligand in the solvated receptor in their decoupled ensembles. It then instructs to average over the set of samples the Boltzmann’s weight of the potential energy change for turning on the coupling between the ligand and the environment without 9 Specifically, the free energy of a solute in a fixed position and orientation in vacuum to a fixed position and orientation in solution; a quantity also known as the solvation free energy in the Ben-Naim standard state [19, 20]. 314 Emilio Gallicchio conformational rearrangements. Here too, each leg’s averaging process is expected to be numerically ill-conditioned (see, for example, Fig. 1) and not generally applicable directly in molecular simulations. Some numerical approaches to this problem are illustrated in the Case Studies section of this chapter. 2.4 The Potential of Mean Force Method In this section we derive a non-alchemical formulation of the statistical mechanics expression 4 which leads to the potential of mean force formula for the of binding constant. Using the definition of the internal configurational partition function of the complex in Eq. 10 and the analogous ones for the receptor and ligand, Eq. 4 is written as R C ∘ dx R dx L dζL I ðζL Þe βΨRL ðx R ,x L ,ζL Þ R , ð25Þ Kb ¼ 2 8π dx R dx L e βΨRL ðx R ,x L ,ζL Þ where we have written the product zRzL of the separated receptor and ligand as the partition function of a single system in which the ligand is placed in an arbitrary position ζL sufficiently removed from the receptor so that it does not interact with it. Equation 25 is then written as C∘ Kb ¼ 2 8π Z dζL e βΔF ðζL Þ , ð26Þ site where the integration is within the binding site region where I(ζL)6¼0, and the potential of mean force (PMF) function is defined as R dx R dx L e βΨRL ðx R ,x L ,ζL Þ βΔF ðζ L Þ ð27Þ ¼R e , dx R dx L e βΨRL ðx R ,x L ,ζL Þ where ΔF(ζL) is the value of the PMF at ζL relative to the value far away from the receptor. With this definition the PMF is zero at any point far away from the receptor. The PMF as defined corresponds to the probability density of p(ζ L) of finding the ligand in the orientation and position ζL relative to the receptor: R dx R dx L e βΨRL ðx R ,x L ,ζL Þ ¼ hδðζ0L ζL Þi ð28Þ pðζL Þ ¼ R 0 dx R dx L dζ0L e βΨRL ðx R ,x L ,ζL Þ so that ΔF ðζL Þ¼ kB T ln pðζL Þ : pðζL Þ ð29Þ The potential of mean force expression 26 formally instructs to map out the probability density 28 to observe the ligand around the receptor in orientation and position ζL, including far away from the Free Energy-Based Computational Methods 315 receptor and within the binding site region, and to then integrate it within the binding site region to obtain the binding constant using Eq. 26. Some comments are in order. First, the PMF function can be obtained in the solvent of potential of mean force formulation as suggested by Eq. 27 or by using an explicit representation of the solvent by inserting the definitions of the effective potential energy Ψ and of the solvent of potential of mean force 9 into Eq. 27 R βU ðx R ,x L ,r N v ,ζ L Þ dx R dx L dr N v e βΔF ðζ L Þ ð30Þ e ¼R : βU ðx R ,x L ,dr N v ,ζ L Þ dx R dx L dr N v e It is evident therefore that the PMF is obtained by monitoring the probability of occurrence of the ligand at ζL whether an implicit or explicit description of the solvent is used. Second, the potential of mean force formula for the binding constant 26 does not require knowledge of the probability density p(ζ L) everywhere around the receptor. It requires it only within the binding site region and at one arbitrary point ζ L far away from the receptor in the solvent bulk to compute ΔF(ζL) from Eq. 29. The latter is a fundamental point. It is not sufficient to study the distribution of placements of the ligand in the binding site to compute the binding free energy. We also require the probability of finding the ligand in the binding site relative to finding it somewhere in the solvent bulk. In practice, the PMF is obtained in a volume that includes both the binding site and positions far away from the receptor to connect the two regions in a statistical sense [21–23]. Finally, the PMF is rarely obtained over all six degrees of freedom of ζ L (three positions and three orientations). In practice, the PMF is collected only along some of the dimensions by averaging over the others. The averaging procedure is formally described by marginalization of p(ζL). For example, to obtain the probability of the position r L of the ligand regardless of its orientation we integrate pðζL Þ ¼ pðr L , θ1 , ψ 1 , ψ 2 Þ over the three Euler angles θ1, ψ 1, and ψ 2 R pðr L Þ¼ dð cos θ1 Þdψ 1 dψ 2 pðr L , θ1 , ψ 1 , ψ 2 Þ: ð31Þ In the bulk, the ligand distribution does not depend on the orientation and we get R pðr L Þ¼ dð cos θ1 Þdψ 1 dψ 2 pðr L , θ1 , ψ 1 , ψ 2 Þ ¼ 8π 2 pðζL Þ: ð32Þ Next, integrate Eq. 26 over θ1, ψ 1, and ψ 2, assuming that the binding site definition does not depend on orientations, and expressing e βΔF ðζL Þ as pðζ L Þ=pðζL Þ, to obtain 316 Emilio Gallicchio Kb ¼ C∘ 8π 2 Z site dr L pðr L Þ=pðζL Þ ¼ K b ¼ C ∘ Z dr L e βΔF ðr L Þ , site ð33Þ where ΔF ðr L Þ¼ kB T ln pðr L Þ pðr L Þ ð34Þ and we have used Eqs. 31 and 32. The implementation of Eq. 33 requires the PMF with respect to the position of the ligand regardless of its orientation. 3 Case Studies of Applications of Free Energy Methods to Protein-Peptide Binding Free Energy Estimation In this section, we review some applications of the free energy methods derived from the statistical mechanics theory of non-covalent molecular binding introduced in Subheading 2.2 to the study of protein-peptide binding phenomena. We will focus in particular on theoretical and methodological aspects that will be introduced and discussed as needed. The following case studies are far from an exhaustive representation of the literature in the field. They have been selected primarily to illustrate the application of the theory and methods presented in Subheading 2. We also do not attempt to review each work exhaustively. 3.1 Binding of Cyclic Peptides to HIV Integrase with the Single-Decoupling Method and Implicit Solvation As part of the infection cycle, HIV inserts its genome into a human chromosome. The HIV integrase (IN) enzyme responsible for this process is recruited to the nuclear chromatin by the human lens epithelium-derived growth factor (LEDGF) transcriptional coactivator [24]. There have been significant attempts [8, 25–27] to develop therapies against HIV based on disrupting the interaction of LEDGF with HIV IN, which occurs at the so-called LEDGF binding domain of integrase (Fig. 3). The study of the interaction of LEDGF and LEDGF-derived synthetic peptides with HIV-IN has provided useful insights for competitive inhibitors’ design [28, 29]. As an example, Fig. 3 illustrates the crystal structure of the LEDGF binding domain of the HIV IN dimer complexed with a cyclic peptide [29]. Building upon an earlier successful application of alchemical binding free energy calculations of small-molecule inhibitors targeting the LEDGF/HIV IN interaction, [30] Kilburg and Gallicchio [31] modeled the binding free energies between HIV IN and of five of the thirteen cyclic peptides assayed by Rhodes et al. [29] The alchemical binding free energy study by Kilburg and Gallicchio recapitulated the trends observed in the experimental assays and Free Energy-Based Computational Methods 317 Fig. 3 The 3AVN crystal structure of the dimer of the LEDGF binding domain of HIV integrase (multi-color ribbons) bound to SHKIDNLD cyclic peptides (red tube) [29] identified the specific structural and energetic signatures responsible for favorable binding. Conversely, the calculations provided explanations for the lack of binding observed for two sequences for which structural information is not available. The study by Kilburg and Gallicchio [31] remains one of a few examples of the successful application of alchemical free energy methods to the computation of the absolute binding free energies of protein-peptide complexes. This was made possible by employing an implementation of Eq. 16 which was first reported under the name of Binding Energy Distribution Analysis Method (BEDAM), [18, 32] as part of the IMPACT molecular simulation program [33]. The latest implementation as a plugin of the OpenMM molecular dynamics library [34] has been named the SingleDecoupling Method (SDM), [16]10 a name chosen to better place it in the same theoretical context as the Double-Decoupling Method (DDM) [13] discussed in Subheading 2.3.1. In the following, we will use the latter name to refer to both implementations. SDM has been used in two studies involving protein-peptide binding to date [31, 35]. The implementation of Eq. 16 requires the averaging of the Boltzmann weight of the effective binding energy in Eq. 14, which in turn requires the specification of the intramolecular potential energy and the solvent potential of mean force for each configuration x i of the molecular species involved. The former is available from a molecular mechanics force field (OPLS-AA [36] in the applications discussed here) while the solvent potential of mean 10 github.com/rajatkrpal/openmm_sdm_plugin. 318 Emilio Gallicchio force is approximated by an implicit solvent model [16]. SDM employs the Analytical Generalized Born plus Non-Polar (AGBNP) implicit solvent model [37, 38] which is now maintained as an OpenMM plugin [39].11 3.1.1 Alchemical Pathways and Stratification We use this case study to illustrate the very general concept of an alchemical pathway and the idea of performing conformational sampling along the pathway to improve the convergence characteristics of the basic binding free energy formula (Eq. 18). This technique, commonly known in the field as stratification, is used in many free energy estimation problems [40]. As discussed in Subheading 2.3, Eq. 18 is not directly applicable in numerical simulations because, fundamentally, the coupled and uncoupled ensembles preferentially visit distinct regions of conformational space (see Fig. 1, for example). The free energy, however, is a thermodynamic state function, and it should be possible to compute it as the sum of free energy changes over a series of intermediate states, each sufficiently similar to its neighbors so that free energy estimation formulas such as Eq. 18 among these are numerically well-behaved [41, 42].12 The intermediate so-called alchemical states are generally implemented by means of an alchemical progress parameter λ that tunes the system’s potential energy function such that λ ¼ 0 corresponds to the initial state and λ ¼ 1 corresponds to the final state. A simple–but not necessarily the most efficient [17, 43] choice–is a linear interpolating function of the form U λ ðxÞ ¼ U 0 ðxÞ þ λuðxÞ, ð35Þ where U0(x) is the potential energy function that describes the initial state and u(x) ¼ U1(x) U0(x), where U1(x) is the potential function of the final state, is the perturbation potential. The progress parameter λ and the specific parameterization of the alchemical potential are said to define an alchemical path that connects, in a thermodynamic sense, the initial and final states. The specific alchemical potential energy function adopted by Kilburg and Gallicchio [31] to study peptide binding is, in the notation of Subheading 2.3, Ψλ ðx R , x L , ζ L Þ ¼ Ψðx R Þ þ Ψðx L Þ þ λuðx R , x L , ζL Þ, ð36Þ where the first term on the r.h.s. is the potential energy function of the decoupled ensemble (corresponding to U0(x) in Eq. 35) and the binding energy function u is defined by Eq. 14.13 It is 11 12 github.com/egallicc/openmm_agbnp_plugin. This concept has since evolved into rigorous statistical interpretations and numerical algorithms, some of which are discussed later in this section. 13 To improve convergence, Kilburg and Gallicchio actually used a soft-core form of the binding energy function [17, 44]. Soft-core functions are critical aspects of alchemical binding free energy calculations. Free Energy-Based Computational Methods 319 straightforward to see that Ψλ at λ ¼ 1 is the potential energy function of the coupled state. An alchemical binding free energy profile, ΔG(λ), along the thermodynamic path is defined, which corresponds to the free energy of the intermediate alchemical state at λ relative to the uncoupled state (λ ¼ 0) [18] ΔGðλÞ¼ kB T ln he βλu i0 ð37Þ which is Eq. 18 with u replaced with λu, the perturbation energy at the alchemical state at λ. By definition, the excess free energy of binding 18 is the difference between the end points of the alchemical binding free energy profile ΔG b ¼ ΔGðλ ¼ 1Þ ΔGðλ ¼ 0Þ: ð38Þ In Kilburg and Gallicchio’s study, the alchemical path was subdivided into 26 intermediate states mostly linearly spaced between 0 and 1, except the region near λ ¼ 0, which required more closely spaced points. Conformational sampling was conducted at each λ-state by molecular dynamics (MD)14 using the alchemical potential energy function 36. The binding energy function 14 and its gradients were evaluated at each MD time step by first evaluating the potential energy of the complex ΨRL ðx R , x L , ζ L Þ and then displacing the peptide in the implicit solvent medium at a large distance away from the protein receptor to evaluate the potential energy ΨR ðx R Þ þ ΨL ðx L Þ without protein-peptide interactions.15 Samples of the decoupled energy Ψ0 ¼ ΨR ðx R Þ þ ΨL ðx L Þ and of the binding energy u were saved at each alchemical state at regular intervals. As discussed in Subheading 3.1.3, these are the inputs for the estimation of the binding free energy profile and of the excess binding free energy through Eq. 38. 3.1.2 Replica-Exchange Conformational Sampling Stratification implies that an alchemical binding free energy calculation is commonly carried out as a collection of molecular simulations, each with a different alchemical potential energy function (Eq. 35) at a series of values of the alchemical progress parameter λ. The accuracy of alchemical free energy calculations depends heavily on the conformational sampling’s quality at each λ-state. In this context, the conformational sampling’s challenge is to generate a diverse set of configurations distributed according to Boltzmann’s distribution for the given temperature and potential energy function. It is not sufficient, like in molecular docking, to propose a set of low-energy configurations. The configurations should also Specifically by replica-exchange molecular dynamics in temperature and λ space as described in Subheading 3.1.2. 15 The ligand displacement approach to compute the alchemical potential energy was made necessary by the many-body nature of the implicit solvation model. As briefly discussed in Subheading 3.2, with pairwise decomposable potentials it is more common that λ is integrated into the calculation of individual interatomic interaction energies. 14 320 Emilio Gallicchio appear according to their probability of occurrence. Conformational sampling in alchemical simulations is carried out by Monte Carlo and, more often, Molecular Dynamics (MD). MD conformational sampling is limited by the slow time-scales of biomolecules’ motion, and a host of advanced conformational sampling algorithms have been devised to overcome it [45]. Kilburg and Gallicchio employed two-dimensional replica-exchange conformational sampling in temperature and alchemical spaces [31, 46]. It is useful to consider separately the problem of sampling intermolecular degrees of freedom (the position and orientation of the ligand relative to the receptor, denoted by ζL above) from the sampling of intramolecular degrees of freedom (the individual conformations of the peptide and the receptor, denoted by x L and x R ). The first problem is related to the simulation algorithm’s ability to explore all relevant binding modes of the protein receptor complex for fixed receptor and peptide conformations. Missing the most stable binding mode would, of course, underestimate the binding affinity. The sampling of intermolecular degrees of freedom is straightforward near the decoupled state (λ ’ 0) where protein-peptide interactions are weak, and the peptide can nearly freely translate and rotate within the binding site volume. In contrast, because of receptor-peptide interactions, rotations, and translations are severely hindered near the coupled state (λ ’ 1) where the peptide visits alternative binding modes only very rarely. Therefore, one solution to this problem is to make it so that the MD thread evolves the system in conformational space as well as λ space. In this way, new binding modes are formed when λ is small and, if they are sufficiently stable, they will be retained when the MD thread visits more strongly coupled states at λ ’ 1. Conversely, an MD thread in a metastable binding mode at λ ’ 1 would have an opportunity to acquire a smaller λ and convert to another binding mode. Of course, the excursions in λ space have to be so that a canonical ensemble of conformations is generated at each alchemical state. The replica-exchange algorithm achieves this by evolving as many MD threads as there are alchemical states. At any one point in time, each MD thread j is assigned the λ value of a unique alchemical state j. The collection of threads, called replicas, forms an ensemble of independent canonical systems with the joint canonical statistical weight function h Pn i ρRE ðx 1 , . . ., x n jλ1 , . . ., λn Þ¼ exp β Ψ ðx Þ , ð39Þ λ j j j ¼1 where Ψλ(x) is the alchemical potential energy function 36, xj denotes the configuration of replica j, and λj is the value of λ assigned to it. The joint distribution is sampled by alternating updates of coordinates xj at a fixed assignment of λ values, which is accomplished independently for each replica by conventional constant temperature MD, with updates of the λ assignments. Free Energy-Based Computational Methods 321 The latter is performed at fixed by proposing permutations of λ assignments fλ1 , . . ., λn g!fλ01 , . . ., λ0n g at fixed configurations xj and accepting and rejecting the move using the Metropolis Monte Carlo algorithm based on the ratio of the values of the proposed and original weight functions ρRE ðx 1 , . . ., x n jλ01 , . . ., λ0n Þ ρRE ðx 1 , . . ., x n jλ1 , . . ., λn Þ: ð40Þ There are many variations of replica-exchange differing in the nature of the replicas, the scheme of permutations of state assignments, and the computational implementation [47]. Schemes, such as the one illustrated above, that modify the parameters of the potential energy function are known in the field as Hamiltonian replica-exchange algorithms [48]. Kilburg and Gallicchio used the Gibbs Independent Sampling Algorithm [17] for Hamiltonian reassignments and an asynchronous implementation [46] of replica-exchange for that allows running the collection of replica simulations on heterogeneous and potentially unreliable computational resources such as on computational grids [49]. Hamiltonian replica-exchange addresses the sampling of intermolecular degrees of freedom. However, because λ couples receptor-peptide interactions, it has only an indirect influence on the rate at which intramolecular degrees of freedom are sampled. Peptides are very flexible and often change conformation upon binding. They often interact with the protein over an extended surface and induce substantial induced-fit reorganization of the receptor. Conformational rearrangements of peptides occur very slowly at room temperature, especially of the cyclic peptides investigated in this study. The temperature replica-exchange algorithm, one of the first versions of replica-exchange proposed, [50] is very useful for accelerating the sampling of the conformational space of peptides and proteins [51, 52] and is applicable to free energy calculations [53]. Kilburg and Gallicchio adopted a two-dimensional replica-exchange scheme in which both the λ and temperature assignments undergo permutations. The joint canonical weight is generalized as h Pn i ρRE ½x 1 , . . ., x n jðβ, λÞ1 , . . ., ðβ, λÞn Þ ¼ exp β Ψ ðx Þ , λ j j j ¼1 j ð41Þ where βj and λj are the inverse temperature and λ assigned to replica j, and (β, λ) is one of the n pair combinations of a set of inverse temperatures and alchemical states. Kilburg and Gallicchio employed 8 temperatures between 300–379 K and 26 alchemical states for a total of 208 replicas for each protein-peptide complex. The multi-dimensional replica-exchange algorithm employed allowed to explore simultaneously multiple conformations of the peptide and multiplied binding modes of each conformation. 322 Emilio Gallicchio 3.1.3 Multi-State Free Energy Estimation While Eq. 18 is formally correct, it is not an optimal free energy estimator. Optimal here refers to a free energy estimator’s ability to return a free energy estimate with the smallest bias relative to the true free energy (accuracy) and smallest variance (precision) with a given finite set of samples. Kilburg and Gallicchio employed the Unbinned Weighted Histogram Analysis Method (UWHAM) estimator [44] which is considered an optimal free energy estimator when no information of the system is known other than the samples from the molecular simulations. The statistical and mathematical origins of the method [44, 54] are beyond the scope of this chapter. The main idea is to arrive at an estimate of the free energy ΔG(λ) (Eq. 37) at λ by using the data collected at all λ-states. UWHAM can be interpreted as an extension of the familiar Weighted Histogram Analysis Method (WHAM), [55] applied to Eq. 19 for the maximum likelihood estimation of the distribution of binding energies in the uncoupled ensemble p0(u) from the corresponding distributions along the alchemical path pλ(u). In this case, Kilburg and Gallicchio collected data as a function of temperature as well as λ on a grid of 208 states. UWHAM provides, in this case, optimal estimates of the dimensionless free energy factor for each state defined as, up to an additive constant,16 F r ¼ ln z RL ðβr , λr Þ, ð42Þ where βr and λr are the values of the inverse temperature and of the alchemical progress parameter of state r and zRL(β, λ) is defined by Eq. 10. Given the free energy factors, the free energy profile as function of temperature and λ is given by Eq. 37, or17 ΔGðβr , λr Þ¼ kB T F r : ð43Þ The dimensionless free energy factors minimize the convex objective function [44]18 X XN Xn N n 1 Nr r F r vrs e e Fr, ln ð44Þ þ s¼1 r¼1 N r¼1 N N where N is the total number samples collected at any of the n states, Nr is the number of samples collected at state r, and vrs ¼ βr ½Ψ0,s þ λr us ð45Þ 16 Note that, because zRL is not dimensionless, the ambiguity of the additive constant is also related to the arbitrariness of the units chosen to evaluate the logarithm. 17 Because the free energy estimates are known up to a temperature-dependent additive factor, differences between free energies at different temperatures are generally meaningless. However, differences along λ at different temperature can be compared. For example, the binding free energy at one temperature ΔGb(β) ¼ Δ G(β, λ ¼ 1) ΔG(β, λ ¼ 0) can be compared to the binding free energy estimate at a different temperature to, for example, estimate the binding entropy. 18 The convexity property guarantees that there is a unique minimum. Free Energy-Based Computational Methods 323 is the dimensionless energy of sample s in state r, where Ψ0,s and us are, respectively, the values of the decoupled potential energy and of the binding energy of the sample collected during the replicaexchange alchemical simulations. The UWHAM optimizer implemented in the statistical program R was used to obtain the dimensionless free energy factors (cran.r-project.org/web/ 19 packages/UWHAM). Note that setting to zero the gradient of the UWHAM objective function leads to the self-consistent equations f 1 r ¼ PN e vrs vr0s , r 0 ¼1 N r 0 f r 0 e Pn s¼1 ð46Þ where f r ¼ e F r . Eq. 46 is the basis of the equivalent Multi-state Bennet Acceptance Ratio (MBAR) method to obtain the free energy factors [57]. The UWHAM formulation of multi-state reweighting has been found to be more generalizable than MBAR’s [56]. For example, it has been recently employed to impose global restraints on the free energy solutions [58]. 3.2 Effect of Mutations on the Binding Affinity of Peptides to PDZ Protein Domains PDZ protein domains are widespread protein-protein interaction modules. They specifically recognize the 4 to 8 amino acids at the C-terminus sequence of proteins. Peptides and peptide derivatives that mimic these binding motifs are investigated as potential therapeutics for many diseases [59]. Panel et al. [60] studied the binding free energies between the TIAM1 PDZ domain and a series of peptides derived from its syndecan-1 and caspr4 protein targets (Fig. 4) using an alchemical relative binding free energy computational method generally known in the field as Free Energy Perturbation (FEP) [62, 63]. The study’s goal was to validate the methodology for protein-peptide binding and obtain physical and structural insights into the recognition mechanisms that allow PDZ domain to target specific sequences. 3.2.1 Theory of Relative Binding Free Energy Calculations The dataset considered by Panel et al. [60] included the TIAM1 PDZ domain bound to the wild-type peptides and a series of single and double mutants. As discussed in Subheading 2.3.1, peptides are generally too large and complex to be studied by doubledecoupling absolute binding free energy calculations with explicit solvation. Instead, the study employed a relative FEP method that yields the difference between a peptide’s binding free energies relative to a reference peptide. The approach is based on the thermodynamic cycle illustrated in Fig. 5. The reference peptide L1 is alchemically transformed into a mutant L2 when bound to the receptor and solvated in water. The difference in the free energies 19 Ding, Vilseck, and Brooks [56] developed a GPU implementation of UWHAM called FastMBAR (github. com/xqding/FastMBAR) [56]. 324 Emilio Gallicchio Fig. 4 The 4GVD crystal structure of the complex between the TIAM1 PDZ domain (multi-color ribbons) and the pTKQEEFYA peptide (red tube) [61] Fig. 5 The thermodynamic cycle used in the relative free energy perturbation method. The vertical transformations correspond to the association equilibrium between the receptor R and one of two ligands L1 and L2. The horizontal legs correspond to the alchemical transformation of one ligand into the other alone in solution (top) or in the complex (bottom) ΔGbound and ΔGsolv of these two processes yields the difference in the binding free energy of the two complexes. Therefore, the method allows probing the effect of different mutations on the binding affinity between the peptide and the receptor. The statistical mechanics formula at the basis of this approach can be derived, for example, from Eq. 20 by considering the expression of the ratio of the binding constants Kb(2) and Kb(1) for the RL2 and RL1 complexes, respectively. When taking the ratio, the constant factors and the partition functions of the solvent, of the receptor in the solvent, and of the ligands in vacuum cancel yielding K b ð2Þ Z N ,RL 2 Z N ,L 1 ¼ e β½ΔG bound ΔG solv , ¼ K b ð1Þ Z N ,RL 1 Z N ,L 2 ð47Þ where the ratio of partition functions involving the receptor corresponds to the free energy difference ΔGbound between the complex Free Energy-Based Computational Methods 325 with ligand L2 in the solvent and the same system but with ligand L2 replaced by L1. Similarly, the ratio of partition functions of the ligands in solution corresponds to the free energy difference ΔGsolv.20 Finally, using Eq. 2, we obtain ΔΔG b :¼ ΔG b ð2Þ ΔG b ð1Þ ¼ ΔG bound ΔG solv ð48Þ which is the key formula of the relative binding FEP method. Let us now turn to the evaluation of ΔGbound and ΔGsolv by alchemical computer simulations. As usual, the strategy is to compute ratios of partition functions as ensemble averages. However, for example, the expression R βU ðx L 2 ,r N v Þ dx L 2 dr N Z N ,L 2 v e ð49Þ ¼R βU ðx L 1 ,r N Z N ,L 1 v Þ dx L 1 dr N v e cannot be directly turned into the form of an ensemble average because, in general, the number and kind of the internal degrees of freedom of the two ligands differ. Panel et al. [60] adopted the so-called dual-topology strategy to address this issue,21 in which the simulation is conducted with a hybrid peptide in which the wild-type, say, and mutated amino acid side chains are both represented at the same time (Fig. 6). The alchemical potential energy function is constructed so that the environment (the water solution or the solvated receptor) interacts with the atoms of both forms of the sidechain with a strength that depends on the alchemical charging parameter λ. Similarly, the intramolecular potential energy function is designed so that the atoms of the protein backbone interact by bond stretching, bond angle, torsional, and 1,4 non-bonded interactions with both forms of the sidechain. The atoms of the two forms of the sidechain being mutated never interact directly with each other. Formally, the dual-topology approach is derived from Eq. 47 by multiplying and dividing each term by an appropriate partition function that introduces the additional degrees of freedom to turn each peptide into the hybrid peptide with both forms of the sidechain. For example, if Z N ,L 1 term represents the peptide with the phenylalanine (PHE) sidechain in solution (Fig. 6, red), multiplying and combining it with Z ILE ¼ 20 R dζILE dx ILE e βU ðζILE Þ e βU ðx ILE Þ , ð50Þ Comparing the free energies of systems with different atomic composition and number of degrees of freedom is arguably physically meaningless at this level of theory. However, note that the overall ratio of partition functions in Eq. 47 if physically well defined. It represents the free energy difference between two systems, the first composed of two solutions one containing the complex with L2 and the other containing L1, and the second in which L2 and L1 have swapped places. Evidently, the free energy difference ΔGbound ΔGsolv, which is the target of the theory, is physically well defined even though the individual components may not be. 21 There is an analogous single-topology strategy [64] which we do not discuss here. 326 Emilio Gallicchio Fig. 6 Representation of the dual-topology alchemical mutation of a phenylalanine (PHE, red) to isoleucine (ILE, green) of the TKQEEFYA peptide considered by Panel et al. [60] The illustration shows the peptide in solution. A similar transformation is applied to the peptide bound to the PDZ domain where, as in Eq. 10, ζ ILE represents the six external coordinates that specify the position and orientation of the added isoleucine (ILE) sidechain relative to the peptide backbone, U(ζILE)22 represents the potential energy terms that anchor the ILE side chain to the peptide backbone,23 x ILE represents the other internal degrees of freedom of the ILE side chain, and U ðx ILE Þ represents the intramolecular potential energy function that couples atoms of ILE together,24 transforms it into the partition function, that we will denote by Z N ,L 1ð2Þ, of the hybrid peptide in the PHE state in which the ILE sidechain is “turned off,” by which we mean that the ILE sidechain interacts only with the backbone through the U(ζ ILE) potential and does not otherwise interact with the environment. The same procedure applied to the partition function of the complex of the original peptide bound to the receptor Z N ,RL 1 in the denominator of Eq. 47 yields the partition function Z N ,RL 1ð2Þ of the hybrid peptide in the PHE state bound to the receptor. Similarly, multiplying and dividing by the term ZPHE analogous to Eq. 50 to install a PHE sidechain onto the peptide with the ILE sidechain, yields the partition functions Z N ,L ð1Þ2 and Z N ,RL ð1Þ2 for the hybrid peptides in solution and bound to the receptor in their ILE states. 22 As explained there, this function acquires in the next section an “SD” superscript. Other attachment modalities, including to the β carbon, are possible. 24 As further discussed later, here we have explicitly singled-out the ζ degrees of freedom that couple the added sidechain to the backbone to emphasize that they must be appropriately chosen, using, for example, the scheme described by Boresch and Karplus [10], to avoid introducing spurious indirect interactions between backbone atoms that would affect the conformational distribution of the original peptide [65]. 23 Free Energy-Based Computational Methods 327 With these preparations, finally Eq. 47 is rewritten as K b ð2Þ Z N ,RL ð1Þ2 Z N ,L 1ð2Þ ¼ e βΔG bound e þβΔG solv , ¼ K b ð1Þ Z N ,RL 1ð2Þ Z N ,L ð1Þ2 ð51Þ Z N ,RL ð1Þ2 ¼ kB T ln he βu2 i1 , Z N ,RL 1ð2Þ ð52Þ where ΔG bound ¼ kB T ln where u2 is the change in potential energy of the system for a given configuration of the solvated complex with the hybrid peptide due to, in this example, turning off PHE sidechain and turning on the ILE sidechain, and h. . .i1 represents the average over the ensemble in which the PHE sidechain is on and the ILE sidechain is off. An analogous ensemble average gives ΔGsolv for the transformation of PHE into ILE in solution. 3.2.2 Alchemical Transformations for Relative Binding Free Energies As discussed in Subheadings 2.3 and 3.1.1 the free energies ΔGsolv and ΔGbound for mutating one sidechain into another are calculated in practice using a hybrid alchemical potential energy function Uλ(x) parametrized by a progress parameter λ. Panel et al. [60] used the NAMD molecular simulation package [66] which implements the alchemical potential [65] U λ ðxÞ ¼ U L 12 ðxÞþð1 λÞU L 1 ðx, 1 λÞ þ λU L 2 ðx, λÞ, ð53Þ where x is the collection of all of the degrees of freedom of the dualtopology peptide system, U L 12 ðxÞ contains the potential energy terms that do not depend on λ SD U L 12 ðxÞ ¼ U 0 ðxÞ þ U SD L 1 ðζ 1 Þ þ U L 2 ðζ 2 Þ, ð54Þ where U0(x) is the unperturbed component of the potential energy (including the intramolecular potential energy terms of the dualtopology sidechains not affected by the transformation, but excluding interactions between the two sidechains), and the terms U SD L i ðζ i Þ represent the auxiliary restraints used in the dual-topology scheme to anchor each sidechain to the backbone (see Eq. 50), and SS SD U L i ðx, λÞ ¼ U NB L i ðx, λÞ þ U L i ðxÞ þ U L i ðxÞ, ð55Þ where U NB L i denotes non-bonded interactions between the sidechain atoms and the environment, U SS L i denotes the bonded (1–2, 1–3, and 1–4 interactions) among backbone atoms with sidechain i, and U SD is the corresponding term for bonded interactions Li between the backbone atoms and the sidechain. 25 As illustrated by Eq. 55, the non-bonded component has an explicit λ dependence due to the use of separation-shifted soft-core pair potentials 25 The S symbol stands for the single-topology region (the backbone in this case), and D stands for dual-topology region (the two sidechains) [65, 67]. 328 Emilio Gallicchio [65, 67] to describe the non-bonded interactions between the dual-topology sidechains and the rest of the system. It is straightforward to see that Eq. 53 evaluated at λ ¼ 0 describes the L1(2) state of the dual-topology peptide with sidechain 2 turned off and, conversely, λ ¼ 1 describes the L(1) 2 state. Panel et al. [60] simulated 11 alchemical states from λ ¼ 0 to λ ¼ 1. The change in free energy from λr to λr+1 was evaluated using the Bennet Acceptance Ratio (BAR) method, which is MBAR (Eq. 46) for two states and where, in this case, vrs ¼ βU λr ðx s Þ ð56Þ is the alchemical potential energy at λr of the conformational sample xs collected at either λr or λr+1.26 3.3 Potential of Mean Force Study of the Binding of the MEEVD Peptide to the TPR2A Receptor The heat shock organizing protein (Hop) binds specifically to the heat shock protein Hsp90 through its tetratricopeptide repeat (TPR) domain TPR2A. TPR modules are widespread protein domains responsible for the specific recognition patterns of many proteins. Due to their molecular recognition characteristics, engineered TPR domains are seen as potential alternatives to antibodyderived biological medicines. Lapelosa [22] studied the binding of the MEEVD peptide from Hsp90 to the TPR2A domain of Hop (Fig. 7) using the potential of mean force methodology outlined in Subheading 2.4. The work yielded an estimate of the standard free energy of binding between TPR2A and MEEVD in good agreement with experimental measurements. It provided structural insights into the entry and exit mechanism of the peptide from the receptor binding site. 3.3.1 Calculation of the Standard Binding Free Energy Lapelosa [22] computed a 1-dimensional radial potential of mean force (PMF), ΔF(r), along the center of mass separation r between the receptor and the peptide (Fig. 7) using the Adaptive Biasing Force (ABF) method described in the next section. The PMF was then employed to compute the free energy of binding. The expression of the binding constant in terms of the radial PMF is derived from Eqs. 33 and 34 by expressing the integral in terms of spherical polar coordinates (r, θ, ϕ), where r is the distance between the centers of mass of the receptor and the peptide, θ is the angle between the line connecting the centers of masses and the axis connecting the C-α atoms of two chosen residues of the receptor, and ϕ is an azimuthal angle (which can be considered arbitrary because neither the conical sampling region nor the binding site region depends on it). Following a procedure similar to the one that yielded Eq. 33 from Eq. 26, we carry out the integration in Eq. 33 over the θ and ϕ coordinates to obtain 26 The numerator and the denominator of Eq. 46 are often combined to cast the formula in terms of energy differences v r 0 s v rs . Free Energy-Based Computational Methods 329 Fig. 7 Illustration of the calculation of the binding free energy of the complex between the Hop TPR2A domain (multi-colored ribbon) with the MEEVD peptide (red and pink tubes). The MEEVD peptide is shown in its position in the crystal structure (PDB id: 1ELR [68], pink tube) and in a representative position and orientation (red tube) within the simulation cone (the yellow shaded region). The potential of mean force is collected along the distance (black arrow) between the center of mass of the receptor and the center of mass of the peptide while the peptide is kept within the cone. The arc across the cone delineates the binding site region r < rb K b ¼ C∘ R rb 0 drr 2 where pðr , θ , ϕ Þ ¼ pðr L Þ, and pðrÞ ¼ R1 cos θ0 pðrÞ , pðr , θ , ϕ Þ R 2π dð cos θÞ 0 dϕpðr, θ, ϕÞ ð57Þ is the polar angle-averaged probability density of the ligand position in the conical region. Considering the value of the radial probability density at distance r in the canonical region far away from the receptor and integrated over the polar angles, pðr Þ ¼ 2πð1 cos θ0 Þpðr , θ , ϕ Þ we finally obtain ð58Þ 27 K b ¼ 2πð1 cos θ0 ÞC ∘ R rb 0 drr 2 e βΔF ðrÞ , ð59Þ where rb ¼ 20 Å is the limiting radial distance of the binding site region and θ0 ¼ 60∘ is the angle of aperture of the cone, and 27 Probably because of a typo, the 2π factor is missing in the corresponding expression (Eq. 2) of the paper by Lapelosa [22]. 330 Emilio Gallicchio ΔF ðrÞ¼ kB T ln pðrÞ pðr Þ ð60Þ is the radial PMF relative to the bulk distance r ¼ 30 Å. 3.3.2 Calculation of the Potential of Mean Force Using the Adaptive Biasing Force Method The peptide’s radial PMF, ΔF(r), was evaluated using the Adaptive Biasing Force (ABF) method [69]. ABF serves the dual purpose of accelerating the sampling of the peptide positions relative to the receptor and providing an estimate of the PMF. ABF introduces a fictitious biasing force fb(r) along the radial direction such that the observed distribution of distances with the addition of the biasing force, pobs(r), is flat within the sampling region (in this case the region within the cone illustrated in Fig. 7 with θ0 ¼ 60∘ angle of aperture and up to r < r ¼ 30 Å). A derivation of ABF is beyond the scope of this chapter, however, to motivate it, first note that differentiation of Eq. 30 leads to the conclusion that the gradient of the PMF with respect of ζL is the average gradient of the system potential energy function ∂ΔF ðζL Þ ¼ ∂ζL ∂U h ∂ζ i L ζL , ð61Þ where U is the potential energy function of the solvated system and h. . .iζL represents an ensemble average at fixed ζL. In other words, the negative of the gradient of the PMF is the system force averaged over the degrees of freedom of the system other than those along which the PMF is defined, thereby justifying the name potential of mean force for ΔF(ζ L). The same conclusion applies to forms of the PMF averaged over some coordinates such as ligand orientations (Eq. 29), including the 1-dimensional radial PMF, ΔF(r), considered in the work of Lapelosa.28 Also, note that the PMF along a coordinate is proportional to the logarithm of the probability distribution for that coordinate (Eq. 29). Thus, a flat distribution indicates that the overall force, the mean force, plus the biasing force along the coordinate is zero or, equivalently, that the added biasing force is equal and opposite to the mean force. This implies that the potential of mean force can be obtained by integrating the biasing force that flattens the radial distribution. The additional benefit of having a flat distribution is that the dynamics along the chosen coordinate are more likely to be diffusive and not impeded by free energy barriers. Indeed, several independent binding/unbinding events have been reported in the study by Lapelosa [22]. 28 In this case the radial force is interpreted in terms of the force of a central potential, and Eq. 61 has additional terms due to the Jacobian of the radial coordinate [69]. Free Energy-Based Computational Methods 4 331 Conclusion This chapter has shown how a statistical mechanics formulation of the non-covalent molecular association from first principles gives rise to different computational methods to estimate the binding free energies of protein-peptide complexes. The three case studies illustrate the application of each method to particular molecular complexes and how they are tailored to achieve specific goals. It is much more challenging to apply rigorous binding free energy estimation methods to protein-peptide complexes relative to small-molecule binding. We hope that this chapter illustrates how a good appreciation of the underlying theories and their computational implementations helps understand the practices connected with each approach and its strengths and limitations. Acknowledgement E.G. acknowledges support from the National Science Foundation (NSF CAREER 1750511). References 1. Kastritis PL, Bonvin AM (2013) On the binding affinity of macromolecular interactions: daring to ask why proteins interact. J Roy Soc Interface 10(79):20120835 2. Kilburg D, Gallicchio E (2016) Recent advances in computational models for the study protein-peptide interactions. Adv Prot Chem Struct Biol 105:27–57 3. D’Annessa I, Di Leva FS, La Teana A, Novellino E, Limongelli V, Di Marino D (2020) Bioinformatics and biosimulations as toolbox for peptides and peptidomimetics design: where are we? Front Molecular Biosci, 7 4. Mihailescu M, Gilson MK (2004) On the theory of noncovalent binding. Biophys J 87: 23–36 5. Gibb CL, Gibb BC (2013) Binding of cyclic carboxylates to octa-acid deep-cavity cavitand. J Comp Aided Mol Des 28:1–7 6. Judy E, Kishore N (2020) Discrepancies in thermodynamic information obtained from calorimetry and spectroscopy in ligand binding reactions: Implications on correct analysis in systems of biological importance. Bullet Chem Soc Jpn 94 7. Simonson T (2016) The physical basis of ligand binding. In: In silico drug discovery and design, pp 3–43 8. Tsiang M, Jones GS, Hung M, Mukund S, Han B, Liu X, Babaoglu K, Lansdon E, Chen X, Todd J, Cai T, Pagratis N, Sakowicz R, Geleziunas R (2009) Affinities between the binding partners of the HIV-1 integrase dimer-lens epithelium-derived growth factor (IN dimer-LEDGF) complex. J Biol Chem 284(48):33580–33599 9. Ranganathan A, Heine P, Rudling A, Pluckthun A, Kummer L, Carlsson J (2017) Ligand discovery for a peptide-binding GPCR by structure-based screening of fragment-and lead-like chemical libraries. ACS Chem Biol 12(3):735–745 10. Boresch S, Tettinger F, Leitgeb M, Karplus M (2003) Absolute binding free energies: a quantitative approach for their calculation. J Phys Chem B 107:9535–9551 11. Marcotrigiano J, Gingras A-C, Sonenberg N, Burley SK (1999) Cap-dependent translation initiation in eukaryotes is regulated by a molecular mimic of eIF4G. Mol Cell 3(6):707–716 12. Wysocka J (2006) Identifying novel proteins recognizing histone modifications using peptide pull-down assay. Methods 40(4):339–343 13. Gilson MK, Given JA, Bush BL, McCammon JA (1997) The statistical-thermodynamic basis for computation of binding affinities: a critical review. Biophys J 72:1047–1069 332 Emilio Gallicchio 14. Gallicchio E, Levy RM (2011) Recent theoretical and computational advances for modeling protein-ligand binding affinities. Adv Prot Chem Struct Biol 85:27–80 15. Hill TL (1986) An introduction to statistical thermodynamics. Dover, New York 16. Roux B, Simonson T (1999) Implicit solvent models. Biophys Chem 78:1–20 17. Pal RK, Gallicchio E (2019) Perturbation potentials to overcome order/disorder transitions in alchemical binding free energy calculations. J Chem Phys 151(12):124116 18. Gallicchio E, Lapelosa M, Levy RM (2010) Binding energy distribution analysis method (BEDAM) for estimation of protein-ligand binding affinities. J Chem Theory Comput 6: 2961–2977 19. Ben Naim A (1974) Water and aqueous solutions. Plenum, New York 20. Gallicchio E, Kubo MM, Levy RM (1998) Entropy-enthalpy compensation in solvation and ligand binding revisited. J Am Chem Soc 120:4526–27 21. Limongelli V, Bonomi M, Parrinello M (2013) Funnel metadynamics as accurate binding freeenergy method. Proc Natl Acad Sci 110(16): 6358–6363 22. Lapelosa M (2017) Free energy of binding and mechanism of interaction for the meevd-tpr2a peptide–protein complex. J Chem Theory Comput 13(9):4514–4523 23. Cruz J, Wickstrom L, Yang D, Gallicchio E, Deng N (2020) Combining alchemical transformation with a physical pathway to accelerate absolute binding free energy calculations of charged ligands to enclosed binding sites. J Chem Theory Comput 16(4):2803–2813 24. Cherepanov P, Maertens G, Proost P, Devreese B, Van Beeumen J, Engelborghs Y, De Clercq E, Debyser Z (2003) HIV-1 integrase forms stable tetramers and associates with LEDGF/p75 protein in human cells. J Biol Chem 278(1):372–381 25. Peat TS, Rhodes DI, Vandegraaff N, Le G, Smith JA, Clark LJ, Jones ED, Coates JA, Thienthong N, Newman J, et al (2012) Small molecule inhibitors of the LEDGF site of human immunodeficiency virus integrase identified by fragment screening and structure based design. PloS One 7:e40147 26. Fader LD, Malenfant E, Parisien M, Carson R, Bilodeau F, Landry S, Pesant M, Brochu C, Morin S, Chabot C, et al (2014). Discovery of BI 224436, a noncatalytic site integrase inhibitor (NCINI) of HIV-1. ACS Med Chem Lett 5(4):422–427 27. Zhang F-H, Debnath B, Xu Z-L, Yang L-M, Song L-R, Zheng Y-T, Neamati N, Long Y-Q (2017) Discovery of novel 3-hydroxypicolinamides as selective inhibitors of HIV-1 integrase-LEDGF/p75 interaction. Eur J Med Chem 125:1051–1063 28. Cherepanov P, Ambrosio AL, Rahman S, Ellenberger T, Engelman A (2005) Structural basis for the recognition between HIV-1 integrase and transcriptional coactivator p75. Proc Natl Acad Sci 102(48):17308–17313 29. Rhodes DI, Peat TS, Vandegraaff N, Jeevarajah D, Newman J, Martyn J, Coates JAV, Ede NJ, Rea P, Deadman JJ (2011) Crystal structures of novel allosteric peptide inhibitors of HIV integrase identify new interactions at the LEDGF binding site. ChemBioChem 12(15):2311–2315 30. Gallicchio E, Deng N, He P, Perryman AL, Santiago DN, Forli S, Olson AJ, Levy RM (2014) Virtual screening of integrase inhibitors by large scale binding free energy calculations: the SAMPL4 challenge. J Comp Aided Mol Des 28:475–490 31. Kilburg D, Gallicchio E (2018) Assessment of a single decoupling alchemical approach for the calculation of the absolute binding free energies of protein-peptide complexes. Front Molecular Biosci 5:22 32. Lapelosa M, Gallicchio E, Levy RM (2012) Conformational transitions and convergence of absolute binding free energy calculations. J Chem Theory Comput 8:47–60 33. Banks J, Beard J, Cao Y, Cho A, Damm W, Farid R, Felts A, Halgren T, Mainz D, Maple J, Murphy R, Philipp D, Repasky M, Zhang L, Berne B, Friesner R, Gallicchio E, Levy R (2005) Integrated modeling program, applied chemical theory (IMPACT). J Comp Chem 26:1752–1780 34. Eastman P, Swails J, Chodera JD, McGibbon RT, Zhao Y, Beauchamp KA, Wang LP, Simmonett AC, Harrigan MP, Stern CD, et al (2017) Openmm 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comp Bio 13(7):e1005659 35. Di Marino D, D’Annessa I, Tancredi H, Bagni C, Gallicchio E (2015) A unique binding mode of the eukaryotic translation initiation factor 4E for guiding the design of novel peptide inhibitors. Prot Sci 24:1370–1382 36. Kaminski GA, Friesner RA, Tirado-Rives J, Jorgensen WL (2001) Evaluation and reparameterization of the OPLS-AA force field for proteins via comparison with accurate quantum chemical calculations on peptides. J Phys Chem B 105:6474–6487 Free Energy-Based Computational Methods 37. Gallicchio E, Levy R (2004) AGBNP: an analytic implicit solvent model suitable for molecular dynamics simulations and high-resolution modeling. J Comput Chem 25:479–499 38. Gallicchio E, Paris K, Levy RM (2009) The AGBNP2 implicit solvation model. J Chem Theory Comput 5:2544–2564 39. Zhang B, Kilburg D, Eastman P, Pande VS, Gallicchio E (2017) Efficient gaussian density formulation of volume and surface areas of macromolecules on graphical processing units. J Comp Chem 38:740–752 40. Chipot, Pohorille (eds) (2007) In: Free energy calculations. theory and applications in chemistry and biology. Springer series in chemical physics. Springer, Berlin 41. Zwanzig RW (1954) High-temperature equation of state by a perturbation method. i. nonpolar gases. J Chem Phys 22(8):1420–1426 42. Jorgensen WL, Thomas LL (2008) Perspective on free-energy perturbation calculations for chemical equilibria. J Chem Theory Comput 4:869–876 43. Khuttan S, Azimi S, Wu J, Gallicchio E (2021) Alchemical transformations for concerted hydration free energy estimation with explicit solvation. J Chem Phys 154:054103 44. Tan Z, Gallicchio E, Lapelosa M, Levy RM (2012) Theory of binless multi-state free energy estimation with applications to protein-ligand binding. J Chem Phys 136: 144102 45. Gallicchio E, Levy RM (2011) Advances in all atom sampling methods for modeling proteinligand binding affinities. Curr Opin Struct Biol 21:161–166 46. Gallicchio E, Xia J, Flynn WF, Zhang B, Samlalsingh S, Mentes A, Levy RM (2015) Asynchronous replica exchange software for grid and heterogeneous computing. Comput Phys Commun 196:236–246 47. Chodera J, Shirts M (2011) Replica exchange and expanded ensemble simulations as Gibbs sampling: simple improvements for enhanced mixing. J Chem Phys 135:194110 48. Sugita Y, Kitao A, Okamoto Y (2000) Multidimensional replica-exchange method for freeenergy calculations. J Chem Phys 113: 6042–6051 49. Xia J, Flynn W, Gallicchio E, Uplinger K, Armstrong JD, Forli S, Olson AJ, Levy RM (2019) Massive-scale binding free energy simulations of HIV integrase complexes using asynchronous replica exchange framework implemented on the IBM WCG distributed network. J Chem Inf Model 59(4):1382–1397 333 50. Sugita Y, Okamoto Y (1999) Replica-exchange molecular dynamics method for protein folding. Chem Phys Lett 314:141–151 51. Felts AK, Harano Y, Gallicchio E, Levy RM (2004) Free energy surfaces of beta-hairpin and alpha-helical peptides generated by replica exchange molecular dynamics with the AGBNP implicit solvent model. Proteins: Struct Funct Bioinf 56:310–321 52. Andrec M, Felts AK, Gallicchio E, Levy RM (2005) Protein folding pathways from replica exchange simulations and a kinetic network model. Proc Natl Acad Sci USA 102: 6801–6806 53. Rick SW (2006) Increasing the efficiency of free energy calculations using parallel tempering and histogram reweighting. J Chem Theory Comput 2:939–946 54. Tan Z (2004) On a likelihood approach for monte Carlo integration. J Am Stat Assoc 99: 1027–1036 55. Gallicchio E, Andrec M, Felts AK, Levy RM (2005) Temperature weighted histogram analysis method, replica exchange, and transition paths. J Phys Chem B 109:6722–6731 56. Ding X, Vilseck JZ, Brooks III CL (2019) Fast solver for large scale multistate Bennett acceptance ratio equations. J Chem Theory Comput 15(2):799–802 57. Shirts MR, Chodera JD (2008) Statistically optimal analysis of samples from multiple equilibrium states. J Chem Phys 129:124105 58. Giese TJ, York DM (2021) Variational method for networkwide analysis of relative ligand binding free energies with loop closure and experimental constraints. J Chem Theory Comput 17:1326–1336 59. Subbaiah VK, Kranjec C, Thomas M, Banks L (2011) PDZ domains: the building blocks regulating tumorigenesis. Biochem J 439(2): 195–205 60. Panel N, Villa F, Fuentes EJ, Simonson T (2018) Accurate PDZ/peptide binding specificity with additive and polarizable free energy simulations. Biophys J 114(5):1091–1102 61. Liu X, Shepherd TR, Murray AM, Xu Z, Fuentes EJ (2013) The structure of the tiam1 PDZ domain/phospho-syndecan1 complex reveals a ligand conformation that modulates protein dynamics. Structure 21(3):342–354 62. Clark AJ, Gindin T, Zhang B, Wang L, Abel R, Murret CS, Xu F, Bao A, Lu NJ, Zhou T, Kwong PD, Shapiro L, Honig B, Friesner RA (2017) Free energy perturbation calculation of relative binding free energy between broadly neutralizing antibodies and the GP120 glycoprotein of HIV-1. J Mol Biol 429(7):930–947 334 Emilio Gallicchio 63. Clark AJ, Negron C, Hauser K, Sun M, Wang L, Abel R, Friesner RA (2019) Relative binding affinity prediction of charge-changing sequence mutations with FEP in protein–protein interfaces. J Mol Biol 431(7):1481–1493 64. Mey ASJS, Allen BK, Macdonald HEB, Chodera JD, Hahn DF, Kuhn M, Michel J, Mobley DL, Naden LN, Prasad S, Rizzi A, Scheen J, Shirts MR, Tresadern G, Xu H (2020) Best practices for alchemical free energy calculations [article v1.0]. Living J Comput Mol Sci 2(1): 18378 65. Jiang W, Chipot C, Roux B (2019) Computing relative binding affinity of ligands to receptor: an effective hybrid single-dual-topology freeenergy perturbation approach in NAMD. J Chem Inf Model 59(9):3794–3802 66. Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel RD, Kale L, Schulten K (2005) Scalable molecular dynamics with NAMD. J Comp Chem 26(16): 1781–1802 67. Steinbrecher T, Mobley DL, Case DA (2007) Nonlinear scaling schemes for Lennard-Jones interactions in free energy calculations. J Chem Phys 127:214108 68. Scheufler C, Brinker A, Bourenkov G, Pegoraro S, Moroder L, Bartunik H, Hartl FU, Moarefi I (2000) Structure of TPR domain–peptide complexes: critical elements in the assembly of the Hsp70–Hsp90 multichaperone machine. Cell 101(2):199–210 69. Comer J, Gumbart JC, Hénin J, Lelièvre T, Pohorille A, Chipot C (2015) The adaptive biasing force method: everything you always wanted to know but were afraid to ask. J Phys Chem B 119(3):1129–1151 Chapter 16 Computational Evolution Protocol for Peptide Design Rodrigo Ochoa, Miguel A. Soler, Ivan Gladich, Anna Battisti, Nikola Minovski, Alex Rodriguez, Sara Fortuna, Pilar Cossio, and Alessandro Laio Abstract Computational peptide design is useful for therapeutics, diagnostics, and vaccine development. To select the most promising peptide candidates, the key is describing accurately the peptide–target interactions at the molecular level. We here review a computational peptide design protocol whose key feature is the use of all-atom explicit solvent molecular dynamics for describing the different peptide–target complexes explored during the optimization. We describe the milestones behind the development of this protocol, which is now implemented in an open-source code called PARCE. We provide a basic tutorial to run the code for an antibody fragment design example. Finally, we describe three additional applications of the method to design peptides for different targets, illustrating the broad scope of the proposed approach. Key words Peptide design, In silico antibody maturation, Molecular dynamics, Consensus scoring functions, Sensor technology, Evolutionary algorithm, Antibody design, Affinity optimization 1 Introduction The design of synthetic peptides is unanimously considered of enormous potential for biomedical applications, in the emerging field of nanomedicine [1–3] as well as in medicinal chemistry [4]. Their versatility enables their use as alternatives to antibodies in targeted drug delivery and biomarker detection [5, 6]. Indeed, like antibodies, they can be mounted on detection devices or on nanoparticles to form ordered capturing arrays [7–10]. They can display pharmacological activity [11–15] and can be employed as modulators of protein/protein interactions [16, 17], with lower adverse effects and a higher binding specificity with respect to traditional drugs [18]. All these applications rely on the possibility to identify suitable hits. The state of the art of peptide design is strongly rooted on biotechnology. Phage display library screening is used to assess Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_16, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 335 336 Rodrigo Ochoa et al. interactions between different types of macromolecules, including peptides [13, 19]. With this technique, it is possible to massively screen potential peptide binders. If a binding partner is known, a suitable sequence corresponding to the minimal sub-domain responsible for binding can be extracted from the partner itself [17]. However, these approaches require specialized infrastructures and are expensive. A more cost-effective alternative, which has aroused in the last years, is computational design. Due to enormous advances in computer power and to a better understanding of the chemical properties of natural amino acids, it is nowadays possible to rationally design peptides or proteins with a high probability of being active in vitro and in vivo [20]. An advantage of computational techniques is that they can describe the binding mechanisms at the atomistic level, allowing for a rational supervision of their properties. For instance, they enable controlling at the molecular level the binding site on the target protein, enhancing the selectivity properties of the designed binders [21]. However, all these benefits do not come for free. Computational design of peptides and proteins requires the efficient exploration of the sequence space, the accurate description of the bound (and unbound) conformations, and an accurate prediction of the peptide–target binding affinities (Fig. 1). Fig. 1 The challenges in computational peptide design: exploring efficiently the sequence space, the bound conformations, and predicting the binding affinity of the protein–peptide complexes Computational Evolution Protocol for Peptide Design 337 Different strategies have been developed to face the challenges associated with computational peptide design. For example, the design can be performed using an in silico panning method with structural information [22], using a genetic algorithm [23] for sequence optimization. In this approach, conformational optimization and binding energy estimation are performed by a docking program. In [24], the authors use a Gaussian Network Model [25] for identifying the binding site and from that an approximate position of the peptide backbone. Then, they systematically attempt docking 400 dipeptides at the positions determined by this procedure in order to maximize the interaction energy, checking simultaneously the quality of the peptide conformation by characterizing the ϕ ψ propensities of the dipeptides. These approaches can be classified as template-based protocols. They are very computationally efficient but require the prior knowledge of the structure of a template. Reversely, de novo methods are computationally more expensive but can be used also when a template is not available (Fig. 1). This is the case of the pepspec protocol [26] included in the Rosetta software suite [27]. The pepspec tool follows a strategy of “anchor and grow” flexible backbone docking by starting from one key residue and optimizing from this point peptide sequences and structures [28]. Another example is the VitAl approach [29], which generates the peptides by sequentially docking a pair of residues and selecting the best fit by scoring the binding energies with AutoDock. PepComposer [30] retrieves patches from a database similar to the query and peptide fragments that interact with these patches. It then merges these fragments into an initial proposal that is further optimized using a set of iterative mutations and controlled backbone movements. Our design algorithm, called PARCE (Protocol for Amino acid Refinement through Computational Evolution), belongs to this second group. PARCE, like most other de novo computational approaches, generates successive single-point mutations on the peptide or protein binder sequence. Each mutation is then accepted or rejected by analyzing the behavior of the complex using explicit solvent molecular dynamics trajectories. This makes the approach much more computationally expensive than the design schemes based on docking, but at the same time it enables describing the conformational changes induced by the binding with a level of accuracy which is only limited by the quality of the force field used in the simulation. Chapter Overview In the following sections, we describe in detail the design protocol implemented in PARCE [31]. We then provide a detailed tutorial and manual-like example for the design. Afterward, three additional applications of the protocol are presented. Finally, we discuss the open problems that require further development for the code and the field. 338 2 Rodrigo Ochoa et al. PARCE: Protocol for Amino Acid Refinement Through Computational Evolution The evolution of PARCE until its current version can be summarized in the timeline of Fig. 2. The original idea of this design approach was proposed by Laio and collaborators in 2012 [32], as an in silico mutagenesis platform for the optimization of amino acid-based binders. The approach can be used to design not only peptides but also antibody fragments, or other proteins whose amino acid sequence requires accurate engineering for applications in biosensing, biomedicine, and bioengineering. The protocol explores the sequence space of peptides bound to proteins or small molecules, using a Monte Carlo approach that integrates various simulation and prediction techniques [33]. The method, already in its original formulation [32], was based on a sequence of single-point mutations. One of the first applications was the design of peptides capable of binding with high affinity to an organic molecule in a denaturating solvent [34]. The idea of performing MD in explicit solvent was originally motivated exactly by this need, but it was afterward adopted in general, also when the design is performed in water. In Ref. 34, the quality of the complexes was estimated by computing the average value over the trajectory of a single suitable scoring function (Vina [35] for this case). This approach was refined and specialized to the design of protein binders in Ref. 36. The process is repeated many times, with the aim to evolve the original sequence toward novel sequences with predicted better affinities toward their targets [21, 32, 34, 36–38]. In 2019, Ref. 39 introduced another key idea: the estimation of the suitability of an attempted mutation to be carried out by a consensus mechanism using a set of binding 2012 2015 2019 2015 2017 2021 Fig. 2 The PARCE timeline describing the main development milestones and the publications supporting the progress of the peptide design protocol Computational Evolution Protocol for Peptide Design 339 scoring functions [40]. This makes the results much less dependent on the accuracy and the quality of a single scoring function. The approach has been successfully used to design peptides and protein fragments capable of binding to protein targets [21, 36, 39]. The PARCE code is distributed as an open-source software (https://github.com/PARCE-project/PARCE-1) and enables designing peptides or proteins capable of binding with higher affinity to a generic target, as long as this can be accurately described by a classical force field. The method, in its current formulation, combines several computational biophysics and bioinformatics tools, in order to achieve an equilibrium between accuracy and computational efficiency. In general, in a design run, one obtains several peptide candidates, whose number can be increased by performing several statistically independent runs (if enough computational resources are available). This increases the pool of sequences for further filtering and validation and is an advantage against more deterministic or brute force alternatives. Moreover, since the code is open source, it is possible to adapt it according to the research project needs. A graphical representation of the protocol and the required dependencies are shown in Fig. 3. In the following, we explain its main steps. 2.1 Mutation Protocol The core of the algorithm is an iterative sequence optimization. At every iteration step, a single-point mutation on the peptide sequence is generated by selecting at random a position along the peptide chain and by replacing the selected residue by a random amino acid (i.e., a mutation). A key element of the algorithm is generating a reliable structure of the mutant. If the mutated side chain is placed incorrectly, it will likely make severe steric clashes with other side chains or with the target. To heal these clashes, it would be necessary to perform very long MD equilibrations, which are not affordable. In PARCE, the configuration of the mutated amino acids can be generated either with the programs Scwrl4 [41] or with FASPR [42]. These approaches were selected based on a study that assessed if a mutation protocol is able to predict amino acid rotamers similar to those that would be generated in a long MD run [43]. After performing the mutation, a first minimization of just the predicted side chain is performed. In order to avoid clashes between the mutated amino acid and the surrounding water molecules, a second minimization of the new amino acid and the water molecules surrounding it within 2 Åis carried out. Finally, a minimization of the full system is performed followed by an NVT equilibration of typically 100 ps. Other standard parameters employed in the equilibration are described in the next section. After the minimization and equilibration steps/phases, the new system is then sampled by performing an MD simulation. 340 Rodrigo Ochoa et al. Fig. 3 Schematic representation of the PARCE pipeline. It includes four main phases: a single-point mutation, the conformational sampling of the new protein–binder complex, the scoring of the conformations of the new complex, and the acceptance or rejection of the mutation. The protocol is iterated to improve the binder sequence. Several-open source dependencies are required to run the protocol 2.2 Conformational Sampling with Molecular Dynamics For each mutation, an MD simulation is run to sample the conformations of the complex. This step can be seen as the “fingerprint” of this design approach, the feature that makes it different from most other design strategies. The specific force filed can be chosen based on the experience of the user, and the MD setup can be adapted to the physico-chemical conditions of the environment in which the binding should happen. For example, for a design of peptides capable of binding a protein in water solution at ambient conditions, one can use the Amber99SB-ILDN protein force field [44], a TIP3P water model [45], a modified Berendsen thermostat [46], and a Parrinello–Rahman barostat [47]. In general, the complex formed by the peptide–protein and its target is solvated in a cubic box with periodic boundaries at a distance of at least 8 Å from any atom of the complex. By default, Na+ and Cl counterions are included in the solvent to make the box neutral, but the Computational Evolution Protocol for Peptide Design 341 concentration and the ion type can be easily changed to take into account a specific ionic strength. In general, the electrostatic interactions are calculated by using the Particle Mesh Ewald (PME) method, with 1.0 nm short-range electrostatic and van der Waals cutoffs [48], and the equations of motion are solved with the leapfrog integrator [49], using a timestep of 2 fs. 2.3 Scoring and Mutation-Acceptance Strategies After performing the mutated peptide–protein (or peptide–ligand) conformational sampling, the trajectory is scored with a chosen set of scoring functions used for protein–protein, protein–peptide, or protein–ligand affinity predictions. The mutation can be accepted or rejected based on three different strategies, outlined in the following. 2.3.1 Monte Carlo Optimization The most simple optimization strategy is based on Monte Carlo and on the use of a single scoring function for estimating the binding affinity. At each step of the mutation cycle, the peptide chain is randomly mutated selecting one amino acid from the sequence and replacing it with a different amino acid. The protocol offers the possibility to select the amino acid positions in the peptide chain that are eligible for mutations, as well as the list of possible amino acids selected for the replacement. For example, in Ref. 34, the design is performed on cyclic peptides: the terminal CYS positions were never mutated in order to conserve the cyclic geometry, while GLY was removed from the amino acid list used for the replacement, avoiding undesired mobility in the new peptide chain. After each mutation step, meaningful conformations of the mutated peptide/target complex in explicit solvent are generated by finite temperature MD, employing the methodology described in the previous section. The binding affinity of the mutated peptide toward the target ligand is then estimated using a single scoring function. In Ref. 34, a cluster analysis is performed over the last part of the trajectory (the last 1 ns of a 5 ns NPT production run) to extract statistically relevant conformations of the peptide–ligand complex. Poorly populated clusters were discarded (clusters with less than 15 conformations), while for the central structure of the remaining clusters, the peptide–ligand affinities were scored employing the Vina scoring function [35]. In the PARCE implementation, the cluster analysis is not performed anymore, and the binding affinity is simply estimated as the average value of the scoring function on the whole MD trajectory, neglecting only its first part (whose length can be set by the user). The new peptide sequence at step k is accepted or rejected based on the Metropolis criterion, with a probability min ð1, exp½ðE k E k1 Þ=T e Þ, ð1Þ 342 Rodrigo Ochoa et al. where Ek1 is the estimated binding affinity before the mutation, Ek is the binding affinity after the mutation, and Te is an efficacious temperature that controls the acceptance rate. If the sequence is accepted, a new mutation cycle is started from the mutated sequence; otherwise, the former sequence from step k 1 is used as starting point for a new mutation attempt. The mutation cycle described above is iterated up to a desired number of mutations. 2.3.2 Replica Exchange Optimization The exploration of the sequence space can be increased using a replica exchange scheme by running simultaneous and independent mutation cycles at many different efficacious temperatures. At the end of each step, a swap between two randomly selected replicas (e.g., r and r0 ) is attempted. The swap is accepted according to a parallel tempering scheme, 1 1 , ð2Þ min ð1, exp Þ ðE r E r 0 Þ Tr Tr where Er and E 0r are the peptide/target binding energy in replicas r and r0, and Tr and T 0r are the efficacious temperatures. If the swap is accepted, the replica indexes are swapped. The replica exchange scheme is not currently implemented in PARCE. 2.3.3 Consensus Optimization The two optimization approaches described above attempt optimizing the binding affinity estimated by a single scoring function. If, for example, one estimates the binding affinity with Vina, the evaluation is based on counting the number and type of peptide– ligand contacts and providing, for each of them, an energy value assuming that the complex is fully solvated in an aqueous environment [35]. For this reason, binding affinities were not necessarily meaningful in non-aqueous environments and the scoring was used with the only intent to screen the most viable peptide–ligand complex observed during the MD trajectory. These limitations motivated us to improve the scoring scheme of the protocol with a consensus optimization scheme. In this strategy, the mutation is accepted following a consensus-based approach using N scoring functions. If a particular number n of scoring functions agrees on an improvement of the binding affinity of the mutated peptide B, with respect to the one prior to the mutation, i.e., peptide A, then the final consensus will accept the attempted mutation [39]. Formally, the consensus score C is defined as C¼ N P ck , ð3Þ k¼1 where ck for the scoring function k is ( 1, S Bk S A k < 0 , ck ¼ 0, otherwise, ð4Þ Computational Evolution Protocol for Peptide Design 343 where S Ik is the value of the average score for peptide I. It should be noted that all employed scoring functions are defined as binding energies, so that lower values mean higher binding affinities. The criterion to evaluate if a consensus among the scoring functions is achieved is based on the comparison of C to a predefined threshold T (with a value between 1 and N). If C T, the mutated sequence is accepted. The scores are estimated as an average over all the snapshots of the trajectory. The next section presents a tutorial on installing and running PARCE and an example guide on designing a nanobody paratope region bound to a protein fragment. 3 Tutorial 3.1 Installing and Running PARCE PARCE can be downloaded from https://github.com/PARCEproject/PARCE-1 and installed under any Linux operating system. A README file with instructions is included in the repository. The code has been initially optimized for Debian and Ubuntu OS server distributions. We note that all the dependencies required to run PARCE are open-source software, but some of them, such as Scwrl4 [41], require academic licenses. In such cases, it is recommended to install these packages following the developer’s documentation to integrate their paths to the code. To guarantee that the additional tools and dependencies are functioning, a set of tests is provided in the repository. A docker container is also available in case the user wants to skip the installation of third-party tools. After installing PARCE, one has to set up the configuration file that contains instructions to start the system and launch the protocol. It describes the path and the characteristics of the input files, as well as the necessary parameters to run the design protocol. An explanation of the input parameters is provided in Table 1. Before running the protocol, we recommend performing an equilibrated MD simulation of the initial complex. Then, the protocol is run by the command: python3 run_protocol . py [- h ] - c CONFIG_FILE The design protocol results are summarized in the output file called mutation_report.txt, which contains details per mutation step, like the type of mutation, the average scores, the binder sequence, and if the mutation was accepted or not. The mutation is defined by the syntax: [old amino acid][binder chain] [position][new amino acid]. An example of a mutation is AB2P, which means that an alanine located in the position number 2 of the chain B is replaced by a proline. 344 Rodrigo Ochoa et al. Table 1 Parameters provided by the user in the configuration file Parameter Explanation Folder Name of the folder that has all the input and output files of the protocol src_route Route of the PARCE folder where the src folder is located Mode The design mode, which has three possible options, including start and restart modes peptide_reference The sequence of the peptide, or protein fragment that will be modified pdbID Name of the structure that is used as input Chain Chain id of the peptide/protein in the structural complex sim_time Time in nanoseconds that will be used to sample the complex after each mutation num_mutations Number of mutations that will be attempted residues_mod These are the specific positions of the residues that want to be modified. md_route Path to the folder containing the input files used during the previous MD sampling of the system md_original Name of the system file located in the folder containing the previous MD sampling score_list List of the scoring functions that will be used to calculate the consensus. half_flag Flag that controls which part of the trajectory is used to obtain the average score. Threshold Threshold used for the consensus scoring. mutation_method Protocol to perform the single-point mutations scwrl_path Provide the path to Scwrl4 in case it is not installed in a PATH folder. gmxrc_path Provide the path to GMXRC for Gromacs In addition, the report file includes failed attempts based on minimization or equilibration problems in MD. To overcome these issues, the protocol automatically attempts a number of mutations using the last accepted structure. If the simulation keeps failing after a certain number of attempts (defined by the user in the input file with the keyword try_mutations), a new mutation will be attempted but using the complex that was accepted previous to the current one. If the problem persists more than the number of try_mutations, the design run is stopped. If the protocol is successful, the number of attempted mutations is decided by the key word try_mutations. PARCE has an MIT license that allows for the distribution of the code and its improvement through new functionalities, for example, for adding new scoring functions. We note that the computational resources required for running PARCE are determined by the complexity of the system, since the design is based on running MD. HPC versions of the code are available upon request. Computational Evolution Protocol for Peptide Design 345 Fig. 4 The structure of the VHH antibody fragment. The peptides to be optimized correspond to the complementary determining regions (highlighted in red). The framework sequence (yellow) is not mutated throughout the whole process 3.2 Tutorial Example: The Optimization of Anti-HER2 Antibody Fragments The human epidermal growth factor receptor 2 (HER2) is a transmembrane protein whose overexpression is associated with specific classes of breast cancer and is thus a widely recognized biomarker employed for monitoring cancer progression, as well as a key pharmacological target for cancer therapy [50, 51]. For this particular example, the goal is to design a novel antibody fragment of camelid origin (or VHH, Fig. 4) capable of detecting HER2 in a patient’s biological fluids [8, 10]. The idea is to optimize a peptide, or a set of peptides, already embedded into an existing protein to recognize the target. In particular, we aim to design peptide fragments that are part of the antibody binding domain, also known as complementary determining regions (CDRs). This process, called antibody maturation, is usually performed in vivo, by animal immunization. Using an in silico process reduces the use of animals for binder discovery. A further advantage of the computational design is that it enables choosing a priori the binding site on the target protein. The selection of the binding site (or epitope) to be targeted is of paramount importance for the development of new nanodevices, for targeted therapies, and for drug design [52]. An example of VHH optimization performed by PARCE is described in Ref. 39. Here, we show how to set up the design and how it is possible to employ a different set of scoring function to obtain an ex novo designed antibody fragment for an arbitrarily selected binding site on HER2. 3.2.1 Design Methodology To get started with any design, it is necessary to have a reasonable starting model of the initial complex. This can be done using either a crystallographic complex or a conformation obtained by docking 346 Rodrigo Ochoa et al. the binder to the target. If there are no experimental 3D structures available, these can be constructed by homology modelling. We also remind that, when working with a new system and before getting started with the optimization, all scoring functions should be benchmarked over the particular system [40, 53]. As a second step, it is necessary to identify the residues that should be mutated. In the case of an antibody fragment, these can be the residues belonging to either one or two or all three CDRs (highlighted in Fig. 4). Only this selected region will be optimized, leaving the sequence of the rest of the protein unchanged throughout the whole process. The input files for the design are the starting topologies for the MD and the configuration file. Examples of these files, aiming to reproduce the results of Ref. 39, can be found in the folder design_input/protein_protein. The CONFIG_FILE contains the input parameters shown in Table 1. The individual peptide residues to be optimized should be explicitly listed in the CONFIG_FILE. For instance, for the optimization of a single antibody fragment loop, the config_vhh.txt would read residues_mod: 54,55,56,57,58,59,60,61 Instead, to optimize all the VHH residues highlighted in Fig. 4, namely residues 29–25 corresponding to the first CDR, 55–61 corresponding to the second CDR, and 101–109 corresponding to the third CDR, one should write residues_mod: 29,30,31,32,33,34,35,55,56,57,58,59,60,61,101, 102,103,104,105,106,107,108,109 simply listing all residues even if they belong to different regions of the system. While the example of the optimization of a single CDR can be found in Ref. 39, here we show how the same process can lead to the optimization of all three VHH CDRs. 3.2.2 DesignOptimization Results A typical optimization path is reported in Fig. 5a, where each score is a proxy measure of the binding affinity between the two components of the system, in that case, the whole VHH and its target. It is important to note that, even if only the selected residues are mutated, the score is calculated over the entire complex. An optimization is considered concluded when all scoring functions reach a plateau. For a collective view of the whole optimization process, a rankbased analysis can be used. First, one computes the rank r ik associated with each sequence i according to the score obtained with a single scoring function k. Accordingly, r ik can be normalized as Computational Evolution Protocol for Peptide Design 347 Fig. 5 Design of an antibody fragment (VHH) bound to the HER2 terminal domain. (a) Evolution of the six scoring functions during the design. The dots in the curve represent the mutations that were accepted. The scoring functions used are BMF-Bluues [75, 76] (gray), Rosetta [27] (magenta), PiePisa [73] (orange), Haddock [70] (blue), Bach6 [77, 78] (mauvre), and Bluues [76] (cyan). (b) Ranking of the configurations: both the single i scoring function normalized ranks (r^k , stars) and the global normalized rank Ri (black line) for each peptide i. In the insets, starting and final configuration of the VHH/HER2 complexes. Color code: HER2 (gray), VHH framework (yellow), starting residues (red), and optimized residues (green) r^ik ¼ r ik , N ð5Þ where N is the total number of accepted mutations obtained in the runs. From the collection of r^ik (indicated by stars in Fig. 5b), a global ranking score Ri for each sequence is defined (black dots in Fig. 5b) as Ri ¼ P r^ik k¼1, N s Ns ; i ¼ 1, N , ð6Þ where Ns is the number of scoring functions. If the ranks of a certain sequence i are consistently low for all the scoring functions, then Ri is small. In the particular case illustrated in Fig. 5, Ri decreases when more mutations are performed, as expected. By comparing the initial and the final configurations of the system, the former with sequence associated with max ðRi Þ and the latter with min ðRi Þ (insets in Fig. 5b), it is possible to see how the initial VHH evolves into a final VHH by changing its orientation to maximize its contacts with the target, defining a larger contact area between the two. The optimized VHHs, or better a selection of the lowest ranking sequences, will then need to undergo extensive MD simulations and stability tests [54]. VHHs passing all the computational tests will then be ready to be expressed in bacterial cells [55]. In the next section, several additional examples of peptide design are presented. 348 4 Rodrigo Ochoa et al. Additional Peptide Design Examples 4.1 Drug-Binding Peptide Design in Different Environments Reference 34 was the first works performing peptide design in explicit solvent with our scheme. It reports the design of highaffinity cyclic peptides toward Ironotecan (CPT-11). CPT-11 is a chemotherapy drug, and its choice was motivated by the need of engineering sensors for therapeutic drug monitoring in denaturant solvent (e.g., methanol), which were afterwards validiated experimentally [37]. Compared to the original protocol of 2012 [32], three important innovations were introduced: (1) the conformational search for viable peptide–ligand conformations during the mutation cycle was carried on by finite-temperature molecular dynamics and not by flexible docking in vacuum, (2) cyclic peptides were adopted for the design, and (3) the design was performed with the peptide– ligand complexes fully solvated in a simulation box with an explicit atomistic description of the solvent molecules [34]. Computationally intense design in explicit solvent was made possible thanks to the advent of GPU-based computing, which started to be efficiently implemented in commonly used MD packages in those years. The protocol adopted was basically the same employed in the current version of PARCE and described in Subheading 2, using in particular Replica Exchange optimization with a single scoring function (Vina [35]). Two independent designs were performed, one in water and one in methanol. The procedure started from a deca-alanine cyclized by a disulfide bridge between two terminal cysteines. CPT-11 was initially inserted within the cyclic peptide, and one randomly selected amino acid of the peptide chain was mutated at each step. The terminal cysteines were not selected for the mutation in order to conserve the cyclic geometry. After each mutation, MD simulations of 5 ns were performed for the peptide– ligand complex fully solvated in water (or methanol), and relevant peptide–ligand structures were selected from the last part of the MD trajectory by cluster analysis (see Subheading 2.3.1). For the selected structures, the peptide–ligand affinities were estimated using the Vina scoring function, and the mutation was accepted or rejected according to a Metropolis criterion. To further enhance the exploration of the sequence space, a replica exchange scheme with 5 effective temperatures was employed, as described in Subheading 2.3.2 (Fig. 6a). After 400 mutations, the best seven peptide–ligand complexes, in terms of binding energies, were selected and their stability further assessed by longer (at least 100 ns) MD simulations in explicit solvent at different temperatures (i.e., from 300 K up to 450 K). The designed peptides revealed a solvent specificity, namely peptides designed in aqueous environment do not necessarily bind the ligand in a different solvent, such as methanol. This is Computational Evolution Protocol for Peptide Design 349 Fig. 6 Design of peptides for CPT-11 in aqueous and methanol solutions. (a) The Vina scoring as a function of the mutation steps for the design in water. The five different colors report the binding affinities observed at the five effective temperatures employed during the procedure (see main text for details). (b) The best seven peptides (A-G), in terms of binding affinity toward CPT-11, from the design in water. Black dots are the binding energies as predicted during the mutation cycle, while in square blue after 100 ns MD at 300 K in water. The green diamond displays the binding affinity of the A-G peptides after 100 ns MD in methanol. (c) as panel (b) for the seven best peptides (α η) designed in methanol. (d) The experimental dissociation constant, kD in methanol solution vs. the computationally predicted binding energies. The green dots show results for two peptides designed in methanol, black dots for three peptides designed in vacuum using the flexible docking approach [32]. The experimental values were taken from Ref. 37. In the green inlet, the peptide backbone around CPT-11 at 0, 25, 75, and 100 ns MD simulations for one of the peptides designed in methanol. Panels (a)–(c) adapted from Ref. 34, copyright 2015 American Chemical Society evident in Fig. 6b: once peptides designed in water are solvated in methanol, the binding becomes weaker and, occasionally, some peptides can even detach from the ligand [34]. This solvent specificity is a consequence of the explicit description of the solvation environment during the design. Peptides created in water are, indeed, richer of aromatic residues than those designed in methanol: once peptides designed in water are immersed in methanol, the 350 Rodrigo Ochoa et al. aromatic side chains are more easily exposed to the solvent, competing with the binding toward the ligand. The high affinity of the designed peptides in methanol has been confirmed experimentally in a follow-up paper using surface plasmon resonance and fluorescence spectroscopy [37]. The peptides displayed an experimental micromolar affinity toward CPT-11 in methanol solution, and MD simulations revealed peptide–drug complexes more stable in solvent than those designed in vacuum using flexible docking (Fig. 6d). Interestingly, the designed peptides were selective toward the target and unable to bind SN-38, an active metabolite of CPT-11 lacking of the carbamate and piperidyl-piperidine groups [37]. A similar procedure was also adopted to design peptides that bind chlorogenic acid (GCA), a compound present in coffee blends, in water solution [56]. Electrochemical measurements and circular dichroism and fluorescence spectroscopy certified the high affinity of the design, showing a remarkable peptide selectivity toward CGA and not to other related phenolic compounds [56]. 4.2 Peptides for Protein Recognition The protocol introduced in Ref. 34 was subsequently employed for the design of peptides for protein recognition [9, 10, 21, 38, 54, 57]. While the approach was still relying on a single scoring function, namely Vina [35], it allowed for an unprecedented versatility in the choice of the binding site. In particular, after having successfully designed linear peptides for a well-defined protein pocket with the docking based code [32, 33], the new approach introduced in Ref. 34 allowed designing ligands for surface-exposed binding sites (Fig. 7), which are generally regarded as “undraggable.” Fig. 7 Design of peptides for B2M in vacuum. (a) The Vina score as a function of the mutation steps. The three different colors report the binding affinities observed at the three effective temperatures employed during the procedure (see main text for details). Yellow crosses indicate peptides that underwent computational and experimental screening. The best peptide/protein complex is shown in (b). (c–d) Top and side view of the two computationally designed peptides discussed in the text. Adapted from Ref. 21 Computational Evolution Protocol for Peptide Design 351 To show the versatility of PARCE in terms of choice of the target binding site, we selected two sites on opposite sides of a globular protein that does not possess pockets: the beta-2microglobulin. Due to the large system size, the design was performed in vacuo, followed by a screening in explicit solvent using MD simulations and a final experimental validation. The first binding site chosen for the design was a surfaceexposed site, which is known to interact with the human histocompatibility antigen [54, 57]. Among the generated peptides, five were experimentally tested giving dose–response surface plasmon resonance (SPR) signals with dissociation constants in the micromolar range. The result was confirmed by means of isothermal titration calorimetry and nuclear magnetic resonance, showing that the approach is capable of designing binders for an arbitrarily selected binding site. We then identified another site on the opposite side of B2M and attempted to generate a second peptide (Fig. 7a). Once again SPR confirmed the dissociation constant to be in the micromolar range. Competition experiments further confirmed the two peptides to bind to non-overlapping binding sites, thus confirming the theoretical predictions (Fig. 7b and c). We further showed that this design approach can be exploited for bottom-up design of smart nanodevices. Indeed, the peptides designed in these projects were employed as sensing elements to build a self-assembled nanochip capable of capturing a target protein by means of preselected binding sites [21], allowing for the immobilization of the chosen protein in a predefined orientation [54]. 4.3 MHC Class II Peptide-Binder Design The major histocompatibility complex (MHC) class II is a complex of encoding proteins responsible for regulating the immune system in humans [58] through the interaction with antigen proteins and peptide subunits. Different experimental and computational strategies have been implemented to predict affinities of peptides toward relevant alleles within the population [59], as a strategy for the development of more efficient vaccines [60]. The field known as immunoinformatics has provided an extensive set of tools, mainly sequence-based strategies, for predicting the affinity between a peptide and MHC class I or class II molecules [61]. However, structural information is also crucial to rationally study peptides bound to the MHC class II binding interface, which has been characterized by a large groove located between the solvent exposed α and β structural subunits [62] (Fig. 8). Specific interactions created between some protein pockets and core amino acids of the peptides contribute to the molecular affinity [63]. The latest has been correlated with immunogenic properties, as well as 352 Rodrigo Ochoa et al. Fig. 8 Summary of the scoring strategy used in the design protocol. (a) The structure of MHC class II in complex with the peptide at step 0, and examples of a rejected mutation at the 20th step (colored in red) and of an accepted mutation at the 50th step (colored in green). (b) Representation of the accepted mutations (green circles) and the rejected (red circles), with the rejected and accepted examples depicted by dash lines other events during the MHC editing process [64]. This motivates the use of structure- and dynamic-based approaches such as PARCE to engineer peptides with better affinities for this molecular receptor. In this example, the starting complex for the design was the MHC class II allele DRB1:01*01 bound to a peptide of 14 amino acids, that is part of an influenza virus antigen (YPKYVKQNTLK LAT ) (Fig. 8). This sequence has a reported bioactivity of IC50 ¼ 130nM from a curated dataset of peptide binders against multiple MHC class II alleles [65]. As reference, we used the crystal structure 1DLH [66] from the Protein Data Bank (PDB) [67] that has a missing tyrosine at the N-terminal flanking region. The missing amino acid was modelled using the Rosetta Remodel functionality [68] using the full protein–peptide complex as a template. The side chains of the complex were relaxed using Rosetta with the protein backbone fixed. The refined protein–peptide structure was equilibrated by an MD simulation of 100 nanoseconds (ns), with previous minimization and NVT/NPT equilibration, using Gromacs v5.1 [69]. Despite the linear conformation of the peptide in the bound state, the complex remains stable during the simulation, Computational Evolution Protocol for Peptide Design 353 mostly due by the hydrogen bonds between the receptor and peptide backbone atoms. The final snapshot of the MD was used as the starting conformation for the design. We then applied the PARCE protocol explained in Subheading 2. Specifically, we configured the protocol to mutate randomly any amino acid of the peptide. We iterated the mutation process and sampled each mutated protein–peptide complex for 5 ns, at a temperature of 350 K. A high temperature was chosen to allow a more efficient exploration of the conformational space. All the protein atoms located at a distance greater than 12 Å from the peptide were restrained to keep the system stable at the selected temperature. The design was performed by using the consensus scheme described in Subheading 2.3.3, using six scoring functions that were previously benchmarked for this specific system [53]: Haddock [70], Vina [35], a combination of DFIRE and GOAP (DFIRE-GOAP) [71, 72], Pisa [73], FireDock [74], and BMF-BLUUES [75, 76]. The threshold parameter T (Subheading 2.3.3) was set equal to 3, following Ref. 39. This means that if 3 or more scoring functions predict better scores for the new mutation, the mutation is accepted. During the design, 100 mutations were attempted, with an acceptance ratio of around 20–30%. The evolution of the scores with accepted and rejected mutations is shown in Fig. 8. As expected, the mutations minimize the majority of the scoring functions scoring functions. This specific example illustrates the usefulness of the consensus strategy with respect to the standard design strategy, which is based on a Monte Carlo optimization of a single scoring function [34] (see Subheading 4.1). Indeed, each scoring function is typically affected by errors, but very often the errors of different scoring functions are uncorrelated and might compensate. The consensus criterion allows complementing the different empirical and physicsbased terms of the scoring functions [39]. As shown in Fig. 9, the scoring functions are, on average, minimizing their value through the trial of multiple mutations. Nonetheless, due to the nature of the stochastic search and the definition of the consensus score, the single scoring functions can also increase. We note that to explore the sequence space, it might be beneficial to run multiple replicas of the protocol starting from the same initial complex but with different random seeds. In this case, the peptides from the different runs can be combined following the same re-ranking procedure using the average rank from all the scoring functions calculated from MD simulations (similarly to that described in Subheading 3.2.2). 354 Rodrigo Ochoa et al. Fig. 9 Evolution of the scoring functions for the design of peptides bound to the MHC class II. We used six scoring functions to calculate the consensus. The dots in the curve represent the mutations that were accepted. The scoring functions used were (a) BMF-Bluues [75, 76], (b) Vina [35], (c) Firedock [74], (d) Haddock [70], (e) DFIRE-GOAP [71, 72], and (f) Pisa [73] Computational Evolution Protocol for Peptide Design 5 355 Concluding Notes and Perspectives The exponential increase of computational resources enables the use of novel strategies to complement, assist, or even replace traditional experimental methods for designing and screening novel peptide ligands for applications ranging from biomarker detection, drug delivery, drug design, and vaccine development. This chapter presented the methods that our team has developed to address this exciting challenge. We developed, implemented, tested, and validated a modular algorithm for the ex novo optimization of amino acid based binders, named PARCE [31]. It enables the optimization of the peptide sequence to maximize its (predicted) binding affinity toward a molecular target. The protocol, initially introduced as an evolutionary algorithm based on iterative docking [32, 33], evolved into a comprehensive open-source design protocol, embedding a number of functionalities that have been tested and improved during the years, thus enhancing its outreach. Indeed, the key of PARCE’s success relies on its modularity: it has been designed so that when novel more accurate approaches are available, these can be easily embedded into the existing code. The explicit description of the solvation environment during the design procedure was crucial for selecting successful candidates that are solvent-specific and target-selective. This procedure and the ongoing improvement of force fields for MD have opened the possibility of exploring new design conditions (e.g., binding in nonstandard solvents and under extreme pressures and temperatures), which may be hardly accessible with other computational approaches. Another determinant for successful designs was the inclusion of multiple scoring functions, in the form of a consensus criterion. This enabled, for example, the in silico unsupervised maturation of an antibody fragment [39]. All these improvements are expected to push forward the limits of peptide design by reaching affinities analogous to those reached by nature. PARCE is only limited by the accuracy of the predictors it relies upon, which can be updated as new techniques become available. We are thus looking forward for novel advances in structure prediction and free energy evaluations. However, we note that accurate predictors typically involve more computational resources. For the case of PARCE, MD simulations involve costs which are orders of magnitude higher than docking methodologies. Nevertheless, we believe that using MD helps improving the quality of the design. We foresee that the continuous growth in computing power will make the trade-off between computational cost and accuracy more and more unbalanced toward accuracy. 356 Rodrigo Ochoa et al. Acknowledgements R.O and P.C. were supported by MinCiencias, Ruta N, University of Antioquia, Colombia, and the Max Planck Society, Germany. N.M. was supported by the Alternatives Research & Development Foundation (Annual Open Grant, PI: S.F., M.A.S.). S.F. would like to acknowledge the Italian Association for Cancer Research (AIRC) through the grant “My First AIRC grant,” Rif.18510, and the CINECA Awards N. HP10B3JT25, 2020, for the availability of high performance computing resources and support. Conflict of Interest The authors declare that they have no competing interest. References 1. Kim BY, Rutka JT, Chan WC (2010) Nanomedicine. N Engl J Med 363(25):2434–2443 2. Zhang X-X, Eden HS, Chen X (2012) Peptides in cancer nanomedicine: drug carriers, targeting ligands and protease substrates. J Controll Release 159(1):2–13 3. Chung EJ (2016) Targeting and therapeutic peptides in nanomedicine for atherosclerosis. Exp Biol Med 241(9):891–898 4. Brayden DJ, Hill T, Fairlie D, Maher S, Mrsny R (2020). Systemic delivery of peptides by the oral route: formulation and medicinal chemistry approaches. Adv Drug Deliv Rev € (2019) 5. Kurrikoff K, Aphkhazava D, Langel U The future of peptides in cancer treatment. Curr Opin Pharmacol 47:27–32 6. Deutscher, S. (2019). Phage display to detect and identify autoantibodies in disease. N Engl J Med 381(1):89–91 7. Cretich M, Damin F, Pirri G, Chiari M (2006) Protein and peptide arrays: recent trends and new directions. Biomol Eng 23(2–3):77–88 8. Ambrosetti E, Paoletti P, Bosco A, Parisse P, Scaini D, Tagliabue E, De Marco A, Casalis L (2017). Quantification of circulating cancer biomarkers via sensitive topographic measurements on single binder nanoarrays. ACS Omega 2(6):2618–2629 9. Adedeji AF, Ambrosetti E, Casalis L, Castronovo M (2018a) Spatially resolved peptideDNA nanoassemblages for biomarker detection: a synergy of DNA-directed immobilization and nanografting. In: DNA nanotechnology. Springer, New York, pp 151–162 10. Adedeji AF, Ambrosetti E, Casalis L, Castronovo M (2018b) Spatially resolved peptideDNA nanoassemblages for biomarker detection: a synergy of dna-directed immobilization and nanografting. In: DNA nanotechnology. Springer, New York, pp 151–162 11. Ciemny M, Kurcinski M, Kamel K, Kolinski A, Alam N, Schueler-Furman O, Kmiecik S (2018) Protein–peptide docking: opportunities and challenges. Drug Discov Today 23(8):1530–1537 12. Diller DJ, Swanson J, Bayden AS, Jarosinski M, Audie J (2015) Rational, computer-enabled peptide drug design: principles, methods, applications and future directions. Future Med Chem 7(16):2173–2193 13. Fosgerau K, Hoffmann T (2015) Peptide therapeutics: current status and future directions. Drug Discovery Today 20(1):122–128 14. La Manna S, Di Natale C, Florio D, Marasco D (2018) Peptides as therapeutic agents for inflammatory-related diseases. Int J Mol Sci 19(9):2714 15. Lee AC-L, Harris JL, Khanna KK, Hong J-H (2019) A comprehensive review on current advances in peptide drug development and design. Int J Mol Sci 20(10):2383 16. Sillerud LO, Larson RS (2005) Design and structure of peptide and peptidomimetic antagonists of protein-protein interaction. Curr Protein Peptide Sci 6(2):151–169 17. Russo A, Aiello C, Grieco P, Marasco D (2016) Targeting “undruggable” proteins: design of synthetic cyclopeptides. Curr Med Chem 23(8):748–762 18. Vlieghe P, Lisowski V, Martinez J, Khrestchatisky M (2010) Synthetic therapeutic peptides: science and market. Drug Discov Today 15(1–2):40–56 19. Leurs U, Lohse B, Ming S, Cole PA, Clausen RP, Kristensen JL, Rand KD (2014) Dissecting the binding mode of low affinity phage display peptide ligands to protein targets by Computational Evolution Protocol for Peptide Design hydrogen/deuterium exchange coupled to mass spectrometry. Anal Chem 86(23): 11734–11741 20. Fjell CD, Hiss JA, Hancock REW, Schneider G (2011) Designing antimicrobial peptides: form follows function. Nat Rev Drug Discov 2 (Mic):31–45 21. Adedeji Olulana, AF, Soler MA, Lotteri M, Vondracek H, Casalis L, Marasco D, Castronovo M, Fortuna S (2021) Computational evolution of beta2-microglubulin binding peptides for nanopatterned surface sensors. Int J Mol Sci 22(2):812 22. Yagi Y, Terada K, Noma T, Ikebukuro K, Sode K (2007) In silico panning for a non-competitive peptide inhibitor. BMC Bioinform 8(1):11 23. Mitchell M (1998) An introduction to genetic algorithms. MIT Press, Cambridge 24. Besray Unal E, Gursoy A, Erman B (2010) Vital: Viterbi algorithm for de novo peptide design. PLoS One 5(6):e10926 25. Haliloglu T, Seyrek E, Erman B (2008) Prediction of binding sites in receptor-ligand complexes with the Gaussian network model. Phys Rev Lett 100(22):228102 26. Raman S, Vernon R, Thompson J, Tyka M, Sadreyev R, Pei J, Kim D, Kellogg E, DiMaio F, Lange O, Kinch L, Sheffler W, Kim B-H, Das R., Grishin NV, Baker D (2009) Structure prediction for casp8 with all-atom refinement using Rosetta. Proteins Struct Funct Bioinform 77(S9):89–99 27. Alford RF, Leaver-Fay A, Gonzales L, Dolan EL, Gray JJ (2017) A cyber-linked undergraduate research experience in computational biomolecular structure prediction and design. PLOS Comput Biol 13(12):e1005837 28. King CA, Bradley P (2010) Structure-based prediction of protein-peptide specificity in Rosetta. Proteins Struct Funct Bioinform 78(16):3437–3449 29. Unal EB, Gursoy A, Erman B (2010) Vital: Viterbi algorithm for de novo peptide design. PLOS One 5(6):1–15 30. Obarska-Kosinska A, Iacoangeli A, Lepore R, Tramontano A (2016) PepComposer: computational design of peptides binding to a given protein surface. Nucleic Acids Res 44(W1): W522–W528 31. Ochoa R, Soler M, Laio A, Cossio P (2020) PARCE: protocol for amino acid refinement through computational evolution. Comput Phys Commun 260:107716 32. Hong Enriquez RP, Pavan S, Benedetti F, Tossi A, Savoini A, Berti F, Laio A (2012) Designing short peptides with high affinity for 357 organic molecules: a combined docking, molecular dynamics, and Monte Carlo approach. J Chem Theor Comput 8(3): 1121–1128 33. Russo A, Scognamiglio PL, Enriquez RPH, Santambrogio C, Grandori R, Marasco D, Giordano A, Scoles G, Fortuna S (2015) In silico generation of peptides by replica exchange Monte Carlo: docking-based optimization of maltose-binding-protein ligands. PLoS One 10(8):1–16 34. Gladich I, Rodriguez A, Hong Enriquez RP, Guida F, Berti F, Laio A (2015) Designing high-affinity peptides for organic molecules by explicit solvent molecular dynamics. J Phys Chem B 119(41):12963–12969 35. Trott O, Olson AJ (2010) Autodock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31(2):455–461 36. Soler MA, Rodriguez A, Russo A, Adedeji AF, Dongmo Foumthuim CJ, Cantarutti C, Ambrosetti E, Casalis L, Corazza A, Scoles G, Marasco D, Laio A, Fortuna S (2017) Computational design of cyclic peptides for the customized oriented immobilization of globular proteins. Phys Chem Chem Phys 19(4): 2740–2748 37. Guida F, Battisti A, Gladich I, Buzzo M, Marangon E, Giodini L, Toffoli G, Laio A, Berti F (2017) Peptide biosensors for anticancer drugs: design in silico to work in denaturizing environment. Biosens Bioelectron 100: 298–303 38. Chi LA, Vargas MC (2020) In silico design of peptides as potential ligands to resistin. J Mol Model 26:1–14 39. Soler MA, Medagli B, Semrau MS, Storici P, Bajc G, de Marco A, Laio A, Fortuna S (2019) A consensus protocol for the in silico optimisation of antibody fragments. Chem Commun 55(93):14043–14046 40. Soler MA, Fortuna S, de Marco A, Laio A (2018) Binding affinity prediction of nanobody-protein complexes by scoring of molecular dynamics trajectories. Phys Chem Chem Phys 20(5):3438–3444 41. Peterson LX, Kang X, Kihara D (2014) Assessment of protein side-chain conformation prediction methods in different residue environments. Proteins Struct Funct Bioinform 82(9):1971–1984 42. Huang X, Pearce R, Zhang Y (2020) FASPR: an open-source tool for fast and accurate protein side-chain packing. Bioinformatics 36: 3758–3765 358 Rodrigo Ochoa et al. 43. Ochoa R, Soler MA, Laio A, Cossio P (2018) Assessing the capability of in silico mutation protocols for predicting the finite temperature conformation of amino acids. Phys Chem Chem Phys 20(40):25901–25909 44. Lindorff-Larsen K, Piana S, Palmo K, Maragakis P, Klepeis JL, Dror RO, Shaw DE (2010) Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins Struct Funct Bioinform 78(8): 1950–1958 45. Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, Klein ML (1983) Comparison of simple potential functions for simulating liquid water. J Chem Phys 79(2):926–935 46. Bussi G, Donadio D, Parrinello M (2007) Canonical sampling through velocity rescaling. J Chem Phys 126(1):014101 47. Parrinello M, Rahman A (1980) Crystal structure and pair potentials: a molecular dynamics study. Phys Rev Lett 45(14):1196–1199 48. Di Pierro M, Elber R, Leimkuhler B (2015) A stochastic algorithm for the isobaric-isothermal ensemble with Ewald summations for all long range forces. J Chem Theor Comput 11(12): 5624–5637 49. Janežič D, Merzel F (1995) An efficient symplectic integration algorithm for molecular dynamics simulations. J Chem Inf Comput Sci 35(2):321–326 50. Hicks DG, Kulkarni S (2008) HER2+ breast cancer: review of biologic relevance and optimal use of diagnostic tools. Am J Clin Pathol 129(2):263–273 51. Oh D-Y, Bang Y-J (2020) HER2-targeted therapies-a role beyond breast cancer. Nat Rev Clin Oncol 17(1):33–48 52. Sawant MS, Streu CN, Wu L, Tessier PM (2020) Toward drug-like multispecific antibodies by design. Int J Mol Sci 21(20):7496 53. Ochoa R, Laio A, Cossio P (2019) Predicting the affinity of peptides to major histocompatibility complex class II by scoring molecular dynamics simulations. J Chem Inf Model 59(8):3464–3473 54. Soler MA, De Marco A, Fortuna S (2016) Molecular dynamics simulations and docking enable to explore the biophysical factors controlling the yields of engineered nanobodies. Sci Rep 6:34869 55. Medagli B, Soler MA, de Zorzi R, Fortuna S (2021) Antibody affinity maturation using computational methods: from an initial hit to small scale expression of optimised binders. In: Computer-aided antibody design. Springer, in press 56. Del Carlo M, Capoferri D, Gladich I, Guida F, Forzato C, Navarini L, Compagnone D, Laio A, Berti F (2016) In silico design of short peptides as sensing elements for phenolic compounds. ACS Sensors 1(3):279–286 57. Soler M, Fortuna S, Scoles G (2015) Computational design of peptides as probes for the recognition of protein biomarkers. In: 10th European-biophysical-societies-association (EBSA) European biophysics congress, vol 44. Springer, New York, pp 149–149 58. Negroni MP, Stern LJ (2018) The N-terminal region of photocleavable peptides that bind HLA-DR1 determines the kinetics of fragment release. PLoS One 13(7):e0199704 59. Peters B, Nielsen M, Sette A (2020) T cell epitope predictions. Ann Rev Immunol 38(1): 123–145 60. Purcell AW, McCluskey J, Rossjohn J (2007) More than one reason to rethink the use of peptides in vaccine design. Nat Rev Drug Discov 6(5):404–414 61. Wang P, Sidney J, Dow C, Mothé B, Sette A, Peters B (2008) A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach. PLoS Comput Biol 4(4):e1000048 62. Bjorkman PJ (2015) Not second class: the first class II MHC crystal structure. J Immunol 194(1):3–4 63. Unanue ER, Turk V, Neefjes J (2016) variations in MHC class II antigen processing and presentation in health and disease. Annu Rev Immunol 34(1):265–297 64. Weaver JM, Sant AJ (2009) Understanding the focused CD4 T cell response to antigen and pathogenic organisms. Immunol Res 45(2–3): 123–143 65. Wang P, Sidney J, Kim Y, Sette A, Lund O, Nielsen M, Peters B (2010) Peptide binding predictions for HLA DR, DP and DQ molecules. BMC Bioinform 11(1):568 66. Stern LJ, Brown JH, Jardetzky TS, Gorga JC, Urban RG, Strominger JL, Wiley DC (1994) Crystal structure of the human class II MHC protein HLA-DR1 complexed with an influenza virus peptide. Nature 368(6468): 215–221 67. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28(1):235–242 68. Huang P-S, Ban Y-EA, Richter F, Andre I, Vernon R, Schief WR, Baker D (2011) RosettaRemodel: a generalized framework for flexible backbone protein design. PLoS One 6(8): e24109 Computational Evolution Protocol for Peptide Design 69. Hess B, Kutzner C, van der Spoel D, Lindahl E (2008) GROMACS 4: algorithms for highly efficient, load balanced, and scalable molecular simulations. J Chem Theor Comput 4: 435–447 70. Dominguez C, Boelens R, Bonvin AMJJ (2003) HADDOCK: a protein–protein docking approach based on biochemical or biophysical information. J Am Chem Soc 125(7): 1731–1737 71. Yang Y, Zhou Y (2008) Specific interactions for Ab Initio folding of protein terminal regions with secondary structures. Proteins Struct Funct Genet 72(2):793–803 72. Zhou H, Skolnick J (2011) GOAP: a generalized orientation-dependent, all-atom statistical potential for protein structure prediction. Biophys J 101(8):2043–2052 73. Krissinel E, Henrick K (2007) Inference of macromolecular assemblies from crystalline state. J Mol Biol 372(3):774–797 359 74. Andrusier N, Nussinov R, Wolfson HJ (2007) FireDock: fast interaction refinement in molecular docking. Proteins Struct Funct Bioinform 69(1):139–159 75. Berrera M, Molinari H, Fogolari F (2003) Amino acid empirical contact energy definitions for fold recognition in the space of contact maps. BMC Bioinform 4(1):8 76. Fogolari F, Corazza A, Yarra V, Jalaru A, Viglino P, Esposito G (2012) Bluues: a program for the analysis of the electrostatic properties of proteins based on generalized Born radii. BMC Bioinform 13(Suppl 4):S18 77. Cossio P, Granata D, Laio A, Seno F, Trovato A (2012) A simple and efficient statistical potential for scoring ensembles of protein structures. Sci Rep 2:1–8 78. Sarti E, Zamuner S, Cossio P, Laio A, Seno F, Trovato A (2013) Bachscore. A tool for evaluating efficiently and reliably the quality of large sets of protein structures. Comput Phys Commun 184(12):2860–2865 Chapter 17 Computational Design of Miniprotein Binders Younes Bouchiba, Manon Ruffini, Thomas Schiex, and Sophie Barbe Abstract Miniprotein binders hold a great interest as a class of drugs that bridges the gap between monoclonal antibodies and small molecule drugs. Like monoclonal antibodies, they can be designed to bind to therapeutic targets with high affinity, but they are more stable and easier to produce and to administer. In this chapter, we present a structure-based computational generic approach for miniprotein inhibitor design. Specifically, we describe step-by-step the implementation of the approach for the design of miniprotein binders against the SARS-CoV-2 coronavirus, using available structural data on the SARS-CoV2 spike receptor binding domain (RBD) in interaction with its native target, the human receptor ACE2. Structural data being increasingly accessible around many protein–protein interaction systems, this method might be applied to the design of miniprotein binders against numerous therapeutic targets. The computational pipeline exploits provable and deterministic artificial intelligence-based protein design methods, with some recent additions in terms of binding energy estimation, multistate design and diverse library generation. Key words Computational protein design, Miniprotein binders, Multistate protein design, Binding affinity, Protein–protein interaction, SARS-CoV-2. 1 Introduction Miniprotein binders can have antibody-like affinity and functionality with the advantages of stability and amenability to synthesis over monoclonal antibodies [1–3]. They can also avoid weaknesses of larger scaffolds such as poor tissue or cell penetration, protease and reduction sensitivity. Miniprotein binders can also have several advantages over small molecule drugs, notably the capacity to block protein–protein interactions when a deep binding pocket is missing at the interface. Therefore, they have the potential to span the gap between monoclonal antibodies and small molecule drugs and thus greatly impact therapies and diagnoses. The ability to design stable miniprotein binders with tight affinity for a given target is thus of great interest for a wide range of medicine applications [2, 4]. We present here a computational pipeline for the Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_17, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 361 362 Younes Bouchiba et al. design of miniprotein binders based on our advanced methods in the structure-based computational protein design (CPD) field. CPD has become a valuable approach in protein engineering to identify protein sequences that can adopt a given 3D fold and possess desired properties, from the exploration of combinatorial sequence spaces astronomically larger than those that can be tested experimentally [5–8]. The most usual description of the CPD problem relies on a pairwise decomposable energy function, a discretized description of the amino acid conformational space based on a library of frequent side chain conformations (i.e., rotamers) and a single fixed backbone conformation. Under such assumptions, referred to as single state design (SSD), the problem of searching for a sequence with the global minimum energy conformation (GMEC) is known to be NP-hard [9]. Because of this, most CPD approaches rely on stochastic optimization algorithms [10– 12], mainly Monte Carlo simulated annealing [13] (as implemented in Rosetta [14]), which provide only asymptotic convergence guarantees. Although they have the advantage of providing a solution at any time, they neither guarantee finding the GMEC in finite time nor a bounded energetic distance to the optimal solution. To try to circumvent this limitation, multiple independent runs are performed (each with a predefined number of steps) in order to cover, as well as possible, a rugged energy landscape. However, the accuracy of metaheuristic methods may drastically decrease as problem size increases [15, 16]. In contrast, provable methods [17], historically based on the Dead-End-Elimination (DEE) theorem and the A* algorithm [18], provably identify the optimal minimum energy design in finite time. Unfortunately, they are rapidly outstripped by the size of the search space and often do not provide any solution in reasonable time [19– 21]. The capacity to efficiently identify sequences with optimal conformations is thus challenging, especially in de novo design, where the full sequence is designed. Relying on state-of-the-art “automated reasoning” artificial Intelligence algorithms, more specifically cost function networks (CFNs) [22], we developed a provable CPD method [16, 19–21] that speeds up searching by several orders of magnitudes compared to previous provable methods. It thus enables, with provable guarantees, on standard hardware, the identification of the GMEC as well as suboptimal solutions at a given energy gap to the GMEC, from a combinatorial search space of size that is far beyond what has been solved with previous provable methods. The remarkable efficiency of CFN algorithms to handle vast search spaces was then exploited to push back some limits in the formulation of the CPD problem. In particular, the fixed backbone approximation in CPD may lead to the rejection of sequences that would be accepted if a slightly different backbone conformation was allowed. Furthermore, protein backbone flexibility plays a key Computational Design of Miniprotein Binders 363 role in protein properties and intermolecular interactions, and this can be even more crucial for peptides and miniproteins. To address this, positive multistate protein design (MSD) with backbone ensembles is an attractive approach [23–26]. In MSD, sequence design relies on the energy contributions of several protein backbone conformations. Combinations of rotamers that would be removed using a single fixed backbone conformation could be accepted when several backbone conformations are used to inform sequence selection. Two main MSD approaches based on distinct criteria (or fitness) can be considered for sequence design. The first one seeks to design a sequence for any of the considered backbone states. The Boltzmann-weighted average of the energies (defined as the sum of optimal energies, weighted by their Boltzmann probabilities) in each state may be an attractive criterion [27]. Because this gives an exponential advantage to the backbone with lowest energy, the computation of the fitness has been approximated by the minimum optimal energy [24, 28, 29], defining what is called “multistate analysis” (MSA) or minMSD. The second approach aims to design a sequence that simultaneously fits several conformational states mimicking protein flexibility required to ensure targeted function. In this case, it is usual to optimize the average of the optimal energies over all states, a problem denoted Σ-MSD [26]. We recently introduced efficient reductions of positive MSD problems to CFNs, with the two criteria [30]. These CFN-based methods, implemented in POMPd, have emerged as efficient MSD approaches, outperforming state-of-the-art provable methods to identify the guaranteed optimal solution and exhaustively enumerate suboptimal sequences. They can solve, in reasonable time, MSD problems with several backbone conformations defining search sizes unreachable up to now except with stochastic MSD approaches with no guarantees of quality [30]. Beyond the identification of the GMEC and the enumeration of suboptimal sequences within an energy threshold of the GMEC, we also extended our CPD approaches to generate libraries of sequences which are both diverse and of low energies [31]. Indeed, producing diverse sequences with a known energy distance to the GMEC can be useful to alleviate the effect of the approximations that exist in CPD models. Based on an incremental CFN approach using sequence diversity constraints that lower-bound the Hamming distance between sequences, the developed method enables the efficient identification of sequences that satisfy guarantees on both sequence diversity and energy quality. In addition to all these, our CFN-based method, EasyE, can efficiently estimate protein–protein binding energies of a large number of mutants [32]. The binding energy is estimated by the difference in energy of the bound and unbound proteins in their globally optimal rotameric side chain conformations. Compared to state-of-the-art computational methods, EasyE shows better 364 Younes Bouchiba et al. correlation coefficients between predicted and experimental values. EasyE is thus highly useful to rank mutant sequences of libraries according to their binding energy. The CFN-based CPD methods are generic and can be applied to the design of different types of proteins. They recently enabled the engineering of a highly stable artificial self-assembling symmetrical eight-bladed β-propeller [33], optimized enzymes in terms of activity and thermostability, as well as new, highly stable nanobody scaffolds (data not published). Based on these methods, in this chapter, we present a pipeline for miniprotein binder design. We provide a step-by-step guide to the pipeline and illustrate it with an application to the design of miniprotein binders against the SARSCoV-2 coronavirus that has emerged in the late 2019 and has since caused a global pandemic [34]. SARS-CoV-2 initiates its entry into host cells by binding to the human angiotensin-converting enzyme2 (ACE2) via the receptor binding domain (RBD) of its spike protein [35, 36]. Miniprotein binders that can block SARSCoV-2 RBD from binding to ACE2 may potentially prevent the virus from entering human cells and serve as an effective antiviral drug [37]. Numerous works aim to develop such antiviral drugs [38]. As miniproteins are more suitable for disrupting protein– protein interactions than small molecules, by specifically binding to the interface binding region, some studies have been recently focused on the design of such anti-SARS-Cov2 binders, based on the analysis of the 3D structure of the ACE2-SARS-CoV-2 RBD complex [39–41]. Compared to monoclonal antibodies, miniprotein binders can also have the advantage of reduced immunogenicity. This chapter reports a complete computational approach to design such miniprotein binders, starting from the de novo building of a 3D scaffold. The overall and versatile pipeline based on CFN technology can be used for different targets. It includes several successive steps (Fig. 1). From the construction of a 3D scaffold of the miniprotein, a multistate design of the core region is performed using several backbone states generated with a backrublike backbone simulation. An initial binding mode between the miniprotein and the target protein is then built by superimposing some structural motifs of the miniprotein on the protein–protein complex to be inhibited. This initial binding mode is used as reference during an automated docking step. For several selected binding modes, a multistate design of the binding interface (and residues outside the core region) is performed with several backbone states (also generated using a backrub-like backbone simulation), generating a library of diverse and low energy sequences. The last step of the pipeline consists in ranking the library sequences according to the binding energy to the target and the energy of the minibinder. Finally, candidate sequences can be selected for experimental testing. The detailed description of the implementation of Computational Design of Miniprotein Binders 3D Scaffold Construction Backrub Simulation 3D Model min(Ebackrub ) Backrub Simulation MultiState Binder Design min(Ebackrub ) Sequence library MultiState Core Design Binding Energy Estimation Input Model Docking Minimization 365 (ΔEcomplex , Eminiprot ) Ranking Selected Candidates Binding Modes Fig. 1 Computational design pipeline. The ellipses, rectangles, and losanges indicate data types, operations, and criteria of selection, respectively. The main steps of the pipeline include: (1) the construction of the 3D scaffold domain; (2) multistate design of the core region with several backbone states generated using a backrub-like backbone simulation; (3) docking of the miniprotein on the target and selection of binding modes; (4) multistate design of the binding interface (and residues out of the core region) and generation of a sequence library; (5) ranking of the sequences according to the binding energy ΔEbinding and the energy of the minibinder Eminiprotein 366 Younes Bouchiba et al. the computational pipeline is given below and exemplified with the design of a triple alpha helix anti-SARS-CoV-2 miniprotein. The choice of a triple helix bundle domain results from the analysis of the structural motifs involved in the interaction of ACE2 with the SARS-CoV-2 RBD [42]. 2 Materials The computational miniprotein design process relies on POMPd [30] and EasyE [32], implemented using the toulbar2 prover [22] and PyRosetta version 4 for energy computation (based on the Rosetta beta_nov16 energy function [43]). The overall computational pipeline also involves Rosetta version 3.11 [14] for energy relaxation, docking, and backrub-type backbone sampling. AMBER18 and AmberTools v19.12 [44] are used for molecular modelling and analyses. Python 3 is required for computations (we used the 3.6.9 version). Analysis and graphics were performed using R version 3.4.4. Molecular models were visualized and aligned using PyMOL (Schrödinger, LLC). WebLogo3 (http://weblogo.threeplusone. com/create.cgi) was used for the generation of sequence logos. The hardware used consisted of a single workstation with Intel Xeon E5-2650 2.3 GHz CPUs, running the Ubuntu LTS distribution 18.04. 3 Methods 3.1 Building of Miniprotein Scaffold and Core Design This section explains the different steps, ranging from the building of the 3D triple alpha helix scaffold of the miniprotein up to its core design using MSD with backbone conformations generated from backrub-like backbone simulation that recapitulates natural protein conformational variability [45] (Fig. 2). 3.1.1 Miniprotein 3D Scaffold Construction 1. Access the CCBuilder 2.0 web server (coiledcoils.chm.bris.ac. uk/ccbuilder2/builder). 2. Using the advanced options menu, choose three oligomeric states and tick the Anti Parallel box for the second chain. 3. Provide an input sequence. The input sequence is dependent on knowledge of the targeted interaction. For our system, we used STIEEQAKTFLDKFNHEA , GDKWSAFLKEQS TLAQMY , and AQNLQNLTVKLQLQALQ , originating, respectively, from ACE2 alpha helix regions 19–36, 66–83, and 89–101, involved in the binding interface with SARSCoV-2 spike RBD (PDB ID: 6LZG). Computational Design of Miniprotein Binders 367 CCBuilder 3D Scaffold Construction Backbone Relaxation AMBER minimization RosettaRelax Backrub Simulation RosettaBackrub min(Ebackrub ) POMPd MultiState Core Design Fig. 2 Triple alpha helix miniprotein backbone building and its core design. The triple alpha helix bundle backbone was constructed using CCBuilder (a coiled-coil modeling server) [46]. The 3D backbone scaffold was generated using as sequence, the amino acid types of ACE2 alpha helix regions involved in the binding with SARS-CoV-2 RBD, from visualization of the ACE2/SARS-CoV-2 X-ray structure (PDB ID: 6LZG). To connect the helices, two serine residues were added to form each linker, and the 3D model was then minimized using the ff14SB amber force field of AMBER18 [47]. The full 3D miniprotein (57 amino acid residues) was then relaxed using RosettaRelax [48] with the beta_nov16 energy function [43]. Backbone conformations were sampled using the Rosetta backrub-like backbone simulation method. After clustering, three backbone conformations were selected, and MSD was performed for designing the core region of the miniprotein. The beta_nov16 energy function was also used for MSD with POMPd [30] 4. Add linkers to connect the helices and minimize the miniprotein. In our case, linkers formed by two serines were added to connect the helices, using the Build functionality of PyMOL (Scrödinger, LLC), and the system was minimized (5000 steps of steepest descent and 5000 steps of conjugate gradient), 368 Younes Bouchiba et al. using the ff14SB force field and the sander module of AMBER18 [44] (see Note 1). 5. Relax the 3D miniprotein model using RosettaRelax with the beta_nov16 energy function and the command: nohup relax.linuxgccrelease -s MyMiniProtein.pdb -nstruct 10 -keep_input_protonation_state -overwrite -beta_nov16 & Check whether the relaxed structure deviated too much from the initial conformation (backbone RMSD <0.5 Å) and pick the lowest energy relaxed model, otherwise consider using the following options: -relax:constrain_relax_to_start_coords -relax:coord_cst_stdev 0.5 -relax:coord_constrain_sidechains 3.1.2 Miniprotein Core Design The amino acids of the core of the 3D miniprotein model previously built and relaxed (MyMiniprotein.pdb) are designed using MSD method (POMPd), with the Σ-MSD criterion. The objective is to stabilize the miniprotein core before designing amino acids for the binding with the target. 1. Generate backbone conformations for MSD using the RosettaBackrub method [45] with the command: nohup mpirun -np 10 backrub.mpi.linuxgccrelease -in:file:s Myprotein.pdb -beta_nov16 -nstruct 100 -backrub:ntrials 10000 -keep_input_protonation_state & Cluster the generated conformations using the K-means clustering algorithm over the backbone RMSD by requesting N clusters, as implemented using the cpptraj module [49] of AMBER18 [44] (see Note 2). Select the lowest energy conformation of each cluster and name each corresponding PDB format file as MyMiniProtein1.pdb, MyMiniProtein2.pdb, etc. In our example, three backbone conformational states were selected and considered as input for MSD, in addition to the initial relaxed conformation (MyMiniprotein.pdb). 2. Select amino acids defining the core of the miniprotein, for example, by a visual inspection of the 3D model using PyMOL. 3. Create a so-called Resfile that specifies the amino acid residues to be designed (the core amino acid residues previously defined) and the amino acid side chains to consider as merely flexible in MSD (all other residues of the miniprotein). An example of Resfile is described in Note 3. An identical Resfile Computational Design of Miniprotein Binders 369 is required for each backbone considered in MSD (named MyMiniProtein.resfile, MyMiniProtein1.resfile, MyMiniProtein2.resfile, etc.). 4. Clone the POMPd git repository and copy the 3D structural models (MyMiniProtein.pdb, MyMiniProtein1.pdb, etc.) and the Resfiles (MyMiniProtein.resfile, MyMiniProtein1.resfile, etc.) into the MSD/positive/directory, compile toulbar2 solver, and submit MSD as follows: cd /PATH/TO/YOUR/DIRECTORY/ git clone https://forgemia.inra.fr/thomas.schiex/pompd cp MyMiniProtein*.pdb MyMiniProtein*.resfile pompd/positive/ cd pompd/positive/ make toulbar2 cd pompd/positive nohup make positive.gmec & The execution can be traced back in the nohup.out file. The designed sequence is saved in the positive.seq output file, where it is repeated once for each state. The script below extracts one copy of the sequence and maps it on the backbone states by side chain placement. size=‘cat MyMiniProtein.nat | awk ’{printf $1}’ | wc -m‘ seq=‘cut -c1-$size positive.seq‘ for i in *pdb do python3 ./exe/tb2cpd.py --doscp --scpseq $seq -i $i.pdbdone The 3D miniprotein model corresponding to the designed sequence mapped on the initial relaxed backbone state (saved as MyMiniProteinCD.pdb) is used for the following step. In our example, this miniprotein is named Gyro. 3.2 Docking on the Protein Target This section describes the exploration and selection of binding modes of the miniprotein (MyMiniProteinCD) on the protein target using RosettaDock [50]. Available structural data on the protein–protein interaction (Complex_ref.pdb) to be blocked by the miniprotein can be used as reference for the docking. An initial binding mode between the miniprotein and the target protein can be built by superimposing the miniprotein on the similar 370 Younes Bouchiba et al. structural motifs (Binding_native_residues) of the native binding protein. The resulting 3D model can then be used as a starting binding mode for the docking. In our case, the structure of the SARS-CoV-2 RDB in complex with ACE2 (PDB ID: 6LZG) [42] was used, and the miniprotein was aligned to the structural motifs (regions 19–36, 66–83, and 89–101) of ACE2 involved in the binding to SARS-CoV-2 RBD (Fig. 3). 1. Using PyMOL, fetch Complex_ref.pdb from the Protein Data Bank (rcsb.org) [51], load the MyMiniProteinCD.pdb, select the Binding_native_residues, and align the MyMiniProteinCD.pdb to the Binding_native_residues. PyMOl>fetch Complex_ref.pdb PyMOl>load MyMiniProteinCD.pdb PyMOl>select Complex_ref and resid [Binding_native_residues] PyMOL>align MyMiniProteinCD, sele, cycles=0 Save the complex between the target and the miniprotein and name it MyMiniProtein_Target.pdb (in our example, the complex of the RBD target (residues 333 to 527 in 6LZG. pdb) with the miniprotein). 2. Relax the complex, MyMiniProtein_Target.pdb using the same command and procedure as described in Subheading 3.1.1. Name the output structural model: MyMiniProtein_Target_relax.pdb. 3. Run the docking simulation using the following command: nohup mpirun -np 2 docking_protocol.mpi.linuxgccrelease -s MyMiniProtein_Target_relax.pdb -ex1 -ex2aro -partners B_A -dock_pert 2 5 -docking:sc_min -nstruct 100 -use_input_sc -beta_nov16 4. Minimize each binding model. minimize.default.linuxgccrelease -s *.pdb -run:min_type lbfgs_armijo_nonmonotone -run:min_tolerance 0.001 -out:suffix .min -beta_nov16 Select several binding models and save them as Protein_Target_pose1.pdb, MyMini- MyMiniProtein_Tar- get_pose2.pdb, etc. In our example, six distinct docking poses (Fig. 3) were selected among the 100 generated. The distribution of the 100 protein binding modes according to their energy score is shown in Fig. 3. As the interaction surface is subsequently Computational Design of Miniprotein Binders 371 A. Miniprotein (Gyro) Native binding protein (ACE2) Miniprotein - Target complex relaxation Target (SARS-CoV-2 RBD) B. Docking and Minimization 4 Counts 5 1 2 1 2 3 4 5 6 3 6 Minimized docking pose Energy [ kcal/mol ] Fig. 3 Miniprotein/SARS-CoV-2 RBD Docking. (a) The miniprotein model named Gyro (colored in green) was superposed to ACE2 (colored in cyan), based on corresponding binding interface regions in the complex between ACE2 and SARS-CoV-2 RBD (in green). The resulting model between the miniprotein and SARS-CoV2 RBD was relaxed using the Rosetta beta_nov16 score function. The relaxed complex was then used as starting conformation for docking. Each generated docking pose was minimized using the beta_nov16 score function. (b) The distribution of the 100 docking poses according to their energy is shown. Six models were selected, giving priority to those that can give rise to the largest interaction surface redesigned, the selection of the protein binding modes was based on the analysis of shape complementarity of surfaces between the two proteins, rather than specifically on the best energy scores. 372 Younes Bouchiba et al. 3.3 Design of Diverse Sequence Libraries This section concerns the redesign of the binding interface between the miniprotein and the protein target for each of the binding poses previously selected. MSD [30] is performed with an ensemble of backbone conformational states of the complex, sampled using RosettaBackrub [52] and a library of sequences both diverse and low energy is generated. For each selected docking pose, the following protocol is applied: 1. Perform backrub-like backbone simulations and select conformational states using the procedure of Subheading 3.1.2. Save the selected states as MyMiniProtein_Target_pose[n]_1.pdb, MyMiniProtein_Target_pose[n]_2. The selected states and the initial conformation of binding pose (MyMiniProtein_Target_pose[n].pdb) are used as input in MSD. In our case, two conformational states of the Gyro/SARS-CoV-2 RBD complex were selected and considered as input for MSD, in addition to the initial relaxed conformation of the complex. pdb, etc. 2. Create the Resfile that specifies the amino acid residues to be designed (all the amino acid residues except the core residues, as defined in the Subheading 3.1.2, which are only considered as flexible) (see Note 3). A copy of this file must be saved as many times as there are states considered in MSD (MyMiniProtein_Target_pose[n].resfile, MyMiniProtein_Target_pose[n]_1.resfile, etc.). 3. Create a file that specifies the sequence diversity constraint (based on the Hamming distance between sequences) for the generation of the library and name it with the same prefix as the .pdb file and the .resfile using the .div suffix. An example would be MyMiniProtein_Target_Pose[n].div. It is formatted as json and should look as in the following example: { "regions": ["1-57 A"], "diversity": 6, "solutions": 10 } This example indicates that the considered residues are the residues 1 to 57 of the chain A and that 10 sequences are requested with a diversity constraint of at least six mutations between any two sequences. make MyMiniProtein_Target_Pose[n].opt+lib Computational Design of Miniprotein Binders 373 The output files: l positive.seq: Contains all optimal sequences of the library with the total energy of each sequence over all the conformational states considered in MSD. Note that the sequences are repeated once for each input state provided. Please refer to Subheading 3.1.2 for commands to extract single sequences. l MyMiniProtein_Target_Pose[n].opt+[m], for each of the m sequences of the library: Corresponds to the 3D model (in PDB format) of the sequence mapped on the MyMiniProtein_Target_pose[n] backbone state. For our miniprotein design, Gyro, we previously retained six binding poses from docking simulations, and for each of them, we generated three diverse sequence libraries using MSD with diversity constraints of 1, 3, and 6 mutations. We requested 10 sequences per library, leading to a total of 180 designed sequences. We considered three conformational states of the Gyro/SARS-CoV2 RBD complex for performing MSD for each docking pose. Designs took roughly 15 mn for three states of a Gyro/SARSCoV-2 RBD complex including 30 mutable positions and the 222 other residues considered as flexible. The total resulting library comprises 159 unique sequences out of 180. The average energies and amino acid variability of designed sequences are shown in Fig. 4. In the library, the amino acid residues of the two first helices are more variable than those of the C-terminal helix. The two first helices are mainly involved in the binding of RBD in the native binding mode of ACE2 (PDB ID: 6LZG), which is consistent with the entropy profile in these regions, because of the presence of a more chemically diverse neighborhood than around the C-ter helix, which is mainly exposed to the solvent. As shown in Fig. 4, the pose’s rank is essentially the same as after docking, except for a flip between poses 3 and 6. 3.4 Designed Miniproteins Ranking and Analysis For each designed miniprotein, the binding energy, ΔEbinding, with the protein target is estimated using EasyE [32] in order to rank the sequences of the library according to their affinity for the target protein. EasyE can be downloaded and installed using the commands: git clone https://git.renater.fr/anonscm/git/easy-jayz/easyjayz.git cd easy-jayz/exes/ sh toulbar2-install.sh 374 Younes Bouchiba et al. Fig. 4 Designed miniproteins. (a) Energy and sequence entropy per docking pose considered. The average energy of designed miniproteins in complex with the spike RBD over the conformational space used for MSD is presented for every design, together with the Shannon entropy associated with the sequences produced for each docking pose. (b) All sequences were merged for WebLogo construction portraying the sequence space mapped for RBD binding. The Shannon entropy of all sequences (n ¼ 180) is mapped on the scaffold of the initial relaxed conformation of Gyro Note that its usage requires specific python libraries, as listed in the Quick_start.pdf file. An example of usage is provided in the Example/folder and described in the Quick_start.pdf instruction file in the downloaded git archive. As input files, EasyE requires: l A .pdb file corresponding to the 3D structural model of the miniprotein scaffold with the initial sequence, used as input in previous MSD (see Subheading 3.3): MyMiniProtein_Target_Pose[n].pdb. l A .seqE file that specifies for each designed sequence, the changes of amino acid types before and after design: mut.seqE. This file should contain all the desired sequences to be mapped on a given scaffold for binding energy computations. A simple procedure to generate such a file is detailed in Note 4. Computational Design of Miniprotein Binders 375 ΔEbinding calculation can be performed using the MyMiniProand the mut.seqE file in the easy-jayz/ directory file with the following command: tein_Target_Pose[n].pdb ./exes/EasyE.py --pdb ./MyMiniProtein_Target_Pose[n].pdb --seq ./mut.seqE --partner A_B --score beta_nov16 --v 1 --min 1 --rec 1 --forced_out 1 --lig 1 Here is a short description of each argument used: l --partner: The interaction chains following the nomenclature [Designed partner]_[Binding target]. l --score: l --v: l --min: Performs minimization with 0.5 kcal/mol of standard deviation harmonic restraints and 0.001 tolerance threshold. l --rec: Forces the computation of receptor energy. l --lig: Forces the computation of the ligand energy. l --forced_out: Score function used, here beta_nov16 [43] [REF]. Verbose output. Forces the output of energies. Two output files are produced: l MyMiniProtein_Target_Pose[n].DeltaE: Contains the values of binding energy, ΔEbinding. l MyMiniProtein_Target_Pose[n].E: Contains the energy values of bound (Eminiprot/target) and unbound states (Eminiprot and Etarget). For ranking, we use ΔEbinding as a metric for the binding affinity and Eminiprot to check the miniprotein stability. A miniprotein of choice would have a high affinity for its intended target and yet be sufficiently stable in its unbound form. We compared the scores of each of the states used in MSD, mapped with each designed sequence of the library for the post-analysis of our design results, and considered the state with the minimum ΔEbinding as the state defining the potential affinity of the sequence. With this criterion, Pose 1 appears as the best scaffold for binding design. While it has only a slightly lower binding energy than Pose 4, its unbound form is far more stable as indicated by its lower Eminiprot. Interestingly, the entropy profile for the designed libraries based on the best pose scaffold (Pose 1) shows low sequence entropy at the N-ter of the scaffold and higher entropy at the C-ter helix, which points toward the solvent (Fig. 5). However, the mutations predicted for the lowest energy design for the ΔEbinding ranking criterion number 7 mutations at this site, most of which are hydrophobic (6 W, 9I, 12F, 13 W, 16 V, and 17I) or 376 Younes Bouchiba et al. Fig. 5 Stability and affinity assessment for designed miniproteins. (a) The values of the best ΔEbinding (first panel) and Eminiprot (second panel) are given in kcal/mol. The optimal design for each, for both ΔEbinding or Eminiprot, is presented below. (b) Sequence entropy profiles for the best designs are mapped on Gyro’s scaffold, and the WebLogo representing amino acid diversity is shown beside. The 3D model of the lowest ΔEbinding design, in interaction with the RBD, is presented with interface residues highlighted in lines representation. The mutated residues are shown in sticks representation and are written in red in the sequence on the right. The design space is shown with a light blue background in the sequence negatively charged residues at the N-ter (2D). The high proportion of hydrophobic packing at the RBD binding site seems to favorably accommodate the RBD binding site surface residues, which are mainly hydrophobic (Fig. 5). Computational Design of Miniprotein Binders 4 377 Conclusion We described a generic method for miniprotein inhibitor design, illustrating its feasibility using structural data on the SARS-CoV2 RBD in interaction with its native target, the human receptor ACE2. Structural data being increasingly accessible for many molecular binding systems, this method might be applied to a variety of protein complexes. Compared to common heuristic methods, our method offers a reliable and very comfortable usage; it combines optimality guarantees with computational efficiency. As an indication, generating 10 sequences with a Hamming diversity constraint of six mutations between all produced sequences, using three conformational states as input, 33 mutable positions, and 222 flexible positions took around 15 min on standard hardware. The main advantage of the diversity constraintbased library generation is that it better accounts for the fact that energy functions are still approximate; it more efficiently spans the mutation profiles for a given protein scaffold, yielding a more varied list of mutants to be tested, enhancing the probability for a functional mutant to be identified [31]. Common methods might get trapped around the minimal energy sequence and hence lack the ability to generate diverse yet low energy sequences. We compared the best miniprotein obtained with this protocol to recently proposed minibinders of a comparable size, binding to the same target [40]. The LCB1 and LCB3 minibinders show an EasyE-estimated binding energy of respectively 75.81 kcal/mol and 55.77 kcal/ mol. The best solution produced by our protocol reached an estimated ΔEbinding of 131.9 kcal/mol. This makes an experimental assay of the proposed miniprotein attractive, to confirm its practical functionality. 5 Notes 1. Initial models were prepared and energy-minimized with AMBER18 [44] using the ff14SB force field [47] with a 9 Å cutoff. The script below shows the command to use, either in the tleap terminal or by entering the commands in a file leap. in and launching it with the command tleap -f leap.in. The output file can then be used for minimization. An input file is provided below along with the proper command to run the minimization with sander [44]. Leap input file: source oldff/leaprc.ff14SB source leaprc.gaff source leaprc.water.tip3p 378 Younes Bouchiba et al. mol = loadpdb MyMiniProtein.pdb saveAmberParm mol MyMiniProtein.prmtop MyMiniProtein.inpcrd quit Sander input file: MINIMIZATION &cntrl imin=1, ntx=1, irest=0, ntpr=50, ntf=1, ntb=0, cut=9.0, nsnb=10, ntr=1, maxcyc=10000, ncyc=5000, ntmin=1, &end END Then run the following command: sander -O -i min.in -p MyMiniProtein.prmtop -c MyMiniProtein. inpcrd -r MyMiniProtein_min.rst -o mini.out -ref MyMiniProtein.inpcrd 2. Backrub simulation clustering can be achieved through K-means clustering, as implemented in the cpptraj module of AmberTools19 [49]. Here is an example script for clustering using the .pdb output of a RosettaBackrub simulation. This method ensures access to conformationally diverse states. This procedure could be simplified by taking the lowest energies conformations in your backrub ensemble. cpptraj_cluster.in parm MyProtein.pdb trajin *pdb average crdset MyAvg run rms ref MyAvg :1-57&@C,CA,N,O cluster c1 \ kmeans clusters 3 randompoint maxit 500 \ rms :1-57&@CA,C,N,O \ out cluster_cnumvtime.dat \ summary cluster_summary.dat \ info cluster_info.dat run 3. Resfiles stipulate which residues are respectively mutable, flexible, or rigid during design, using Rosetta syntax. Mutables residues are allowed to adopt any rotamer of any specified residue type. All residues can be used with the ALLAA keyword, while specific types can be given using PIKAA. If no resfile is Computational Design of Miniprotein Binders 379 provided, all residues will be considered as flexible only. You can force a residue to be flexible using the NATAA keyword. Rigid residues are ignored in the energy optimization and are specified by the NATRO keyword. Note that rigid residues will not appear in the output sequences (.seq extension). This can be cumbersome, for example, when using the substitution_names_easyE.py script provided for Subheading 3.4. A simple trick to get the native sequence without its rigid residues is to perform a side chain packing of your protein with all residues, except the rigid ones, set as flexible. These sequences can then be used for design comparison. For more detailed resfile syntax, please refer to the Rosetta documentation: (https://www.rosettacommons.org/docs/latest/rosetta_ basics/file_types/resfiles). The first lines of an example resfile used in this chapter are presented below: NATAA USE_INPUT_SC start 1 A ALLAA USE_INPUT_SC 2 A ALLAA USE_INPUT_SC 3 A ALLAA USE_INPUT_SC 5 A ALLAA USE_INPUT_SC 6 A ALLAA USE_INPUT_SC 9 A ALLAA USE_INPUT_SC ... 4. EasyE relies on a specific sequence format for mutation specification. The file follows a [Residue Number]_[Mutation type] syntax. Multiple mutations must be separated with an underscore. A python script is provided in the easy-jayz/exes/ folder for converting your designed sequences into the proper format: substitution_name.py. This script can be used to convert a toulbar2 output sequence (MyMiniProtein_Target_Pose[n].nat native sequences and positive.seq or MyMiniProtein_Target_Pose[n].seq for respectively MSD and SSD) to an EasyE readable file. For each design condition (pose, diversity constraint, conformational state), use: awk ‘{print $1}’ MyProtein_Target_min.nat > ref.seq size=‘awk ‘{printf $1}’ MyProtein_Target_min.nat| wc -m‘ awk ‘{print $1}’ positive.seq | cut -c1-$size > mut.seq ./exes/substitution_name_easyE.py --mut mut.seq --ref ref.seq > mut.seqE 380 Younes Bouchiba et al. The output file mut.seqE contains the designed sequence in the proper format, i.e., as mutations to the native sequence. Please note that rigid residues will be absent from the output sequence, as they can be ignored in the energy optimization. Hence, we recommend to refer to the Note 3 for more detailed instructions about how designs with rigid residues can be achieved. Acknowledgments The authors thank the French ANR for financial support through the grant ANR-19-PI3A-0004. We also thank the Computing mesocenter of Région Midi-Pyrénées (CALMIP, Toulouse, France) for providing access to the HPC resources. References 1. Vazquez-Lombardi R, Phan TG, Zimmermann C et al (2015) Challenges and opportunities for non-antibody scaffold drugs. Drug Discov Today 20:1271–1283. https://doi.org/10. 1016/j.drudis.2015.09.004 2. Crook ZR, Nairn NW, Olson JM (2020) Miniproteins as a powerful modality in drug development. Trends Biochem Sci 45:332–346. https://doi.org/10.1016/j.tibs.2019.12.008 3. Gebauer M, Skerra A (2020) Engineered protein scaffolds as next-generation therapeutics. Annu Rev Pharmacol Toxicol 60:391–415. https://doi.org/10.1146/annurev-pharmtox010818-021118 4. Chevalier A, Silva D-A, Rocklin GJ et al (2017) Massively parallel de novo protein design for targeted therapeutics. Nature 550:74–79. https://doi.org/10.1038/nature23912 5. Mignon D, Druart K, Michael E et al (2020) Physics-based computational protein design: an update. J Phys Chem A 124:10637–10648. https://doi.org/10. 1021/acs.jpca.0c07605 6. Setiawan D, Brender J, Zhang Y (2018) Recent advances in automated protein design and its future challenges. Expert Opin Drug Discov 13:587–604. https://doi.org/10.1080/ 17460441.2018.1465922 7. Samish I (2017) Computational protein design. Humana Press 8. Kuhlman B, Bradley P (2019) Advances in protein structure prediction and design. Nat Rev Mol Cell Biol 20:681–697. https://doi.org/ 10.1038/s41580-019-0163-x 9. Pierce NA, Winfree E (2002) Protein Design is NP-hard. Protein Eng Des Sel 15:779–782. https://doi.org/10.1093/protein/15.10.779 10. Kuhlman B, Dantas G, Ireton GC et al (2003) Design of a novel globular protein fold with atomic-level accuracy. Science 302:1364–1368. https://doi.org/10.1126/ science.1089427 11. Villa F, Panel N, Chen X, Simonson T (2018) Adaptive landscape flattening in amino acid sequence space for the computational design of protein:peptide binding. J Chem Phys 149:072302. https://doi.org/10.1063/1. 5022249 12. Mignon D, Simonson T (2016) Comparing three stochastic search algorithms for computational protein design: Monte Carlo, replica exchange Monte Carlo, and a multistart, steepest-descent heuristic. J Comput Chem 37:1781–1793. https://doi.org/10.1002/ jcc.24393 13. Kuhlman B, Baker D (2000) Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci U S A 97:10383–10388 14. Leaver-Fay A, Tyka M, Lewis SM et al (2011) Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol 487:545–574. https://doi. org/10.1016/B978-0-12-381270-4. 00019-6 15. Voigt CA, Gordon DB, Mayo SL (2000) Trading accuracy for speed: a quantitative comparison of search algorithms in protein sequence Computational Design of Miniprotein Binders design. J Mol Biol 299:789–803. https://doi. org/10.1006/jmbi.2000.3758 16. Simoncini D, Allouche D, de Givry S et al (2015) Guaranteed discrete energy optimization on large protein design problems. J Chem Theory Comput 11:5980–5989. https://doi. org/10.1021/acs.jctc.5b00594 17. Hallen MA, Donald BR (2019) Protein design by provable algorithms. Commun ACM 62:76–84. https://doi.org/10.1145/ 3338124 18. Leach AR, Lemon AP (1998) Exploring the conformational space of protein side chains using dead-end elimination and the a* algorithm. Proteins 33:227–239. https://doi.org/ 10.1002/(SICI)1097-0134(19981101) 33:2<227::AID-PROT7>3.0.CO;2-F 19. Traoré S, Allouche D, André I et al (2013) A new framework for computational protein design through cost function network optimization. Bioinformatics 29:2129–2136. https://doi.org/10.1093/bioinformatics/ btt374 20. Traoré S, Allouche D, André I et al (2017) Deterministic search methods for computational protein design. Methods Mol Biol 1529:107–123. https://doi.org/10.1007/ 978-1-4939-6637-0_4 21. Allouche D, André I, Barbe S et al (2014) Computational protein design as an optimization problem. Artif Intell 212:59–79. https:// doi.org/10.1016/j.artint.2014.03.005 22. Hurley B, O’Sullivan B, Allouche D et al (2016) Multi-language evaluation of exact solvers in graphical model discrete optimization. Constraints 21:413–434. https://doi.org/10. 1007/s10601-016-9245-y 23. Druart K, Bigot J, Audit E, Simonson T (2016) A hybrid Monte Carlo scheme for multibackbone protein design. J Chem Theory Comput 12:6035–6048. https://doi.org/10. 1021/acs.jctc.6b00421 24. Davey JA, Chica RA (2012) Multistate approaches in computational protein design. Protein Sci 21:1241–1252. https://doi.org/ 10.1002/pro.2128 25. Davey JA, Chica RA (2014) Improving the accuracy of protein stability predictions with multistate design using a variety of backbone ensembles. Proteins 82:771–784. https://doi. org/10.1002/prot.24457 26. Sauer MF, Sevy AM, Crowe JE Jr, Meiler J (2020) Multi-state design of flexible proteins predicts sequences optimal for conformational change. PLoS Comput Biol 16:e1007339. https://doi.org/10.1371/journal.pcbi. 1007339 381 27. Davey JA, Damry AM, Euler CK et al (2015) Prediction of stable globular proteins using negative design with non-native backbone ensembles. Structure 23:2011–2021. https:// doi.org/10.1016/j.str.2015.07.021 28. Davey JA, Chica RA (2017) Multistate computational protein design with backbone ensembles. Methods Mol Biol 1529:161–179. https://doi.org/10.1007/978-1-4939-66370_7 29. Karimi M, Shen Y (2018) iCFN: an efficient exact algorithm for multistate protein design. Bioinformatics 34:i811–i820. https://doi. org/10.1093/bioinformatics/bty564 30. Vucinic J, Simoncini D, Ruffini M et al (2020) Positive multistate protein design. Bioinformatics 36:122–130. https://doi.org/10. 1093/bioinformatics/btz497 31. Ruffini M, Vucinic J, de Givry S et al (2019) Guaranteed diversity quality for the weighted CSP. In: 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI), pp 18–25 32. Viricel C, de Givry S, Schiex T, Barbe S (2018) Cost function network-based design of protein-protein interactions: predicting changes in binding affinity. Bioinformatics 34:2581–2589. https://doi.org/10.1093/bio informatics/bty092 33. Noguchi H, Addy C, Simoncini D et al (2019) Computational design of symmetrical eightbladed β-propeller proteins. IUCrJ 6:46–55. https://doi.org/10.1107/ S205225251801480X 34. Hui DS, Azhar EI, Madani TA et al (2020) The continuing 2019-nCoV epidemic threat of novel coronaviruses to global health — the latest 2019 novel coronavirus outbreak in Wuhan, China. Int J Infect Dis 91:264–266. https://doi.org/10.1016/j.ijid.2020.01.009 35. Zhou P, Yang X-L, Wang X-G et al (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579:270–273. https://doi.org/10.1038/ s41586-020-2012-7 36. Hoffmann M, Kleine-Weber H, Schroeder S et al (2020) SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell 181:271–280.e8. https://doi.org/10.1016/j. cell.2020.02.052 37. Shyr ZA, Gorshkov K, Chen CZ, Zheng W (2020) Drug discovery strategies for SARSCoV-2. J Pharmacol Exp Ther 375:127–138. https://doi.org/10.1124/jpet.120.000123 38. Pomplun S (2021) Targeting the SARS-CoV2-spike protein: from antibodies to 382 Younes Bouchiba et al. miniproteins and peptides. RSC Med Chem 12 (2):197–202. https://doi.org/10.1039/ D0MD00385A 39. Linsky TW, Vergara R, Codina N et al (2020) De novo design of potent and resilient hACE2 decoys to neutralize SARS-CoV-2. Science 370:1208–1214. https://doi.org/10.1126/ science.abe0075 40. Cao L, Goreshnik I, Coventry B et al (2020) De novo design of picomolar SARS-CoV2 miniprotein inhibitors. Science 370:426–431. https://doi.org/10.1126/sci ence.abd9909 41. Han Y, Král P (2020) Computational design of ACE2-based peptide inhibitors of SARS-CoV2. ACS Nano 14:5143–5147. https://doi. org/10.1021/acsnano.0c02857 42. Wang Q, Zhang Y, Wu L et al (2020) Structural and functional basis of SARS-CoV-2 entry by using human ACE2. Cell 181:894–904.e9. https://doi.org/10.1016/j.cell.2020.03.045 43. Alford RF, Leaver-Fay A, Jeliazkov JR et al (2017) The Rosetta all-atom energy function for macromolecular modeling and design. J Chem Theory Comput 13:3031–3048. https://doi.org/10.1021/acs.jctc.7b00125 44. Case DA, Ben-Shalom IY, Brozell SR et al (2018) AMBER. University of California, San Francisco 45. Smith CA, Kortemme T (2008) Backrub-like backbone simulation recapitulates natural protein conformational variability and improves mutant side-chain prediction. J Mol Biol 380:742–756. https://doi.org/10.1016/j. jmb.2008.05.023 46. Wood CW, Woolfson DN (2018) CCBuilder 2.0: powerful and accessible coiled-coil modeling. Protein Sci 27:103–111. https://doi.org/ 10.1002/pro.3279 47. Maier JA, Martinez C, Kasavajhala K et al (2015) ff14SB: improving the accuracy of protein side chain and backbone parameters from ff99SB. J Chem Theory Comput 11:3696–3713. https://doi.org/10.1021/ acs.jctc.5b00255 48. Conway P, Tyka MD, DiMaio F et al (2014) Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci 23:47–55. https://doi.org/10. 1002/pro.2389 49. Roe DR, Cheatham TE (2013) PTRAJ and CPPTRAJ: software for processing and analysis of molecular dynamics trajectory data. J Chem Theory Comput 9:3084–3095. https://doi. org/10.1021/ct400341p 50. Gray JJ, Moughon S, Wang C et al (2003) Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol 331:281–299. https://doi.org/10.1016/ s0022-2836(03)00670-3 51. Berman HM, Westbrook J, Feng Z et al (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242 52. Davis IW, Arendall WB, Richardson DC, Richardson JS (2006) The backrub motion: how protein backbone shrugs when a sidechain dances. Structure 14:265–274. https://doi. org/10.1016/j.str.2005.10.007 Chapter 18 Computational Design of Peptides with Improved Recognition of the Focal Adhesion Kinase FAT Domain Eleni Michael, Savvas Polydorides, and Georgios Archontis Abstract We describe a two-stage computational protein design (CPD) methodology for the design of peptides binding to the FAT domain of the protein focal adhesion kinase. The first stage involves high-throughput CPD calculations with the Proteus software. The energies of the folded state are described by a physicsbased energy function and of the unfolded peptides by a knowledge-based model that reproduces aminoacid compositions consistent with a helicity scale. The obtained sequences are filtered in terms of the affinity and the stability of the complex. In the second stage, design sequences are further evaluated by all-atom molecular dynamics simulations and binding free energy calculations with a molecular mechanics/implicit solvent free energy function. Key words Computational peptide design, Computational protein design, Proteus program, Molecular mechanics, Monte Carlo 1 Introduction The protein focal adhesion kinase (FAK) is required for the efficient assembly and disassembly of focal adhesions and controls many biological processes including cell adhesion, growth and survival, embryonic development, and wound healing [1–3]. Due to its increased expression in many cancers [3–5], it constitutes a promising cancer therapy target [6–8]. Numerous efforts to inhibit FAK have targeted its interactions with other proteins [3, 9– 13]. An important target is the complex between FAK and the protein Paxillin, whose formation is a necessary and sufficient condition for FAK localization at focal adhesions. In the present chapter we employ high-throughput computational protein design (CPD) calculations and all-atom simulation/free energy analysis calculations to optimize peptides that recognize the kinase focal adhesion targeting (FAT) domain. The obtained sequences could serve as a starting point for the design of peptidomimetic inhibitors Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_18, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 383 384 Eleni Michael et al. of the FAK:Paxillin complex. Such inhibitors are promising against protein-protein interactions, which typically involve areas with sizes of 10004000 Å2, much larger than the contact areas of small druglike molecules (a few hundred Å2 [14, 15]). High-throughput CPD calculations assess a large number (109) of sequences and conformations in a few hours, using a desktop computer. Key ingredients for the efficiency and accuracy of such calculations are the energy function and the sequence/ conformation sampling method. We describe the structure of the folded states (FAT complexes and unbound peptides) by a fixed backbone/discrete rotamer approximation and its interactions by a physics-based energy function that includes molecular mechanics terms and describes solvent effects by a Generalized Born term, a Lazaridis-Karplus term and a dispersion -interaction term (MM/GBDILK approximation) [16, 17]. We describe implicitly the unfolded state of the unbound peptide by a knowledge-based model, adjusted to yield peptide sequences consistent with a helicity scale [18]. We sample the sequence/conformation space by a Monte Carlo (MC) procedure that produces sequences weighted according to their relative stabilities. We filter the designed sequences with respect to the stabilities and affinities of the FAK complex and evaluate selected sequences via all-atom simulations and a free energy function that includes molecular mechanics terms and describes solvent effects by a Generalized Born term, a Lazaridis-Karplus term and a dispersion -interaction term (MM/GBDILK approximation) [16, 17]. The CPD calculations are conducted with the Proteus 3.0 CPD software [19, 20] that is freely available for academic and government scientists from the website address https://proteus. polytechnique.fr. The distribution contains source code, binaries for Intel processors, detailed documentation, and tutorials. Scientists in industry can consult the above address for obtaining instructions. The all-atom simulations are conducted with the NAMD program [21]. Description of the System The focal adhesion kinase FAT domain folds into a four-helix bundle [22–24]. As shown in Fig. 1, two sites, respectively, at the helix 1/helix 4 interface (site 14) and at the helix 2/helix 3 interface (site 23), interact with conserved segments on Paxillin with the consensus sequence LDXLLXXL, known as LD motifs (reviewed in [25]). Selected LD motifs of the Paxillin-family proteins are listed in Table 1. In the FAK:Paxillin complex, motifs LD2 and LD4 interact, respectively, with sites 14 and 23, but short peptides with the Paxillin LD2 or LD4 sequences can bind at both sites [26–29]. The affinities of LD-motif containing peptides for LD-binding domains are on the order of μM [17]. Computational Optimization of Peptide Affinity for FAK 385 Fig. 1 Structure of the focal adhesion kinase FAT complex with two LD-motif peptides, bound at the interfaces of helices 1 and 4 (site 14) and helices 2 and 3 (site 23). The signature residues of the 0LDXLLXXL+7 motif are indicated Table 1 Sequence alignment of natural LD motifs of the Paxillin-family of proteins Paxillin, Leupaxin and HIC-5 Residue position 3 2 1 0 +1 +2 +3 +4 +5 +6 +7 +8 Paxillin LD1 M D D L D A L L A D L E Paxillin LD2 L S E L D R L L L E L N Paxillin LD3 R P S V E S L L D E L E Paxillin LD4 T R E L D E L M A S L S Paxillin LD5 G S Q L D S M L G S L Q Leupaxin LD1 M E E L D A L L E E L E Leupaxin LD4 A A Q L D E L M A H L T Leupaxin LD5 K A S L D S M L G G L E Hic5 LD1 M E D L D A L L S D L E Hic5 LD2 L C E L D R L L Q E L N Hic5 LD4 T L E L D R L M A S L S Hic5 LD5 K G S L D T M L G L L Q L D X L L X X L Consensus motif 386 Eleni Michael et al. In what follows, we first describe briefly the employed CPD methodology. We present the energy function and structural model for the folded state and the knowledge-based model for the unfolded peptide. We then outline a step-by-step procedure to identify sequences that optimize the peptide-protein affinities or the complex stabilities. A final step involves testing selected sequences by all-atom MD simulations in explicit solvent and binding free energy calculations with the MM/GBDILK model. 2 Methodology The design methodology is summarized in the flowchart of Fig. 2. Below, we describe the various steps in more detail. 1. Determine the target of sequence and conformational optimization 2. Describe the folded state (complex and unbound peptide): Energy function and structural model 3. Compute the Interaction Energy Matrix (IEM) 4. Determine the knowledge-based model for the unfolded peptide. Stability design of the unbound peptide and complex 5. 6. chemical similarity) Study selected complexes by all-atom MD and binding free energy analysis Fig. 2 Flowchart of the various steps used in the peptide design Computational Optimization of Peptide Affinity for FAK 387 2.1 Target of Optimization In a standard CPD methodology [19, 20], the molecule is partitioned into three groups: (1) “fixed” residues (usually the backbone and all glycines and prolines), kept in the same conformation; (2) “inactive” residues that explore a small set of conformations, usually from a discrete rotamer database [30]; (3) “active” residues that change both chemical type and conformation and correspond to the design target. The structural relaxation of the fixed backbone and the discrete sidechain rotamers is treated implicitly via the choice of the protein dielectric constant [31]. 2.2 Description of the Folded State The energy of a particular sequence (S)/rotamer {Ri} combination of the folded complex and the folded, unbound peptide is described by a physics-based free energy function with molecular mechanics (MM) and solvation free energy terms: 2.2.1 Energy Function vdw Coulomb DI LK bonded GB EX = EX + EX + EX + EX + EX + EX EMM Esolv ð1Þ Here, X denotes the complex (C) or the unbound peptide (P). The MM energy terms correspond to bonded, van der Waals (vdw) and Coulombic interactions and are modeled by the Amber ff99SB force field [32]. The generalized Born (GB) term E GB X models the interaction of the solute charges with the solvent polarization and corresponds to the Hawkins-Cramer-Truhlar approximation [33– 35]. The term E DI X models solute-water attractive dispersion interactions [16, 36, 37]. The Lazaridis-Karplus term E LK X models the tendency of various groups to be exposed to solvent [38]. The combination of GB with the DI and LK terms (MM/GBDILK model) was parameterized in [16]. Our recent MD simulations and free energy analysis have shown that this term reproduces well the relative affinities of FAK complexes with LD-motif containing peptides [17]. The protein and solvent dielectric constants are set to 6.8 and 80 [16, 17]. Basic ingredients of the GB term are the atomic solvation radii, which approximate the atomic distances from the solvent interface. These radii depend on the entire protein geometry, rendering the GB term a many-body function. The pretabulation of energies in an interaction energy matrix (IEM) prior to the design (see below) requires the use of a pairwise-approximation energy function. For the GB term, we achieve this by the “Fluctuating Dielectric Boundary Method,” described in Note 1 and refs. 39, 40. 2.2.2 Structural Model In “fixed backbone” methods, as the one used here, an optimization is performed for one or a few specific backbone conformations, usually taken from an experimental structure (e.g., an x-ray structure) or an MD trajectory. The backbone choice determines the atomic coordinates and the interactions of the entire system and may affect the design. Multi-backbone methods have also been 388 Eleni Michael et al. developed that iterate between sequence and backbone conformation optimizations [41], sample sequence/backbone conformations simultaneously [42, 43] or generate backbone motions via MC [42] or MD [17, 44]. The design system employed here consists of FAK residues 916–1049 (in Homo sapiens numbering) and two 12-residue peptides bound at the LD-motif recognition sites 14 (interface of FAK helices 1 and 4) and 23 (interface of FAK helices 2 and 3) (see Fig. 1). Our previous all-atom MD simulations of FAK complexes with peptides containing the paxillin LD2 or LD4 motifs showed that site 14 has a stronger affinity for LD2 and site 23 has a similar affinity for either motif [17]. The structural model employed here is taken from a simulation of the complex with two LD2-motif peptides (see Note 2). 2.3 Evaluation of the Interaction Energy Matrix (IEM) Prior to the construction of the IEM matrix, a set of positions are designated as active or inactive. At this stage, an extended range of chemical types and rotamer conformations can be defined (e.g., the natural aminoacids and the conformations of a rotamer database [30], augmented by conformations seen in an experimental structure). During design, the range of active and inactive residues and the list of chemical types/rotamer conformations sampled can be restricted to a subset of the choices included in the IEM construction. Extended interaction sites at protein—protein interfaces can be handled by performing sequential design simulations targeting selected subgroups of positions. To perform the IEM computation, each active residue is replaced by a “giant residue” that includes all possible sidechain types attached to its backbone Cα atom. For all available chemical types and conformations at each active or inactive position I, we compute interaction energies of I with itself and the fixed portion of the system (diagonal IEM elements), and with every other active or inactive residue J (off-diagonal terms). The calculation is performed separately for each residue position and is trivially parallelizable. The construction of the giant molecule and the computation of the IEM elements are performed by suitable input files and shell scripts. For more details we refer to the Proteus 3.0 manual [45]. To alleviate steric clashes due to the fixed backbone/discrete rotamer approximation, during the IEM construction we conduct short energy minimizations (see Note 3). With this protocol we avoid performing on-the-fly minimizations during MC design and sample states from a Boltzmann ensemble [19, 46]. 2.4 The physical meaning of the MC in sequence space and its statistical ensemble is explained in [19]. Briefly, we consider a solution that contains equal concentrations of all possible sequences of a molecule X, distributed into folded (F) and unfolded (U) states according to their relative stabilities. A sequence change S1 ! S2 Stability Design Computational Optimization of Peptide Affinity for FAK 389 corresponds to a process where one folded molecule S1 unfolds and one unfolded molecule S2 folds [19]. The corresponding energy change in the Metropolis MC criterion is S1 →S2 F ΔEX = EX (S2 ) F EX (S1 ) U EX (S2 ) U EX (S1 ) ð2Þ Application of Eq. 2 requires defining the folded and unfolded states. We discuss separately the stability calculations of the unbound peptide and the complex. 2.4.1 The Unbound Peptide We consider the helical conformation as the “folded” state of the unbound peptide, since structural and secondary-structure prediction studies show that LD-motif peptides can form α-helices also in solution. We model the unfolded unbound peptide P as a collection of independent aminoacids [19]. In this model, an aminoacid with chemical type t is associated with a characteristic “reference energy” E ref t . The total energy of a particular, unfolded peptide sequence is the sum of reference energies of its constituent aminoacids: EU P ðSÞ ¼ P t∈aa nt ðSÞE ref t ð3Þ In Eq. 3, nt(S) is the number of aminoacids of type t in sequence S and the sum is over all aminoacid types. We compute the aminoacid-dependent reference energies E ref t via a maximum-likelihood formalism employed previously in whole-protein redesign [19, 47]. In our case, we adjust the reference energies to ensure that the designed sequences are consistent with the AGADIR helix-propensity scale of Muñoz and Serrano [18]. To achieve this, we compute the average aminoacid frequencies hfti during a stability design of the unbound peptide, and compare with the aminoacid frequencies in the helicity scale f hel t . We adjust the values E ref via the following linear update rule [47]: t hel ref E ref t ðν þ 1Þ ¼ E t ðνÞ þ δE ½ f t h f t iðνÞ ref ð4Þ where E t ðνÞ and hfti(ν) are the reference energies and the running frequencies at MC iteration ν; the quantity δE is an empirical constant with dimensions of energy, set to 0.5 kcal/mol in the present work. The reference energies are updated until the design frequencies converge to the helicity frequencies, h f t i f hel t , 8t∈aa. During design, we treat all peptide sidechains as active. The design protocol of the unbound peptide is detailed in Note 4. We improve sampling via a Replica-Exchange MC (REMC) method, where multiple copies of the same system (replicas) are simulated at a range of temperatures and are allowed to exchange temperatures at periodic intervals. The procedure is performed by the C module protMC. 390 Eleni Michael et al. 2.4.2 The Complex In this case, the folded state is the structure of the complex. The unfolded state consists of the unbound, unfolded peptide P and the unbound, unfolded protein. The energy change in the MC Metropolis criterion due to a peptide sequence mutation S1 ! S2 in the complex C is U ΔE SC1 !S 2 ¼ ½E FC ðS 2 Þ E FC ðS 1 Þ ½E U P ðS 2 Þ E P ðS 1 Þ ð5Þ The energies E FC ðS 1 Þ and E FC ðS 2 Þ of the protein complexes with peptide sequences S1 and S2 are evaluated via Eq. 1; the energies U EU P ðS 1 Þ and E P ðS 2 Þ of the unbound, unfolded peptides are evaluated via Eq. 3. The energy of the unbound, unfolded protein is also approximated as a sum of reference energies, dependent on the aminoacid type. Since the protein composition is the same in the two complexes, the unfolded protein energy does not contribute in Eq. 5. Peptide sidechains are treated as active and protein sidechains within 8 Å of any peptide atom as “inactive,” excluding prolines and glycines. Other sidechain and backbone atoms are kept fixed. The design protocol is detailed in Note 5. 2.5 Filtering of Sequences The stability design calculations produce a large number of peptide sequences in the environment of the complex and the unbound peptide. To identify a set of promising sequences for further analysis, we apply various filtering criteria. 2.5.1 Binding Affinity A main selection criterion is the binding affinity. Using the populations of a sequence S in the simulations of the complex ( pC(S)) and the unbound peptide ( pP(S)), we first compute the stabilities of the various sequences (S) relative to a reference sequence R in the complex and the unbound peptide: ΔΔG stab C ðSÞ¼ k B T ln ΔΔG stab P ðSÞ¼ pC ðSÞ pC ðRÞ p ðSÞ kB T ln P pP ðRÞ ð6Þ The peptide sequence relative stabilities are generally different in the complex and unbound state, due to the protein-peptide interactions. Subtracting the above equations, we obtain an estimate of the relative binding free energy of sequence S in terms of the sequence populations: ΔΔG bind ðSÞ¼ kB T ln pC ðSÞ p ðSÞ þ kB T ln P pC ðRÞ pP ðRÞ ð7Þ Extraction of the populations and calculation of the free energies is performed with a python script. Computational Optimization of Peptide Affinity for FAK 391 The above procedure can identify sequences with good stabilities and binding affinities. For a subset of the most promising sequences, we perform additional rotamer optimization simulations of the unbound helical peptide and the complex, and compute the corresponding average solution energies hE H and P ðSÞicf hEC(S)icf, where h icf denotes the conformational average. We evaluate the binding affinities relative to a reference sequence via the expression: H ΔΔG bind ðSÞ ¼ ½hE C ðSÞicf hE H P ðSÞicf ½hE C ðRÞicf hE P ðRÞicf ð8Þ 2.5.2 Additional Criteria We filter further the obtained sequences, using as criteria the stability of the complex and the unbound peptide and the chemical similarity of the obtained sequences. More details are presented in the results. 2.6 All-Atom MD Simulations of Selected Complexes Selected complexes predicted with good stabilities or binding affinities are subjected to all-atom MD simulations in explicit solvent. Details of the simulation protocol are in Note 6. For each complex, we extract coordinate sets at 10-ps intervals and compute the solution free energies of the complex and the unbound molecules via Eq. 1. We adopt the “single-trajectory” approximation, in which the protein and peptide have identical conformations in the complex and unbound states. In this approximation, bonded and intramolecular vdw and Coulomb energies are identical in the complex and unbound protein and do not contribute to stability. The resulting binding affinities are ΔG bind DI ¼ ðE vdw þ ΔE LK ÞþðE Coul þ ΔE GB Þ C þ ΔE C np p ΔG bind þ ΔG bind ð9Þ The last two terms correspond to nonpolar (np) and polar (p) contributions to the binding free energies. Structural relaxation accompanying complex formation is ignored in the above approximation. Contributions to binding free energy due to this relaxation could be estimated by conducting independent simulations of the complex and unbound molecules. These contributions could be associated with large uncertainties and may not improve the results [48]. 3 Results 3.1 Optimization of the Unfolded Peptide Reference Energies We optimized the reference energies E ref of the unfolded peptide t via stability design calculations of a 12-residue peptide, terminated by acetylated (ACE) and N-methylated (CT3) blocking groups. The logos of Fig. 3a show the design choices at each position. 392 Eleni Michael et al. Fig. 3 Logo representation of the design sequences for (a) the unbound peptide; (b) the peptide bound at site 14 (top panel) or site 23 (bottom panel). The consensus LD motif LDXLLXXL is in the residue index range 0–7. Amino acid types are shown in one-letter representation. The letter sizes are proportional to the aminoacid probabilities. The letter color is associated with the aminoacid physical properties The size of each letter reflects the corresponding aminoacid frequency. The various frequencies are fairly uniform (4 11%) in accord with the Muñoz and Serrano (MS) scale [18]. Thus, the computed reference energies almost flatten the sequence landscape of the unbound peptide. 3.2 Preliminary Unrestricted Stability Design of the Complex Using these reference energies, we performed a preliminary stability design test in the FAT complex with two 12-residue peptides bound at sites 14 and 23. All 24 peptide sidechains were treated as active. With an average 10 rotamers per aminoacid type, the resulting sequence/conformation space has a size ((18 10)24 10124 ¼ 10178 states), rendering the accurate computation of sequence populations intractable. Nevertheless, the design is an instructive test of the energy function and the computational protocol. The simulations employed 8 replicas in the range 88 K– 1510 K. Computational Optimization of Peptide Affinity for FAK 393 The logos of Fig. 3b show the sequences of the roomtemperature replica. The upper and lower plot correspond, respectively, to peptides bound at sites 14 and 23. Comparison with Fig. 3a shows that the intermolecular interactions with the FAT domain introduce strong sequence preferences at various positions. Despite the enormous size of the sequence space, the design calculation correctly predicts hydrophobic residues at positions 3, 0, +3, +4, and +7 at both binding sites, in accord with the consensus LD motif 0LDXLLXXL+7. Leucine appears most frequently in nine out of these ten peptide positions, and solely at positions 0, +4 at site 14 and +3, +7 at site 23. These results suggest that hydrophobic residues at these positions make key contacts with nonpolar protein residues that stabilize the complex. The design also places the signature aspartic acid (D) of the LD motif as the most probable choice at position +1, for the peptide bound at site 14; a glutamic acid (E) is also predicted as the fifthmost probable choice. At site 23, a larger range of aminoacid types are accepted at position +1. At the remaining positions 2, 1, +2, +5, +6, +8, a large variety of chemical types is inserted. Overall, the test design reproduces key features of the LD-motif containing sequences, despite the enormous space of possible sequence combinations. 3.3 Focused Stability Design To improve sampling, we restricted the list sampled by active residues to chemical types observed at the same positions in Paxillin-family proteins; the choices are summarized in Table 2. Even with these restrictions, the total number of possible sequences is very large (107). To facilitate design, we considered separately the segments (3) to (+2) (segment A) and (+2) to (+8) (segment B). With the chemical types of Table 2 the two segments form Table 2 Aminoacid types sampled during design Residue position -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 Aminoacid type Leu, Thr Polar or charged Polar or negatively charged Leu, Val Polar or negatively charged Polar or charged Leu Leu, Met All except positively charged Polar or negatively charged Leu Ser, Asn, Glu Total number of types 2 10 8 2 8 10 1 2 16 8 1 3 394 Eleni Michael et al. 25600 and 7680 sequences, respectively. We designed each segment independently, maintaining the other segment in the paxillin LD2 sequence. At the end, we created 12-residue sequences by combining designed segments A and B with the same residue type at the common position +2. Choice of Designed Sequences Based on Affinity and Stability The stability calculations considered the complex with peptides bound at either of the two sites 14 and 23. Details of the design are in Note 5. For segment A, we obtained 19,660 distinct sequences at site 14 and 22,629 sequences at site 23. For segment B, we obtained 4782 sequences at site 14 and 7461 sequences at site 23. In the case of the unbound peptide, we obtained 24,040 sequences for segment A and 6380 sequences for segment B. In order to choose a set of sequences with good affinity and stability, we applied a set of filtering criteria. l The binding free energies are estimated from the sequence probabilities via Eq. 7. Low-probability sequences should be discarded, as their energies have large uncertainties. In our case, we conducted 8 108-step MC runs. If all 26,000 segment-A sequences were equally favored, they would appear about 30,000 times in the design solutions. We discarded sequences that appear less frequently by a factor of 10 (less than 3000 times). l We selected sequences with better binding free energies than the corresponding paxillin LD2 segment and better stabilities than the corresponding native complex. We augmented this list with sequences having a small positive stability relative to LD2 and containing a D or E residue at position 1 (for segment A). l To create 12-residue sequences, we combined segments A and B containing the same type at common position +2. For example, segments 3LRTLDE+2 and +2ELLATLE+8 were joined to form 3 LRTLDELLATLE +8. This procedure produced 9762/15,271 sequences at sites 14/23. l We subjected the resulting complexes and unbound peptides to rotamer optimization via MC simulations of 106 steps at room temperature and computed the binding affinities of the various complexes, relative to the native LD2 complexes via Eq. 8. Profiles of sequences with better binding affinities than the native complex are shown in Fig. 4. At site 14, positions 0 and +4 are occupied almost exclusively by leucine. At site 23, a valine is preferred over leucine at position 0; methionine is almost as probable as leucine at position 4. A valine residue is encountered at position 0 in paxillin motif LD3 and a Computational Optimization of Peptide Affinity for FAK 395 Fig. 4 Sequence logos obtained from refined design of the peptide in site 1/4 (above) and 2/3 (below). The profiles are Boltzmann-weighted by the sequence binding free energies methionine at position 4 in paxillin, leupaxin, and Hic-5 motif LD4 (Table 1). Position +1 prefers negatively charged sidechains, at both sites, in accordance with the features of the “LD” motif. A leucine is favored over threonine at position 3, as in the paxillin LD2 motif. This is in accord with a recent MD study [17] of paxillin LD2:FAK and LD4:FAK complexes, where L3 formed improved nonpolar interactions relative to T3 at both binding sites. Paxillin motifs LD2 and LD4 contain a negative (E) residue at position 1. In the simulations of ref. 17, the E sidechain made several unfavorable long-range polar interactions with the protein. The design substitutes E by polar residues at this position. Polar residues are also encountered in Paxillin LD5, Leupaxin LD4, and Hic-5 LD5 motifs (Table 1). In our earlier MD simulations, LD2 motif residue E+6 was not engaged in stable hydrogen bonds at site 14 and formed a persistent hydrogen bond with K1002 at site 23 [17]. In agreement with this, the design predicts mainly polar aminoacids for position +6 at site 14 and mainly negative aminoacids at site 23. The C-terminal position +8 is occupied by N and S residues in Paxillin motifs LD2 and LD4; Leupaxin LD4 motif contains a T residue, and several other LD motifs contain a negatively charged E residue. The design favors an E residue at position +8, for site 14. The E sidechain might form favorable hydrogen-bonding interactions with a nearby histidine (H1025). 3.4 Clustering Based on Sequence Similarity The design procedure identified several thousand sequences with potentially improved binding affinity for FAK with respect to the native LD2 motif. A next step is to choose a manageable number (e.g., around ten) of sequences for further theoretical analysis and experimental testing. For this purpose, we partitioned the solutions into clusters of sequences with similar physicochemical properties 396 Eleni Michael et al. ΔΔ ΔΔ [kcal/mol] LKTLDELLAYLE LKTLDELLLSLE LKYLDELLAHLE LKYLDELLLYLE LRTLDDLLLYLE LRTLDDLLAHLE .. LKTLDRLLAELE LYTLDKLLLELE LSTLDRLLMELE .. -2.37 -2.32 -2.25 -2.18 -2.15 -2.14 -1.49 -1.48 -1.48 [kcal/mol] LRQVEKLLEELE LQQVEKLLEELE LTYVQKLMEELE LKEVQKLMEELE LRQVEKLLFDLE LQQVEKLLYDLE .. LRYLEQLLAELE LRYLERLLDELE LTYLERLLDELE .. -5.40 -5.39 -5.04 -5.01 -4.98 -4.91 -3.15 -3.06 -2.92 Fig. 5 Selected top affinity sequences at sites 14 and 23 for some of the clusters identified by sequence similarity analysis. Affinities are relative to the LD2 motif (see Note 7). We identified 655 sequence clusters at site 14 and 707 clusters at site 23. Figure 5 shows the top affinity sequences for some of the clusters, ranked by their binding affinity relative to the LD2 sequence. 3.5 All-Atom MD Simulations of Selected Complexes The efficiency of the CPD calculations is based on approximations such as the selection of one or a few structural models for the IEM matrix, the fixed backbone and the discretization of the rotamer conformations. Checking the design solutions by explicit-solvent MD simulations is an important additional step that can validate or eliminate false positives. Below we present some examples of both cases. Prior to the simulations, we optimized the rotamer conformations with the protX program [45]. The solvation and MD simulation setup was prepared with the CHARMM-GUI interface [49] and the simulations were conducted with the NAMD program [21]. The binding affinities, shown in Table 3, were computed via Eq. 9. The decomposition of peptide-protein interaction energies, relative to the LD2 complex, of Fig. 6 provides insights on contributions from specific residues. According to the design, sequences LKTLDELLAYLE and L RQVEKKLEELE have the highest relative binding affinity for sites 14 and 23, respectively (Fig. 5). In the MD analysis, both sequences retain improved affinities by 3.1–3.4 kcal/mol, relative to the native LD2 complex. A dimer with these sequences connected by a suitable linker might be a successful inhibitor of the FAK:Paxillin complex. In the case of the first sequence, improved binding at site 14 is mainly due to the nonpolar free energy component (Table 3). The substitution N+8E (with respect Computational Optimization of Peptide Affinity for FAK 397 Table 3 Binding affinities of selected designed sequences for the focal adhesion kinase FAT domain, evaluated by all-atom MD simulations of the corresponding complexes and post-processing analysis in the MM-GBDILK approximation (Eq. 9) Peptide Binding sequence site Binding affinity terms ΔG np bind ΔE Coul C LSELDRLLLELN 14 (LD2) 25.1 (1.6) 4.3 (0.6) 1.0 46.2 (0.7) (2.2) 11.9 5.9 3.3 28.4 (0.2) (0.2) (0.5) (1.9) 23 25.7 (1.0) 17.9 (1.1) 17.7 46.7 (0.7) (1.4) 14.7 6.6 0.3 25.4 (0.5) (0.3) (1.1) (0.9) 14 28.5 (0.3) 1.0 (2.7) 4.2 50.7 (2.6) (0.6) 12.6 6.3 3.2 31.7 (0.4) (0.1) ( 0.1) ( 0.3) LRQVEKLLEELE 23 28.8 (2.3) 21.3 (6.0) 21.1 49.7 (4.9) (3.8) 14.3 6.7 0.1 28.7 (0.3) (0.1) ( 7.8) ( 3.8) LKTLDRLLAELE 14 24.9 (1.0) 2.8 (0.3) 5.4 45.2 (0.2) (1.2) 11.6 6.0 2.6 27.5 (0.3) (0.0) (0.5) (1.5) LRYLEQLLAELE 21.5 (5.3) 26.9 (1.6) 25.9 39.5 (2.1) (7.3) 13.4 5.6 1.0 20.5 (0.6) (0.7) (0.6) (5.9) LKTLDELLAYLE 23 ΔEGB ΔE vdW C ΔG pbind ΔGbind ΔELK ΔEDI Fig. 6 Decomposition of binding free energies for sequences LKTLDELLAYLE at site 14 (left) and LRQVEKL LEELE at site 23 (right). The values are computed relative to native LD2 sequence (LSELDRLLLELN). Error bars correspond to standard deviations. The labels “nX/Y” denote chemical types encountered at position n in the LD2 (X) and designed (Y) sequence to the LD2 motif) improves nonpolar contacts with proximal residues and electrostatic interactions with site 14 lysines; the substitution E1T alleviates unfavorable polar contacts with a nearby tyrosine (Y925). Notably, additional contributions to affinity are due to the conserved leucines L3, L0, and L+4. For sequence LR QVEKLLEELE, the stronger relative affinity for site 23 is mainly 398 Eleni Michael et al. due to the polar interactions of the substitution E/Q1 and L3. The substitution L/V0 reduces unfavorable polar contributions, while the lysine sidechain is electrostatically preferred over the arginine at position +2. Additionally E+6 makes slightly better polar and nonpolar interactions with respect to the native LD2 motif. Two sequences predicted by the MD analysis to bind more weakly than the LD2 sequence are also included in Table 3. Sequence LKTLDRLLAELE binds at site 14 with a relative affinity of 1.5 kcal/mol (design) or + 0.2 kcal/mol (MD). Similarly, sequence LRYLEQLLAELE has a relative affinity for site 23 of 3.2 kcal/mol (design) and + 4.2 kcal/mol (MD). 4 Notes 1. In the “Fluctuating Boundary Method,” the GB interaction 0 term between two residues R and R is expressed as a polynomial of the residue solvation radii, with coefficients precomputed and stored [39, 40]. The solvation radii are updated during the MC simulation and the GB term is computed efficiently from the polynomial expression. A simpler method is the “Native Environment Approximation,” where the solvation radius of a particular residue is computed with the rest of the molecule in the wildtype sequence and native structure [50]. 2. A total of 1000 MD trajectory frames were extracted at 20-ps intervals, aligned with respect to the backbone non-hydrogen coordinates and clustered with Wordom [51]. Frames were in the same cluster if their peptide and proximal protein Cα atoms (within 5 Å of the peptide) had an RMSD less than 1.8 Å. For the most populated cluster (62% of frames), we chose the snapshot with the highest binding affinity in the MM/GBDILK approximation (Eq. 9) [16, 17] as the structural model for the construction of the IEM of the complex and unbound peptide. 3. In the computation of diagonal terms, we alleviate steric repulsions and improve sidechain orientations by 15 steps of Powell’s conjugate gradient minimization. The sidechain is retained at its rotamer via dihedral-angle harmonic restraints with a 200 kcal/mol/rad2 constant and a tolerance range of 5∘ around the initial rotamer angle; only atoms beyond Cβ are allowed to move. Rotamers that retain high energies after minimization can be excluded from the sampling space. Interactions between two sidechains I and J are computed if the minimum distance lmin between any atoms is smaller than 12 Å. If lmin 3 Å, the orientations of the two sidechains are Computational Optimization of Peptide Affinity for FAK 399 optimized by a 15-step minimization; the interactions of the two sidechains and the fixed part are taken into account and the sidechains are retained in their specific rotamer orientations by dihedral harmonic restraints. 4. We conducted iterative replica-exchange Monte Carlo (REMC) simulations, using eight replicas with temperatures between 88 K and 1510 K (thermal energies ranging from 0.175 to 3 kcal/mol). Swaps between neighboring replicas were attempted every 50,000 steps. Each cycle had a length of 5 106 MC steps per replica. At the end of a cycle, the average aminoacid composition was computed from the sequences generated from the room-temperature replica (kBT ¼ 0.592 kcal/mol) and the reference energies were updated via Eq. 4. 5. REMC simulations were conducted as in the previous Note. Each had a length of 8 108 MC steps, with swaps every 4000 steps. At each step, one or two positions were randomly chosen and their rotamers and/or chemical types were modified with the following frequencies: one-position rotamer changes 57%, chemical type changes 11%; two-position rotamer changes 23%, rotamer/chemical type changes 6%, type changes 3%. The modifications were accepted or rejected based on to the Metropolis criterion. 6. Prior to the simulations, the “active” and “inactive” sidechain rotamers were optimized in the presence of the backbone and remaining fixed sidechains with the protX program [45]. The solvation and simulation setup was performed with the CHARMM-GUI interface [49] and the simulations were conducted with the NAMD program [21]. The structures were solvated in a truncated octahedral water box, creating a hydration layer of minimum thickness of 15 Å. The solvated complexes were minimized by 500 conjugate gradient and 500 adopted-basis Newton-Raphson steps, with harmonic restraints of 1.0 kcal/mol/Å2 on backbone and 0.1 kcal/ mol/Å2 on sidechain non-hydrogen atoms. Each complex was equilibrated by an 0.5-ns simulation in the NVT ensemble at 300 K, with backbone harmonic restraints and periodic velocity reassignment. During production, the system temperature was maintained around 300 K by Langevin dynamics with a friction coefficient of 5 ps1. The pressure was kept constant at 1 atm using a Nose-Hoover Langevin piston with a period of 200 fs. Electrostatic interactions were computed every 2 steps by the Particle Mesh Ewald method. A distance cutoff 12 Å was set for all non-bonded interactions. Each production run had a time-step of 2 fs and a total duration of 10 ns. 7. This can be achieved with the aid of a similarity substitution matrix [52, 53], which takes into account aminoacid 400 Eleni Michael et al. physicochemical properties. For example, substitutions which preserve the sidechain size and charge, polar or nonpolar character can be classified as conservative; the corresponding sequences can be grouped in the same cluster. Within each cluster, the sequence with best binding affinity can be chosen as the representative member of the cluster. Acknowledgements This work was co-funded by the European Regional Development Fund and the Republic of Cyprus through the Research and Innovation Foundation (Project: INFRASTRUCTURES /1216/ 0060). EM was supported by a graduate student fellowship from the University of Cyprus. References 1. Schaller MD (2010) Cellular functions of FAK kinases: insight into molecular mechanisms and novel functions. J Cell Sci 123(7):1007–1013 2. Walkiewicz KW, Girault J, Arold ST (2015) How to awaken your nanomachines: sitespecific activation of focal adhesion kinases through ligand interactions. Prog Biophys. Mol. Bio 119(1):60–71 3. Naser R, Aldehaiman A, Dı́az-Galicia E, Arold ST (2018) Endogenous control mechanisms of FAK and PYK2 and their relevance to cancer development. Cancers 10(6):196 4. Sulzmaier FJ, Jean C, Schlaepfer DD (2014) FAK in cancer: mechanistic findings and clinical applications. Nat Rev Cancer 14(9): 598–610 5. Shen T, Guo Q (2018) Role of Pyk2 in human cancers. Med Sci Monitor 24:8172–8182 6. Liu S, Chen L, Xu Y (2018) Significance of PYK2 level as a prognosis predictor in patients with colon adenocarcinoma after surgical resection. Oncotargets Ther 11:7625–7634 7. Quiroga MN, Dı́az MR, Moreno J, Aguilar RG, Ibarra ML, Sánchez PP, Cabrero IA, Gómez GV, Zavaleta LR, Aranda DA, Gómez FS (2019) Increased expression of FAK isoforms as potential cancer biomarkers in ovarian cancer. Oncol Lett 17:4779–4786 8. Pan M-R, Wu C-C, Kan J-Y, Li Q-L, Chang S-J, Wu C-C, Li C-L, Ou-Yang F, Hou M-F, Yip H-K, Luo C-W (2019) Impact of FAK expression on the cytotoxic effects of CIK therapy in triple-negative breast cancer. Cancers 12(1):94 9. Golubovskaya VM, Ho B, Zheng M, Magis A, Ostrov D, Morrison C, Cance WG (2013) Disruption of focal adhesion kinase and p53 interaction with small molecule compound r2 reactivated p53 and blocked tumor growth. BMC Cancer 13(1):342 10. Golubovskaya VM, Palma NL, Zheng M, Ho B, Magis A, Ostrov D, Cance WG (2013) 0 A small-molecule inhibitor, 5 -o-tritylthymidine, targets FAK and Mdm-2 interaction, and blocks breast and colon tumorigenesis in vivo. Anti-Cancer Agent Me 13(4):532–545 11. Ucar DA, Magis AT, He D, Lawrence NJ, Sebti SM, Kurenova E, Zajac-Kaye M, Zhang J, Hochwald SN (2013) Inhibiting the interaction of cMET and IGF-1r with FAK effectively reduces growth of pancreatic cancer cells in vitro and in vivo. Anti-Cancer Agent Me 13(4):595–602 12. Ucar DA, Kurenova E, Garrett TJ, Cance WG, Nyberg C, Cox A, Massoll N, Ostrov DA, Lawrence N, Sebti SM, Zajac-Kaye M, Hochwald SN (2014) Disruption of the protein interaction between FAK and IGF-1R inhibits melanoma tumor growth. Cell Cycle 11(17): 3250–3259 13. Lv P-C, Jiang A-Q, Zhang W-M, Zhu H-L (2018) FAK inhibitors in cancer, a patent review. Expert Opinion Therapeutic Patents 28(2):139–145. PMID: 29210300 14. Alvarado C, Stahl E, Koessel K, Rivera A, Cherry BR, Pulavarti SVSRK, Szyperski T, Cance W, Marlowe T (2019) Development of a fragment-based screening assay for the focal adhesion targeting domain using SPR and NMR. Molecules 24(18):3352 Computational Optimization of Peptide Affinity for FAK 15. Mabonga L, Kappo AP (2020) Peptidomimetics: a synthetic tool for inhibiting protein–protein interactions in cancer. Int J Peptide Res Therapeutics 26(1):225–241 16. Michael E, Polydorides S, Simonson T, Archontis G (2017) Simple models for nonpolar solvation: parameterization and testing. J Comp Chem 38(29):2509–2519 17. Michael E, Polydorides S, Promponas V, Skourides P, Archontis G (2021) Recognition of LD motifs by the focal adhesion targeting domains of focal adhesion kinase and prolinerich tyrosine kinase 2beta: insights from molecular dynamics simulations. Proteins 89(29): 29–52 18. Munoz V, Serrano L (1995) Elucidating the folding problem of helical peptides using empirical parameters. ii. helix macrodipole effects and rational modification of the helical content of natural peptides. J Mol Biol 245(3): 275–296 19. Mignon D, Druart K, Michael E, Opuu V, Polydorides S, Villa F, Gaillard T, Panel N, Archontis G, Simonson T (2020) Physicsbased computational protein design: an update. J Phys Chem A 2020:10637–10648 20. Simonson T, Gaillard T, Mignon D, am Busch MS, Lopes A, Amara N, Polydorides S, Sedano A, Druart K, Archontis G (2013) Computational protein design: the proteus software and selected applications. J Comput Chem 34(28):2472–2484 21. Phillips JC, Hardy DJ, Maia JDC, Stone JE, Ribeiro JV, Bernardi RC, Buch R, Fiorin G, Henin J, Jiang W, McGreevy R, Melo MCR, Radak BK, Skeel RD, Singharoy A, Wang Y, Roux B, Aksimentiev A, Luthey-Schulten Z, Kale LV, Schulten K, Chipot C, Tajkhorshid E (2020) Scalable molecular dynamics on CPU and GPU architectures with NAMD. J Chem Phys 153:044130 22. Hayashi I, Vuori K, Liddington RC (2002) The focal adhesion targeting (FAT) region of focal adhesion kinase is a four-helix bundle that binds paxillin. Nat Struct Mol Biol 9(2): 101–106 23. Arold ST, Hoellerer MK, Noble MEM (2002) The structural basis of localization and signaling by the focal adhesion targeting domain. Structure 10(3):319–327 24. Lulo J, Yuzawa S, Schlessinger J (2009) Crystal structures of free and ligand-bound focal adhesion targeting domain of Pyk2. Biochem Bioph Res Co 383(3):347–352 25. Alam T, Alazmi M, Gao X, Arold ST (2014) How to find a leucine in a haystack? structure, ligand recognition and regulation of 401 leucine–aspartic acid (LD) motifs. Biochem J 460(3):317–329 26. Liu G, Guibao CD, Zheng J (2002) Structural insight into the mechanisms of targeting and signaling of focal adhesion kinase. Mol Cell Biol 22(8):2751–2760 27. Hoellerer MK, Noble MEM, Labesse G, Campbell ID, Werner JM, Arold ST (2003) Molecular recognition of paxillin LD motifs by the focal adhesion targeting domain. Structure 11(10):1207–1217 28. Gao G, Prutzman KC, King ML, Scheswohl DM, DeRose EF, London RE, Schaller MD, Campbell SL (2004) NMR solution structure of the focal adhesion targeting domain of focal adhesion kinase in complex with a paxillin LD peptide: evidence for a two-site binding model. J Biolog Chem 279(9):8441–8451 29. Bertolucci CM, Guibao CD, Zheng J (2005) Structural features of the focal adhesion kinasepaxillin complex give insight into the dynamics of focal adhesion assembly. Prot Sci 14(3): 644–652 30. Tuffery P, Etchebest C, Hazout S, Lavery R (1991) A new approach to the rapid determination of protein side chain conformations. J Biomol Struct Dyn 8(6):1267–1289 31. Simonson T (2013) What is the dielectric constant of a protein when its backbone is fixed? JCTC 9:4603–4608 32. Cornell W, Cieplak P, Bayly C, Gould I, Merz K, Ferguson D, Spellmeyer D, Fox T, Caldwell J, Kollman P (1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:S179–S197 33. Still WC, Tempczyk A, Hawley RC, Hendrickson T (1990) Semianalytical treatment of solvation for molecular mechanics and dynamics. J Am Chem Soc 112(16):6127–6129 34. Hawkins GD, Cramer CJ, Truhlar DG (1995) Pairwise solute descreening of solute charges from a dielectric medium. Chem Phys Lett 246(1–2):122–129 35. Schaefer M, Karplus M (1996) A comprehensive analytical treatment of continuum electrostatics. J Phys Chem 100(5):1578–1599 36. Weeks JD, Chandler D, Andersen HC (1971) Role of repulsive forces in determining the equilibrium structure of simple liquids. J Chem Phys 54(12):5237–5247 37. Aguilar B, Shadrach R, Onufriev AV (2010) Reducing the secondary structure bias in the generalized born model via r6 effective radii. J Chem Theory Comp 6(12):3613–3630 402 Eleni Michael et al. 38. Lazaridis T, Karplus M (1999) Effective energy function for proteins in solution. Proteins 35(2):133–152 39. Archontis G, Simonson T (2005) A residuepairwise generalized born scheme suitable for protein design calculations. J Phys Chem B 109(47):22667–22673 40. Villa F, Mignon D, Polydorides S, Simonson T (2017) Comparing pairwise-additive and many-body generalized born models for acid/ base calculations and protein design. J Comput Chem 38(28):2396–2410 41. Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D (2003) Design of a novel globular protein fold with atomic-level accuracy. Science 302:1364–1368 42. Ollikainen N, de Jong RM, Kortemme T (2015) Coupling protein side-chain and backbone flexibility improves the re-design of protein-ligand specificity. PLOS Comput Biol 11(9):1–22 43. Druart K, Bigot J, Audit E, Simonson T (2016) A hybrid Monte Carlo scheme for multibackbone protein design. J Chem Theory Comp 12:6035–6048 44. Hayes RL, Armacost, KA, Vilseck JZ, Brooks III CL (2017) Adaptive landscape flattening accelerates sampling of alchemical space in multisite λ dynamics. J Phys Chem. B 121(15):3626–3635 45. Simonson T (2020) PROTEUS 3.0 Manual. https://proteus.polytechnique.fr 46. Mignon D, Simonson T (2016) Comparing three stochastic search algorithms for computational protein design: Monte Carlo, replica exchange Monte Carlo, and a multistart, steepest-descent heuristic. J Comput Chem 37(19):1781–1793 47. Mignon D, Panel N, Chen X, Fuentes EJ, Simonson T (2017) Computational design of the Tiam1 PDZ domain and its ligand binding. J Chem Theory Comput 13(5):2271–2289 48. Genheden S, Ryde U (2015) The MM/PBSA and MM/GBSA methods to estimate ligandbinding affinities. Expert Opin Drug Dis 10(5):449–461 49. Jo S, Kim T, Iyer VG, Im W (2008) CHARMM-GUI: a web-based graphical user interface for CHARMM. J Comp Chem 29(11):1859–1865 50. Polydorides S, Simonson T (2013) Monte Carlo simulations of proteins at constant pH with generalized Born solvent, flexible sidechains, and an effective dielectric boundary. J Comput Chem 34:2742–2756 51. Seeber M, Cecchini M, Rao F, Setanni G, Caflisch A (2007) Wordom: a program for efficient analysis of molecular dynamics simulations. Bioinf 23:2625–2627 52. Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185(4154):862–864 53. Sneath P (1966) Relations between chemical structure and biological activity in peptides. J Theoret Biol 12(2):157–195 Chapter 19 Knowledge-Based Unfolded State Model for Protein Design Vaitea Opuu, David Mignon, and Thomas Simonson Abstract The design of proteins and miniproteins is an important challenge. Designed variants should be stable, meaning the folded/unfolded free energy difference should be large enough. Thus, the unfolded state plays a central role. An extended peptide model is often used, where side chains interact with solvent and nearby backbone, but not each other. The unfolded energy is then a function of sequence composition only and can be empirically parametrized. If the space of sequences is explored with a Monte Carlo procedure, protein variants will be sampled according to a well-defined Boltzmann probability distribution. We can then choose unfolded model parameters to maximize the probability of sampling native-like sequences. This leads to a well-defined maximum likelihood framework. We present an iterative algorithm that follows the likelihood gradient. The method is presented in the context of our Proteus software, as a detailed downloadable tutorial. The unfolded model is combined with a folded model that uses molecular mechanics and a Generalized Born solvent. It was optimized for three PDZ domains and then used to redesign them. The sequences sampled are native-like and similar to a recent PDZ design study that was experimentally validated. Key words Monte Carlo, Proteus software, Molecular mechanics, Implicit solvent, Machine learning, Maximum likelihood, PDZ domain 1 Introduction Computational protein design (CPD) is an exciting field that has had many successes. One important application is to design or redesign whole proteins and miniproteins [1–9]. A new fold was produced [2], SARS-Cov-2 miniprotein inhibitors were obtained recently [8], and multiprotein assemblies as large as 43 nm have been designed [10, 11]. For these applications, the unfolded state of the protein plays a central role. Indeed, protein variants are chosen according to their stability, which is the free energy difference between the folded and unfolded states. One possible unfolded model is an extended peptide structure [12, 13]. Several chapters in this volume describe the dynamics of extended peptides, which are complex and expensive to explore thoroughly Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols, Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_19, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022 403 404 Vaitea Opuu et al. Fig. 1 Extended peptide unfolded model: each residue interacts with nearby backbone and solvent, leading to the picture on the right. The unfolded energy has become a function of the sequence only [14–17]. Another route is to posit a statistical distribution of residue–residue interactions, such as would exist in a fluctuating moderately compact polymer [18, 19], like a Gaussian chain [20]. This route gives valuable insights, but has not been applied to, and may not be accurate enough for whole protein design. Rather, for whole protein design, an empirical knowledgebased unfolded model is normally used [21–24]. This model makes two main assumptions. First, in the unfolded state, the side chains do not interact with each other, but only with solvent and nearby backbone. With this assumption, the energy of the unfolded state depends on the sequence but not on any specific 3D structure, as depicted in Fig. 1. Second, we assume that a good way to parametrize the unfolded model is to choose parameters so that the CPD calculation reproduces the experimental amino acid composition, for example, that of a specific protein family of interest [22–24]. This can be achieved by an iterative trial-and-error approach: run a design calculation, compute the resulting amino acid frequencies, tweak parameters to push the individual frequencies up or down as appropriate, and start over. This idea can be cast as a likelihood maximization problem, leading to well-defined machine learning algorithms, such as the one described below. The model does not require a precise 3D structure for the unfolded state. To search the vast space of protein variants with stability as a guiding principle is a hard problem. Considerable work has gone into developing effective search algorithms [25–29]. One important approach is to use Monte Carlo (MC) simulations. At each MC step, a trial mutation R ! R0 is introduced at a particular position (or a set of positions) in the folded protein. The resulting stability change is computed, by taking energy differences between R and R0 in their folded and unfolded states. This is shown in Fig. 2. We see that when a trial mutation is introduced for the folded state, we need to consider the reverse mutation in the unfolded state. This double mutation is equivalent to a double change in protein conformations: we effectively unfold R and refold R0 . The corresponding energy change is indeed the stability change due to the mutation. Because of this equivalence between a trial mutation Knowledge-Based Unfolded State Model for Protein Design 405 Fig. 2 A Monte Carlo mutation move. A point mutation is introduced in the folded protein (above); the reverse mutation is introduced in the unfolded protein (below). The double mutation is equivalent to unfolding the initial variant R and refolding the new one R0 and a double conformational change, it is easy to show [24, 30] that the MC simulation mimics a precise statistical ensemble: a collection of all protein variants, present at equal concentrations, distributed between their folded and unfolded states according to their stabilities. Thus, the MC simulation mimics a large, equimolar, combinatorial library of sequences, such as would be produced experimentally [31, 32]. When we explore this statistical ensemble with MC, we are constantly picking variants to unfold and others to refold. If the MC move probabilities are correctly chosen, a long simulation will produce a trajectory where the populations of folded protein variants follow a well-defined distribution: a Boltzmann distribution controlled by the folding free energy [30, 33, 34]. It is this welldefined probability distribution that allows us to formulate the unfolded state parametrization as a likelihood maximization. This leads to practical algorithms. In the next section, “Theory,” we briefly recall the unfolded model and the maximum likelihood formalism. The model contains one or two unfolded energy parameters per amino acid type t, say E ut . We also outline the computational procedure. In the “Materials” section, we describe the energy function, the MC procedure, and the experimental sequences used. After that, in the “Methods” section, we present the procedure as a tutorial, in the context of our Proteus software [22, 24]. Proteus uses a molecular mechanics energy function with a Generalized Born (GB) solvent model for the folded state. However, the methodology and practical steps are general and could be implemented with other energy functions and CPD programs, like Rosetta [35, 36] or Osprey [37]. In the tutorial, we consider three PDZ proteins, and we adjust the E ut parameters in an iterative way, during a series of MC simulations. The goal is that a long MC simulation should give sequences with the same amino acid composition as a reference set of natural PDZ proteins. In the “Illustrative Application” section, we briefly characterize the designed sequences, to illustrate the quality of the 406 Vaitea Opuu et al. model. We focus on the so-called native sequence recovery: to what extent are the designed sequences similar to natural sequences from the PDZ family. This is an established benchmark for CPD models. Finally, the “Notes” section provides some caveats. We end with a conclusion. 2 Theory 2.1 Extended Peptide Model For the unfolded state, a widespread model is based on a fully extended, fully solvated peptide [13, 38]. This model is a useful first step before introducing the knowledge-based unfolded model. Indeed, with a fully extended peptide, it is natural to assume that each amino acid interacts with solvent and nearby backbone, but not the other amino acids. For a peptide whose sequence is denoted S, the unfolded state energy then has the form: E u ðSÞ ¼ P E u ðt i Þ: i∈S ð1Þ The sum is over all amino acids; ti represents the side chain type of amino acid i. The type-dependent “unfolded energies” E u ðtÞ E ut can be computed from the 3D structure of an extended peptide. This model is important because it expresses the unfolded energy as a function only of the protein sequence composition. A natural next step is to use empirical values for the E ut, instead of ones based on a peptide structure. This leads to the knowledge-based model, described next. 2.2 KnowledgeBased Unfolded State Model For whole protein design, the folding energy is essential and an extended peptide model is not accurate enough. Instead, the E ut can be chosen empirically, to reproduce the amino acid composition of a given set of natural proteins. This was done in most successful whole protein designs [1–9]. The E ut can be thought of as effective chemical potentials. MC generates an ensemble where the population of each sequence follows a Boltzmann distribution [30]. This makes it possible to choose unfolded energies that maximize the probability of the natural sequences. We recall briefly the method [23, 24]. Let S be a set of N natural “target” sequences S, compatible with one or more folds of interest. The Boltzmann probability of S is pðSÞ ¼ 1 exp ð βΔG f ðSÞÞ, Z ð2Þ where ΔGf(S) ¼ G f(S) Eu(S) is the folding free energy of S, G f(S) is the free energy of the folded form, β ¼ 1/kT is the inverse temperature, and Z is a normalizing constant (the partition function). We denote by ℒ the probability of the entire sequence set, Knowledge-Based Unfolded State Model for Protein Design 407 which depends on the E ut; we refer to ℒ as their likelihood [39]. To maximize ℒ, its derivatives with respect to the E ut should be zero. After some algebra, we obtain [23] ∂ ln ℒ ¼ β ∂E ut X ðnS ðtÞhnðtÞiÞ ¼ βðN ðtÞ N hnðtÞiÞ, ð3Þ S where n(t) is the number of amino acids of type t per sequence sampled during the simulations, and N(t) is the number in the whole dataset S [23]. Therefore, ℒ maximum ) N ðtÞ ¼ hnðtÞi, 8t∈aa: N ð4Þ Thus, to maximize ℒ, we choose fE ut g such that a long simulation gives the same amino acid frequencies as the target database. Second derivatives also have a simple expression: 2 ∂ ln ℒ ¼ N β2 ðhnðtÞnðwÞi hnðtÞihnðwÞiÞ: ∂E ut ∂E wu ð5Þ With the first and second derivatives in hand, various gradient search methods can be used. 2.3 Gradient Search Method We use an iterative method to approach the maximum likelihood fE ut g values. At iteration n, let fE ut ðnÞg be the current parameter guess. We begin by running a simulation with these parameters. We then update the parameters by moving along the gradient of ℒ , using the update rule [39]: E ut ðn þ 1Þ ¼ E ut ðnÞ þ α ∂ exp ln ℒ ¼ E ut ðnÞ þ δE ðnt hnðtÞin Þ: ∂E ut ð6Þ exp nt ¼ N(t)/N is the mean population of Here, α is a constant, amino acid type t in the target database, hin indicates an average over a simulation done with the current unfolded energies fE ut ðnÞg, and δE is an empirical constant with the dimension of an energy, referred to as the update amplitude. This update procedure is repeated until convergence. 3 Materials 3.1 Energy Function for the Folded State The energy function for the folded state has the form E f ¼ E intra þ E GB þ E nonpolar : ð7Þ The first term is the protein internal energy. In our work, it is taken from the Amber ff99SB force field [40]. The other two are solvent contributions. The Generalized Born (GB) term EGB captures the 408 Vaitea Opuu et al. main electrostatic effects [41], while Enonpolar represents dispersion and hydrophobic effects through a Lazaridis–Karplus (LK) term [42]. The GB term involves atomic solvation radii bi that approximate the distance from atom i to the protein surface and depend on all coordinates. With a “Native Environment Approximation” (NEA) [43, 44], each bi is computed ahead of time, with the rest of the system in its native sequence and conformation. This removes the many-body character of the GB solvent and leads to a pairwise-additive energy. When a pairwise-additive energy is combined with a fixed backbone and a discrete rotamer library, residue interaction energies can be computed ahead of time and stored in a lookup table or “energy matrix” [1]. We also developed an exact method where the bi are computed on the fly, during MC, yet an energy matrix can still be used, with the GB energies represented by a lookup table of lookup tables [44, 45]. This method is referred to as the “Fluctuating Dielectric Boundary” method or FDB. 3.2 Monte Carlo Exploration MC simulations with Proteus use moves where either rotamers, amino acid types, or both are changed at one or two positions. Mutating positions are user-defined and depend on the problem. Sampling can be enhanced by Replica Exchange Monte Carlo (REMC), where several MC simulations are run in parallel at different temperatures [30]. With a precalculated energy matrix, one billion REMC steps can be run in a few hours on a desktop machine. 3.3 Experimental Sequences and Alignment The tutorial below will consider three PDZ domains: 1G9O, 1R6J, and 2BYG. To define the target amino acid frequencies, we collected homologous sequences for each domain by searching the non-redundant (NR) database with NCBI/Blast [46], the Blosum62 scoring matrix, and the PDB sequence as query. We retained homologs with sequence identities vs. the query above 60%. We used the HMMER algorithm [47] and the Superfamily tool [48] to identify and eliminate any Blast hits that did not belong to the same protein family as the query, leaving a total of 199 homologues. Finally, we aligned each query and its homologues with Clustal Omega [49]. Experimental amino acid frequencies were averaged over the alignment, with separate averaging for buried and exposed positions. Burial was determined by the fractional burial observed for each position in the 3D structures of the test proteins. 3.4 Amino Acid Group Constraints To improve convergence, we usually apply constraints during the early iterations [23]. The amino acid types are grouped into classes, whose parameters are linked. Thus, Asp and Glu could form a class. The difference between their E ut values is computed from molecular mechanics (the energy function in 7) and kept fixed. During this Knowledge-Based Unfolded State Model for Protein Design 409 early phase, the Asp and Glu values are thus interdependent and chosen to reproduce the total Asp+Glu frequency in the target dataset. Later, the class constraints are released, and the parameters for the individual types (Asp and Glu) evolve independently. A detailed optimization schedule is shown further on. 3.5 Proteus Software Files and Documentation 4 4.1 Proteus 3.0 is freely available from https://proteus. polytechnique.fr to academic and government scientists. Scientists from companies should contact the corresponding author. The distribution includes source code, binaries for Intel processors, extensive test cases, and detailed documentation. Files to run the tutorial below are included. Methods: A Proteus Tutorial Overview The present tutorial documents the procedure to parametrize the unfolded state model in the context of CPD. It uses a set of three PDZ domains: NHREF, Syntenin, and DLG2, referred to by their PDB codes: 1G9O, 1R6J, and 2BYG. We assume these systems have already been set up for Proteus. In particular, the 3D structural models are in place, and the three energy matrices have been computed, following a procedure documented elsewhere [24, 50]. The structural models (setup.pdb files) include information on the solvent accessibility of the protein residues. In addition, a sequence alignment has been constructed, by searching a sequence database using the three domains as queries [24]. During the tutorial, we will l compute the target frequencies using the proteins from the sequence alignment as a reference, l compute initial guesses for the unfolded energy parameters, l optimize the parameters iteratively, using Proteus. The main data files and scripts are listed in Table 1. The top directory is referred to as TOP. The main computational steps (controlled by iteration.sh) will typically be run in parallel on a small collection of laboratory computers running Unix. One of them will serve as a “master” node, hosting all the input data and software and coordinating the calculations. At each iteration, for each test protein, one or more MC simulations are run with the current parameter guess. In each simulation, all or part of the protein residues are allowed to mutate. In the tutorial, two simulations are done per protein, each with half of the residues mutating (every other residue). Relevant information is provided by the user in a file TOP/machine.info on the master node (see also Note 1): 410 Vaitea Opuu et al. Table 1 Main tutorial files The test proteins: 1G9O, 1R6J, 2BYG TOP/1G9O/ matrix.bb Diagonal elements of the energy matrix matrix.pw Off-diagonal elements of the energy matrix setup.pdb 3D structure, with explicit residue burial information 1R6J/, 2BYG/ idem TOP/lib/ all_seq.aln Sequence alignment of natural homologs, Clustal format Scripts and environment variables TOP/ project.info