Uploaded by Khaled

Thomas Simonson - Computational Peptide Science Methods and Protocols-Humana (2022)

advertisement
Methods in
Molecular Biology 2405
Thomas Simonson Editor
Computational
Peptide Science
Methods and Protocols
METHODS
IN
MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, UK
For further volumes:
http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and
methodologies in the critically acclaimed Methods in Molecular Biology series. The series was
the first to introduce the step-by-step protocols approach that has become the standard in all
biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents
needed to complete the experiment, and followed by a detailed procedure that is supported
with a helpful notes section offering tips and tricks of the trade as well as troubleshooting
advice. These hallmark features were introduced by series editor Dr. John Walker and
constitute the key ingredient in each and every volume of the Methods in Molecular Biology
series. Tested and trusted, comprehensive and reliable, all protocols from the series are
indexed in PubMed.
Computational Peptide Science
Methods and Protocols
Edited by
Thomas Simonson
Lab de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France
Editor
Thomas Simonson
Lab de Biologie Structurale de la Cellule
(CNRS UMR7654)
Ecole Polytechnique
Palaiseau, France
ISSN 1064-3745
ISSN 1940-6029 (electronic)
Methods in Molecular Biology
ISBN 978-1-0716-1854-7
ISBN 978-1-0716-1855-4 (eBook)
https://doi.org/10.1007/978-1-0716-1855-4
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part
of Springer Nature 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and
retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter
developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been
made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer
Nature.
The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface
Computational peptide science is a broad and fast-moving field. “Peptides” have many
shapes, and computations come in many forms. This book provides a collection of protocols
and approaches, compiled by many of today’s leaders in the field. While diverse and
important, the topics are far from exhaustive and the choices partly subjective. The methodologies include mining properties from sequence databases, predicting structure, dynamics, and interactions using molecular modeling, and designing peptides computationally.
But the diversity starts with the peptides.
Many natural peptides are produced in cells. They can be genetically encoded or
nonribosomal and have important functions, acting as ligands, inhibitors, messengers,
hormones, toxins, or structural building blocks. For example, natural antimicrobial peptides
target bacterial ribosomes as part of the innate immunity of animals and insects [1, 2]. Other
peptides arise as by-products of protein cleavage, maturation, degradation, or misfolding.
Their accumulation or aggregation, for example in amyloid fibers, can have major consequences for the health of cells and tissues [3]. Exogenous peptides are processed by the
major histocompatibility complex (MHC) for immunity. Protein regions that are intrinsically or transiently disordered have many of the same properties as peptides. Thus, protein–protein interactions are often mediated by short, weakly structured, linear peptide
motifs [4] or by larger protein regions that only become structured upon binding [5, 6].
Synthetic peptides and peptidomimetics are another category that have potential applications as antibacterial ligands [1, 2], miniproteins [7], or for the formation of assemblies and
biomaterials [8, 9].
We would like to understand and engineer all these systems. However, it is hard to
characterize the structure and dynamics of peptides experimentally. They are often disordered when they are not engaged by another macromolecule in a complex. They can sample
many conformations over many timescales, similar to a denatured protein [10, 11]. They
may interact with lipid membranes, which are themselves dynamic and fluid. Such structures
are not readily solved by crystallography or NMR. Mean properties like the radius of
gyration or diffusion coefficient can be measured in vitro, and peptides can be probed
chemically by protease digestion or hydrogen exchange. But their precise conformations
and dynamics remain elusive, and their behavior in vivo even more so [12].
Computational approaches are another route. They are increasingly attractive as computer power continues to grow. Massive sequence databases can be mined to identify low
complexity regions, propensities for disorder [13], or amyloid formation [14]. Molecular
modeling can be used to predict structure and binding [15]. Molecular dynamics can
explore conformational space with atomic resolution, revealing conformer populations,
timescales, solvent or lipid structure, and the underlying physical interactions. Virtual
directed evolution of peptides or miniproteins can be done with the methods of protein
design [7, 16, 17].
Here, too, there are difficulties. Structure, flexibility, binding, and specificity arise from a
competition and balance among many interactions, most of them weak, involving peptides,
solvents, ions, and possibly receptors or membranes [18]. Peptide recognition often involves
conformational selection or induced fit. Enthalpic and entropic effects are both essential. To
capture all these effects at a manageable cost, molecular modeling introduces many
v
vi
Preface
approximations. Widely used force fields are quite simple, with constant atomic point
charges, simple, transferable Lennard–Jones interactions, and water molecules described
by just three particles [15]. To increase throughput, an essential further step is to treat
solvent implicitly, usually as a dielectric continuum [19]. This is a drastic approximation even
for polar interactions, and it does not include the nonpolar interactions with solvent, which
require specific treatment. For receptor binding, pose selection and scoring are hard combinatorial problems, and further approximations are often used, such as a rigid receptor and
even simpler solvent models.
Methodologies continue to develop and improve. Force fields have been refined for
peptidomimetics and for unfolded proteins [20]. Electronic polarizability can be treated
explicitly for ionic interactions [21, 22]. Powerful methods to sample conformations are
increasingly available, like adaptive landscape flattening [23–25]. Coarse-grained models
allow very long simulation times [26]. High-throughput methods for peptide docking
[27] and design [7, 16, 17] continue to improve. Powerful machine learning approaches
are under development to mine sequence databases [28, 29].
This volume introduces many of these methodologies. A few chapters have the form of
literature reviews. Most are practical tutorials for specific methods. The first four describe
methods to infer peptide properties from their sequences: antimicrobial activity, foldability,
sheet formation. The next five describe methods to simulate the structure and dynamics of
peptides, including amyloid formers and membrane-active peptides, using tools of increasing sophistication. Five chapters describe the design and modeling of peptides to form
organized assemblies and to bind protein interfaces, and the prediction of peptide–MHC
complexes. Gallichio reviews advanced free energy simulations for peptide binding. Finally,
the last four chapters describe methods for high-throughput peptide or miniprotein design.
This is an exciting time for computational peptide science. The concepts, methods, and
guidelines laid out below should help both novices and experienced workers benefit from
the new opportunities and challenges, now and in the future.
Paris, France
Thomas Simonson
References
1. Krizsan A, Volke D, Weinert S, Str€a ter N, Knappe D, Hoffmann R (2014) Insect-derived proline-rich
antimicrobial peptides kill bacteria by inhibiting bacterial protein translation at the 70S ribosome.
Angew Chem Int Ed 53:12236–12239
2. Seefeldt AC, Nguyen F, Antunes S, Pérébaskine N, Graf M, Arenz S, Inampudi KK, Douat C,
Guichard G, Wilson DN, Innis CA (2016) The proline-rich antimicrobial peptide onc112 inhibits
translation by blocking and destabilizing the initiation complex. Nat Struct Mol Biol 22:470–475
3. Scheckel C, Aguzzi A (2018) Prions, prionoids and protein misfolding disorders. Nat Rev Genet 19:
405–418
4. Borg JP (ed) (2020) PDZ mediated interactions: methods and protocols, vol 2256. Springer Verlag,
New York
5. Shoemaker BA, Portman JJ, Wolynes PG (2000) Speeding molecular recognition by using the folding
funnel: the fly-casting mechanism. Proc Natl Acad Sci U S A 97:8868–8873
Preface
vii
6. Gianni S, Dogan J, Jemth P (2016) Coupled binding and folding of intrinsically disordered proteins:
what can we learn from kinetics? Curr Opin Struct Biol 36:18–24
7. Cao L, Goreshnik I, Coventry B, Case JB, Miller L, Kozodoy L, Chen RE, Carter L, Walls L, Park Y-J,
Stewart L, Diamond M, Veesler D, Baker D (2020) De novo design of picomolar SARS-Cov-2 miniprotein inhibitors. Science 370:426–431
8. Reches M, Gazit E (2003) Casting metal nanowires within discrete self-assembled peptide nanotubes.
Science 300:625–627
9. Wei G, Su Z, Reynolds NP, Arosio P, Hamley IW, Gazit E, Mezzenga R (2017) Self-assembling
peptide and protein amyloids: from structure to tailored function in nanotechnology. Chem Soc Rev
46:4661–4708
10. Ptitsyn OB (1995) Molten globule and protein folding. Adv Protein Chem 47:83–229
11. Korzhnev DM, Religa TL, Banachewicz W, Fersht AR, Kay LE (2010) A transient and low-populated
protein-folding intermediate at atomic resolution. Science 329:1312–1316
12. Theillet FX, Binolfi A, Bekei B, Martorana A, Rose HM, Stuiver M, Verzini S, Lorenz D, van
Rossum M, Goldfarb D, Selenko P (2016) Structural disorder of monomeric α-synuclein persists in
mammalian cells. Nature 530:45–50
13. Barik A, Katuwawala A, Hanson J, Paliwal K, Zhou Y, Kurgan L (2020) Depicter: intrinsic disorder
and disorder function prediction server. J Mol Biol 432:3379–3387
14. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L (2004) Prediction of sequencedependent and mutational effects on the aggregation of peptides and proteins. Nat Biotech 22:
1302–1306
15. Becker O, MacKerell AD Jr, Roux B, Watanabe M (eds) (2001) Computational biochemistry &
biophysics. Marcel Dekker, New York
16. Stoddard B (ed) (2016) Design and creation of ligand binding proteins, vol 1414. Springer Verlag,
New York
17. Mignon D, Druart K, Michael E, Opuu V, Polydorides S, Villa F, Gaillard T, Panel N, Archontis G,
Simonson T (2020) Physics-based computational protein design: an update. J Phys Chem A 124:
10637–10648
18. Simonson T (2015) The physical basis of ligand binding. In: Casavotto C (ed) In silico drug discovery
and design: theory, methods, challenges, and applications, chapter 1. CRC Press, Boca Raton
19. Roux B, Simonson T (1999) Implicit solvent models. Biophys Chem 78:1–20
20. Best RB (2017) Computational and theoretical advances in studies of intrinsically disordered proteins.
Curr Opin Struct Biol 42:147–154
21. Panel N, Villa F, Fuentes EJ, Simonson T (2018) Accurate PDZ-peptide binding specificity with
additive and polarizable free energy simulations. Biophys J 114:1091–1101
22. Rackers JA, Wang Z, Lu C, Laury ML, Lagardere L, Schnieders MJ, Piquemal J-P, Ren PY, Ponder JW
(2018) Tinker 8: software tools for molecular design. J Chem Theory Comput 14:5273–5289
23. Lu C, Li X, Wu D, Zheng L, Yang W (2016) Predictive sampling of rare conformational events in
aqueous solution: designing a generalized orthogonal space tempering method. J Chem Theory
Comput 12:41–52
24. Villa F, Panel N, Chen X, Simonson T (2018) Adaptive landscape flattening in amino acid sequence
space for the computational design of protein:peptide binding. J Chem Phys 149:072302
25. Yalinca H, Gehin CJC, Oleinikovas V, Lashuel HA, Gervasio FL, Pastore A (2019) The
role of post-translational modifications on the energy landscape of Huntingtin
N-terminus. Front Mol Biosci 6:95
viii
Preface
26. Souza P, Alessandri R, Barnoud J, Thallmair S, Faustino I, Grunewald F, Patmanidis I, Abdizadeh H,
Bruininks B, Wassenaar T, Kroon P, Melcr J, Nieto V, Corradi V, Khan H, Domanski J, Javanainen M,
Martinez-Seara H, Reuter N, Best R, Vattulainen I, Monticelli L, Periole X, Tieleman P, de Vries AH,
Marrink SJ (2021) Martini 3: a general purpose force field for coarse-grained molecular dynamics. Nat
Methods 18(4):382–388
27. Goodsell DS, Sanner MF, Olson AJ, Forli S (2021) The AutoDock suite at 30. Prot Sci 30:31–43
28. Gao W, Mahajan SP, Sulam J, Gray JJ (2020) Deep learning in protein structural
modeling and design. Patterns 1:1–23
29. Cannataro M, Guzzi PH, Agapito G, Zucco C, Milano M (2021) Artificial intelligence
in bioinformatics: from omics analysis to deep learning and network mining. Elsevier,
Amsterdam
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
xi
1 Machine Learning Prediction of Antimicrobial Peptides . . . . . . . . . . . . . . . . . . . . .
Guangshun Wang, Iosif I. Vaisman, and Monique L. van Hoek
2 Tools for Characterizing Proteins: Circular Variance, Mutual Proximity,
Chameleon Sequences, and Subsequence Propensities . . . . . . . . . . . . . . . . . . . . . . .
Mihaly Mezei
3 Exploring the Peptide Potential of Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chris Papadopoulos, Nicolas Chevrollier, and Anne Lopes
4 Computational Identification and Design of Complementary
β-Strand Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yoonjoo Choi
5 Dynamics of Amyloid Formation from Simplified Representation
to Atomistic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Phuong Hoang Nguyen, Pierre Tufféry, and Philippe Derreumaux
6 Predicting Membrane-Active Peptide Dynamics in Fluidic
Lipid Membranes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Charles H. Chen, Karen Pepper, Jakob P. Ulmschneider,
Martin B. Ulmschneider, and Timothy K. Lu
7 Coarse-Grain Simulations of Membrane-Adsorbed
Helical Peptides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Manuel N. Melo
8 Peptide Dynamics and Metadynamics: Leveraging Enhanced
Sampling Molecular Dynamics to Robustly Model Long-Timescale
Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Joseph Clayton, Lokesh Baweja, and Jeff Wereszczynski
9 Metadynamics Simulations to Study the Structural Ensembles
and Binding Processes of Intrinsically Disordered Proteins . . . . . . . . . . . . . . . . . . .
Rui Zhou and Mojie Duan
10 Computational and Experimental Protocols to Study Cyclo-dihistidine
Self- and Co-assembly: Minimalistic Bio-assemblies with Enhanced
Fluorescence and Drug Encapsulation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . .
Asuka A. Orr, Yu Chen, Ehud Gazit, and Phanourios Tamamis
11 Computational Tools and Strategies to Develop Peptide-Based
Inhibitors of Protein-Protein Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maxence Delaunay and Tâp Ha-Duong
12 Rapid Rational Design of Cyclic Peptides Mimicking
Protein–Protein Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Brianda L. Santini and Martin Zacharias
1
ix
39
63
83
95
115
137
151
169
179
205
231
x
13
14
15
16
17
18
19
Contents
Structural Prediction of Peptide–MHC Binding Modes . . . . . . . . . . . . . . . . . . . . .
Marta A. S. Perez, Michel A. Cuendet, Ute F. Röhrig,
Olivier Michielin, and Vincent Zoete
Molecular Simulation of Stapled Peptides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Victor Ovchinnikov, Aravinda Munasinghe, and Martin Karplus
Free Energy-Based Computational Methods for the Study
of Protein-Peptide Binding Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Emilio Gallicchio
Computational Evolution Protocol for Peptide Design . . . . . . . . . . . . . . . . . . . . . .
Rodrigo Ochoa, Miguel A. Soler, Ivan Gladich, Anna Battisti,
Nikola Minovski, Alex Rodriguez, Sara Fortuna,
Pilar Cossio, and Alessandro Laio
Computational Design of Miniprotein Binders . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Younes Bouchiba, Manon Ruffini, Thomas Schiex, and Sophie Barbe
Computational Design of Peptides with Improved Recognition
of the Focal Adhesion Kinase FAT Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eleni Michael, Savvas Polydorides, and Georgios Archontis
Knowledge-Based Unfolded State Model for Protein Design . . . . . . . . . . . . . . . . .
Vaitea Opuu, David Mignon, and Thomas Simonson
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
245
283
303
335
361
383
403
425
Contributors
GEORGIOS ARCHONTIS • Department of Physics, University of Cyprus, Nicosia, Cyprus
SOPHIE BARBE • TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse,
France
ANNA BATTISTI • SISSA, Trieste, Italy
LOKESH BAWEJA • Department of Physics and the Center for Molecular Study of Condensed
Soft Matter, Illinois Institute of Technology, Chicago, IL, USA
YOUNES BOUCHIBA • TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse,
France
CHARLES H. CHEN • Synthetic Biology Group, Research Laboratory of Electronics,
Massachusetts Institute of Technology, Cambridge, MA, USA
YU CHEN • Department of Molecular Microbiology and Biotechnology, George S. Wise Faculty
of Life Sciences, Tel Aviv University, Tel Aviv, Israel
NICOLAS CHEVROLLIER • Institute for Integrative Biology of the Cell (I2BC), Université
Paris-Saclay, Gif-sur-Yvette, cedex, France
YOONJOO CHOI • Combinatorial Tumor Immunotherapy MRC, Chonnam National
University Medical School, Hwasun-gun, Jeollanam-do, Republic of Korea
JOSEPH CLAYTON • Department of Physics and the Center for Molecular Study of Condensed
Soft Matter, Illinois Institute of Technology, Chicago, IL, USA
PILAR COSSIO • Biophysics of Tropical Diseases, Max Planck Tandem Group, University of
Antioquia, Medellin, Colombia; Department of Theoretical Biophysics, Max Planck
Institute of Biophysics, Frankfurt am Main, Germany
MICHEL A. CUENDET • Molecular Modelling Group, SIB Swiss Institute of Bioinformatics,
Lausanne, Switzerland; Oncology Department, Centre Hospitalier Universitaire Vaudois
(CHUV), Precision Oncology Center, Lausanne, Switzerland
MAXENCE DELAUNAY • Université Paris-Saclay, CNRS, BioCIS, Châtenay-Malabry, France
PHILIPPE DERREUMAUX • Laboratoire de Biochimie Théorique, CNRS, Université de Paris,
UPR 9080, Paris, France; Institut de Biologie Physico-Chimique, Fondation Edmond de
Rothschild, PSL Research University, Paris, France
MOJIE DUAN • State Key Laboratory of Magnetic Resonance and Atomic and Molecular
Physics, National Center for Magnetic Resonance in Wuhan, Innovation Academy for
Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan,
People’s Republic of China
SARA FORTUNA • Italian Institute of Technology (IIT), Genova, Italy; Department of
Chemical and Pharmaceutical Sciences, University of Trieste, Trieste, Italy
EMILIO GALLICCHIO • Department of Chemistry, Ph.D. Program in Biochemistry and Ph.D.
Program in Chemistry at The Graduate Center of the City University of New York,
Brooklyn College of the City University of New York, New York, NY, USA
EHUD GAZIT • Department of Molecular Microbiology and Biotechnology, George S. Wise
Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
IVAN GLADICH • Qatar Environment and Energy Research Institute, Hamad Bin Khalifa
University, Doha, Qatar; SISSA, Trieste, Italy
TÂP HA-DUONG • Université Paris-Saclay, CNRS, BioCIS, Châtenay-Malabry, France
xi
xii
Contributors
MARTIN KARPLUS • Department of Chemistry and Chemical Biology, Harvard University,
Cambridge, MA, USA; Laboratoire de Chimie Biophysique, ISIS, Université de Strasbourg,
Strasbourg, France
ALESSANDRO LAIO • The Abdus Salam International Centre for Theoretical Physics, Trieste,
Italy; SISSA, Trieste, Italy
ANNE LOPES • Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay,
Gif-sur-Yvette, cedex, France
TIMOTHY K. LU • Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts
Institute of Technology, Cambridge, MA, USA; Department of Biological Engineering,
Massachusetts Institute of Technology, Cambridge, MA, USA
MANUEL N. MELO • Instituto de Tecnologia Quı́mica e Biologica Antonio Xavier,
Universidade Nova de Lisboa, Oeiras, Portugal
MIHALY MEZEI • Department of Pharmacological Sciences, Icahn School of Medicine at
Mount Sinai, New York, NY, USA
ELENI MICHAEL • Department of Physics, University of Cyprus, Nicosia, Cyprus
OLIVIER MICHIELIN • Molecular Modelling Group, SIB Swiss Institute of Bioinformatics,
Lausanne, Switzerland; Oncology Department, Centre Hospitalier Universitaire Vaudois
(CHUV), Precision Oncology Center, Lausanne, Switzerland
DAVID MIGNON • Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654),
Ecole Polytechnique, Palaiseau, France
NIKOLA MINOVSKI • Department of Chemical and Pharmaceutical Sciences, University of
Trieste, Trieste, Italy; Theory Department, Laboratory for Cheminformatics, National
Institute of Chemistry, Ljubljana, Slovenia
ARAVINDA MUNASINGHE • Department of Chemistry and Chemical Biology, Harvard
University, Cambridge, MA, USA
PHUONG HOANG NGUYEN • Laboratoire de Biochimie Théorique, CNRS, Université de Paris,
UPR 9080, Paris, France; Institut de Biologie Physico-Chimique, Fondation Edmond de
Rothschild, PSL Research University, Paris, France
RODRIGO OCHOA • Biophysics of Tropical Diseases, Max Planck Tandem Group, University of
Antioquia, Medellin, Colombia
VAITEA OPUU • Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole
Polytechnique, Palaiseau, France
ASUKA A. ORR • Artie McFerrin Department of Chemical Engineering, Texas A&M
University, College Station, TX, USA
VICTOR OVCHINNIKOV • Department of Chemistry and Chemical Biology, Harvard
University, Cambridge, MA, USA
CHRIS PAPADOPOULOS • Institute for Integrative Biology of the Cell (I2BC), Université ParisSaclay, Gif-sur-Yvette, cedex, France
KAREN PEPPER • Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts
Institute of Technology, Cambridge, MA, USA
MARTA A. S. PEREZ • Computer-aided Molecular Engineering Group, Department of
Oncology UNIL-CHUV, Lausanne University, Lausanne, Switzerland; Ludwig Institute
for Cancer Research, Lausanne, Switzerland; Molecular Modelling Group, SIB Swiss
Institute of Bioinformatics, Lausanne, Switzerland
SAVVAS POLYDORIDES • Department of Physics, University of Cyprus, Nicosia, Cyprus
ALEX RODRIGUEZ • The Abdus Salam International Centre for Theoretical Physics, Trieste,
Italy
Contributors
xiii
UTE F. RÖHRIG • Molecular Modelling Group, SIB Swiss Institute of Bioinformatics,
Lausanne, Switzerland
MANON RUFFINI • TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse,
France; Université Fédérale de Toulouse, ANITI, INRAE, UR 875, Toulouse, France
BRIANDA L. SANTINI • Center for Functional Protein Assemblies, Physics Department T38,
Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany
THOMAS SCHIEX • Université Fédérale de Toulouse, ANITI, INRAE, UR 875, Toulouse,
France
THOMAS SIMONSON • Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654),
Ecole Polytechnique, Palaiseau, France
MIGUEL A. SOLER • Italian Institute of Technology (IIT), Genova, Italy
PHANOURIOS TAMAMIS • Artie McFerrin Department of Chemical Engineering, Texas A&M
University, College Station, TX, USA; Department of Materials Science and Engineering,
Texas A&M University, College Station, TX, USA
PIERRE TUFFÉRY • Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm,
RPBS, Paris, France
JAKOB P. ULMSCHNEIDER • Department of Physics, Institute of Natural Sciences, Shanghai
Jiao Tong University, Shanghai, China
MARTIN B. ULMSCHNEIDER • Department of Chemistry, King’s College London, London, UK
IOSIF I. VAISMAN • School of Systems Biology, George Mason University, Manassas, VA, USA
MONIQUE L. VAN HOEK • School of Systems Biology, George Mason University, Manassas, VA,
USA
GUANGSHUN WANG • Department of Pathology and Microbiology, College of Medicine,
University of Nebraska Medical Center, 985900 Nebraska Medical Center, Omaha, NE,
USA
JEFF WERESZCZYNSKI • Department of Physics and the Center for Molecular Study of
Condensed Soft Matter, Illinois Institute of Technology, Chicago, IL, USA
MARTIN ZACHARIAS • Center for Functional Protein Assemblies, Physics Department T38,
Technical University of Munich, Ernst-Otto-Fischer-Straße 8, Garching, Germany
RUI ZHOU • State Key Laboratory of Magnetic Resonance and Atomic and Molecular
Physics, National Center for Magnetic Resonance in Wuhan, Innovation Academy for
Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan,
People’s Republic of China
VINCENT ZOETE • Computer-aided Molecular Engineering Group, Department of Oncology
UNIL-CHUV, Lausanne University, Lausanne, Switzerland; Ludwig Institute for
Cancer Research, Lausanne, Switzerland; Molecular Modelling Group, SIB Swiss Institute
of Bioinformatics, Lausanne, Switzerland
Chapter 1
Machine Learning Prediction of Antimicrobial Peptides
Guangshun Wang, Iosif I. Vaisman, and Monique L. van Hoek
Abstract
Antibiotic resistance constitutes a global threat and could lead to a future pandemic. One strategy is to
develop a new generation of antimicrobials. Naturally occurring antimicrobial peptides (AMPs) are recognized templates and some are already in clinical use. To accelerate the discovery of new antibiotics, it is
useful to predict novel AMPs from the sequenced genomes of various organisms. The antimicrobial peptide
database (APD) provided the first empirical peptide prediction program. It also facilitated the testing of the
first machine-learning algorithms. This chapter provides an overview of machine-learning predictions of
AMPs. Most of the predictors, such as AntiBP, CAMP, and iAMPpred, involve a single-label prediction of
antimicrobial activity. This type of prediction has been expanded to antifungal, antiviral, antibiofilm, antiTB, hemolytic, and anti-inflammatory peptides. The multiple functional roles of AMPs annotated in the
APD also enabled multi-label predictions (iAMP-2L, MLAMP, and AMAP), which include antibacterial,
antiviral, antifungal, antiparasitic, antibiofilm, anticancer, anti-HIV, antimalarial, insecticidal, antioxidant,
chemotactic, spermicidal activities, and protease inhibiting activities. Also considered in predictions are
peptide posttranslational modification, 3D structure, and microbial species-specific information. We compare important amino acids of AMPs implied from machine learning with the frequently occurring residues
of the major classes of natural peptides. Finally, we discuss advances, limitations, and future directions of
machine-learning predictions of antimicrobial peptides. Ultimately, we may assemble a pipeline of such
predictions beyond antimicrobial activity to accelerate the discovery of novel AMP-based antimicrobials.
Key words Multidrug resistance, Antimicrobial peptides, Database, Machine learning, Peptide
prediction
1
Introduction
The discovery and production of antibiotics has saved millions of
lives. It is regarded as one of the greatest achievements of humankind in the twentieth century. However, pathogens fight back,
leading to reduced potency of conventional antibiotics. To minimize toxic effects, bacteria can pump the drug out of the cells,
reduce drug affinity to specific targets via mutations, and degrade
antibiotics by enzymes. Among various multidrug-resistant (MDR)
microbes, the ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii,
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_1,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
1
2
Guangshun Wang et al.
Pseudomonas aeruginosa, and Enterobacter species) account for 90%
of infections in hospitals [1]. There are also other emerging resistant pathogens, including human immunodeficiency virus type
1 (HIV-1), SARS-CoV2, Ebola, Zika viruses, resistant bacteria
Mycobacterium tuberculosis, Salmonella, Candida, Neisseria gonorrhoeae, and Clostridioides difficile. If no action is taken, the projected annual deaths could reach ten million by 2050 [2]. To meet
this challenge, one fundamental strategy is to develop a new generation of antimicrobials that are capable of eliminating those MDR
pathogens.
Antimicrobial peptides (AMPs) are considered as an alternative
to conventional non-peptide antibiotics. This chapter focuses on
prediction of antimicrobial peptides. First, we provide a brief introduction to AMPs. Second, we discuss the major prediction methods of AMPs. Third, both the data sets for predictions and the
algorithms of machine learning are described. Fourth, we discuss
the major machine-learning prediction of AMPs. Fifth, we compare
the prediction outcomes of machine learning in terms of accuracy
on the same platform. Results from test runs using new peptides
not included in the training sets, and the important amino acids
implied from machine learning are compared with those derived
from our database analysis of the major classes of natural AMPs.
Then, we outline additional predictions that may speed up
computer-aided novel antimicrobial discovery. Finally, we summarize the major achievements and limitations of AMP predictions
and discuss future directions.
2
Innate Immune Antimicrobial Peptides
Naturally occurring antimicrobial peptides are important components of innate immune systems. Such peptides are deployed in a
variety of organisms such as plants and animals. They play a critical
role in protecting organisms from infections. AMPs have remained
potent for millions of years. As a consequence, they are recognized
candidates for developing novel antimicrobials since they can kill
drug-resistant pathogens, including bacteria, fungi, viruses, and
parasites. AMPs are usually gene-encoded and can be expressed
constitutively to guard certain niches or induced in response to
invading pathogens [3–8]. According to the newly programmed antimicrobial peptide database (APD, https://aps.
unmc.edu) http://aps.unmc.edu/AP, https://wangapd3.com, over
3000 natural AMPs have been discovered from six life kingdoms
(bacteria, archaea, protists, fungi, plants, and animals) [9–11]. At
present, 74% of the peptides originated from animals, while 11.2%
and 11.1% were discovered in bacteria and plants, respectively.
Most of natural AMPs (88%) are cationic, and only a small portion
(6%) are anionic. Anionic AMPs, such as daptomycin already in
Machine Learning Prediction of Antimicrobial Peptides
3
Table 1
Amino acid properties, frequency, and peptide count in the antimicrobial peptide database (APD)
Single
letter
Full name
Molecular
weight
Peptide
Classa count
Count%
(2020)
I
Isoleucine
113.16
Phobic 2511
0.77
5.9%
V
Valine
99.13
Phobic 2492
0.76
5.69%
L
Leucine
113.16
Phobic 2835
0.87
8.26%
F
Phenyl
alanine
147.18
Phobic 2240
0.69
4.09%
C
Cysteine
103.14
Phobic 1721
0.53
6.81%
M
Methionine
131.2
Phobic 959
0.29
1.27%
A
Alanine
71.08
Phobic 2511
0.77
7.68%
W
Tryptophan
186.21
Phobic 1185
0.36
1.65%
G
Glycine
57.05
Special 2950
0.91
11.51%
P
Proline
97.12
Special 1958
0.60
4.67%
T
Threonine
101.11
Polar
2053
0.63
4.48%
S
Serine
87.08
Polar
2483
0.76
6.07%
Y
Tyrosine
163.18
Polar
1266
0.39
2.49%
Q
Glutamine
128.13
Polar
1352
0.42
2.59%
N
Asparagine
114.1
Polar
1968
0.60
3.86%
E
Glutamate
acid
129.12
Acidic
1465
0.45
2.68%
D
Aspartic acid 115.09
Acidic
1463
0.45
2.7%
H
Histidine
137.14
Basic
1231
0.38
2.17%
K
Lysine
128.17
Basic
2782
0.85
9.51%
R
Arginine
156.19
Basic
1843
0.57
5.88%
Frequency in 3257
AMPs
phobic ¼ hydrophobic. In the APD, the hydrophobic content (Pho) is the ratio between the total hydrophobic amino
acids and total amino acids in a peptide sequence [9]. Visited January 2021
a
clinical use, may need metal to be active [12]. Another 6% of AMPs
have a net charge of zero. In the APD, the majority of AMPs
contain hydrophobic contents (Pho) between 10% and 70%
(defined in Table 1). Only about 1% such peptides have very high
(>70%) or very low (<10%) Pho. In terms of length, 2879 peptides
in the current APD3 (88%) are shorter than 50 amino acids. The
average length of all AMPs (3257 as of January 2021) in the APD3
is 33.2 with an averaged net charge of +3.3. The most frequently
occurring amino acids (>8%) are glycine (G), lysine (G), and leucine (L), [10] while the least occurring amino acids (<2%) include
methionine (M) and tryptophan (W) (Table 1). Such frequencies
are proportional to the percentage of natural AMPs containing one
4
Guangshun Wang et al.
Fig. 1 Important amino acids derived from amino acid composition profiles of
classic classes of antimicrobial peptides [3]: (a) α-helical and β-sheet families
and (b) amino acid-rich families, including Trp-rich, His-rich, Pro-rich, and
Leu-rich AMPs. Data obtained in the APD [13] in Dec 2020
of the 20 amino acids also calculated in Table 1. The variation of the
amino acid (composition) signatures of natural AMPs in different
structure, activity, and source groups has been tabulated elsewhere
[13]. Figure 1 displays amino acid signatures for known α-helical,
β-sheet peptides (panel A), tryptophan-rich (Trp-rich), histidinerich (His-rich), proline-rich (Pro-rich) AMPs, and leucine-rich
(Leu-rich) temporins (panel B). It is evident that such signatures
depend on the amino acid composition of a group of AMPs in the
APD. The amino acid sequence of a peptide, however, clearly plays
a role as well in determining peptide structure and activity
[6, 14]. Another important player is posttranslational modification
(e.g., amidation, glycosylation, halogenation, hydroxylation, and
cyclization) of peptide sequences, with 24 types of modifications
annotated in the current APD3 as of October 2020 [11, 15]. Typically, cationic AMPs target anionic bacterial membranes due to the
formation of the classic amphipathic helix structure [3–6]. However, such peptides can also attack other targets such as bacterial cell
walls and ribosomes. It is believed that the simultaneous attack of
more than one target renders it difficult for bacteria to develop
resistance to AMPs. Beyond bacterial killing and biofilm inhibition,
Machine Learning Prediction of Antimicrobial Peptides
5
AMPs are found to have other functional roles, ranging from
pathogen toxin neutralization, wound healing, to host immune
regulation [4, 5, 16]. A total of 24 types of AMP functions are
annotated in the APD3 [11, 13].
3
An Overview of Prediction Methods of Antimicrobial Peptides
The majority of natural AMPs were identified using the classic
isolation and characterization methods [3–5]. Such peptide identification procedures are laborious and time-consuming. One alternative method is to predict AMPs by computers based on the
current peptide knowledge and sequenced genomes of numerous
organisms [9, 17–19]. These prediction methods are grouped into
five classes based on the information considered in programming
[20]: (1) mature peptide (i.e., AMPs), (2) propeptide, (3) mature
peptide and propeptide, (4) processing enzyme, and (5) genomic
context (Fig. 2). Some AMPs such as cathelicidins possess a conserved pro-sequence domain prior to the mature peptide. Such a
conserved sequence pattern became one method for identifying
uncharacterized cathelicidins from sequenced genomes for mammals, fish, reptiles, birds, and amphibians (method 2). The human
cathelicidin was initially predicted as FALL-39 [21], which is
merely 1–2 resides longer than the mature forms isolated in
human neutrophils and reproductive system (LL-37 and
ALL-38), respectively [22, 23]. In the same vein, the discovery of
bacteriocins from bacteria has been expanded from highly conserved processing enzymes (method 4a) to transporters (method
4b) and the entire gene clusters (i.e., genomic context; method 5).
Computer programs such as BAGEL, antiSMASH, and BACIIα
have been established for bacteriocin identifications [24–
Fig. 2 Five information-content based methods for prediction of antimicrobial peptides [20]
6
Guangshun Wang et al.
26]. Occasionally, both precursor and mature sequences (method
3) were considered in clustering AMPs probably due to the nature
of a particular data set then available [27]. The most widely
explored information for prediction are mature peptides (method
1). Sequence patterns such as multiple disulfide bonds were utilized
for identifying defensin-like AMPs in plants, cattle, mice, and
humans [28–30]. A GXC γ-core motif has also been identified in
these peptides and utilized for AMP prediction [31].
The construction of databases for AMPs greatly facilitated the
development of computer-based design [32] and prediction methods. Table 2 provides a list of databases for AMPs [11, 18, 33–
49]. In 2004, the APD and ANTIMIC were simultaneously published in the database issue of Nucleic Acid Research in 2004
[9, 50]. The APD, with a focus on structure and activity of mature
AMPs, was widely accepted and utilized by the AMP field [9]. Since
then, more databases have been established with varying scopes or
by entering additional details (Table 2). A systematic review on
such databases has been described elsewhere [51]. Because of the
model role of the APD, it is useful to describe its data scope and
evolution. In the first two versions [9, 10], the APD attempted to
cover all AMP sequences: experimentally determined, predicted,
and synthetic. This history can be seen from a small number of
synthetic and predicted entries remaining in the current APD
(72 synthetic peptides and 211 predicted peptides without activity
data). There are three types of activity data annotated in the APD:
(1) minimal inhibitory concentration (MIC); (2) diffusion distance; and (3) optical density decrease as an evidence of inhibition.
Due to convenience, MIC values based on microdilution assays are
frequently measured and reported. Since predicted peptides might
not be true AMPs [11], it was decided to postpone the collection of
such peptides in the APD. Also, a large number of the synthetic
peptides derived from the same template tended to dominate data
filtering in the database, thereby deviating the database filtering
from natural wisdom to artificial peptides. As a consequence, the
APD also postponed the collection of synthetic peptides. Thus, the
third version of the APD (APD3) [11] uses the following criteria to
register AMPs: (1) natural peptides, (2) peptides with known
amino acid sequences, (3) peptides with known activity (MIC
<100 μM), and (4) peptides of less than 100 amino acids
[11]. The last condition was relaxed to 200 amino acids to incorporate important human antimicrobial proteins. This practice generates a widely utilized core data set for AMP search, prediction,
and design.
Based on mature peptides, the first computer-based prediction
was programmed in the APD in 2003 [9]. The program informs
users whether the input sequence is likely to be an AMP based on
some known AMP knowledge, such as positive charge and amphipathic nature. Later, it was improved based on the peptide
Table 2
Web accessible databases dedicated to antimicrobial peptidesa
Databases and
prediction
algorithms
a
Citing
references
Link
Notes
APD3
http://aps.unmc.edu/AP/main.
php
Antimicrobial peptide database, [11]
with curated, experimentally
verified antimicrobial peptides
from bacteria, archaea, protists,
fungi, plants, and animals
CAMPR3
http://www.camp3.bicnirrh.res.in/
Collection of Antimicrobial
peptides
DBAASP v3
https://dbaasp.org
Database of antimicrobial activity [33]
and structure of peptides
[18]
Defensins
http://defensins.bii.a-star.edu.sg/
knowledgebase
Antimicrobial peptides from the
defensin family
[34]
BaAMPs
http://www.baamps.it/
Database of biofilm-active
antimicrobial peptides
[35]
BACTIBASE
http://bactibase.hammamilab.org/
about.php
Bacterocin-type naturally
occurring antimicrobial
peptides
[36]
DADP
http://split4.pmfst.hr/dadp/
Database of anuran (frog or toad) [37]
defense peptides
DRAMP
http://dramp.cpu-bioinfor.org
Database of AMPs including
clinical trial data on peptides
[38]
Peptaibol
http://peptaibol.cryst.bbk.ac.uk/
introduction.htm
Database of peptaibols, mainly
antifungal peptides
[39]
LAMP
http://biotechlab.fudan.edu.cn/
database/lamp/index.php
AMPs taken from other databases [40]
YADAMP
http://www.yadamp.unisa.it/
default.aspx
Yet another database of
antimicrobial peptides
[41]
PhytAMP
http://phytamp.pfba-lab-tun.org/
main.php
A database dedicated to plant
AMPs
[42]
InverPep
https://ciencias.medellin.unal.edu. AMPs from invertebrates from
other databases
co/gruposdeinvestigacion/
prospeccionydisenobiomoleculas/
InverPep/public/home_en
[43]
HIPdb
http://crdd.osdd.net/servers/
hipdb
Manually curated database of
experimentally validated HIV
inhibitory peptides
[44]
Thiobase
https://db-mml.sjtu.edu.cn/
THIOBASE/
Sulfur-rich, highly modified
[45]
heterocyclic peptide antibiotics
EnzyBase
http://biotechlab.fudan.edu.cn/
database/EnzyBase/home.php
Lysins, bacteriocins, autolysins,
and lysozymes
[46]
ParaPep
http://crdd.osdd.net/raghava/
parapep/
Antiparasitic peptides
[47]
dbAMP
Not accessible
AMPs
[48]
AntiTbPdb
https://webs.iiitd.edu.in/raghava/
antitbpdb/
Anti-TB peptides
[49]
Adapted and updated based on the APD Links [13, 20]
8
Guangshun Wang et al.
parameter space (net charge, hydrophobic content, and peptide
length) defined by the entire database [19]. If such parameters of
a new sequence are out of the scope, the program will inform the
users that the input sequence is less likely to be an AMP. The APD
also outputs five peptide sequences most similar to the user’s input.
Subsequently, Lata et al. first programmed an artificial neural
network (ANN), quantitative matrices (QM), and a support vector
machine (SVM) in 2007 based on the APD data set [17]. Since
then, there has been a growing interest in AMP prediction at both
the single-label and multi-label levels. The single-label prediction
will predict the likelihood of being antimicrobial, while multi-label
predictions were developed based on different functions of AMPs
annotated in the APD3 [11], such as chemotaxis, toxin neutralization, protease inhibition, and wound healing. The first multi-label
prediction [52] predicts antibacterial activity in the initial stage
followed by predictions of other types of activities, including antifungal, antiviral, anti-HIV, and anticancer activities. CAMP collected both synthetic and predicted peptides. Its prediction tool
[18, 53] enables three tasks. First, users can predict the antimicrobial activity of a peptide sequence by four different models. Second,
users can predict the antimicrobial region within a peptide
sequence. Third, users can generate a large combinatorial list of
sequences for a user-defined sequence and then can predict effect of
single residue substitutions on antimicrobial activity using the AMP
predictor. Table 3 lists some major machine-learning prediction
programs [48, 53–77].
4 Training Data Sets, Machine-Learning Models, and Algorithms for Classification
and Prediction of Antimicrobial Peptides
Machine learning models are commonly used for classification and
prediction of AMPs. Nearly all machine-learning predictions of
AMPs are supervised. The quality of these models is determined
by a number of different factors. Among the most important contributors to the model performance are training sets consisting of
antimicrobial and non-antimicrobial peptides, features used to represent the peptides, classification schemes, and machine-learning
algorithms.
4.1 Training Sets for
Predictions
4.1.1 Positive Training
Set
Quality of the training set is critically important for the model
performance, since it is the only source of information the model
uses to learn. AMP sequences for the training set are usually
extracted from one or more AMP databases. The growing number
of AMP databases (some examples are listed in Table 2) represents a
wide range of approaches to data collection, data curation, and data
management. For the purpose of training set design, it is important
Machine Learning Prediction of Antimicrobial Peptides
9
Table 3
Machine learning prediction of antimicrobial peptides
Tool name
URL
Algorithms Features
Year References
AntiBP
http://crdd.osdd.net/raghava/
antibp2
SVM, QM, Single-label
ANN
2007 [17]
CAMP
http://www.bicnirrh.res.in/
antimicrobial
SVM, RF,
DA
Single-label
2010 [18, 53]
http://amp.biosino.org/
BLASTP,
NNA
Single-label
2011 [54]
AMP region
Scan
2012 [55]
AMPA
http://tcoffee.crg.cat/apps/ampa
ANFIS
NA
ANFIS
Single-label
2012 [56]
Peptide
Locator
http://bioware.ucd.ie/
BRNN
Single-label
2013 [57]
iAMP-2L
http://www.jci-bioinfo.cn/
iAMP-2L
FKNN
Two-level,
Multi-label
2013 [52]
DBAASP
https://dbaasp.org/prediction/
general
Thresholds
SVM-LZ
NG (BioMed Research
international)
SVM
Single-label
2015 [58]
ADAM
http://bioinformatics.cs.ntou.edu.
tw/ADAM/
SVM,
HMM
Single-label
2015 [59]
MLAMP
http://www.jci-bioinfo.cn/
MLAMP
RF—MLSMOTE
Multi-label
2016 [60]
iAMPpred
http://cabgrid.res.in:8080/
amppred/
SVM
Single-label
2017 [61]
AmPEP
http://cbbio.cis.umac.mo/
software/AmPEP/
RF
Single-label
2018 [62]
AMP
scanner
http://www.ampscanner.com
DNN
Single-label,
Large scale
2018 [63]
AntiMPmod https://webs.iiitd.edu.in/raghava/
antimpmod/
SVM
Single-label,
PTM/3D
2018 [64]
dbAMP
http://csb.cse.yzu.edu.tw/
dbAMP/
RF
Single-label
2019 [48]
AMAP
http://faculty.pieas.edu.pk/fayyaz/
software.html#AMAP
SVM,
XGBoost
Multi-label
2019 [65]
NA
IDQD
Single-label
2019 [66]
AMPfun
http://fdblab.csie.ncu.edu.tw/
AMPfun/index.html
CART
Multi-label
2020 [67]
AMP0
http://ampzero.pythonanywhere.
com
ZSL, FSL
Single-label,
Species-specific
2020 [68]
2014 [33]
(continued)
10
Guangshun Wang et al.
Table 3
(continued)
Tool name
URL
Algorithms Features
Year References
MIV-RF
NA
RF
Single-label,
Sequence
2020 [69]
Deephttps://cbbio.cis.um.edu.mo/
AmPEP30
AxPEP
CNN
Genome
Search
2020 [70]
ACEP
https://github.com/Fuhaoyi/
ACEP
DNN
Highthroughput
predictions
2020 [71]
IAMPE
http://cbb1.ut.ac.ir/
KNN,
Single-label
SVM, RF
2020 [72]
Macrel
https://big-data-biology.org/
software/macrel
RF
Genome search
2020 [73]
https://github.com/mtyoumans/
lstm_peptides
LSTM
RNN
Single-label
2020 [74]
Ampir
https://github.com/legana/ampir
SVM
Genome wide
2020 [75]
amPEPpy
https://github.com/tlawrence3/
amPEPpy
RF
Genome wide
2020 [76]
Ensemble
model
Single-label
2021 [77]
Ensemblehttp://ncrna-pred.com/Hybrid_
AMPPred
AMPPred.htm
to take into account that AMP databases vary in size, sources of
information, amount and quality of annotations, and other parameters. Sizewise, the current versions stretch from over 3000 peptides in the APD [9–11] to 10,000 in CAMP [18, 53], 12,000 in
dbAMP [48], 16,000 in DBAASP [33], and 23,000 in LAMP2
[40]. Some of the larger databases (e.g., LAMP2 [40]) may contain
the entire content of the smaller ones by copying the peptide entries
from existing databases. At the same time, the non-overlapping
components are frequently present, primarily in the scope of synthetic peptides and due to different definitions of AMPs. Some
specialized databases have expanded the data set by including
other types of peptides, which do not necessarily fall into the
definition of classic AMPs [44, 49]. For instance, antiviral peptides
can also be designed by investigators in the laboratories based on
the viral machinery such as proteases. As a result, the distribution of
peptides by sequence length in databases can be different as well.
The APD contains mostly natural AMPs, which are templates for
making synthetic peptides. For example, there are hundreds of
LL-37–derived peptides. 88% of the entries in the APD are less
than 50 amino acids and only 80 peptides out of 3257 have a length
greater than 100 residues. Similarly, most peptides in DBAASP
database are shorter than 50 residues. Only 20 entries in DBAASP
Machine Learning Prediction of Antimicrobial Peptides
11
are longer than 100 residues, while CAMP contains 1850 such
sequences. The longest sequence in APD and DBAASP is less
than 190 residues compared to 1256 residues in CAMP.
The first training set for machine-learning model testing was
extracted from the APD [17]. Another data set used in AMP
prediction was derived from the CAMP [18]. Because the majority
of natural AMPs in the CAMP were taken from the APD, there is a
significant overlap between these two data sets. Some recent studies
generated a hybrid data set by merging the peptide sequences from
different databases [61, 62, 69, 70, 77]. The size of the positive
data set appears to influence prediction outcome [61]. Speciesspecific predictions of AMPs [68] were made based on the
DBAASP, which annotate antimicrobial activity in more details
[33]. For 3D structural data, the APD has direct links to the
Protein Data Bank (PDB) [78]. Hence, a list of training peptides
with 3D structures can also be generated without redundancy (i.e.,
multiple sets of structural coordinates are possible for the same
peptide determined by different methods, at different resolutions,
or under different conditions).
4.1.2 Negative Data Set
Ideally, the negative set should consist of peptides which were
tested experimentally and displayed no antimicrobial activity
against one or more relevant pathogens. Non-AMP sequences are
a natural byproduct of any wet lab screening for antimicrobial
peptides. However, negative results are rarely published, and as a
result, the large sets of validated non-AMP sequences are likely
sitting in the drawers of investigators and not available to the
public. Creating a database of non-AMP sequences and convincing
researchers to contribute data into this database would be a helpful
step in improving the quality of the training sets.
Bioinformaticians/computing scientists have taken an alternative approach to obtaining negative data sets. The AntiBP [17]
generated the first negative data set based on the Uniprot
[79]. The negative part of the training set is usually selected from
the random sequences in the protein sequence database, which are
not annotated as antimicrobial, secretory, toxins, etc. Sequences in
the negative set can be controlled by the level of sequence identity,
sequence composition, similarity to the sequences in the positive
set, structural, and other properties. Since the protein sequence
databases are very large (the October 2020 release of UniProt
database contains more than 200 million sequences) [79], the
supply of sequences for the negative sets is practically unlimited.
There are caveats with these data. The sequences in the negative set
may possess antimicrobial properties, although the probability of
this is relatively low. Also, antimicrobial activities of AMPs are very
sensitive to sequence variation [80]. Such features may not be
represented in the current negative data set. Training the models
on different combinations of a positive set with several independent
12
Guangshun Wang et al.
negative sets may provide insights into the scale of negative set
contamination by hitherto unknown antimicrobial peptides.
In many cases, it is advisable to use a balanced training set,
where the AMP and non-AMP sequences are equally represented.
AMP sequences can be selected from AMP databases (Table 2).
Normally, only a subset of the entire database (or several databases)
can be used to compile a positive part of the training set. Sequences
from the database are filtered by length, activity, sequence identity,
and other parameters. In most studies, the positive sets range from
several hundred to several thousand sequences, while the size of the
negative set from Uniprot can be much larger. However, the data
sets for numerous species-specific predictions were much smaller
due to limited MIC data [68].
4.2 Descriptors and
Features
Many different features of peptides can be used to characterize their
antimicrobial activity and discriminate between antimicrobial and
non-antimicrobial peptides. Frequently, these features are based on
identities, physicochemical properties, structural properties, and
compositions of individual amino acid residues and their combinations [61, 81–83]. Physical and chemical properties of amino acids
which are most likely to improve machine-learning (ML) model
performance include hydrophobicity, electrostatic charge, and
polarity. Similarly important are structural properties such as helical
propensity and solvent accessibility. In many models, feature vectors include residue locations in the sequence, compositional characteristics, and sequence patterns. The overall number of features
can be very large; in those cases, feature selection can help to reduce
the size of the feature vector by removing features with relatively
low contributions to the model performance.
4.3 MachineLearning Algorithms
A large number of different machine-learning algorithms (Table 3)
have been implemented in AMP classification and prediction models since the first papers reporting this approach were published in
2007 [17, 27, 84]. ML methods successfully used in AMP modeling include K-nearest neighbor [52, 72], hidden Markov models
(HMMER) [27], naı̈ve Bayes [72], neural networks
(NN) (including their deep learning varieties) [63, 70, 71, 74,
85–87], support vector machines [17, 18, 58, 59, 61, 64, 65, 72,
75], random forests (RF) [18, 48, 60, 62, 69, 73, 76], zero-shot
learning (ZSL) [68], and many others (Table 3).
Support vector machine classification maps feature vectors
representing the peptides in the training set into a higher dimensional space. Then the algorithm constructs an optimal hyperplane
which separates two classes of peptides, AMPs and non-AMPs, with
the maximal margin of separation between the classes. This hyperplane serves as a decision boundary in the original space. The
hyperplane divides the entire higher dimensional space into two
half-spaces, and each new peptide from the prediction set is going
Machine Learning Prediction of Antimicrobial Peptides
13
to be located in one of these two half-spaces. This location will
determine the predicted class for new peptides.
Decision tree (DF) classifiers have the form of a rooted binary
tree. A divide-and-conquer approach is used during model training.
It traverses the tree starting from the root, and at each node, an
input feature is selected that best separates the output classes.
Learned trees are frequently pruned to decrease overfitting. After
the tree is created using a training set, a new peptide can be sorted
down the tree based on the values of the input features on the
corresponding node, and the appropriate branch is followed to the
next node. The recursive process terminates once the peptide
reaches a leaf node, where the peptide class, AMP or non-AMP, is
identified. The random forest algorithm is an ensemble method
based on decision trees. It generates multiple bootstrapped data
sets, each data set trains a classification tree by randomly selecting a
fixed-size subset of the available predictors for splitting at each
node, and predictions are made by majority vote over all trees.
Random forests help to avoid many pitfalls of the decision tree
algorithm, particularly overfitting.
While most of the predictions aimed to discriminate AMP and
non-AMP (i.e., single-label), several labs have attempted a multilabel prediction based on the multifunctional data annotated in the
APD3 [11, 13]. The four multi-label predictions (iAMP-2L,
MLAMP, AMAP, and AMPfun) all conduct predictions on two
levels [52, 60, 65, 67]. Similar to the single-label prediction
described above, the first level of the multi-label prediction predicts
whether the peptide is an AMP or non-AMP. If it is, then the
program moves on to the second-level prediction to predict the
likelihood of other functions the peptide may have. These can
include antibacterial, antibiofilm, antiviral, anti-HIV, antifungal,
antiparasitic, antimalarial, anticancer, insecticidal, antioxidant, chemotactic, enzyme inhibitors, and spermicidal activity. It appears
that AMAP is best in terms of accuracy. It also predicted more
biological functions of AMPs at the second level.
To evaluate the performance of an algorithm on a training set,
cross-validation (CV) and random split into two subsets are commonly used. Implementation of tenfold CV begins with a random
grouping of the training set peptides into ten equally sized subsets.
Stratification is applied to maintain class proportions of the full
training set in each of the subsets. At the next step, one of the
subsets is held out while the remaining nine subsets (90% of the
original training set) are combined into one set that is used to train
a model. The heldout subset (10% of the original training set) is
then treated as a test set, and the trained model predicts the class for
each peptide in the subset. Then the procedure is repeated for the
remaining nine combinations. The iterative procedure yields a single prediction for each of the peptides in the original training set,
which is then compared to the actual class. These comparisons
14
Guangshun Wang et al.
allow to calculate the numbers of true positive (TP), true negative
(TN), false positive (FP), and false negative (FN) predictions. Commonly used performance measures, such as sensitivity, specificity,
precision, balanced error rate, and Matthew’s correlation coefficient, are all functions of these four numbers. Many published
ML models report CV accuracy values which are close to 100%.
The actual real-world performance of these models on predicting
novel antimicrobial peptides may be lower due in part to the
extremely complex AMP activity landscape.
5
Machine Learning Predictions of Special Antimicrobial Peptides
5.1 Utility and Main
Drawbacks of AMP
Prediction Algorithms
Overall, our ability to accurately predict the antimicrobial activity,
hemolytic activity, or cytotoxic activity of any peptide sequence is a
developing field. While advances in machine learning, positive and
negative data sets, and analytic approaches have been made, the
accuracy of predicting the properties of a new peptide sequence is
still low, too low to be of reliable use in a screening step, for
example. Improvements in the peptide sorting and analysis, especially thinking about the different surface properties of Gramnegative and Gram-positive bacteria, could yield significant
advancements in accuracy, which would significantly advance the
field. This lack of reliability is the main drawback of AMP prediction
algorithms and the main hindrance in their use in high-throughput
design programs to generate new AMPs.
5.2 Antiviral Peptide
Predictors and Data
The antiviral activity of antimicrobial peptides is of considerable
interest. In particular, antiviral peptides (AVPs) appear to have
activity against membrane-enveloped viruses, such as LL-37 against
influenza virus [88, 89]. Some peptides (e.g., LL-37 and
θ-defensins) have been found to have HIV inhibitory activities
[90]. Antiviral peptides (AVPs) have been shown to exert their
activities at various steps in the viral life cycle, including impeding
attachment to host cells, altering viral replication within cells, or
indirectly by recruiting other parts of the immune system to promote host defense [90]. The antimicrobial peptide LL-37 has been
shown to be effective to inhibit attachment and entry of the influenza virus [88, 89]. As an example of the indirect mode of antiviral
activity, the Rhesus theta-defensin has been shown to be indirectly
antiviral against SARS-CoV-1 [91], with the major effect being an
increase in the host defense that allows survival of the mice against
this infection. LL-37 is also active against Zika virus [92]. Recently,
several highly effective AMPs were designed that show significant
activity against Ebola virus (EBOV) infection of cells [93]. These
peptides were designed or “engineered” fragments of LL-37 peptide [7] and were found to strongly inhibit EBOV entry into cell
lines and human primary macrophages, but not viral replication
Machine Learning Prediction of Antimicrobial Peptides
15
Table 4
Prediction algorithm websites for antiviral peptides (AVPs)
Prediction
algorithms
Link
Notes
References
AVPPred
http://crdd.osdd.net/
servers/avppred/
Webserver for collecting and detecting
effective AVPs
[94]
AVPdb
http://crdd.osdd.net/
servers/avpdb
A database of experimentally validated
antiviral peptides
[95]
FIRM-AVP
https://msc-viz.emsl.pnnl.
gov/AVPR
“Feature-informed reduced machine
learning
for antiviral peptide prediction”
[96]
[93]. This study represents an exciting advance in both the design
of active antiviral peptides and their application to important diseases such as Ebola.
Several websites [94–96] have been established to assist the
prediction of AVPs (Table 4). Using database analysis and a feature
reduction technique (recursive feature elimination algorithm, or
RFE), one group generated a software tool to predict antiviral
peptides with this advance, Feature-Informed Reduced Machine
Learning for Antiviral Peptide Prediction (FIRM-AVP) [96]. The
analysis assembled 649 features that correlated with antiviral activity and then applied a reduction of the number of features to
169 based on the Pearson’s correlation coefficient and computed
MDGI (mean decrease of Gini index) values. They then applied the
RFE technique to order the features by importance and to identify
the most important features. Three features that were identified in
common between two different parts of the analysis include
“PseAAC (pseudo amino acid composition) feature for leucine
(L) amino acid,” “PseAAC feature for lysine (K) amino acid,” and
“Location oriented feature for α-helix” [96]. This suggests that
these features may have a strong contribution to the physicochemical features of an effective antiviral peptide. Overall, this is in
agreement with the general observation that antiviral peptides are
often alpha-helical and positively charged peptides [90].
5.3 Antifungal
Peptide Predictors and
Data
Specific databases and prediction models [97, 98] have been developed for antifungal peptides (AFPs) (Table 5). Antifungal peptides
appear to have a prominence of the amino acids cysteine (C),
glycine (G), histidine (H), lysine (K), arginine (R), and tyrosine
(Y) in their amino acid sequences [98]. A similar set of frequently
occurring amino acids L, C, alanine (A), G, K, and R was obtained
when 1210 antifungal AMPs in the APD were statistically analyzed
[11]. Positional analysis suggests that the amino-terminus of antifungal peptides may predominately be R, valine (V), or K, while C
and H are predominant at the carboxyl terminus of the peptide.
16
Guangshun Wang et al.
Table 5
Prediction algorithm websites for antifungal peptides (AFPs)
Database Link
Notes
References
PlantAFP http://bioinformatics.cimap.res.in/sharma/
PlantAFP/
Plant-derived
peptides
[97]
AntiFP
https://webs.iiitd.edu.in/raghava/antifp/algo.php
[98]
Table 6
Prediction algorithm websites for other specific and unique kinds of peptides
Databases and prediction
algorithms
Link
Notes
References
AIPred
www.thegleelab.org/AIPpred
Anti-inflammatory
peptides
[99]
PIP-EL
www.thegleelab.org/PIP-EL
Pro-inflammatory
peptide
[100]
AntiTBpred
http://webs.iiitd.edu.in/raghava/
antitbpred/
Antitubercular
peptides
[101]
This is different from the most common amino acids (G, L, A, and
K) found in antibacterial helical peptides [10, 11].
5.4 Specific and
Unique Peptide
Prediction Tools
Many other specialized prediction algorithms for peptides have
been developed in recent years [99–101]. While anti-inflammatory
and pro-inflammatory activities are closely linked to infection outcomes, these peptides may not be directly antimicrobial. However,
it may be of interest to antimicrobial peptide researchers, especially
since many antimicrobial peptides, such as LL-37, are known to
have host-directed effects in addition to antibacterial effects
[105]. Some websites have been developed for predicting very
specific kinds of activities that may be of interest to antimicrobial
peptide researchers, including anti-inflammatory peptides,
pro-inflammatory peptides, and antitubercular peptides (Table 6).
5.5
Tuberculosis (TB) continues to be a plague on humanity, infecting
more than ten million people each year worldwide, and is responsible for approximately two million annual deaths globally. The
emergence of multidrug resistant and extremely multidrug resistant
(XDR) strains of TB, especially in prisons and other enclosed conditions, is an extreme challenge to society and to the medical
community to develop new approaches to treat these infections.
Antimicrobial peptides may represent one new approach to treating
Mycobacterium infection [102–104], likely in combination with
other treatments. The AntiTBpred website has been developed to
Tuberculosis
Machine Learning Prediction of Antimicrobial Peptides
17
Table 7
AntiTBpred output for the activity of LL-37 against tuberculosis
Prediction method
ID
AntiTB_MD SVM
ensemble
LL37
AntiTB_RD SVM
ensemble
Score
Prediction
ID
Anti-TB peptide
HBD2 0.30
LL37 0.25
Non Anti-TB
peptide
HBD2 0.202 Non Anti-TB
peptide
AntiTB_MD Hybrid
method
LL37 0.25
Non Anti-TB
peptide
HBD2
0.053 Non Anti-TB
peptide
AntiTB_RD Hybrid
method *
LL37
HBD2
0.673 Anti-TB peptide
0.78
0.317 Anti-TB peptide
Score
Prediction
Non Anti-TB
peptide
help researchers parse through antimicrobial peptide sequences and
to try to identify candidates that might be useful against this
recalcitrant and challenging organism.
Using LL-37, the human cathelicidin, as an example, AntiTBpred analysis suggests (Table 7) that this peptide either may or
may not be an antitubercular peptide. Studies have shown that
in vitro and in vivo, LL-37 is antibacterial for Mycobacterium tuberculosis (MTb) and can reduce bacilli counts in a mouse model
[105]. Further studies have shown that LL-37 is required to control intracellular MTb replication [103–105]. The antimicrobial
peptide HBD2 has also been shown to have antibacterial activity
against MTb in vitro [106]. In the output example below, these
two peptide sequences were analyzed using all four models within
AntiTBPred. Only 1 of the 4 models correctly predicted (gray
highlights) that HBD2 was antiTB, and it also predicted that
LL-37 would be antiTB.
5.6 Antibiofilm
Peptide Predictors and
Data
Biofilm formation by bacteria is a major contributor to colonization, persistence, and difficulty in treatment of bacterial infections.
Chronic, nonhealing diabetic wounds on the lower extremities,
lung infections in cystic fibrosis patients, hip-replacement and
other orthopedic implants, and chronic bladder infections all have
bacterial biofilm as a major component of their etiology. In recent
years, as our understanding of bacterial biofilms has increased
[107–109], it has become clear that some antimicrobial peptides
have the ability to either prevent the attachment and formation of
biofilm or can induce the dispersal of bacterial biofilms [110–
117]. Several databases and websites [11, 35, 118–120] have
been developed to gather the information on antibiofilm peptides
and to try to predict their activity (Table 8).
Although not strictly a peptide-focused resource for peptide
researchers, a related tool aBiofilm (https://bioinfo.imtech.res.in/
manojk/abiofilm/) [121] may be of interest to antibiofilm peptide
researchers. This tool provides a database, an antibiofilm predictor
and data-visualization tools.
18
Guangshun Wang et al.
Table 8
Prediction algorithm websites for Antibiofilm peptides
Databases
and
prediction
algorithms
Link
Notes
References
BaAMPs
http://www.baamps.it/
Database of biofilm-active
antimicrobial peptides
[35]
dPABBs
http://ab-openlab.csir.res.in/
abp/antibiofilm/
Predictor of antibiofilm activity of
peptides and generates possible
peptide variants and predicts their
antibiofilm activity
[118]
BIPEP
http://cbb1.ut.ac.ir/
BIPClassifier/Index
Uses NMR and physicochemical
descriptors
[119]
BioFIN
http://metagenomics.iiserb.ac.
in/biofin/ and http://
metabiosys.iiserb.ac.in/
biofin/
6
[120]
Antimicrobial Prediction Outcome Comparison
6.1 Prediction
Comparison on the
Same Platform
The prediction accuracy of AMPs can be determined by numerous
factors, ranging from data sets, peptide sequence information
encoding, to algorithms. Which data set to use depends on the
aim of the prediction and personal knowledge. How to represent
the peptide faithfully in a manner which is understandable by
computers is a challenging task by itself. This is further complicated
by numerous types of chemical modifications annotated in the
APD3 [11]. An optimized prediction requires a sufficient definition
of both the types and numbers of peptide features. Such peptide
features range from a dozen to hundreds. The algorithms or models
may be used alone or in combination.
6.1.1 Data Sets
A reliable data set is critical to obtain useful predictions. Machinelearning predictions normally use a balanced positive and negative
data ratio of 1:1 to avoid a biased prediction toward the large data
set. CAMP used a positive:negative ratio of 1:1.5 [18]. AmPEP
tested numerous ratios and achieved a higher accuracy when a 1:3
ratio was utilized [62]. A too high ratio is undesired as the prediction will tilt toward negative sequences, thereby reducing the overall performance of machine learning in predicting AMPs. Meher
and colleagues tested the effect of the size of positive peptides.
They found that the more positive peptides, the better the prediction [61]. This makes sense because the prediction program is
better trained with more positive examples (synthetic + natural
Machine Learning Prediction of Antimicrobial Peptides
19
AMPs). When more and more synthetic peptides are included,
however, the prediction accuracy toward natural AMPs may drop.
This is undesired when the goal is to scan the genomes to discover
novel antibiotics.
6.1.2 Peptide Features
A thorough description of the peptide sequence would require
numerous features. The first prediction noticed the need of a
more complete representation of peptide information. A higher
accuracy was achieved when the peptide features from both the N
and C-termini were considered [17]. Wang et al. [54] utilized
270 sequence features to represent each AMP. These include
20 standard amino acids (AAC) and 50 pseudo-amino acid compositions (PseAAC) that describe the peptide sequence based on
positional correlations between amino acids. Each PseAAC is also
linked with five features: polarity, secondary structure, molecular
volume, codon diversity, and electrostatic charge (50 5). However, each peptide feature may not play the same role in prediction.
In pattern recognition, it is most important to identify the major
features significant for peptide classification. CAMP started with
257 features and found 64 features were best for RF [18]. It is
possible to further reduce the peptide features required for prediction. Bhadra et al. were able to reduce the features from 105 to
23 without a loss of prediction accuracy [62]. Tripathi and Tripathi
utilized merely 15 peptide features to reach a comparable prediction accuracy, including the consideration of the sequence shuffling
effect [69]. It appears that only a dozen key peptide features are
needed to achieve a comparable prediction accuracy.
6.1.3 Algorithms/Models
Tripathi and Tripathi applied different algorithms (RF, J48, SVM,
and Naı̈ve Bayes) to peptide prediction based on the same data set.
They found random forest is best [69]. Also, Yan et al. found that
deep learning (CNN) performed similarly to RF but better than
SVM [70]. However, both SVM (8 studies) and RF (7 cases) are
popular in Table 3. To reduce overfitting, there is also an attempt to
utilize an ensemble approach by involving multiple models
[77]. Lin and Xu [60] revealed a higher accuracy of the more recent
multi-label prediction methods such as iAMP-2L and MLAMP
(92.2% and 94.7%) than those programmed in the CAMP (SVM,
RF, and DA at 57.8–77.5% accuracy) [18]. It appears that the high
accuracy reported for machine learning does not match the outcomes of real tests (below). There is room to improve for all the
existing programs.
6.2 Testing the
Prediction Outcomes
by Using Peptides Not
Included in the
Training Set
How each program performs in AMP prediction can be put into
practice. We tested the AntiBP program by using newly discovered
natural AMPs, which were not included in the training set. Among
the 17 peptides with known activity, 71% were predicted correctly
[20]. Another test was conducted in 2015 using 10 new peptides
(APD ID: 2399-2408) [51]. AntiBP SVM predicted 70% correctly,
20
Guangshun Wang et al.
whereas the RF, SVM, ANN, and DA programs in CAMP [18]
obtained 60–80% correctness. iAMP-2L [52] achieved a similar
prediction of 80%. Bishop et al. [122] identified 568 novel peptides
from alligator plasma. From 45 predicted to be AMPs by CAMP
[18], eight peptides were chemically synthesized and subjected to
antibacterial assays. Five were experimentally proved to be antimicrobial (a prediction accuracy of 5/8 ¼ 62.5%). Yan et al. [70]
developed Deep-AmPEP30 and predicted three antimicrobial
sequences from the genome of Candida glabrata, and one peptide
was proved active against Gram-positive bacterium Bacillus subtilis
and Gram-negative Vibrio parahaemolyticus. These tests underscore
the limitations of existing programs. Porto et al. [80] found that
the machine-learning programs worked well only for peptides
resembling the trained data set. However, they failed to predict
sequence shuffled peptides [14], indicating an insufficient consideration of peptide sequence information.
6.3 Comparison with
Existing AMP
Knowledge
Every machine-learning algorithm is essentially a black box. It is not
surprising that there is no direct link between the computing
outcome and AMP biology. AmPEP compared various descriptors
that distinguish the AMPs from non-AMPs and identified charge as
the most important descriptor [62]. The iAMPpred program [61]
also found the importance of net charge followed by isoelectric
point of the peptides in the training set. The iAMP-2L program
reveals that amino acid composition accounts for 60% of the
weightings [52]. Taken together, the AMP charge and composition
are two major features for AMP differentiation. Overall, these
machine-learning findings agree with the research results of AMPs
that cationicity and hydrophobicity are the two most important
factors that determine peptide antimicrobial activity. Amino acid
composition is important in determining the peptide activity spectrum as well [9, 123, 124].
Some programs documented selected amino acids to be important predictors of AMPs. Based on the APD3 data set, the AMAP
study [65] identified amino acids C, K, V, and phenylalanine (F) for
AMP prediction, whereas aspartic acid (D), glutamic acid (E), L, Y,
proline (P), R, and asparagine (N) are indicators for non-AMPs.
Using a merged data set, iAMPpred identified amino acids K, P, C,
and isoleucine (I) [61]. Wang [54] found C, P, R, W, and H based
on both natural and patented AMPs in the CAMP database. In
another study, amino acids G, F, P, and W were identified [44]
based on the DBAASP data set [33]. It is evident that there is a low
level of consensus from different prediction studies. This may result
from differences in the training data sets, algorithms, and the
assessment of important features during prediction.
It may be useful to compare the above amino acids with the
frequently occurring amino acids (~10%) discovered from analyses
of the major classes of natural peptides in the APD3 [10]. K, L,
Machine Learning Prediction of Antimicrobial Peptides
21
G, and A are frequently occurring (abundant) amino acids (~10% or
more) in 463 known helical AMPs. In contrast, amino acids C, G,
and R are abundant in natural AMPs with a known β-sheet structure (87 in the APD3) (Fig. 1a). For the “rich” families, His-rich
AMPs are clearly rich in H and G, while Pro-rich AMPs are rich in P
and R. Also, Trp-rich peptides are rich in W and R (Fig. 1b). When
combined, we have G, L, A, K, C, R, H, P, R, and W. Most of the
machine learning discovered amino acids correspond to part of the
frequently occurring amino acids of AMPs discovered in the APD3
[13]. Machine learning also identified hydrophobic V, F, and
I. While F and I are abundant in helical AMPs from fish and
mammals, V is abundant in lactone and lactam types of bacteriocins
[13]. It is puzzling why both L and A were not identified by any
machine learning. Leucine is, on average, rich in 121 amphibian
temporins (Fig. 1b) and important for peptide design [32]. Alanine
is particularly high in amphibian AMPs from South America
[13]. Increased conversations between AMP and bioinformatics
people may improve the prediction outcomes in the future.
7 Beyond Antimicrobial Properties and Proposed Prediction Integration Toward
Future Medicine
7.1 Antimicrobial
Peptide Properties that
Contribute to AMP
Activity
As discussed above, the general properties of peptides that appear
to be positively correlated with AMP activity have been identified
from experience and usually include the following physicochemical
parameters: (1) peptide length, (2) amphipathicity, (3) hydrophobicity, and (4) cationicity. However, the translation of these general
principles into very specific physicochemical rules by which certain
sequences can be included or excluded or predicted to have antimicrobial activity or not has been the challenge of the last decades
since their discovery. As discussed above, there are many detailed
bioinformatic and computational approaches that seek to solve this
problem of AMP prediction (Table 3).
7.2 Important
Antimicrobial Peptide
Properties in Addition
to AMP Activity
Additional properties of peptides will contribute to them being
“successful” antimicrobial peptides besides AMP activity. These
properties, beyond antimicrobial peptide activity, include: toxicity
towards host cells, ability to penetrate microbial or eukaryotic
membranes, susceptibility to host proteases and “stickiness,” and
the propensity to be bound to albumin or other high-abundance
proteins in the host, among others. Host-cell toxicity can include
hemolytic activity and cytotoxicity, or it can be observed in vivo
through toxicity trials. Cell permeability of the peptide can be a
critical factor if the target of the AMP is an intracellular bacteria, for
example. “Stickiness” to high-abundance host proteins or high
susceptibility to host proteases can affect the in vivo availability of
the peptide and its half-life, aspects of pharmacodynamics (PD) and
pharmacokinetics (PK) that have significant implication for future
22
Guangshun Wang et al.
clinical success. Unfortunately, the PK/PD data for AMPs are
sparse, since most of the peptides have not been advanced to that
level [6]. Some of the major parameters for consideration and
possible inclusion in a computational approach are listed in
Table 9. Many tools for computing these properties are available
online, for example, in R (Peptides, https://rdrr.io/cran/
Peptides/man/), ExPASy (expasy.org), and the calculation tool of
the APD3 [11].
LL-37 is a widely studied human cathelicidin peptide encoded
by the single CAMP gene. It is stored in and released from neutrophils and expressed in other types of human cells as well.
Depending on the cells and physiological conditions, the precursor
of human cathelicidin may be cleaved into different mature peptides. This peptide has been found to be antibacterial against many
pathogens, including resistant strains, persisters, and biofilms. It
belongs to the classic amphipathic helical family with a short tail at
the C-terminus (PDB: 2K6O) [7]. In Table 9A, the major physicochemical properties of LL-37 are shown as computed by one of the
many websites described below. This peptide is short (37 aa),
amphipathic (>1), cationic (net charge +6), has a high pI (>10),
and has a low molecular weight (under 5 kDa). ExPASy ProtParam
tool provides instability index (23.34) and aliphatic index 89.46.
The APD website calculates GRAVY (0.724), Boman index
(2.99 kcal/mol), and Wimley-White whole residue hydrophobicity
(12.83) for LL-37. As a well-studied peptide, we will use LL-37 as
an example in our discussion of the online tools described below.
7.3 Host-Cell
Toxicity and Hemolysis
Host-cell cytotoxicity and hemolysis are critical to the clinical
potential of any antimicrobial peptide. Thus, we propose that this
issue needs to be considered early, right after identification of
desired antimicrobial activity of any peptide as a potential strong
counter-selection criterion. Although sequence features such as
multiple lysines and high hydrophobicity are known to contribute
to host-cell cytotoxicity, it appears to remain challenging to
“design-out” host-directed toxicity of active peptides while retaining the desired antimicrobial activity of the sequence. The combined AMP selection and counter-selection procedure leads to a
short list of AMPs with high therapeutic indexes for experimental
validation.
There are multiple online programs available for the computational prediction of toxicity and hemolysis of antimicrobial peptides.
For example, Gupta et al. have published a method of in silico toxicity
prediction for peptides [125, 126]. This site is called ToxinPred and
has two algorithms available, ToxinPred SVM-SwissProt and ToxinPred QM-di-SwissProt. To illustrate the use of this website, we
submitted the sequence of LL-37, the human cathelicidin, to compare the prediction versus in-laboratory data (Table 9A, B). It can be
seen that experimentally the cytotoxicity of LL-37 is dose-dependent
and increases with increasing concentration of peptide (Table 9B).
SVM
score
Nontoxin 0.34
HydroPrediction phobicity
A549
A431 squamous cell carcinoma cells
pMSC
MA-104
Thermally wounded human skin equivalents MTT
(HSE)
Scrambled
LL-37
LL-37
LL-37
LL-37
LL-37
MTT, Neutral
red
MTT
MTT
MTT
MTT
A549
LL-37
Assay
Cell line
Peptide
(B) Experimental cytotoxicity activity of human cathelicidin LL-37
LLGDFFRKSKEKIGKEFKRIVQRIKDFL 1.58
RNLVPRTES
Peptide sequence
[131]
No cytotoxicity at up to 200 μg/model
[129]
No toxicity up to 10 μg/mL
[130]
[128]
Cytotoxic at 20 μg/mL. Not toxic at 5 μg/mL
Statistically significant cytotoxicity (>10%) observed
20–50 μg/mL
[127]
Not cytotoxic up to 50 μg/mL
10.61 4493.32
Mol wt
[127]
+6.0
pI
Not cytotoxic up to 50 μg/mL
0.62
Net
charge
References
1.06
Hydrophilicity
Result
0.72
AmphiHydropathicity pathicity
(A) Predicted Toxicity of LL-37 on ToxinPred (validated via ExPASy ProParam tool)
Table 9
Hemolytic prediction of activity for LL-37 human cathelicidin peptide
Machine Learning Prediction of Antimicrobial Peptides
23
24
Guangshun Wang et al.
However, this subtlety of concentration of peptide is not captured by
the predictors, which just predict one result for some unknown
concentration of peptide. Thus, just like a stopped clock is correct
twice a day, the predictor is correct at some concentrations of LL-37
and is incorrect at higher concentrations. This concentrationdependence of the real-life data needs to be integrated with computational predictors in the future, perhaps by including the concentrations at which the results are included in the data set as an
“antibacterial” or “noncytotoxic” peptide.
Hemolytic activity is the ability of a peptide to lyse red blood
cells. This assay is normally performed with a washed 2% solution of
red blood cells, following a standard protocol [132, 133]. Many
different defibrinated red blood cell types can be used, depending
on the intent of the experiment, such as sheep [132–134], horse
[135], chicken [136], or mouse [137, 138], which may be more
sensitive to peptide hemolysis than human red blood cells
[138]. Often it is desirable to use deidentified human blood to test
hemolytic activity, which can be obtained from companies like
BioIVT and used in these assays [138]. Computational predictors
of hemolytic activity can be used to compute an estimate of hemolytic activity. For example, HemoPred [139], HemoPI/Hemolytik
[140], and HAPPENN [141] are some of the websites currently
available (Table 10). HemoPred utilizes a random forest classifier
based on amino acid sequence, dipeptide composition, and physicochemical parameters [139]. HemoPI is based on comparing a data
set of highly hemolytic peptides to a random data set of peptides
from SwissProt [140]. Finally, the HAPPENN tool employs neural
networks based on classification of known peptides as hemolytic and
nonhemolytic to predict the hemolytic activity from a new peptide’s
primary sequence [141].
As an exercise, we ran the sequence of the LL-37 peptide
through the various hemolysis predictors (Table 11) and compared
the results to published laboratory generated data regarding hemolytic activity (Table 12).
From the literature, the following hemolysis data was obtained
for the LL-37 peptide (Table 12), as an example. This is not a
comprehensive meta-analysis, but shows data from several papers
that contained data over a wide range of peptide concentrations and
Table 10
Hemolytic predictor websites
Name
Link
References
HemoPred
http://codes.bio/hemopred/
[139]
HemoPI/
Hemolytik
https://webs.iiitd.edu.in/raghava/hemopi/index.php or http://crdd.
osdd.net/raghava/hemopi/
[140]
HAPPENN
https://research.timmons.eu/happenn
[141]
Machine Learning Prediction of Antimicrobial Peptides
25
Table 11
Hemolytic prediction of activity for LL-37 human cathelicidin peptide
Test
sequence
LL-37: LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES
Prediction
results
Program used Predicted result
Notes
HemoPred
Hemolytic
HemoPI
PROB
score
0.34 (SVM (HemoPI-1)
based
0.72 (SVM (HemoPI-2)
based) (Hemolytic)
0.88 SVM (HemoPI-3)
based) (Hemolytic)
Note from website: PROB score is the normalized SVM
score and ranges between 0 and 1, i.e., 1 very likely to be
hemolytic, 0 very unlikely to be hemolytic
HAPPENN
PROB
score
0.089 (Not Hemolytic)
Note from website: PROB score is the normalized sigmoid
score and ranges between 0 and 1. 0 is predicted to be
most likely nonhemolytic, 1 is predicted to be most likely
hemolytic
Table 12
Summary of reported percent hemolysis results with different amounts of LL-37 peptide against
human red blood cells
Hemolysis of human red blood cells
References
8% hemolysis at 20 μM
[144]
~30% hemolysis at 20 μM
[147]
4.47% hemolysis at 38.8 μM
[146]
~10% hemolysis at 60 μM
[143]
9% hemolysis at 100 μM
[148]
~60% hemolysis at 100 μM
[142]
~50% hemolysis at 200 μM
[145]
hemolytic results [142–148]. Of course, there is no indication from
these computational predictors of dose-dependence of the effect,
although “the dose makes the poison” in most cases with antimicrobial peptides, including LL-37. The prediction results vary from
absolutely one end of the hemolytic activity spectrum to the
other—one analysis result says “Not Hemolytic,” one result is
“Somewhat hemolytic,” and one result is “Hemolytic.” This small
analysis suggests that there is significant room for improvement in
the accuracy of these predictors compared to actual experimental
data generated in the laboratory (Table 13 and Fig. 3).
26
Guangshun Wang et al.
Table 13
Peptide parameters for integrated prediction
Parameter of
interaction
Commonly used parameters
Comments
Antibacterial
activity
MIC >8 μg/mL is often considered
“active” performed under CLSI
guidelines using CA-MHB and
designated concentrations of peptide.
The peptide is defined as inactive in the
APD with MIC >100 μg/mL or μM
Different methods and conditions for
antimicrobial activity make it difficult
to compare peptide activity
Does not account for peptide binding to
serum proteins or being cleaved by
serum factors in vivo
PK/PD data are lacking for AMPs, and
they are not addressed by this metric
Host-cell
cytotoxicity
Cytotoxicity at 100 μg/mL or less; TC50
should be <10–20% at the MIC,
depending on the assay used
The relationship of this value in vitro
with in vivo/whole body toxicity has
not been established. Often the level
of LL-37 is taken as a benchmark,
since it is native to the human body
Hemolysis
Hemolysis at 100 μg/mL or HC50 should The relationship of this value to in vivo/
be <10–20% at MIC
whole body toxicity has not been
measured. Often the level of LL-37 is
taken as a benchmark, since it is native
to the human body
Host cell
permeability
An important parameter if the target
microorganism has an intracellular step
to its infectious life-cycle
Pathogen cell
permeability
An important parameter if the target of the Assays to measure intracellular bacterial
targets such as enzymes or DNA in
peptide at sub-MIC concentrations
the presence of extracellular peptide
might be an intracellular component of
are useful to assess this parameter
the bacteria, such as target enzymes or
[149–151]
DNA
Assays to measure intracellular
replication of bacteria in the presence
of extracellular peptide are useful to
assess this parameter [113]
Initially called protein-binding potential
“This function computes the potential
Stickiness to
[3], Boman index was renamed and
protein interaction index proposed by
other proteins
programmed in the APD for every
Boman [3] based in the amino acid
(Boman
peptide [9]. It is also available in the
sequence of a protein. The index is equal
index)
calculation and prediction interface of
to the sum of the solubility values for all
the APD for any other peptides. This
residues in a sequence, it might give an
parameter is also programmed in R at
overall estimate of the potential of a
https://rdrr.io/cran/Peptides/man/
peptide to bind to membranes or other
boman.html
proteins as receptors, to normalize it is
divided by the number of residues. A
protein has high binding potential if the
index value is higher than 2.48”
Propensity for
host protease
cleavage
Protease cleavage will reduce the activity
and half-life of the peptide
Can be predicted using Expasy server
PeptideCutter. https://web.expasy.
org/peptide_cutter/
Other negative
effects
References
Comments
(continued)
Machine Learning Prediction of Antimicrobial Peptides
27
Table 13
(continued)
Parameter of
interaction
Commonly used parameters
Comments
Carcinogenic
effect
None
No reports were found on the
carcinogenic effect of antimicrobial
peptides. Work is being done to use
AMPs to fight cancer [155, 156]
Antigenicity
None
It is very difficult to raise antibodies
against antimicrobial peptides. This is
accomplished if at all by coupling
KLH to the peptide. To our
knowledge, there have been no
reports of spontaneous antibody
production against naturally
produced AMP, which is too small
Cell penetrating [152]
properties
Cell penetrating properties of peptides
are probably a negative property on
net, especially in seeking a bactericidal
mechanism. Website are available to
select for CPPs; this could be a
counter-selection or down-selection
step in an AMP design protocol unless
this property is used to target
intracellular pathogens
Fig. 3 Percent hemolysis results with different amounts of LL-37 peptide against
human red blood cells. The data from Table 11 were plotted. The best-fit line is
y ¼ 0.2142x + 8.0017. The shaded gray area represents a 95% confidence
interval
28
Guangshun Wang et al.
7.4 Bacterial CellPenetrating Peptides
Another factor that may need to be considered in computational
prediction of AMP activity is the characteristic of cell-penetration of
the pathogen itself: bacteria, membrane-virus, fungal cell, etc.
While the main mechanism of action of AMPs is clearly membrane
targeting and disruption, there are multiple, well-defined examples
of intra-bacterial targets of AMPs that may contribute to their
physiological effect, especially at Sub-MIC levels in vivo. These
can include targeting bacterial enzymes critical for bacterial survival, or direct interference of the AMP with the bacterial DNA.
One example of the association of AMPs with critical bacterial
enzymes is the identification of acyl carrier protein as a target of
LL-37, the human cathelicidin protein. This association was first
determined biochemically by binding the bacterial proteins to
immobilized peptide and identifying high-affinity binding proteins
[149]. Another example of intra-bacterial targets of AMPs is the
association of LL-37 directly with bacterial DNA within the cell,
leading to mutations of critical genes [150, 151]. This work
includes a compelling visualization of the AMP inside the live
Pseudomonas bacteria, associated with the DNA. This property of
AMPs to enter the bacteria to exert some direct, non-membrane
acting effect could be computationally assessed using cellpenetrating peptide (CCP) analysis, such as is done for other
well-known CPPs [152]. Unlike AMPs, CPPs for bacterial pathogens should have the property of being non-killing but membranepenetrating, and comparison of these sets of peptide sequences may
reveal some interesting differences. It might be possible to use the
CPP algorithm to counter-select for peptides that do not have this
property if a membrane-targeting peptide was desired to possibly
achieve bactericidal activity.
7.5 Inclusion of
Additional Parameters
in Drug Development
It would be useful if these computational predictors could be used
in a combinatorial fashion to achieve the goals of the researcher in
designing new AMPs, such as was designed in the database filtering
technology approach [153, 154]. For example, perhaps one seeks a
short, helical antimicrobial peptide that has activity against Gramnegative bacteria and especially has anti-biofilm activity and low
hemolytic activity. It would be useful to have separate analytical
tools linked together to generate the desired output. With the everincreasing number of modules available in R, and web-based prediction and analysis tools, this analysis could be done from small
scale to high-throughput sequence analysis to design novel peptides. If the computational predictors could be made more accurate, this could be useful in drug-development projects upstream of
in vitro screening programs, for example, to increase hit efficacy.
The inclusion of prescreening for hemolysis and cytotoxicity would
be very useful to reduce the number of hits that have poor in vivo
performance characteristics. In addition, high throughput peptide
sequencing could enable the generation of high-quality training
sets and negative data sets.
Machine Learning Prediction of Antimicrobial Peptides
8
8.1
29
Current Achievements and Future Directions
Achievements
In summary, antimicrobial peptide prediction is in essence a peptide
classification problem. Different supervised learning algorithms
have been trained to predict AMPs (Table 3). The major achievements include the following:
1. Construction of AMP databases that facilitated machine
learning prediction. The APD database, initially online in
2003 and updated regularly, provides a platform for understanding the structure and activity relationship of
natural AMPs.
2. Generation of hypothetically negative data sets based on
UniProt.
3. Successful encoding peptide features for machine-learning
prediction.
4. Programming of various machine-learning algorithms with
more or less similar prediction outcomes.
5. Execution of both single and multi-label predictions as well as
ensemble predictions of AMPs.
6. Consideration of the impact of the peptide sequence in addition to amino acid composition.
7. Consideration of posttranslational modifications and 3D structure of AMPs although rare.
8. Species-specific prediction of AMPs.
8.2
Future Directions
Machine-learning prediction of AMPs remains a challenging task.
The success rate is modest and not yet perfect because numerous
factors are in play. We anticipate that the quality of AMP prediction
will improve with the development of the following aspects:
1. More complete positive data set for AMPs from continued
peptide search and database update. There are two types of
positive data. First, a continued expansion of natural AMPs in
the APD will increase the accuracy of identifying natural AMP
sequences. Second, data merging from different databases is
anticipated to continue and a large data set with more and more
synthetic peptides may improve the prediction of artificial
sequences.
2. Experimentally validated negative data sets for AMPs. Our
ongoing collection of such peptides may reduce false positives
in ML predictions.
3. Ranking peptide activity data based on the same scale (e.g.,
MIC, diffusion distance, and E-test). This is a challenging task
due to limited activity analysis under various lab conditions. A
recommended guide for antimicrobial assays of AMPs may be
helpful.
30
Guangshun Wang et al.
4. Increased use of information about the target organism in
classification and analysis of AMPs (e.g., the target is Grampositive vs Gram-negative bacteria, or a specific pathogen).
5. Continued improvement of peptide encoding for rapid and
accurate computing identification.
6. Increased use of peptide information on chemical modifications and their relationship with activity.
7. Increased high-quality 3D structures and their applications in
AMP prediction. This is yet another challenging task as currently only ~13% AMPs are known to have 3D structures in the
APD3 and high-quality structures are not easy to obtain [11].
8. Development of more powerful machine-learning/artificial
intelligence algorithms or models to better handle sequence
and structural diversity and data imbalance of AMPs. Combined use of various ML models (i.e., ensemble) may improve
predictions.
9. Increased communication between AMP investigators and
machine learning/AI scientists.
10. Establishment of a pipeline of predictions of peptide properties
required as a medicine by considering antimicrobial activity,
cell toxicity in vitro and in vivo, and peptide bioavailability for
efficacy in vivo.
Besides AMP prediction, another goal of the APD database is
to help design novel peptides to combat antibiotic-resistant pathogens [9]. Different methods have been demonstrated [32]. The
frequently occurring amino acids, such as glycine, leucine, and
lysine, are sufficient in designing peptides with antibacterial activity
comparable to human cathelicidin LL-37 [10, 13]. Interestingly, a
substitution of leucine in the database designed peptide DFTamP1
with isoleucine or valine led to activity or solubility decrease [153],
underscoring the significance of nature’s choice of leucine as a
frequently occurring amino acid in AMPs [10]. Also, there is an
inverse correlation between peptide length and leucine content of
over 1000 amphibian peptides in the APD [157]. Our screening of
representative peptides from the APD led to the identification of
different sets of AMPs against methicillin-resistant Staphylococcus
aureus (MRSA) and HIV-1 [158, 159]. The grammar approach
emphasizes the unique sequences in the database and their combinations [14]. The database filtering technology (DFT) is an ab
initio approach, thereby providing another avenue [153]. The database derived parameters are useful to make peptide mimics [160] or
to design even short peptides to decrease the production cost
[6]. Our expansion of the DFT from in silico filtering to in vitro
and in vivo filtering establishes a pipeline for peptide discovery
[154]. This idea can be harnessed to establish a pipeline of
Machine Learning Prediction of Antimicrobial Peptides
31
machine-learning predictions to accelerate peptide discovery. When
quantitative MIC values are used to train ML algorithm, it becomes
possible to rank the peptide activity to identify most potent
sequences [161]. Likewise, a subsequent counterselection can be
conducted by ranking peptide toxicity to host cells (Table 10) so
that less toxic peptides can be selected for experimental validation.
Ultimately, one may be able to generate an expert system that
automatically designs and produces personalized antimicrobials
with designed activity spectrum and molecular target for patients
to treat a particular pathogen-caused infection. The multiple functions of AMPs annotated in the APD3 imply other potential applications as well.
Acknowledgments
This study was supported by Joint Warfighter Medical Research
Program (JWMRP) JW200188 (MVH), the NIH grant
R01GM138552, and the University of Nebraska Collaborative
Initiation Grant (GW). Thanks to Fahad Alsaab and Maxwell
Tabarrok for assistance with the hemolytic data.
References
1. Boucher HW, Talbot GH, Bradley JS,
Edwards JE, Gilbert D, Rice LB, Scheld M,
Spellberg B, Bartlett J (2009) Bad bugs, no
drugs: no ESKAPE! An update from the
Infectious Diseases Society of America. Clin
Infect Dis 48:1–12
2. O’Neill J. (2016) Tracking drug resistant
infections globally: Final report and recommendations, The review on antimicrobial
resistance,
Wellcome
Trust,
HM
Government.
3. Boman HG (2003) Antibacterial peptides:
basic facts and emerging concepts. J Inter
Med 254:197–215
4. Mangoni ML, McDermott AM, Zasloff M
(2016) Antimicrobial peptides and wound
healing: biological and therapeutic considerations. Exp Dermatol 25:167–173
5. Hancock REW, Sahl HG (2006) Antimicrobial and host-defense peptides as new antiinfective therapeutic strategies. Nat Biotechnol 24:1551–1557
6. Lakshmaiah Narayana J, Mishra B,
Lushnikova T, Wu Q, Chhonker YS,
Zhang Y, Zarena D, Salnikov ES, Dang X,
Wang F, Murphy C, Foster KW, Gorantla S,
Bechinger B, Murry DJ, Wang G (2020) Two
distinct amphipathic peptide antibiotics with
systemic efficacy. Proc Natl Acad Sci U S A
117:19446–19454
7. Wang G, Narayana JL, Mishra B, Zhang Y,
Wang F, Wang C, Zarena D, Lushnikova T,
Wang X (2019) Design of Antimicrobial Peptides: Progress made with human cathelicidin
LL-37. Adv Exp Med Biol 1117:215–240
8. Browne K, Chakraborty S, Chen R, Willcox
MD, Black DS, Walsh WR, Kumar N (2020)
A new era of antibiotics: the clinical potential
of antimicrobial peptides. Int J Mol Sci 21:
7047
9. Wang Z, Wang G (2004) APD: the antimicrobial peptide database. Nucleic Acids Res 32:
D590–D592
10. Wang G, Li X, Wang Z (2009) The updated
antimicrobial peptide database and its application in peptide design. Nucleic Acids Res 37:
D933–D937
11. Wang G, Li X, Wang Z (2016) APD3: the
antimicrobial peptide database as a tool for
research and education. Nucleic Acids Res
44:D1087–D1093
12. Kreutzberger MA, Pokorny A, Almeida PF
(2017)
Daptomycin-Phosphatidylglycerol
domains in lipid membranes. Langmuir 33:
13669–13679
13. Wang G (2020) The antimicrobial peptide
database provides a platform for decoding
the design principles of naturally occurring
antimicrobial peptides. Protein Sci 29(1):
8–18
32
Guangshun Wang et al.
14. Loose C, Jensen K, Rigoutsos I, Stephanopoulos G (2006) A linguistic model for the
rational design of antimicrobial peptides.
Nature 443(7113):867–869
15. Wang G (2012) Post-translational modifications of natural antimicrobial peptides and
strategies for peptide engineering. Curr Biotechnol 1:72–79
16. Wang G, Mishra B, Lau K, Lushnikova T,
Golla R, Wang X (2015) Antimicrobial peptides in 2014. Pharmaceuticals (Basel) 8:
123–150
17. Lata S, Sharma BK, Raghava GP (2007) Analysis and prediction of antibacterial peptides.
BMC Bioinformatics 8:263
18. Thomas S, Karnik S, Barai RS, Jayaraman VK,
Idicula-Thomas S (2010) CAMP: a useful
resource for research on antimicrobial peptides. Nucleic Acids Res 38:D774–D780
19. Wang G (2015) Improved methods for classification, prediction, and design of antimicrobial peptides. Methods Mol Biol 1268:43–66
20. Wang G (2010) Antimicrobial peptides: discovery, design and novel therapeutic strategies, 2nd edn. CABI, England. published
in 2017
21. Gudmundsson GH, Agerberth B, Odeberg J,
Bergman T, Olsson B, Salcedo R (1996) The
human gene FALL39 and processing of the
cathelin precursor to the antibacterial peptide
LL-37 in granulocytes. Eur J Biochem 238:
325–332
22. Sørensen O, Arnljots K, Cowland JB, Bainton
DF, Borregaard N (1997) The human antibacterial cathelicidin, hCAP-18, is synthesized in myelocytes and metamyelocytes and
localized to specific granules in neutrophils.
Blood 90:2796–2803
23. Sørensen OE, Gram L, Johnsen AH,
Andersson E, Bangsbøll S, Tjabringa GS,
Hiemstra PS, Malm J, Egesten A, Borregaard
N (2003) Processing of seminal plasma
hCAP-18 to ALL-38 by gastricsin: a novel
mechanism of generating antimicrobial peptides in vagina. J Biol Chem 278(31):
28540–28546
24. de Jong A, van Heel AJ, Kok J, Kuipers OP
(2010) BAGEL2: mining for bacteriocins in
genomic data. Nucleic Acids Res 38:
W647–W651
25. Blin K, Kazempour D, Wohlleben W, Weber T
(2014) Improved lanthipeptide detection and
prediction for antiSMASH. PLoS One 9(2):
e89420
26. Yount NY, Weaver DC, de Anda J, Lee EY,
Lee MW, Wong GCL, Yeaman MR (2020)
Discovery of novel type II Bacteriocins using
a new high-dimensional Bioinformatic algorithm. Front Immunol 11:1873
27. Fjell CD, Hancock RE, Cherkasov A (2007)
AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics 23:1148–1155
28. Dos Santos-Silva CA, Zupin L, OliveiraLima M, Vilela LMB, Bezerra-Neto JP,
Ferreira-Neto JR, Ferreira JDC, de OliveiraSilva RL, Pires CJ, Aburjaile FF, de Oliveira
MF, Kido EA, Crovella S, Benko-Iseppon AM
(2020) Plant antimicrobial peptides: state of
the art, in silico prediction and perspectives in
the omics era. Bioinform Biol Insights 14:
1177932220952739
29. Jia HP, Mills JN, Barahmand-Pour F,
Nishimura D, Mallampali RK, Wang G,
Wiles K, Tack BF, Bevins CL, McCray PB Jr
(1999) Molecular cloning and characterization of rat genes encoding homologues of
human beta-defensins. Infect Immun 67:
4827–4833
30. Wang CK, Kaas Q, Chiche L, Craik DJ (2008)
CyBase: a database of cyclic protein sequences
and structures, with applications in protein
discovery and engineering. Nucleic Acids Res
36:D206–D210
31. Yount NY, Andrés MT, Fierro JF, Yeaman
MR (2007) The gamma-core motif correlates
with antimicrobial activity in cysteinecontaining kaliocin-1 originating from transferrins. Biochim Biophys Acta 1768(11):
2862–2872
32. Wang G (2013) Database-guided discovery of
potent peptides to combat HIV-1 or superbugs. Pharmaceuticals (Basel) 6(6):728–758
33. Pirtskhalava M et al (2021) DBAASP v3:
database of antimicrobial/cytotoxic activity
and structure of peptides as a resource for
development of new therapeutics. Nucleic
Acids Res 49:D288–D297
34. Seebah S et al (2007) Defensins knowledgebase: a manually curated database and information source focused on the defensins family
of antimicrobial peptides. Nucleic Acids Res
35:D265–D268
35. Di Luca M et al (2015) BaAMPs: the database
of biofilm-active antimicrobial peptides. Biofouling 31:193–199
36. Hammami R, Zouhir A, Le Lay C, Ben
Hamida J, Fliss I (2010) BACTIBASE second
release: a database and tool platform for bacteriocin characterization. BMC Microbiol 10:
22
37. Novković M, Simunić J, Bojović V, Tossi A,
Juretić D (2012) DADP: the database of
Machine Learning Prediction of Antimicrobial Peptides
anuran defense peptides. Bioinformatics 28:
1406–1407
38. Kang X et al (2019) DRAMP 2.0, an updated
data repository of antimicrobial peptides. Sci
Data 6:148
39. Whitmore L, Wallace BA (2004) The Peptaibol database: a database for sequences and
structures of naturally occurring peptaibols.
Nucleic Acids Res 32:D593–D594
40. Zhao X, Wu H, Lu H, Li G, Huang Q
(2013) LAMP: a database linking antimicrobial peptides. PLoS One 8:e66557
41. Piotto SP, Sessa L, Concilio S, Iannelli P
(2012) YADAMP: yet another database of
antimicrobial peptides. Int J Antimicrob
Agents 39:346–351
42. Hammami R, Ben Hamida J, Vergoten G,
Fliss I (2009) PhytAMP: a database dedicated
to antimicrobial plant peptides. Nucleic Acids
Res 37(Database issue):D963–D968
43. Gómez EA, Giraldo P, Orduz S (2017) InverPep: a database of invertebrate antimicrobial
peptides. J Glob Antimicrob Resist 8:13–17
44. Qureshi A, Thakur N, Kumar M (2013)
HIPdb: a database of experimentally validated
HIV inhibiting peptides. PLoS One 8:e54908
45. Li J, Qu X, He X, Duan L, Wu G, Bi D,
Deng Z, Liu W, Ou HY (2012) ThioFinder:
a web-based tool for the identification of thiopeptide gene clusters in DNA sequences.
PLoS One 7(9):e45878
46. Wu H, Lu H, Huang J et al (2012) EnzyBase:
a novel database for enzybiotic studies. BMC
Microbiol 12(1):54
47. Mehta D, Anand P, Kumar V, Joshi A,
Mathur D, Singh S, Tuknait A,
Chaudhary K, Gautam SK, Gautam A, Varshney GC, Raghava GP (2014) ParaPep: a web
resource for experimentally validated antiparasitic peptide sequences and their structures.
Database 2014:bau051
48. Jhong JH, Chi YH, Li WC et al (2019)
dbAMP: an integrated resource for exploring
antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data. Nucleic Acids
Res 47:D285–D297
49. Usmani SS, Kumar R, Kumar V, Singh S,
Raghava GPS (2018) AntiTbPdb: a knowledgebase of anti-tubercular peptides. Database (Oxford) 2018:bay025
50. Brahmachary M, Krishnan SP, Koh JL, Khan
AM, Seah SH, Tan TW, Brusic V, Bajic VB
(2004) ANTIMIC: a database of antimicrobial sequences. Nucleic Acids Res 32(Database issue):D586–D589
33
51. Wang G (2015) Database resources dedicated
to antimicrobial peptides. In: Chen C, Yan X,
Jackson CR (eds) Antimicrobial resistance and
food safety. Academic Press, Cambridge,
Massachusetts, pp 365–384
52. Xiao X, Wang P, Lin WZ, Jia JH, Chou KC
(2013) iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides
and their functional types. Anal Biochem
436:168–177
53. Waghu FH, Barai RS, Gurung P, IdiculaThomas S (2016) CAMPR3: a database on
sequences, structures and signatures of antimicrobial peptides. Nucleic Acids Res 44
(D1):D1094–D1097
54. Wang P, Hu L, Liu G, Jiang N, Chen X, Xu J,
Zheng W, Li L, Tan M, Chen Z, Song H, Cai
YD, Chou KC (2011) Prediction of antimicrobial peptides based on sequence alignment
and feature selection methods. PLoS One
6(4):e18476
55. Torrent M, Di Tommaso P, Pulido D, Nogués
MV, Notredame C, Boix E, Andreu D
(2012) AMPA: an automated web server for
prediction of protein antimicrobial regions.
Bioinformatics 28(1):130–131
56. Fernandes FC, Rigden DJ, Franco OL (2012)
Prediction of antimicrobial peptides based on
the adaptive neuro-fuzzy inference system
application. Biopolymers 98(4):280–287
57. Mooney C, Haslam NJ, Holton TA,
Pollastri G, Shields DC (2013) PeptideLocator: prediction of bioactive peptides in protein
sequences. Bioinformatics 29(9):1120–1126
58. Ng XY, Rosdi BA, Shahrudin S (2015) Prediction of antimicrobial peptides based on
sequence alignment and support vector
machine-pairwise
algorithm
utilizing
LZ-complexity. Biomed Res Int 2015:
212715
59. Lee HT, Lee CC, Yang JR et al (2015) A
large-scale structural classification of antimicrobial peptides. Biomed Res Int 2015:
475062
60. Lin W, Xu D (2016) Imbalanced multi-label
learning for identifying antimicrobial peptides
and their functional types. Bioinformatics
32(24):3745–3752
61. Meher PK, Sahu TK, Saini V, Rao AR (2017)
Predicting antimicrobial peptides with
improved accuracy by incorporating the compositional, physico-chemical and structural
features into Chou’s general PseAAC. Sci
Rep 7:42362
62. Bhadra P, Yan J, Li J, Fong S, Siu SWI (2018)
AmPEP: sequence-based prediction of
34
Guangshun Wang et al.
antimicrobial peptides using distribution patterns of amino acid properties and random
forest. Sci Rep 8:1697
63. Veltri D, Kamath U, Shehu A (2018) Deep
learning improves antimicrobial peptide recognition. Bioinformatics 34:2740–2747
64. Agrawal P, Raghava GPS (2018) Prediction of
antimicrobial potential of a chemically modified peptide from its tertiary structure. Front
Microbiol 9:2551
65. Gull S, Shamim N, Minhas F (2019) AMAP:
hierarchical multi-label prediction of biologically active and antimicrobial peptides. Comput Biol Med 107:172–181
66. Feng P, Wang Z, Yu X (2019) Predicting
antimicrobial peptides by using increment of
diversity with quadratic discriminant analysis
method. IEEE/ACM Trans Comput Biol
Bioinform 16:1309–1312
67. Chung CR, Kuo TR, Wu LC et al (2019)
Characterization and identification of antimicrobial peptides with different functional
activities. Brief Bioinform:bbz043. https://
doi.org/10.1093/bib/bbz043
68. Gull S, Minhas FUAA (2020) AMP0: speciesspecific prediction of anti-microbial peptides
using zero and few shot learning. IEEE/
ACM Trans Comput Biol Bioinform.
https://doi.org/10.1109/TCBB.2020.
2999399
69. Tripathi V, Tripathi P (2020) Detecting antimicrobial peptides by exploring the mutual
information of their sequences. J Biomol
Struct Dyn 38:5037–5043
70. Yan J, Bhadra P, Li A, Sethiya P, Qin L, Tai
HK, Wong KH, Siu SWI (2020) DeepAmPEP30: improve short antimicrobial peptides prediction with deep learning. Mol Ther
Nucleic Acids 20:882–894
71. Fu H, Cao Z, Li M et al (2020) ACEP:
improving antimicrobial peptides recognition
through automatic feature fusion and amino
acid embedding. BMC Genomics 21:597
72. Kavousi K, Bagheri M, Behrouzi S et al
(2020) IAMPE: NMR-assisted computational prediction of antimicrobial peptides. J
Chem Inf Model 60:4691–4701
73. Santos-Junior CD, Pan S, Zhao XM et al
(2020) Macrel: antimicrobial peptide screening in genomes and metagenomes. PeerJ 8:
e10555
74. Youmans M, Spainhour JCG, Qiu P (2020)
Classification of antibacterial peptides using
long short-term memory recurrent neural
networks. IEEE/ACM Trans Comput Biol
Bioinform 17:1134–1140
75. Fingerhut L, Miller DJ, Strugnell JM et al
(2020) Ampir: an R package for fast
genome-wide prediction of antimicrobial peptides. Bioinformatics 36:5262–5263
76. Lawrence TJ, Carper DL, Spangler MK et al
(2020) amPEPpy 1.0: A portable and accurate
antimicrobial peptide prediction tool. Bioinformatics 37(14):2058–2060. https://doi.
org/10.1093/bioinformatics/btaa917
77. Lertampaiporn S, Vorapreeda T, Hongsthong
A et al (2021) Ensemble-AMPPred: robust
AMP prediction and recognition using the
ensemble learning method with a new hybrid
feature for differentiating AMPs. Genes
(Basel) 12:137
78. Berman HM, Westbrook J, Feng Z,
Gilliland G, Bhat TN, Weissig H, Shindyalov
IN, Bourne PE (2000) The protein data
Bank. Nucleic Acids Res 28(1):235–242
79. MacDougall A et al (2020) UniRule: a unified
rule resource for automatic annotation in the
UniProt knowledgebase. Bioinformatics
36(17):4643–4648
80. Porto WF, Pires ÁS, Franco OL (2017) Antimicrobial activity predictors benchmarking
analysis using shuffled and designed synthetic
peptides. J Theor Biol 426:96–103
81. Othman M, Ratna S, Tewari A, et al. (2017)
Classification and prediction of antimicrobial
peptides using N-gram representation and
machine learning. Proceedings of the 8th
ACM international conference on bioinformatics, computational biology, and health
informatics. Boston, Massachusetts, USA:
Association for Computing Machinery, 605
82. Mooney C, Haslam NJ, Pollastri G et al
(2012) Towards the improved discovery and
design of functional peptides: common features of diverse classes permit generalized prediction of bioactivity. PLoS One 7:e45012
83. Burdukiewicz M, Sidorczuk K, Rafacz D et al
(2020) Proteomic screening for prediction
and design of antimicrobial peptides with
AmpGram. Int J Mol Sci 21:4310
84. Kaplan N, Morpurgo N, Linial M (2007)
Novel families of toxin-like peptides in insects
and mammals: a computational approach. J
Mol Biol 369:553–566
85. Muller AT, Kaymaz AC, Gabernet G et al
(2016) Sparse neural network models of antimicrobial peptide-activity relationships. Mol
Inform 35:606–614
86. Schneider P, Muller AT, Gabernet G et al
(2017) Hybrid network model for "deep
learning" of chemical data: application to
antimicrobial peptides. Mol Inform 36
Machine Learning Prediction of Antimicrobial Peptides
87. Su X, Xu J, Yin Y et al (2019) Antimicrobial
peptide identification using multi-scale convolutional network. BMC Bioinformatics 20:
730
88. Tripathi S et al (2015) Antiviral activity of the
human cathelicidin, LL-37, and derived peptides on seasonal and pandemic influenza a
viruses. PLoS One 10:e0124706
89. Barlow PG et al (2011) Antiviral activity and
increased host defense against influenza infection elicited by the human cathelicidin LL-37.
PLoS One 6:e25333
90. Wang G (2012) Natural antimicrobial peptides as promising anti-HIV candidates. Curr
Top Pept Protein Res 13:93–110
91. Wohlford-Lenane CL et al (2009) Rhesus
theta-defensin prevents death in a mouse
model of severe acute respiratory syndrome
coronavirus pulmonary disease. J Virol 83:
11385–11390
92. He M, Zhang H, Li Y, Wang G, Tang B,
Zhao J, Huang Y, Zheng J (2018)
Cathelicidin-derived antimicrobial peptides
inhibit Zika virus through direct inactivation
and interferon pathway. Front Immunol 9:
722
93. Yu Y et al (2020) Engineered human cathelicidin antimicrobial peptides inhibit Ebola
virus infection. iScience 23:100999
94. Thakur N, Qureshi A, Kumar M (2012)
AVPpred: collection and prediction of highly
effective antiviral peptides. Nucleic Acids Res
40:W199–W204
95. Qureshi A, Thakur N, Tandon H, Kumar M
(2014) AVPdb: a database of experimentally
validated antiviral peptides targeting medically important viruses. Nucleic Acids Res
42:D1147–D1153
96. Chowdhury AS et al (2020) Better understanding and prediction of antiviral peptides
through primary and secondary structure feature importance. Sci Rep 10:19260
97. Tyagi A et al (2019) PlantAFP: a curated
database of plant-origin antifungal peptides.
Amino Acids 51:1561–1568
98. Agrawal P et al (2018) In silico approach for
prediction of antifungal peptides. Front
Microbiol 9:323
99. Manavalan B et al (2018) AIPpred: sequencebased prediction of anti-inflammatory peptides using random Forest. Front Pharmacol
9:276
100. Manavalan B et al (2018) PIP-EL: a new
ensemble learning method for improved
Proinflammatory peptide predictions. Front
Immunol 9:1783
35
101. Usmani SS, Bhalla S, Raghava GPS (2018)
Prediction of Antitubercular peptides from
sequence information using ensemble classifier and hybrid features. Front Pharmacol 9:
954
102. Gupta K, Singh S, van Hoek ML (2015)
Short, synthetic cationic peptides have antibacterial activity against Mycobacterium smegmatis by forming pores in membrane and
synergizing with antibiotics. Antibiotics
(Basel) 4:358–378
103. Torres-Juarez F et al (2015) LL-37 immunomodulatory activity during Mycobacterium
tuberculosis infection in macrophages. Infect
Immun 83:4495–4503
104. Rao Muvva J et al (2019) Polarization of
human monocyte-derived cells with vitamin
D promotes control of Mycobacterium tuberculosis infection. Front Immunol 10:3157
105. Rivas-Santiago B et al (2013) Activity of
LL-37, CRAMP and antimicrobial peptidederived compounds E2, E6 and CP26 against
Mycobacterium tuberculosis. Int J Antimicrob
Agents 41:143–148
106. Corrales-Garcia L et al (2013) Bacterial
expression and antibiotic activities of recombinant variants of human beta-defensins on
pathogenic bacteria and M. tuberculosis. Protein Expr Purif 89:33–43
107. Wong GC, O’Toole GA (2011) All
together now: integrating biofilm research
across disciplines. MRS Bull 36:339–342
108. O’Toole GA (2003) To build a biofilm. J
Bacteriol 185:2687–2689
109. O’Toole GA (2011) Microtiter dish biofilm
formation assay. J Vis Exp 47:2437
110. de la Fuente-Nunez C et al (2014) Broadspectrum anti-biofilm peptide that targets a
cellular stress response. PLoS Pathog 10:
e1004152
111. de la Fuente-Nunez C et al (2012) Inhibition
of bacterial biofilm formation and swarming
motility by a small synthetic cationic peptide.
Antimicrob
Agents
Chemother
56:
2696–2704
112. Overhage J et al (2008) Human host defense
peptide LL-37 prevents bacterial biofilm formation. Infect Immun 76:4176–4182
113. Chung EMC et al (2017) Komodo dragoninspired synthetic peptide DRGN-1 promotes
wound-healing of a mixed-biofilm infected
wound. NPJ Biofilms Microbiomes 3:9
114. Duplantier AJ, van Hoek ML (2013) The
human cathelicidin antimicrobial peptide
LL-37 as a potential treatment for Polymicrobial infected wounds. Front Immunol 4:143
36
Guangshun Wang et al.
115. Dean SN, Bishop BM, van Hoek ML (2011)
Susceptibility of Pseudomonas aeruginosa biofilm to alpha-helical peptides: D-enantiomer
of LL-37. Front Microbiol 2:128
116. Dean SN, Bishop BM, van Hoek ML (2011)
Natural and synthetic cathelicidin peptides
with anti-microbial and anti-biofilm activity
against Staphylococcus aureus. BMC Microbiol
11:114
117. Amer LS, Bishop BM, van Hoek ML (2010)
Antimicrobial and antibiofilm activity of
cathelicidins and short, synthetic peptides
against Francisella. Biochem Biophys Res
Commun 396:246–251
118. Sharma A et al (2016) dPABBs: a novel in
silico approach for predicting and designing
anti-biofilm peptides. Sci Rep 6:21839
119. Fallah F et al (2020) BIPEP: sequence-based
prediction of biofilm inhibitory peptides
using a combination of NMR and physicochemical descriptors. ACS Omega 5:
7290–7297
120. Gupta S et al (2016) Prediction of biofilm
inhibiting peptides: an in silico approach.
Front Microbiol 7:949
121. Rajput A, Thakur A, Sharma S, Kumar M
(2018) aBiofilm: a resource of anti-biofilm
agents and their potential implications in targeting antibiotic drug resistance. Nucleic
Acids Res 46:D894–D900
122. Bishop BM, Juba ML, Devine MC, Barksdale
SM, Rodriguez CA, Chung MC, Russo PS,
Vliet KA, Schnur JM, van Hoek ML (2015)
Bioprospecting the American alligator (Alligator mississippiensis) host defense peptidome. PLoS One 10:e0117394
123. Cherkasov A, Hilpert K, Jenssen H, Fjell CD,
Waldbrook M, Mullaly SC, Volkmer R, Hancock RE (2009) Use of artificial intelligence
in the design of small peptide antibiotics
effective against a broad spectrum of highly
antibiotic-resistant superbugs. ACS Chem
Biol 4(1):65–74
124. Wang X, Mishra B, Lushnikova T, Narayana
JL, Wang G (2018) Amino acid composition
determines peptide activity Spectrum and
hot-spot-based Design of Merecidin. Adv
Biosyst 2(5):1700259
125. Gupta S et al (2013) In silico approach for
predicting toxicity of peptides and proteins.
PLoS One 8:e73957
126. Gupta S et al (2015) Peptide toxicity prediction. Methods Mol Biol 1268:143–157
127. Gordon YJ et al (2005) Human cathelicidin
(LL-37), a multifunctional peptide, is
expressed by ocular surface epithelia and has
potent antibacterial and antiviral activity. Curr
Eye Res 30:385–394
128. Wang W et al (2017) Antimicrobial peptide
LL-37 promotes the viability and invasion of
skin squamous cell carcinoma by upregulating
YB-1. Exp Ther Med 14:499–506
129. Oliveira-Bravo M et al (2016) LL-37 boosts
immunosuppressive function of placentaderived mesenchymal stromal cells. Stem
Cell Res Ther 7:189
130. Hosseini Z et al (2020) The human cathelicidin LL-37, a defensive peptide against rotavirus infection. Int J Pept Res Ther 26:911–919
131. Haisma EM et al (2014) LL-37-derived peptides eradicate multidrug-resistant Staphylococcus aureus from thermally wounded
human skin equivalents. Antimicrob Agents
Chemother 58:4411–4419
132. Barksdale SM, Hrifko EJ, van Hoek ML
(2017) Cathelicidin antimicrobial peptide
from Alligator mississippiensis has antibacterial activity against multi-drug resistant Acinetobacter baumanii and Klebsiella pneumoniae.
Dev Comp Immunol 70:135–144
133. Barksdale SM, Hrifko EJ, Chung EM, van
Hoek ML (2016) Peptides from American
alligator plasma are antimicrobial against
multi-drug resistant bacterial pathogens
including Acinetobacter baumannii. BMC
Microbiol 16:189
134. Hitt SJ, Bishop BM, van Hoek ML (2020)
Komodo-dragon cathelicidin-inspired peptides are antibacterial against carbapenemresistant Klebsiella pneumoniae. J Med Microbiol 69:1262–1272
135. de Latour FA et al (2010) Antimicrobial activity of the Naja atra cathelicidin and related
small peptides. Biochem Biophys Res Commun 396:825–830
136. van Dijk A et al (2009) Identification of
chicken cathelicidin-2 core elements involved
in antibacterial and immunomodulatory
activities. Mol Immunol 46:2465–2473
137. Nizet V et al (2001) Innate antimicrobial peptide protects the skin from invasive bacterial
infection. Nature 414:454–457
138. Gao J et al (2020) Design of a sea Snake
Antimicrobial Peptide Derivative with therapeutic potential against drug-resistant bacterial infection. ACS Infect Dis 6:2451–2467
139. Win TS et al (2017) HemoPred: a web server
for predicting the hemolytic activity of peptides. Future Med Chem 9:275–291
140. Chaudhary K et al (2016) A web server and
Mobile app for computing hemolytic potency
of peptides. Sci Rep 6:22843
Machine Learning Prediction of Antimicrobial Peptides
141. Timmons PB, Hewage CM (2020) HAPPENN is a novel tool for hemolytic activity
prediction for therapeutic peptides which
employs neural networks. Sci Rep 10:10869
142. Oren Z et al (1999) Structure and organization of the human antimicrobial peptide
LL-37 in phospholipid membranes: relevance
to the molecular basis for its non-cell-selective
activity. Biochem J 341(Pt 3):501–513
143. Ciornei
CD,
Sigurdardottir
T,
Schmidtchen A, Bodelsson M (2005) Antimicrobial and chemoattractant activity, lipopolysaccharide neutralization, cytotoxicity, and
inhibition by serum of analogs of human
cathelicidin LL-37. Antimicrob Agents Chemother 49:2845–2850
144. Al-Adwani S et al (2020) Studies on citrullinated LL-37: detection in human airways,
antibacterial effects and biophysical properties. Sci Rep 10:2376
145. Rajasekaran G, Kim EY, Shin SY (2017)
LL-37-derived membrane-active FK-13 analogs possessing cell selectivity, anti-biofilm
activity and synergy with chloramphenicol
and anti-inflammatory activity. Biochim Biophys Acta Biomembr 1859:722–733
146. Luo Y et al (2017) The naturally occurring
host defense peptide, LL-37, and its
truncated mimetics KE-18 and KR-12 have
selected biocidal and Antibiofilm activities
against Candida albicans, Staphylococcus
aureus, and Escherichia coli in vitro. Front
Microbiol 8:544
147. Koro C et al (2016) Carbamylated LL-37 as a
modulator of the immune response. Innate
Immun 22:218–229
148. Murakami M et al (2004) Postsecretory processing generates multiple cathelicidins for
enhanced topical antimicrobial defense. J
Immunol 172:3070–3077
149. Chung MC, Dean SN, van Hoek ML (2015)
Acyl carrier protein is a bacterial cytoplasmic
target of cationic antimicrobial peptide
LL-37. Biochem J 470:243–253
150. Limoli DH et al (2014) Cationic antimicrobial peptides promote microbial mutagenesis
and pathoadaptation in chronic infections.
PLoS Pathog 10:e1004083
151. Limoli DH, Wozniak DJ (2014) Mutagenesis
by host antimicrobial peptides: insights into
microbial evolution during chronic infections. Microb Cell 1:247–249
37
152. Oikawa K et al (2018) Screening of a cellpenetrating peptide library in Escherichia
coli: relationship between cell penetration efficiency and cytotoxicity. ACS Omega 3:
16489–16499
153. Mishra B, Wang G (2012) Ab initio design of
potent anti-MRSA peptides based on database filtering technology. J Am Chem Soc
134(30):12426–12429
154. Mishra B, Lakshmaiah Narayana J,
Lushnikova T, Wang X, Wang G (2019)
Low cationicity is important for systemic
in vivo efficacy of database-derived peptides
against drug-resistant Gram-positive pathogens. Proc Natl Acad Sci U S A 116(27):
13517–13522
155. Beheshtirouy S, Mirzaei F, Eyvazi S, Tarhriz V
(2020) Recent advances on therapeutic peptides for breast cancer treatment. Curr Protein Pept Sci. https://doi.org/10.2174/
1389203721999201117123616
156. Marqus S, Pirogova E, Piva TJ (2017) Evaluation of the use of therapeutic peptides for
cancer treatment. J Biomed Sci 24:21
157. Wang G (2020) Bioinformatic analysis of
1000 amphibian antimicrobial peptides
uncovers multiple length-dependent correlations for peptide design and prediction. Antibiotics (Basel) 9(8):491
158. Wang G, Watson KM, Peterkofsky A, Buckheit RW Jr (2010) Identification of novel
human
immunodeficiency
virus
type
1-inhibitory peptides based on the antimicrobial peptide database. Antimicrob Agents
Chemother 54(3):1343–1346
159. Menousek J, Mishra B, Hanke ML, Heim CE,
Kielian T, Wang G (2012) Database screening
and in vivo efficacy of antimicrobial peptides
against methicillin-resistant Staphylococcus
aureus USA300. Int J Antimicrob Agents
39(5):402–406
160. Dong Y, Lushnikova T, Golla RM, Wang X,
Wang G (2017) Small molecule mimics of
DFTamP1, a database designed antistaphylococcal peptide. Bioorg Med Chem
25(3):864–869
161. Witten J, Witten Z (2019) Deep learning
regression model for antimicrobial peptide
design bioRxiv A preprint posted on July
12, 2019
Chapter 2
Tools for Characterizing Proteins: Circular Variance, Mutual
Proximity, Chameleon Sequences, and Subsequence
Propensities
Mihaly Mezei
Abstract
For the characterization of various aspects of protein structures, four useful concepts are discussed:
chameleon sequences, circular variance, mutual proximity, and a subsequence-based foldability score.
These concepts were used in estimating foldability of globular, intrinsically disordered and fold-switching
proteins, properties of protein–protein interfaces, quantifying sphericity, helping to improve protein–protein docking scores, and estimating the effect of mutations on stability. A conjecture about the Achilles’ heel
of proteins is presented as well.
Key words Circular variance, Mutual proximity, Chameleon sequence, Amino acid propensity, Foldability score
1
Introduction
The study of proteins is made difficult by the fact that their structure emerges from a polymer sequence that, at first glance, appears
to be random. However, the fact that choosing randomly from the
huge space of possible amino acid (AA) sequences the chance of
finding a sequence that results in a well-folded protein is virtually
zero [1] suggests that beyond the seeming randomness there must
be a plethora of information. On the other hand, the existence of
chameleon sequences [2] and fold-switching sequences [3] point to
the limitation of what can be inferred from sequence alone. One
aim of this chapter is to discuss some of the works aiming to tease
out from a sequence the information about foldability.
The other difficulty of dealing with protein structures is their
irregular shape. While they are often able to form crystals, it is not
always the case, and, in any event, the regularity required by the
formation of periodic structured is achieved by filling the empty
space, mostly with water. Characterizing irregular shapes is not a
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_2,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
39
40
Mihaly Mezei
simple geometric problem, and the other aim of this chapter is to
show how the concept of circular variance (CV) can help in this
endeavor.
Most proteins function as part of a complex. The interface
between proteins thus has to be able to maintain the complex;
being able to characterize the interface is necessary for understanding its formation. One aspect of characterizing the interface is the
identification of contact pairs, usually done using distance thresholds. A further aim of this chapter is to show the usefulness of the
concept that defines contacts as mutually proximal pairs of atoms.
In addition, CV was also shown to be helpful for characterizing
geometric properties of the interfaces.
2
Materials and Methods
2.1
Data Sets Used
Most studies described in this chapter relied on the Protein Data
Bank (PDB) [4]. The PDB is the depository of protein structures,
carefully annotated for structural features (e.g., helix or sheet), as
well as measures of the reliability of the data (i.e., temperature
factors and resolution). Unresolved regions are also indicated.
The PDB provided a file called ss.txt that lists the sequences of
394869 protein chains (domains) of the structures in the PDB as of
2018. This set is referred to as the PDB set and was used for the
chameleon search and for the establishment of various statistics. For
the test of foldability predictions, a new set of 40351 chains was
obtained from structures deposited after 2018, i.e., not included in
the PDB set. This set is referred to as the new PDB set. Both the
PDB and the new PDB set were filtered to insure that no pair in the
set has more than 50% sequence identity; the filtered sets are
referred to as the PDB50 and newPDB50 set, resp., containing
35667 and 4735 sequences, resp. For the study of protein–protein
interfaces [5], 1172 protein complexes were selected from the
PDB, referred to as the PDB-C set. For each complex, only the
pair with the largest number of contacts was used as defined by the
biological oligomer annotation. The sequences of intrinsically disordered proteins (IDPs) were obtained from the DisProt [6] data
set, the list of fold-switching (F-S) sequences was obtained from the
paper of Porter and Looger [3], and the data on mutations were
obtained from the data set compiled by Pucci et al. [7]. For the test
of the contact potential, protein complexes were selected from the
DOCKGROUND [8] and ZLAB [9] benchmark sets. Docked
protein complexes were generated by ClusPro [10] and
PatchDock [11].
2.2
Circular Variance
The circular variance (CV) of the angles [12], a measure of the
spread of the angles, is shown as a vertical bar on the y-axis inside
the dial. For a set of n angles [φi], CV is defined as:
Tools for Characterizing Proteins
41
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
!2
!2
u n
n
u X
X
t
CV ¼ 1 sin φi
þ
cos φi =n
ð1Þ
i¼1
i¼1
Since sin φi and cos φi are the x and y components of a unit
!
vector e i , the definition of CV generalizes to three dimensions as:
X
n
! ð2Þ
CV ¼ 1 e i =n
i¼1 !
For a set of vectors of arbitrary length v i , the weighted circular
variance CVw can be defined [13] as:
X
X
n
n ! ! CV w ¼ 1 v i =
ð3Þ
v i¼1 i¼1 i
The calculation of CV and CVw has been implemented into the
program Simulaid [14] available at the URL https://mezeim01.u.
hpc.mssm.edu/simulaid.
2.3
Mutual Proximity
2.4 Definition of
Surface and Interface
Atoms and their
Properties
Mutual proximity is used to provide a parameter-free definition of
contacts between two proteins [15]. Figure 1 demonstrates the
relation between mutual proximity and contact—red arrows are
placed between the two mutually proximal atoms.
Note, that the concept was also found to be useful in designing
Monte Carlo moves satisfying microscopic reversibility [16].
Surface atoms were defined by two criteria: (a) exposed accessible
surface fraction exceeds 3% and (b) the CV calculated with respect
to the rest of the protein atoms is less than 0.8. Since the crystal
structures did not contain hydrogen coordinates, only heavy atoms
were considered. The CV filter was needed as some proteins
contained large enough cavities that internal atoms had significant
accessible surface.
The ruggedness indicator of a surface atom i, RG(i) is
defined as:
X
jCV ði Þ CV ðkÞj
RG ði Þ ¼
ð4Þ
jN n ði Þj
! !
kj r k r i <RN
where !
r k is a surface neighbor atom of i and Nn(i) is the number
of surface neighbors of atom i.
The assignment of interface atoms relied on the mutual proximity criterion: atoms are considered to be on the interface (a) if it is
an atom forming contact (i.e., it is on the list of mutually proximal
atom pairs) or (b) if it is within 4 Å of a contact atom. The contact
propensity PRi,j of AA pair {i,j} is defined as:
42
Mihaly Mezei
Fig. 1 Illustration of the relation between mutual proximity and contact. Red
two-headed arrows are drawn between mutually proximal pairs of atoms
PRi,j ¼
N i,j
20
P
i, j ¼1
= P i P i 2 δi,j
ð5Þ
N i,j
where Pi is the probability of AA i being on the surface, Ni,j is the
number of [i,j] pairs found in the data set, and δi,j is the Kronecker
delta.
2.5 Contact Score
and Docking Score
Correction
With the contact propensity of residue pairs {i,j}, PRi,j, known, a
contact score S, and its normalized variant SN can be defined as:
X
kT ln P R i,j ; S N ¼ S=N ct
ð6Þ
S¼
½i, j where Nct is the number of contacts in the complex.
A putative improvement to a docking score SD can be
defined as:
S D0 ¼ S D w S
ð7Þ
where w is a scaling parameter. Analogous corrections can be made
with SN.
2.6 Foldability
Scores
Based on the propensities of various p-tuples in folded proteins of
known structure, two kinds of scores were defined: SCp when the
statistics of p-tuples is adequate (i.e., p 5) and SSCp when the
statistics are sparse. For a sequence of length N + p 1,
XN
PN i
SC p ¼
=N
ð8Þ
ln
i¼0
PR i
Tools for Characterizing Proteins
43
where PNi and PRi are the probabilities of finding the i-th p-tuple
in the PDB50 set and in the RANW set, resp. If the i-th p-tuple was
missing from the PDB50 set, PNi was set to 0.5/20p.
Clearly, for SCp to be meaningful, there has to be ample statistics on the p-tuples. For situations where this is not the case, the
simpler score SSCp was defined:
"
#
N
X
n i =N
ð9Þ
SSC p ¼
i¼0
where ni is the number of occurrences of the p-tuple i in the
experimental set.
3
Applications
3.1 Chameleon
Sequences
Chameleon sequences provide an important indication to the limits
of predicting the secondary structure of a protein from its
sequence. The number and length of known chameleon sequences
grew significantly as the number of known protein structures grew:
in 1998 the longest chameleons had seven residues and only three
such chameleons were found [17] while the recent search [2] found
chameleons up to length 11 (not considering some longer ones
found on the same protein in different structures) and a lot more of
the shorter ones. This suggests that the issue of chameleon propensities can be now treated with reasonable precision since there is
enough data to reduce statistical fluctuations.
The recent work used all the sequences available at the time the
project was started. While this involved many similar sequences, use
of a reduced set was likely to result in missing chameleons. Thus, it
was important to develop an efficient algorithm since the complexity of the brute-force approach is O(N2). The algorithm used was
based on the sorting at various steps, making its complexity O
(NlogN) [2]; an implementation is available at the URL https://
mezeim01.u.hpc.mssm.edu/cham.
Table 1 gives the chameleon propensities of each of the 20 AAs
(the % of residues in chameleons/% AA) for chameleons of length
5–8—lengths with ample statistics. Two interesting observations
can be made: (1) the propensities vary significantly with length for
some of the AAs and (2) the residues with the highest chameleon
propensity (ALA, VAL, LEU) also have higher than average AA
propensities. This raises the question whether higher than average
chameleon propensities are not simply a consequence of the high
AA propensities since finding the same AA in two different
sequences is proportional to the square of their propensities. To
see whether this is the case, the Spearman rank correlation was also
calculated between the AA propensities and the chameleon
44
Mihaly Mezei
Table 1
Chameleon propensities of chameleon sequences
Length
5
6
7
8
% AA
ALA
1.0
1.5
1.6
1.6
8.0
CYS
0.8
0.3
0.2
0.2
1.4
ASP
0.6
0.5
0.5
0.4
5.6
GLU
0.8
1.0
1.1
1.0
6.6
PHE
1.4
0.9
0.8
0.8
3.9
GLY
0.6
0.6
0.6
0.6
7.4
HIS
0.7
0.3
0.3
0.8
2.7
ILE
1.6
1.5
1.3
1.2
5.6
LYS
0.8
0.8
0.9
1.1
5.9
LEU
1.2
1.8
1.9
1.6
9.0
MET
1.0
0.5
0.5
0.5
2.3
ASN
0.7
0.4
0.4
0.6
4.2
PRO
0.2
0.1
0.1
0.1
4.6
GLN
0.9
0.7
0.6
0.9
3.8
ARG
1.0
1.0
1.0
0.7
5.2
SER
0.9
0.8
0.8
1.1
6.3
THR
1.3
1.0
0.9
0.9
5.6
VAL
1.8
2.1
2.0
2.0
7.0
TRP
1.0
0.3
0.3
0.6
1.3
TYR
1.6
0.9
0.8
0.6
3.4
# of chameleons
187,803
36,669
1822
79
Corr. With AA prop.
0.20
0.22
0.35
0.08
propensities of the chameleons of different length, also shown in
Table 1. The correlations are indeed positive, but quite small,
suggesting that there is more to chameleon propensities than just
the effect of the AA propensity.
Furthermore, chameleon propensities were also found to be
not a significant factor in fold-switching proteins [18]. Looking at
chameleons of length 5, 6, and 7, the propensities of chameleons in
fold-switching proteins were slightly larger: 15.6 vs. 13.7,
1.2 vs. 0.8, and 0.1 vs. 0.0, for lengths 5, 6, and 7, resp.
Tools for Characterizing Proteins
45
3.2 Shape and
Smoothness of the
Protein–Protein
Interface
The atoms forming the interface in the complexes of the PDB-C set
were examined by asking two questions: (a) is the surface at the
interface more or less smooth than elsewhere and (b) is the interface protruding relative to the rest of the surface or the opposite.
The smoothness (or lack thereof) was measured by ruggedness
indicator RG defined by Eq. (4) above. The average RG values set
of the interaction and full surface atoms in the PDB-C set were
found to be 0.069 (s.d. ¼ 0.007) and 0.066 (s.d. ¼ 0.005), resp.
Since this difference is significant at the level p ¼ 0.001, it can be
concluded that interaction surfaces are less smooth than the rest.
As for the shape of the interface, it can be characterized by
comparing the CV values of the atoms in the interface and of the
atoms elsewhere. Again, using the PDB-C set, the average CV
values of atoms at the interface and of atoms at the surface were
found to be 0.454 (s.d. ¼ 0.044) and 0.464 (s.d. ¼ 0.029), resp.
While this difference is quite small, it is significant at the level
p ¼ 0.001 according to the Student t test. Thus, we can conclude
that the interaction surface is likely to be protruding, even if only
slightly.
3.3 Contact
Statistics and its
Application
Correlated mutations have been proven to be an important tool in
analyzing protein–protein associations [19]. This suggests that the
residue pairs in contact are far from random, and there are specific
propensities of residue pairs to be in contact at protein–protein
interfaces. Indeed, the analysis of protein complexes in the
PDB-C set showed large variations in the propensities of AA pairs
to be in contact. Table 2 presents the propensities PRi,j of each AA
pair to be in contact. The large variations in these contact propensities indicate the importance of this data.
The contact propensities were used to help improving protein–
protein docking (PPD) results. When attempting to find the interface between two proteins known to form a complex, PPD algorithms provide a number of putative solutions—most of them
looking quite reasonable. The problem is that the PPD results are
strongly dependent on small details of the conformation, as seen
from the fact that when the monomers are in the conformations
that they form in the complex (i.e., redocking) then the top-scoring
conformation is actually the experimentally known one in most
cases. However, if the monomer conformations are obtained independently of the complex structure then it is very rare that the
top-scoring conformation is the experimental one, suggesting
that there is room for improvement.
The first test of the contact potentials defined by Eq. (6) above
used 18 complexes where the ensemble of docked models
contained a model close to the crystal structure. The first question
was how many of the models have better contact scores than the
crystal structure and how many of them have better score than the
score of the model close to the crystal structure. If the answer is
0.65
0.31
0.59
0.59
0.68
0.77
1.19
1.00
1.04
0.61
0.46
0.55
1.55
0.77
2.07
0.68
0.30
0.37
0.38
1.23
GLY
ALA
VAL
LEU
ILE
PHE
TRP
TYR
PRO
MET
SER
THR
CYS
ASN
HIS
GLN
ASP
GLU
LYS
ARG
0.92
0.32
0.21
0.47
0.47
0.46
1.42
0.92
0.65
0.45
0.54
0.73
1.03
1.99
1.68
0.99
0.81
0.52
0.70
1.04
0.45
0.85
0.64
1.20
1.12
0.73
1.09
0.90
0.89
1.17
1.01
1.56
1.78
1.73
1.09
1.46
1.36
0.76
0.41
0.39
0.37
0.80
0.69
0.84
1.60
0.68
0.45
1.64
1.24
1.59
1.73
2.51
2.60
1.29
1.13
0.71
0.70
0.94
1.00
1.52
1.22
1.25
0.88
0.39
1.28
0.87
2.35
1.23
2.99
2.30
2.56
0.96
0.84
0.90
1.46
1.81
1.39
1.31
1.82
1.37
3.44
3.05
1.37
1.19
1.00
0.64
2.21
4.37
1.48
0.63
0.26
0.54
0.64
1.07
0.40
3.20
2.92
1.18
1.23
4.28
0.45
0.64
0.79
0.99
0.61
1.05
1.50
0.87
3.41
2.17
3.23
2.25
1.70
0.95
7.40
3.80
1.36
2.69
2.75
3.54
1.85
9.12
3.24
10.0
Table 2
Contact pair propensities normalized by surface propensities
3.65
0.79
0.62
0.96
1.25
1.56
0.79
0.75
1.10
0.93
3.68
0.81
0.50
0.77
0.82
0.56
0.77
0.68
0.35
0.41
0.58
0.86
0.53
0.46
0.60
1.07
0.99
0.77
0.34
0.62
1.54
0.92
0.47
0.17
0.91
2.30
1.56
17.2
1.35
0.71
0.62
0.76
0.65
1.37
1.02
1.32
0.52
1.32
2.30
1.89
2.78
1.91
0.49
1.46
0.63
1.17
2.88
1.12
0.31
0.31
1.67
1.16
0.21
0.58
0.47
0.86
46
Mihaly Mezei
Tools for Characterizing Proteins
47
Table 3
Contact score comparison between redocked models generated by ClusPro and the crystal structure
and the native-like model
# of
# of models beating
PDB ID models the crystal S score
# of models beating
the crystal SN score
#of models beating
the native-like
RMSD of the
S score
best model
4ODS
30
0
0%
4
13%
4
2.0
4ONL
19
4
21%
4
21%
4
2.8
4POZ
30
1
3%
12
40%
0
4.3
2QKO
24
4
16%
3
13%
15
3.6
4QVF
7
3
42%
4
57%
3
4.7
4UHP
16
5
31%
7
43%
0
4.8
4X7S
25
0
0%
7
28%
0
4.6
4YII
20
1
5%
4
20%
4
6.0
4YON
30
10
33%
10
33%
0
4.0
4Z95
25
0
0%
13
52%
13
5.1
4GUZ
24
11
45%
7
29%
2
3.4
4I4N
15
6
40%
8
53%
0
4.7
4OFW
30
28
93%
27
90%
12
8.6
4PGG
30
0
0%
0
0%
1
6.0
4PVC
30
5
16%
5
16%
6
5.8
4R1N
26
24
92%
22
84%
7
5.6
4WOY
30
21
70%
30
100%
19
49.6
4WUM
30
30
100%
29
96%
9
4.5
zero for both, then this contact potential is fully capable to rank the
docked poses. At the other end of the spectrum, if the answer is 50%
or more than this, contact potential has no relevance. Table 3 shows
the contact score comparison between redocked models generated
by ClusPro and the crystal structure and the native-like model. The
average percentage is 33.7 and 43.7 for S and SN, resp., i.e., better
than 50%, albeit not by too much. The fact that S performs better
than SN suggests that the number of contact is an important contribution to the score. Also, the fact that in four out of 18 complexes
S did gave the best scores suggests that the performance is better
than 33.7%.
The crucial test is whether, using Eqs. (6 and 7), the contact
score can improve in the ranking given by the docking servers.
Table 4 shows, using different weight factors w, the rescoring
48
Mihaly Mezei
Table 4
Rescoring results for unbound docking ensembles
# of models beating the model with the best RMSD using the
correction factor w below
Best model
PDB ID RMSD Rank Software
0.0
1.0
2.0
5.0
10.0
20.0
50.0
1DE4
5.5
8
ClusPro
7
7
7
7
7
7
8
1E6E
9.2
4
ClusPro
3
2
2
2
2
1
0
1E6J
6.0
17
ClusPro
16
17
16
16
14
14
14
1HIA
7.8
16
ClusPro
15
15
15
15
16
16
15
1HIA
8.5
3
PatchDock 2
0
0
1MAH
7.9
2
ClusPro
1
1
1
1
2
3
8
1MLC
5.4
19
ClusPro
18
19
19
23
24
23
22
1N8O
10.0
7
ClusPro
6
6
6
5
6
7
6
3MXW
2.2
6
ClusPro
5
4
3
1
1
1
1
3SIC
5.7
2
ClusPro
1
1
0
0
0
0
0
0
0
100.0
200.0
0
0
results for unbound docking ensembles for complexes, i.e., where
the docking started from monomer structures obtained independently of the complex structure. Note that for most unbound
docking ensembles, there was no structure close to the experimental complex structure—hence the small number of tests shown.
While the original rank of the experimental-like structure was not
always raised to one, in several cases it was improved. More importantly, it did not worsen the rank.
The calculation of the docking score adjustment of ClusPro
and PatchDock docked ensembles has been implemented in the
program Rescore. It is available at the URL https://mezeim01.u.
hpc.mssm.edu/rescore.
3.4 Characterization
of Sphericity
When comparing regular geometric shapes like ellipsoids, it is a
simple matter to measure their deviation from spherical. For objects
that are irregular, like a globular protein, the answer is far from
simple. It turned out that CV could help in this issue as well. The
idea is to calculate the CV of each atom with respect to all the others
and compare the distribution of the CV values with the distribution
of CV values calculated from points uniformly distributed in a
sphere [20].
The first step is the calculation of the reference distribution of
CVs, and CVws, i.e., their distribution for points within a sphere. To
that effect, 1,600,000 uniformly distributed points were generated
by a Monte Carlo procedure. However, direct comparison of the
reference distribution function and the distribution function of the
Tools for Characterizing Proteins
49
CV values of a protein is not likely to give meaningful answer since
most globular proteins (protein domains) have much fewer atoms
than the points used in the reference distribution thus, unlike the
reference distribution, it would either be too noisy (if the bin sizes
are small) or too crude (if the bin sizes are large).
To see which properties of the distribution can be used to
characterize the extent of distortion from the spherical, the point
set used for the reference distribution was progressively scaled
along one direction to produce points within progressively more
elongated ellipsoids. Next, various properties of the distribution
were calculated and their ability of properly correlate with the
degree of distortion was tested. These properties included the
average (absolute or squared) differences between the density distributions, and between the cumulative probability distributions,
the differences between the sums of various CV powers, as well as
the various moments of the CV distributions. Comparisons were
made both using the CV and CVw values. Interestingly, the Pearson
correlations between the various measures and the difference
between the sphericities calculated from the volume and surface
of the elongated ellipsoids were generally much higher when using
the power sums or the moments than when looking directly at the
(discretized) distributions.
The sphericity calculations were also performed on the set Hass
and Kohl used for their calculations of sphericity based on accessible surface [21]. The power and moment-based correlations again
were higher than the ones based on the explicit distributions. The
largest correlation (0.78) was obtained using the second power of
the CVw sums. This correlation indicates that the difference
between the two measures is larger than what would expect from
the various approximations involved in either algorithms. Rather it
shows that sphericity of an irregular object is not a uniquely defined
concept.
The calculation of the CV-based sphericity measure has been
implemented in the program CVDISTR. It is available at the URL
https://mezeim01.u.hpc.mssm.edu/cvdistr.
3.5 Achilles’ Heel of
Proteins
One demonstration of the usefulness of the CV for determining the
degree at which an atom is buried involved shaving off atoms from a
protein one by one: at each step, the atom that had the largest
exposed surface area was removed. It was found that ordering the
atoms by this shaving procedure correlates well with ordering them
by their circular variance calculated w.r.t. the rest of the atoms. This
correlation then suggests that shaving can also be done meaningfully based on CV instead of exposed surface area.
During this shaving (either by the use of the exposed surface
area or by CV), most of the time side chain atoms or backbone
atoms at the chain end were removed; it was rare that a backbone
atom that was not near the chain end was the one removed. It is
50
Mihaly Mezei
Table 5
Achilles’ heel residue propensities
Residue
% in PDB-C
% AH residues
% AH residues/% in PDB-C
GLY
6.91
11.62
1.68
ALA
6.84
7.28
1.06
VAL
6.72
3.54
0.53
LEU
8.77
4.46
0.51
ILE
5.28
2.30
0.44
SER
5.95
8.19
1.38
THR
5.92
5.80
0.98
ASP
5.58
8.99
1.61
GLU
6.76
10.09
1.49
ASN
4.28
6.38
1.49
GLN
4.44
4.51
1.01
LYS
6.30
9.32
1.48
HIS
2.40
1.64
0.68
ARG
4.83
3.56
0.74
PHE
3.78
1.43
0.38
TYR
4.13
1.62
0.39
TRP
2.06
1.13
0.55
CYS
1.87
0.85
0.45
MET
1.94
1.12
0.58
PRO
5.01
6.20
1.24
thus proposed that these spots on the protein have special properties, perhaps vulnerabilities. Hence the suggested name—Achilles’
heel (AH) of a protein.
CV-based shavings were run on each protein domain in the
PDB-C set. The residue of a backbone atom that was shaved and
was at least 10 residues from the nearest chain end was considered
an AH residue. On the average, 0.6% of the residues were found to
be AH residues. Table 5 shows the percent occurrence of the
20 AAs in the PDB-C set and among the AH residues, as well as
the propensity of an AA to be an AH residue as the fraction of the
AH%/PDB-C%. Remarkably, except for alanine and glutamine, this
fraction significantly differs from one, suggesting that AH propensity may indeed have a role in the function of the protein.
Tools for Characterizing Proteins
51
The secondary-structure element propensity of the AH residues has also been examined. Their propensities to be in helix,
sheet, and loop were found to be 16.7%, 10.0%, and 73.3%, resp.
This is not surprising since most of the time loops are on the
surface, sheets are in the interior, and helices have one side exposed.
Finally, the correlation of the AH propensity of an amino acid
with several other molecular properties was examined. The Pearson
correlation coefficient with AA hydrophobicity, using the hydrophobicity scale of Rose et al. [22], turned out to be 0.89. This
means that amino acids with high AH propensity are likely to be
polar, which is indeed the case. This, in turn, may have structural
consequence: the backbone of AH residues is likely to experience a
pull from the favorable solvation of the polar side chains, resulting
in sticking out more than the nonpolar ones.
The Pearson correlation coefficient with the molecular volume,
using the values of Darby and Creighton [23], is 0.47 and with
the average length of the AH residue side chains is 0.20. While
these are rather weak correlations, they are in the direction
expected: larger/longer side chains tend to “protect” the
backbone.
The shaving by CV has been implemented in the program
CVSHAVE. It is available at the URL https://mezeim01.u.hpc.
mssm.edu/cvshave.
3.6 Foldability of a
Sequence
For a long time, the proteins were viewed in “black and white,” i.e.,
either folded into a unique structure or is disordered. This picture
got slowly refined by the discovery of intrinsically disordered proteins and the discovery of proteins that can exist in different conformation—at first just a few, involved in diseases and later the
larger set of fold-switching proteins [3]. Given that only a negligible fraction of possible sequences is capable of folding, the question
arises: what information can be obtained from the sequence about a
sequence’s capability of folding.
The first observation is that the distribution of AAs in known
proteins is far from uniform. Furthermore, the specific propensities
show little variation when different types of organism are compared
[24] indicating that the deviation from the uniform distribution has
an important role in biology. Digging deeper, the propensities of
different AAs to be present in different secondary structure elements (SSE), i.e., helices or sheets, are different, leading to a whole
family of methods predicting the SSEs that a given sequence might
form. The question thus arose: can folding and non-folding
sequences be differentiated by the amount of SSEs predicted?
To answer this question, the Garnier-Robinson-Osguthrop
(GOR) method was used to predict the % of residues forming
helix or sheet in the PDB50 set, the IDP set as well as two sets of
randomly generated sequences, one using the uniform distribution
while the other the amino acid propensities, referred to as RANU
52
Mihaly Mezei
and RANW, resp. For the PDB50, IDP, and RANW sets, the
average percentages, 62 11, 59 13, and 59 7, resp., were
rather close but the average percentage for the RANU set was
significantly less, 43 8. Incidentally, the average of the experimental percentage was 53 13. Here the represents one standard deviation. The interesting part of this result is that there is
significant difference between the RANW and RANU set result,
confirming the importance of the nonuniformity of AA propensities for the ability of a sequence to fold. The slight difference
between the PDB and IDB and between the PDB and RANW
sets indicates that, indeed, there does exist a subtle difference
between these sequences. The importance of the nonuniformity
of AA propensities, using different arguments, has also been
observed by Mittal et al. [25].
Beyond simple propensities, the logical next question is the
propensities of various, progressively more complex combinations
of AAs. Here the difficulty is increasing the complexity of combination to investigate increases the chance of obtaining useful information but the statistics gets progressively more inadequate to
produce reliable result. The increase in the size of the PDB helps
in improving the statistics but the number of structures in the PDB
increases at a much lower rate than the increase of the p-tuple space
as a function of p.
Recent work [26, 27] examined a large number of protein
sequences with known folded structures, using the PDB50 set,
for signals that would set these sequences apart from non-folding
ones. With increasing complexity, the signal strength increased.
Furthermore, even when the statistics became woefully inadequate,
some useful signals were still found.
Comparing the order of AAs in each neighboring pair, 26 of the
190 AA pairs showed asymmetry (the ratio of occurrence of the pair
in different order) >1.15 or <1/1.15 and a few pairs showed
significantly larger values: MET-HIS (0.56), PRO-GLU (1.62),
PRO-CYS (0.72), MET-THR (1.32), MET-TRP (0.76), and
PRO-HIS (0.77). Not surprisingly, prolines are prominent in
this list.
The propensities of various AAs to be separated by ns residues
(irrespective of order) were also examined. These propensities were
normalized by the propensity at ns ¼ 10. Some pairs showed
significant effects, not just for ns ¼ 0 (i.e., neighbors). Here
HIS-HIS and CYS-CYS pairs showed the largest deviation
from one: 1.71 and 0.74, resp., for ns ¼ 0, and even for ns ¼ 2,
the ratios were 1.46 and 0.56, resp.; however, most of the pairs
showed no effect for ns > 0.
Looking at the SCp scores for p ¼ 3, 4, and 5, significant
differences were obtained when using different type of input sets
[18, 26, 27]. Table 6 lists, for p ¼ 3, 4, and 5, the average scores
and their S.D. for the PDB50, IDP, F-S, RANW, and RANU sets,
Tools for Characterizing Proteins
53
Table 6
Foldability scores of various data sets
p:
3
4
5
PDB50
<SCp> S.D.
0.047 0.062
0.094 0.097
0.338 169
IDP
<SCp> S.D.
Ov(PDB50,IDP)
0.044 + 0.067
0.91
0.074 0.100
0.88
0.203 0.197
0.68
F-S
<SCp > S.D.
Ov(PDB50,F-S)
0.028 0.060
0.82
0.061 0.093
0.79
0.323 241
0.71
RANW
<SCp> S.D.
Ov(PDB50,RANW)
0.043 0.032
0.30
0.082 0.048
0.19
0.205 0.075
0.324
RANU
<SCp > S.D.
Ov(PDB50,RANU)
0.051 0.033
0.27
0.109 0.052
0.14
0.044 0.089
0.009
100%
99.58%
69.92%
p-tuple space coverage
as well as the overlaps between the PDB50 set and all the other sets.
The coverage of the p-tuple space is also given in the table. For all
three p values, there was a clear progression in the average foldability scores from the experimentally known folded set, PDB50 to
IDP, the intrinsically disordered set, followed by F-S, the foldswitching set, and, significantly farther, the two random sets
RANW and RANU, with the uniformly random set being the
farthest from the experimental set. The distance between the
averages was also reflected in the overlaps that showed that the
nonrandom sets, while distinct, were still fairly close to each
other; real difference was only seen between the random and nonrandom sets.
Given the significant separation between the PDB50 and random sets, the question arose: how reliably can the foldability score
predict if a given sequence will fold? For the test, the new PDB50
set was used; randomly generated sets were prepared using different
random number seeds from the ones used in generating and comparing score distributions.
Table 7 collects the folding predictions using SCp, p ¼ 3, 4, and
5 on the new PDB50, RANW, and RANU sets. In view of the
smaller overlap between the PDB50 and RANW distributions for
p ¼ 4 vs. p ¼ 3, it is not surprising that the SC4 performs better than
SC3. The performance of SC5, however, was uneven, even though
extending the tuple length is expected to yield better performance.
The likely reason for not meeting this expectation is the fact that
while the triplet and quadruplet space is essentially fully covered by
the PDB50 set, the coverage of the pentuplet space is significantly
lower. However, the surprisingly low rate of false negatives with
SC5 suggested the combination of SC4 and SC5 in the following
way; whenever the SC4 predicts randomness, but SC5 predicts
54
Mihaly Mezei
Table 7
Foldability predictions
Score
Prediction
New PDB50
RANW
RANU
Triplet
Folded
Non-folded
75.8%
24.3%
9.4%
90.6%
6.3%
93.7%
Quadruplet
Folded
Non-folded
79.3%
20.7%
4.3%
95.7%
4.0%
96.0%
Pentuplet
Folded
Non-folded
79.4%
20.6%
0.5%
99.5%
3.6%
96.4%
Quadruplet+
Pentuplet
Folded
Non-folded
88.4%
11.6%
4.2%
95.8%
31.1%
68.9%
foldability, change the prediction to foldable; otherwise keep the
SC4 prediction. As shown also on Table 7, the combination significantly improved the reliability of the folding predictions.
A recent work used a modified score and refined the foldability
predictions by making the scores depend not only on the AAs in the
p-tuplets but also on the secondary structure predicted from the
sequence [28]. It performed slightly better than the combination
of SC4 and SC5 described above.
Given the success of the foldability scores SCp, the logical next
question is whether it can serve as a measure of stability as well.
However, the short answer is, sorry, not really; the Pearson correlation between the melting temperatures (Tm) and SCp is only 0.1,
0.19, and 0.14 for p ¼ 3, 4, and 5, resp. Looking at the Spearman
(rank) correlations, the numbers are similarly low: 0.10, 0.18, and
0.19 for p ¼ 3, 4, and 5, resp.
The discouragingly low correlations prompted another look at
longer p-tuples, despite the progressively worse coverage in the
PDB50 set (11% and 0.7% for p ¼ 6 and 7, resp.), using the SSCp
score. The first question raised was whether the change in stability
upon a mutation can be predicted by the change in the foldability
score. Not surprisingly, the answer was again negative; the correlations were between the score change upon mutation and the
change in Tm were 0.04, 0.03, 0.11, 0.10, and 0.18 for p ¼ 3,
4, 5, 6, and 7, resp.
The next question raised was whether “coarse-graining” our
quest by asking whether it is possible to predict just the sign of the
stability change upon a mutation. Here the answer was a qualified
yes. Table 8 shows the percent of mutations where change in the
foldability score matched the change in the Tm for p ¼ 3, 4, 5,
6, and 7, resp. Not surprisingly, for p ¼ 3 and 4, the predictions
were too close to 50% to be of any use. However, for p ¼ 6 and
7, the prediction accuracy was over 70%. Furthermore, for mutations where the sign change predictions are the same for more than
Tools for Characterizing Proteins
55
Table 8
Mutation sign prediction
p
Percent sign match
3
50.8%
4
57.4%
5
67.9%
6
71.4%
7
70.6%
one p, the accuracy can be further increased; the best pair performance was found using p ¼ 6 and 7 (72.1%), and the best three
combination was found using p ¼ 5, 6, and 7 (73.7%). While this is
still rather low, it can be used to prioritize planned mutations in an
experimental project since at negligible computational costs the
success rate can be significantly improved.
The calculations of the foldability measures and estimates have
been also implemented in the program FOLD. It is available at the
URL https://mezeim01.u.hpc.mssm.edu/fold .
An other recent work also took advantage of the fact that the
missing coverage also contains information [29]. In this work, the
authors found certain short sequences that were found only
in IDPs.
4
Notes and Protocols
All programs mentioned in this chapter can be downloaded free for
academic users. They are written in Fortran 77. The download
package includes the source code, documentation, and, wherever applicable, data files and/or sample inputs; compiled
executables are also available.
The programs are run from the command line. The general format
of the commands is:
<Program name> [directive argument] [directive argument] . . .
If only the program name is typed, a list of possible directives,
arguments, and, if available, defaults are printed. For example,
calling the chameleon search program without any directives
prints
56
Mihaly Mezei
The possible command line options are:
-mn : minimum chameleon length default: 5
-mx : maximum chameleon length default: 24
-ro : file name root default: chameleon
-db : debug level default: 0
-lf : list of files/sequences default: ss.txt
-ff : file format (pdb|cif|ann) default: ann
-mp : residue mapping file
In general, the directives can be in any order. Some of the programs
also offer an interactive quiz to enter the run information when
just the program name is typed. All programs check the input
for consistency, limits, and file availability. If limits are violated,
the program tells which size parameters to change, if possible;
the documentation (.html file) indicates how data limits can be
changed, if necessary.
A typical chameleon calculation would run as follows:
cham -mn 5 -mx 10 -ro demo -lf ss.txt -ff ann
Chameleon finder - written by Mihaly Mezei - Version 07/21/
2018
The program needs a file listing the PDB/CIF files or PDBids
to process
or a FASTA file with seqences and SS annotations Memory use:
over 1421 Mb
Opened list file ss.txt
Opened result file demo.res
Opened detail file demo.dtl
Opened debug file demo.log
...
Read/analyzed 139870 PDB ids and 394364 chains
Average chain length= 260.3 residues
Number of 10-residue chameleons found= 32
Number of 10-residue sequences used= 5380037
A typical part of the chameleon list in the output file demo.res is
below:
HX:1AMB A ir= 11 SH:5OQV F ir= 11 Nctot= 6 EVHHQKLVFF GLU VAL
HIS HIS GLN LYS LEU VAL PHE PHE
Full list:1AMB A 11 h 1AMC A 11 h 1IYT A 11 h 1Z0Q A 11 h 2LMQ
K 11 s
Full list:5OQV F 11 s
Tools for Characterizing Proteins
57
A typical rescoring calculation would run as follows:
rescore -sf 4GUZ.cp_sc -md cluspro.4GUZ.24 -mr model.000 -xf
4GUZ_AD.pdb -ds CP
Rescoring a set of docked P-P models using the contact
propensity matrix
Written by Mihaly Mezei - Version 08/11/2020
Opening file 4GUZ.cp_sc (ClusPro result file)
Opening file 4GUZ_AD.pdb (X-ray structure file)
Opening file cluspro.4GUZ.24_rescore.res
Result will be printed to file cluspro.4GUZ.24_rescore.res
Read 24 scores
# of ATOM records= 4482
Opening file cluspro.4GUZ.24/model.000.00.pdb (First docked
model)
...
Opening file cluspro.4GUZ.24/model.000.23.pdb (24th docked
model)
# of ATOM records= 5448
4GUZ Average model score= 3.138 Range: [ -3.115, 9.134]
4GUZ Average normalized model score= 0.075 Range: [ -0.074,
0.203]
4GUZ X-ray dimer scoresum= 3.87 # of contacts= 30 sc/n= 0.129
4GUZ # of models=24 nsc_xp=11 nscn_xp= 7 nCA:568 284 nCT= 30
RMSDmin= 3.36 (# 6) wfac= 0.0 # beating RMSDmin= 8
...
RMSDmin= 3.36 (# 6) wfac= 50.0 # beating RMSDmin= 3
im_min= 6 scmin= 6.34919596
SCM: 4.7 6.2 5.5 3.6 2.8 6.3 2.9 9.1 4.9 1.5
SCM: 1.2 4.9 1.9 0.2 5.1 -1.6 1.6 -3.1 4.5 2.6
SCM: -1.5 5.2 6.7 -0.2
Number of non-native structures with better contact score than
the native= 2
A typical sphericity calculation would run as follows:
cvdistr -cvcomplist pdb.list > cvdistr.res
where pdb.list is a list of PDB files for which sphericity parameters will be calculated. The results are redirected to the file
cvdistr.res. A typical output for one PDB file would be:
Compare UNNORMALIZED CV
ID=1a02N2 NN CVd= 0.03405 CVd_cum= 0.21472 CVdA= 0.68107
CVdA_cum= 2.12160
Difference (abs) of the powers of the CV distribution
0.141155E-01
0.174501E-01
0.126555E-01
0.817334E-02
58
Mihaly Mezei
0.525148E-02
0.347859E-02
0.239483E-02
0.170957E-02
0.125860E-02
0.574280E-03
0.456523E-03
0.367119E-03
0.200791E-03
0.166126E-03
0.137933E-03
0.950353E-03
0.732546E-03
0.298048E-03
0.243868E-03
0.114784E-03
Cumulative difference (abs) of the powers of the CV distribution
0.141155E-01
0.315656E-01
0.442211E-01
0.523944E-01
0.635193E-01
0.652289E-01
0.664875E-01
0.687447E-01
0.692012E-01
0.695683E-01
0.703110E-01
0.704772E-01
0.706151E-01
0.576459E-01
0.611245E-01
0.674379E-01
0.681704E-01
0.698664E-01
0.701103E-01
0.707299E-01
Difference (abs) of the moments of the CV distribution
0.365169E-08
0.923624E-02
0.165961E-03
0.725276E-03
0.374331E-04
0.183825E-04
0.825398E-05
0.737525E-06
0.259078E-06
0.862984E-07
0.472241E-07
0.434107E-07
0.327017E-07
0.154974E-03
0.967322E-04
0.382788E-05
0.172708E-05
0.240465E-07
0.390821E-07
0.286229E-07
Cumulative difference (abs) of the moments of the CV distribution
0.365169E-08
0.923625E-02
0.940221E-02
0.101275E-01
0.104166E-01
0.104350E-01
0.104433E-01
0.104496E-01
0.104498E-01
0.104499E-01
0.104500E-01
0.104501E-01
0.104501E-01
0.102825E-01
0.103792E-01
0.104471E-01
0.104488E-01
0.104499E-01
0.104500E-01
0.104501E-01
A typical shaving calculation would run as follows:
cvshave -lf PDB_list -rd 10 -iv 0 -of PDB_d10.out
CV-based shaving of proteins for Achilles heel search
Written by Mihaly Mezei - version 08/15/2020
File PDB_list opened OK as unit 10
File PDB_d10.out opened OK as unit 30
CV cutoff= 15.0
Residue distance limit= 10
...
Tools for Characterizing Proteins
59
Processed 1179 PDB files; analyzed 4016 protein domains
Average % of BB break residues= 0.56
1 GLY aa%= 6.91 aa_ach%= 11.62 aa_ach%/aa%= 1.68
2 ALA aa%= 6.84 aa_ach%= 7.28 aa_ach%/aa%= 1.06
...
A typical foldability prediction calculation would run as
follows:
1. Filter the chains in ss.txt to no more than 50% sequence
identity.
fold -op FILT -da READ -in ss.txt -pm 50 -ou ss_nr50.out
2. Calculate the propensities of all quadruplets (file
dat).
quad_nr50.
fold -op QUAD -da READ -in ss_nr50.txt -qd quad_nr50.dat -ou
quad_nr50.out
3. Calculate the quadruplet-based foldability scores if the sequences
in ss_nr50.txt.
fold -op SC04 -da READ -in ss_nr50.txt -sf SSPR -qd quad_nr50.
dat -ou quad_nr50_score.out
4. Predict the foldabilities of the sequences in ss_new_nr50.txt.
fold -op PRFO -da READ -in ss_new_nr50.txt -qd quad_nr50.dat
-us QUAD -ou prfo_nr50_quad.out -lf prfo.list
A typical output segment for the foldability prediction:
6QMB:B <HX len>= 8.2 A sc= 0.0000 T sc= 0.0000 Q sc= 0.0247 P
sc= 0.0000 foldprop= 17.98 ln(fp)= 2.8893 ibin= 78 GF
6QMM:A <HX len>= 9.2 A sc= 0.0000 T sc= 0.0000 Q sc= 0.1358 P
sc= 0.0000 foldprop= 1.00 ln(fp)= 1.0000 ibin= 59 FD
6QPI:B <HX len>= 7.6 A sc= 0.0000 T sc= 0.0000 Q sc=-0.0757 P
sc= 0.0000 foldprop= 0.14 ln(fp)= -1.9942 ibin= 30 GR
6QPP:A <HX len>= 5.7 A sc= 0.0000 T sc= 0.0000 Q sc=-0.0217 P
sc= 0.0000 foldprop= 0.74 ln(fp)= -0.3070 ibin= 46 GR?
Here, the labels GF, FD, GR, and GR stand for Guessed
Folded, Folded, Guessed Random, and Probably Random, resp.
60
Mihaly Mezei
Acknowledgments
Conversations with Prof George Rose on the ideas described here
are gratefully acknowledged. This work was supported in part
through the computational resources and staff expertise provided
by Scientific Computing at the Icahn School of Medicine at Mount
Sinai.
References
1. Dokholyan N (2009) Protein designability and
engineering. In: Structural bioinformatics, 2nd
edn. Wiley-Blackwell, Hoboken, NJ
2. Mezei M (2018) Revisiting chameleon
sequences in the protein data Bank. Algorithms
1 1 : 1 1 4 . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 /
a11080114
3. Porter LL, Looger LL (2018) Extant foldswitching proteins are widespread. Proc Natl
Acad Sci U S A 115:5968–5973. https://doi.
org/10.1073/pnas.1800168115
4. Berman HM, Westbrook J, Feng Z,
Gilliland G, Bhat TN, Weissig H, Shindyalov
IN, Bourne PE (2000) The protein data Bank.
Nucleic Acids Res 28:235–242. https://doi.
org/10.1093/nar/28.1.235
5. Mezei M (2015) Statistical properties of
protein-protein interfaces. Algorithms 8:
92–99. https://doi.org/10.3390/a8020092
6. Piovesan D, Tabaro F, Marco IM, Quaglia NF,
Oldfield CJ, Aspromonte MC, Davey NE,
Davidović R, Dosztányi Z, Elofsson A,
Gasparini A, Hatos A, Kajava AV, Kalmar L,
Leonardi E, Lazar T, Macedo-Ribeiro S,
Macossay-Castillo
M,
Meszaros
A,
Minervini G, Murvai N, Pujols J, Roche DB,
Salladini E, Schad E, Schramm A, Szabo B,
Tantos A, Tonello F, Tsirigos KD,
Veljković N, Ventura S, Vranken W,
Warholm P, Uversky VN, Dunker AK,
Longhi S, Silvio P, Tosatto CE (2016) DisProt
7.0: a major update of the database of disordered proteins. Nucleic Acids Res 45:
D219–D227. https://doi.org/10.1093/nar/
gkw1279
7. Pucci F, Bourgeas R, Rooman M (2016) Highquality thermodynamic data on the stability
changes of proteins upon single-site mutations.
J Phys Chem Ref Data 45. https://doi.org/10.
1063/1.4947493
8. Kirys T, Ruvinsky AM, Singla D, Tuzikov AV,
Kundrotas PJ, Vakser IA (2015) Simulated
unbound structures for benchmarking of protein docking in the DOCKGROUND
resource. BMC Bioinformatics 16:243
9. Vreven T, Moal I, Vangone A, Pierce B,
Kastritis P, Torchala M, Chaleil R, JimenezGarcia B, Bates P, Fernandez-Recio J,
Bonvin A, Weng Z (2015) Updates to the
integrated protein-protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J Mol Biol 427:
3031–3041
10. Comeau SR, Gatchell DW, Vajda S, Camacho
CJ (2004) ClusPro: a fully automated algorithm for protein-protein docking. Nucleic
Acids Res 32:W96–W99. https://doi.org/10.
1093/nar/gkh354
11. Schneidman-Duhovny D, Inbar Y, Nussinov R,
Wolfson HJ (2005) PatchDock and SymmDock: servers for rigid and symmetric docking.
Nucl Acids Res 33:W363–W367. https://doi.
org/10.1093/nar/gki481
12. Mardia KV, Jupp PE (2000) Directional statistics. John Wiley & Sons, Ltd, Chichester
13. Mezei M (2003) A new method for mapping
macromolecular topography. J Mol Graph
Model 21(5):463–472
14. Mezei M (2010) Simulaid: a simulation facilitator and analysis program. J Comput Chem
31(14):2658–2668. https://doi.org/10.
1002/jcc.21551
15. Mezei M, Zhou M-M (2007) Pspace: a program to plan the covering of a protein space.
Source Code Biol Med 2:6. https://doi.org/
10.1186/1751-0473-2-6
16. Mezei M (2003) Efficient Monte Carlo sampling for long molecular chains using local
moves, tested on a solvated lipid bilayer. J
Chem Phys 118:3874–3880. https://doi.
org/10.1063/1.1539839
17. Mezei M (1998) Chameleon sequences in the
PDB. Prot Engng 11:411–414. https://doi.
org/10.1093/protein/11.6.411
18. Mezei M (2020) Foldability and chameleon
propensity
of
fold-switching
protein
sequences. Proteins 89:3–5. https://doi.org/
10.1002/prot.25989
Tools for Characterizing Proteins
19. Göbel U, Sander C, Schneider R, Valencia A
(1994) Correlated mutations and residue contacts in proteins. Proteins 18:309–317.
https://doi.org/10.1002/prot.340180402
20. Mezei M (2015) Use of circular variance to
quantify the deviation of a macromolecule
from the spherical shape. J Math Chem 53:
2184–2189. https://doi.org/10.1007/
s10910-015-0540-4
21. Hass J, Koeh P (2014) How round is a protein?
Exploring protein structured for globularity
using conformal mapping. Front Mol Biosci
1:1–15. https://doi.org/10.3389/fmolb.
2014.00026
22. Rose GD, Geselowitz AR, Lesser GJ, Lee RH,
Zehfus MH (1985) Hydrophobicity of amino
acid residues in globular proteins. Science 229:
834–838. https://doi.org/10.1126/science.
4023714
23. Creighton
NJDaTE
(1993)
Protein
structure. In: Focus. IRL Press, Oxford University Press, Oxford. https://doi.org/10.
1016/0307-4412(95)90200-7
24. Gaur RK (2014) Amino acid frequency distribution among eukaryotic proteins. IIOAB J 5:
6–11
61
25. Mittal A, Jayaram B, Shenoy S, Bawa TS
(2010) A stoichiometry driven universal spatial
Organization of Backbones of folded proteins:
are there Chargaff’s rules for protein folding? J
Biomol Struct Dyn 28:133–142. https://doi.
org/10.1080/07391102.2010.10507349
26. Mezei M (2020) On predicting foldability of a
protein from its sequence. Proteins 88:
355–356. https://doi.org/10.1002/prot.
25811
27. Mezei M (2019) Exploiting sparse statistics for
a sequence-based prediction of the effect of
mutations. Algorithms 12:214. https://doi.
org/10.3390/a12100214
28. Kaushik R, Zhang KYJ (2020) A protein
sequence fitness function for identifying natural and nonnatural proteins. Proteins 88(10):
1271–1284. https://doi.org/10.1002/prot.
25900
29. Mittal A, Changani AM, Taparia S (2021)
Unique and exclusive peptide signatures
directly identify intrinsically disordered proteins from sequences without structural information. J Biomol Struct Dyn 39
(8):2885–2893. https://doi.org/10.1080/
07391102.2020.1756410
Chapter 3
Exploring the Peptide Potential of Genomes
Chris Papadopoulos, Nicolas Chevrollier, and Anne Lopes
Abstract
Recent studies attribute a central role to the noncoding genome in the emergence of novel genes. The
widespread transcription of noncoding regions and the pervasive translation of the resulting RNAs offer to
the organisms a vast reservoir of novel peptides. Although the majority of these peptides are anticipated as
deleterious or neutral, and thereby expected to be degraded right away or short-lived in evolutionary
history, some of them can confer an advantage to the organism. The latter can be further subjected to
natural selection and be established as novel genes. In any case, characterizing the structural properties of
these pervasively translated peptides is crucial to understand (1) their impact on the cell and (2) how some
of these peptides, derived from presumed noncoding regions, can give rise to structured and functional de
novo proteins. Therefore, we present a protocol that aims to explore the potential of a genome to produce
novel peptides. It consists in annotating all the open reading frames (ORFs) of a genome (i.e., coding and
noncoding ones) and characterizing the fold potential and other structural properties of their
corresponding potential peptides. Here, we apply our protocol to a small genome and show how to apply
it to very large genomes. Finally, we present a case study which aims to probe the fold potential of a set of
721 translated ORFs in mouse lncRNAs, identified with ribosome profiling experiments. Interestingly, we
show that the distribution of their fold potential is different from that of the nontranslated lncRNAs and
more generally from the other noncoding ORFs of the mouse.
Key words Noncoding DNA, Fold potential, De novo genes, Small ORF-encoded peptides, ORFtrack, ORFold
1
Introduction
Many studies attribute a central role to the noncoding genome in
novel gene birth and more generally in the emergence of genetic
novelty. As a matter of fact, thousands of small open reading frames
(ORFs) have been identified in noncoding regions of various genomes. Interestingly, the wide use of transcriptomics revealed a highpervasive transcription of noncoding regions, and an important
fraction of the resulting RNAs has been shown to be translated by
ribosome profiling experiments [1–4]. In addition, mass spectrometry experiments conducted on mammals, bacteria, or plants [5–
11] confirm the existence of these translation products in the cell,
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_3,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
63
64
Chris Papadopoulos et al.
with the identification of hundreds of peptides derived from noncoding regions. The fact that these noncanonical products exhibit
short sizes, are present in low abundance, and use alternative start
codons renders difficult their identification and suggests that their
number is largely underestimated. Interestingly, their sequences are
more conserved than those of noncoding sequences, suggesting
that they are subjected to purifying selection [5, 6] and they could
be functional. It has been proposed that these noncanonical translation products are consequently exposed to natural selection and,
thereby, provide the organism with the raw material for the emergence of genetic novelty. However, how noncoding sequences can
give rise to novel genes remains unclear. Particularly, noncoding
sequences are not expected to fold to a stable and specific structure
and have not been subjected to purifying selection in order not to
be deleterious for the cell. One can ask how these pervasively
translated products can (1) be tolerated by the cell and (2) give
rise to functional products, since most proteins achieve their function through a well-defined 3D structure. Indeed, noncoding
sequences display different sequence features from coding ones,
being shorter and characterized by different nucleotide compositions [5, 12]. They are rather expected to encode disordered,
misfolded, or aggregation-prone peptides, and we can hypothesize
that they would be rapidly degraded or short-lived in evolutionary
history. Nevertheless, it has been demonstrated that proteins from
random libraries could fold in silico or in vitro, some of them being
even beneficial in Escherichia coli [13–16]. All these results place the
foldability of noncoding ORFs at the center of novel gene birth and
strengthen the need to characterize the fold potential (including
the propensities for disorder, folded state, and aggregation), not
only of the experimentally observed de novo peptides but also of all
the amino acid sequences “encoded” by presumed noncoding
ORFs, which could give rise to novel peptides upon pervasive
translation.
Therefore, we present a protocol that enables in an automated
way (1) the extraction and annotation of all possible ORFs of a
genome and (2) the prediction of their fold potential along with
their propensities for disorder and aggregation. It relies on the
ORFmine package (unpublished but available at https://github.
com/i2bc/ORFmine) which aims to annotate a genome’s ORFs
and probe their fold potential and structural properties. ORFmine
consists of two independent programs, ORFtrack and ORFold.
ORFtrack works in a stand-alone fashion and is very flexible,
enabling different levels of annotation depending on the user
request. ORFold relies on three gold-standard programs, HCA
[17–20], Tango [21–23], and IUPred2A [24–26], which predict
respectively the fold potential, the aggregation, and the disorder
propensities of an amino acid sequence. Here, we consider as
foldable the amino acid sequences that are able to fold to a stable
Exploring the Peptide Potential of Genomes
65
3D structure or to a molten globule state, in which the specific
tertiary structure is lost but the secondary structures are intact. Our
protocol can be applied to any completely sequenced genome and
takes a few hours on a personal computer for a small genome
(bacteria, archaea, or fungi), although we recommend launching
the pipeline on a cluster for larger genomes (e.g., plant or mammal
genomes). Here we present a detailed application of our protocol to
the small genome of E. coli. Then we show how to apply our
protocol to very large genomes (Mus musculus). In the last part,
we present a case study based on a ribosome profiling experiment
performed on the mouse. In this example, we probe the fold
potential of 721 ORFs present in lncRNAs which are translated,
not conserved across species, and which show weak or no signature
of selective pressure (i.e., presumed as noncoding). We then show
how ORFold can be used to compare the fold potential of a subset
of ORFs of interest (e.g., translated ORFs present in lncRNAs)
with those of the coding and noncoding ORFs of the genome
they belong to. The latter protocol can be extended to any set of
sequences of interest, including, for example, peptides identified in
mass spectrometry experiments carried out in different conditions,
de novo peptides associated with specific diseases, or even designed
sequences.
2
Materials
ORFmine
ORFmine is a package that we developed in order to explore the
peptide potential of a noncoding genome, with the extraction and
annotation of all the possible ORFs present in noncoding regions.
The ORFmine package is not published yet, but is available at:
https://github.com/i2bc/ORFmine. It consists of two independent programs, ORFtrack and ORFold, that can be combined
together or used independently (Fig. 1). Used together, ORFtrack
and ORFold, provide a global picture of the fold potential and the
structural properties of all the potential peptides of a genome.
Otherwise, ORFtrack can simply be used to extract and annotate
the ORFs of a genome, while ORFold can estimate the fold potential of any set of sequences without using genomic information.
2.1.1 ORFtrack
ORFtrack aims at extracting and annotating all the possible ORFs
of a genome according to a set of defined genomic features. It takes
as inputs a FASTA file containing all the chromosome or contig
sequences and its corresponding annotation GFF file (for more
details, see the GFF3 file format description at https://github.
com/The-Sequence-Ontology/Specifications/blob/master/
gff3.md). ORFtrack searches, in the six possible frames, for all
possible ORFs of at least 60 nucleotides bounded by STOP codons
(i.e., it does not search for start codons). In order to annotate each
2.1
66
Chris Papadopoulos et al.
Fig. 1 Pipeline of ORFmine. The inputs and outputs are represented with gray rectangles while the main scripts
are shown with red circles. The mandatory inputs necessary to the ORF annotation and the estimation of their
structural properties (e.g., fold potential and disorder and aggregation propensities), as well as their
corresponding outputs are connected to their related scripts with black arrows. The classical pipeline of
ORFmine provides the user with a plot representing the distribution of the fold potential of the input ORFs (red
box). Optionally, a genome annotation file (GFF format) can be given to ORFold (dashed arrows). In this case,
ORFold produces new GFF files (one per studied structural property) where all input ORFs are associated with
the score of the corresponding property. The GFF produced by ORFtrack and ORFold can be subsequently
uploaded to a genome viewer (black boxes) where ORFs will be colored according to their annotation (black
box on the left) or their structural properties (black box on the right)
resulting ORF (e.g., intergenic ORF, noncoding ORF that overlaps
a coding sequence, coding ORF), their localization is subsequently
compared to those of all genomic features annotated in the GFF file
(e.g., CDS, tRNA, rRNA, or any other feature defined by the user
in the third column of the GFF file) (Figs. 2 and 3). There are four
main categories of ORFs: (1) Coding ORFs (c_CDS) which correspond to ORFs that include a coding sequence (CDS) (i.e., in the
same frame as a CDS). They are generally larger than the CDS since
they are defined from STOP-to-STOP (2). Noncoding intergenic
ORFs (nc_intergenic) which do not overlap any genomic feature
(3). Noncoding ORFs which overlap a genomic feature on the same
strand (nc_ovp_same-x with x standing for the corresponding
genomic feature), and (4) noncoding ORFs which overlap a genomic feature on the opposite strand (nc_ovp_opp-x with x standing
for the corresponding genomic feature) (Figs. 2 and 3). The user
has to keep in mind that ORFtrack provides an ORF-centered point
of view of the input genome and that ORFs do not correspond to
real biological objects but rather to the potential peptides that
Exploring the Peptide Potential of Genomes
67
Fig. 2 Decision tree of ORFtrack. ORFs are annotated according to four main categories: c_CDS for coding
ORFs (orange box), noncoding intergenic ORFs (gray box), and noncoding ORFs that overlap a genomic feature
on the same strand (blue box) or on the opposite strand (green box)
Fig. 3 Schematic representation of the six frames of a DNA section. The genomic features annotated in the
original GFF file are represented in the middle line. The ORFs of the six frames are colored with respect to their
ORFtrack annotation. The overlap between an ORF and a genomic feature is illustrated with a rectangle
colored according to the ORF annotation
could be produced upon pervasive translation with no information
on the localization of their first translated codon. For example, a
noncoding ORF overlapping a tRNA does not correspond to a
68
Chris Papadopoulos et al.
tRNA, which by definition has neither phase nor a corresponding
amino acid sequence, but to the corresponding peptide which
could be produced upon the pervasive translation of the tRNA
gene with no knowledge of the first translated codon.
If a noncoding ORF overlaps more than one genomic feature,
ORFtrack applies the following priority rules:
1. The noncoding ORF overlaps a CDS and any other genomic
feature: it is annotated as a noncoding ORF overlapping a CDS
(same or opposite strand) (e.g., nc_ovp_(same/opp)-CDS).
2. The noncoding ORF overlaps a genomic feature on the same
strand and any other genomic feature on the other strand
(except CDS): it is annotated as a noncoding ORF overlapping
the feature on the same strand (e.g., nc_ovp_same-x).
3. The noncoding ORF overlaps two or more genomic features
located on the same strand that can correspond to the same or
the opposite strand of the noncoding ORF: it is annotated as
overlapping the genomic feature that has the larger overlap
with it (e.g., nc_ovp_(same/opp)-x).
The program provides the user with a new GFF file containing
all the identified ORFs annotated according to the four categories
defined previously. ORFget (a tool provided with ORFtrack) generates a FASTA file containing the amino acid sequences of all
identified ORFs or a subset of ORFs selected with respect to their
annotation category (e.g., c_CDS, nc_intergenic, nc_ovp_same,
nc_ovp_opp) or to their complete annotation for a finer selection.
An example is nc_ovp_same-lncRNAs and nc_ovp_opp-lncRNAs, if
the user seeks to investigate whether ORFs overlapping lncRNAs
display specific properties compared to other noncoding ORFs—see
Subheading 3.3 for an example). Finally, ORFget allows the user to
extract in a FASTA file the amino acid sequences of all annotated
proteins and to reconstruct all isoforms of multi-exonic genes if
they are annotated in the input GFF file.
2.1.2 ORFold
ORFold aims at estimating the fold potential of a set of amino acid
sequences using the HCA method [17–20]. In addition, it can
predict their disorder or aggregation propensities, with IUPred
and Tango, respectively [21–26]. Although HCA is very fast and
can handle all ORFs of a small genome in a few minutes, the
calculation of the disorder and aggregation propensities slows
down ORFold (around 3 h on a single CPU (2 GHz processor,
16 GB RAM) for all the ORFs of E. coli). Consequently, the user
can turn off the calculation of the disorder and aggregation propensities. ORFold takes as input a FASTA file containing the amino
acid sequences to treat. The output of ORFold is a table containing
the fold potential and/or the disorder and aggregation propensities
of each input sequence. Optionally, the user can provide ORFold
Exploring the Peptide Potential of Genomes
69
with the genome annotation GFF file of the input genome. In this
case, the fold potential and/or the disorder and aggregation propensities of each ORF will be added to the GFF file. The latter can
be uploaded subsequently on a genome viewer such as IGV [27],
enabling the visual inspection and manual analysis of the distribution of the fold potential and the other structural properties along
the genome. The program can handle several FASTA files at the
same time and will generate as many outputs as given FASTA files.
Finally, ORFold can also provide the user with plots representing
the distribution of the fold potential of the input sequences along
with those of a dataset of globular proteins used as reference, taken
from Mészáros et al. [24].
HCA
ORFold estimates the fold potential with the HCA (Hydrophobic
Cluster Analysis) approach [19, 28]. HCA toolkit is available at
https://github.com/T-B-F/pyHCA. It splits an amino acid
sequence into hydrophobic clusters and linkers. The former gathers
strong hydrophobic residues (V, I, L, F, M, Y, W) and cysteines
while the latter corresponds to stretches of residues which are
composed of at least four non-hydrophobic residues or a proline.
Hydrophobic clusters usually indicate one or several regular secondary structures connected by short loops, which constitute signatures of globular domains. Linkers correspond to loops or
disordered regions. The fold potential of a sequence is determined
by its composition in hydrophobic clusters and linkers and is
reflected by the HCA score. The latter ranges from 10 to +10
with low HCA scores indicating sequences that are enriched in
linkers and expected to be disordered. High HCA scores correspond to sequences with a high density in hydrophobic clusters and
are likely to form aggregates in solution, though some of them may
be able to fold in lipidic environments. Sequences that are able to
fold in solution are usually characterized by intermediate HCA
scores, as shown with the HCA scores of the reference dataset of
globular proteins in Fig. 5.
Tango
ORFold calculates the aggregation propensity of a sequence with
Tango [21–23], which is available at http://tango.crg.es upon
request to the developers. Following the criteria proposed by
Linding et al. [21], a sequence segment is considered as
aggregation-prone if it is composed of at least five consecutive
residues predicted as populating a b-aggregated conformation
with a percentage occupancy greater than 5%. The aggregation
propensity of a sequence is then calculated as the fraction of residues predicted in an aggregation-prone segment.
70
Chris Papadopoulos et al.
IUPred
3
ORFold calculates the disorder propensity with IUPred [24–26,
29]. We use the version 2A of IUPred [24, 25], which is available at
https://iupred2a.elte.hu upon request to the developers. Consistent with the criteria used for the definition of an aggregationprone region, we considered as disordered a region composed of
at least five consecutive residues displaying a disorder probability
higher than 0.5. According to the aggregation propensity calculation, the disorder propensity of a sequence is calculated as the
fraction of residues predicted in a disordered prone segment.
Methods
3.1 Classical Use:
Probing the Fold
Potential of a Complete
Genome
Here we seek to probe the fold potential and the aggregation and
disorder propensities of all noncoding ORFs of E. coli str. K-12
substr. MG1655 (E. coli), regardless whether they overlap a genomic feature. As a reference, we will also characterize these properties
for all CDS of E. coli.
3.1.1 FASTA and GFF
Files Used in this Example
1. E_coli.fna (available at https://github.com/i2bc/ORFmine in
the “examples” directory).
2. E_coli.gff (available at https://github.com/i2bc/ORFmine in
the “examples” directory).
3.1.2 Annotation of the
ORFs of E. coli with
ORFtrack
The following ORFtrack instruction displays all the genomic features annotated in the E. coli genome:
> orftrack -fna E_coli.fna -gff E_coli.gff --show-types
Up to 12 different genomic features are annotated in the E. coli
genome, including CDS, tRNA, rRNA (see Note 1). We then
annotate all the possible ORFs of E. coli with the following
instruction:
> orftrack -fna E_coli.fna -gff E_coli.gff
The execution time on a single CPU (2 GHz processor, 16 GB
RAM) is 38 s. ORFtrack generates a new GFF file (mapping_orf_E_coli.gff) that contains 135097 annotated ORFs of which
130637 are annotated as noncoding. Table 1 shows the distribution of the output ORFs across the different annotation categories
with various levels of annotations. This information is available in
the summary file produced by ORFtrack (summary.log). Notice
that it is also possible to scan all the annotated ORFs by loading the
new GFF into a genome viewer.
Exploring the Peptide Potential of Genomes
71
Table 1
Counts of E. coli ORFs for each annotation category
Total ORFs
135,097
Coding
(c_CDS)
Noncoding (nc_*)
4460
130,637
Noncoding intergenic
(nc_intergenic)
Noncoding overlapping with a genomic feature
(nc_ovp_*)
18,318
112,319
On the same
On the opposite strand
strand
(nc_ovp_opp-x)
(nc_ovp_same-x)
47,880
64,439
With x standing for:
3.1.3 Extraction and
Writing of the Noncoding
ORFs and the CDS of E. coli
Extraction of Noncoding
ORFs
45,053
CDS
62,354
1136
Repeat region
545
626
Sequence feature
566
607
r-RNA
528
140
nc-RNA
130
119
t-RNA
114
119
Pseudogene
109
77
Mobile genomic
element
87
3
Origin of replication
4
0
Recombination
feature
2
In this example, we consider all the 130637 noncoding ORFs and
do not differentiate noncoding intergenic ORFs from those that
overlap a genomic feature. Therefore, we extract and write the
amino acid sequences of all noncoding ORFs (i.e., nc_intergenic,
nc_ovp_same, and nc_ovp_opp) with ORFget with the following
command line (see Note 2):
> orfget -fna E_coli.fna -gff mapping_orf_E_coli.gff -features_include nc -o E_coli_noncoding
ORFget generates a FASTA file with the resulting 130637
amino acid sequences.
72
Chris Papadopoulos et al.
Extraction of CDS
Finally, in order to compare the structural properties of CDS with
those of the potential peptides “encoded” in noncoding regions,
we extract and rebuild the amino acid sequences of each CDS of
E. coli according to the original annotation GFF file:
> orfget -fna E_coli.fna -gff E_coli.gff -features_include CDS
-o E_coli_CDS
We obtain a FASTA file of 4316 protein sequences.
3.1.4 Characterization of
the Fold Potential, and the
Disorder and Aggregation
Propensities of the ORFs
and CDS of E. coli with
ORFold
We aim to characterize the fold potential and the disorder and
aggregation propensities of the noncoding ORFs (intergenic and
overlapping ORFs) and CDS of E. coli. ORFold can handle the two
datasets at the same time with the following instruction:
> orfold -fna E_coli_noncoding.pfasta E_coli_CDS.pfasta -gff
mapping_orf_E_coli.gff E_coli.gff -options HIT
The execution time on a single CPU is around 3 h. ORFold
generates two tables (one per dataset) containing, for each
sequence, its fold potential as well as its disorder and aggregation
propensities calculated by HCA, IUPred, and Tango, respectively.
In addition, ORFold writes the output values in a new GFF file that
can be uploaded into a genome viewer. The original GFF can be
uploaded as well, providing a reference with the exact localization
of the genomic features annotated in the original GFF. We recall
that ORFtrack identifies and annotates all the possible ORFs of a
genome, which do not correspond to real objects but rather to the
potential peptides that could be produced if their corresponding
DNA region is transcribed and the resulting RNA subsequently
translated.
Figure 4 shows the two DNA strands of a genomic section of
E. coli represented by the genome viewer IGV [27] after uploading
the original GFF (blue genes in the middle) and the new GFF
returned by ORFtrack (small ORFs in the panels 2 and 4).
Although the genome of E. coli is very compact, with few intergenic
regions, there is a high density of noncoding ORFs that overlap
with the coding genes of E. coli and that represent a high potential
of novel peptides in case of ribosomal frameshifting. Interestingly,
the distribution of the fold potential along the genome is not
homogeneous. We observe an island of noncoding ORFs with
high HCA values (ORFs in light and dark red in the middle of
the figure). These ORFs potentially encode peptides enriched in
hydrophobic residues that are likely to be foldable (light red ORFs)
or expected to form aggregates in solution (dark red ORFs). The
GFF returned by ORFold containing the Tango or IUPred values
can provide the user with complementary information (data not
shown). The genomic regions around the island of high HCA
Exploring the Peptide Potential of Genomes
73
Fig. 4 Screenshot of a genomic section of E. coli represented by IGV. Genomic features present in the original
GFF file (CDS in this example) are represented with blue boxes in the middle of the figure (panel 3). Panels
2 and 4 represent the noncoding ORFs identified by ORFtrack in the positive and negative strands,
respectively. They are colored according to their annotation category (gray, blue, and green for nc_intergenic,
nc_ovp_same, and nc_ovp_opp, respectively). Panels 1 and 5 represent the same ORFs colored with respect
to their HCA scores. ORFs with low HCA scores are colored in blue, whereas ORFs with high HCA scores are
colored in red. For more clarity, c_CDS that correspond to ORFs including a CDS in the same frame are not
shown, since the corresponding CDS are already represented with the blue boxes in the middle panel
values ORFs are enriched in ORFs with intermediate HCA values
typical of foldable sequences (ORFs in light red and light blue).
Overall, it is interesting to note that the fold potential seems to be
quite conserved among the three frames of a strand, though it can
vary along the strand. This recalls the observation made by
Bartonek et al. [30], who showed that the hydrophobicity profiles
of protein sequences are preserved in +1, 1 frames through the
structure of the genomic code. Finally, the visual inspection of the
distribution of the fold potential of noncoding ORFs suggests that
there are a vast number of ORFs that potentially encode foldable
peptides (light blue and light red boxes corresponding to intermediate HCA values). Whether these peptides would fold to a specific
3D structure or to a molten globule is a crucial and very difficult
question that deserves further investigation.
Finally, we plot the distributions of the fold potential of the two
datasets with ORFplot. Notice that ORFplot can deal with several
inputs and will plot as many distributions as given tables.
> orfplot -tab E_coli_CDS.tab E_coli_nocoding.tab -names “E.
coli CDS” “E. coli noncoding ORFs”
74
Chris Papadopoulos et al.
Fig. 5 Distribution of the HCA scores calculated for the CDS and the noncoding ORFs of E. coli (dark blue and
light blue curves, respectively). The HCA score distribution of the set of globular proteins is represented by the
gray histogram. Dotted black lines delineate the boundaries of the low, intermediate, and high HCA score bins
so that 95% of the globular proteins fall into the intermediate HCA score bin. Each distribution is compared
with that of the globular protein set with a Kolmogorov-Smirnov test. Asterisks on the plot denote level of
significance: *** < 0.001
Figure 5 shows the fold potential distributions of the noncoding ORFs and the CDS of E. coli as plotted by ORFplot. Furthermore, as a reference, ORFplot plots the distribution of the HCA
scores of a set of globular protein sequences taken from [24]. The
fold potential distribution of the CDS is clearly different from the
one of the noncoding sequences (KS test, P ¼ 9.9 1018). The
CDS is enriched in intermediate HCA values typical of foldable
proteins, as shown by the HCA scores of the globular proteins.
Conversely, noncoding ORFs display a wide range of HCA values
reflecting foldable, disordered, or aggregation-prone potential peptides. Nevertheless, it is interesting to note that most of them
(~64%) exhibit similar HCA scores to globular proteins, revealing
an important potential of foldable peptides, in line with the observation made in Fig. 4.
3.2 Application to
Large Genomes and
Comparison with Other
Species
The execution time and the size of the outputs increase with the
size of the input genome. This can become dramatic for very large
genomes such as those of mammals or plants. Even if the execution
time for ORFtrack and ORFget is acceptable, it becomes prohibitive for ORFold. Furthermore, the sizes of the outputs are very
Exploring the Peptide Potential of Genomes
75
large. In this section, we present alternatives to reduce the computational time and the size of the generated outputs.
3.2.1 FASTA and GFF
Files Used in this Example
1. M_musculus.fna.
2. M_musculus.gff
(downloadable at https://www.ncbi.nlm.nih.gov/
genome/?term¼mus+musculus).
3. E_coli.fna.
4. E_coli.gff
(downloadable at https://www.ncbi.nlm.nih.gov/
genome/?term¼e+coli).
5. H_volcanii.fna.
6. H_volcanii.gff.
(downloadable at https://www.ncbi.nlm.nih.gov/
genome/?term¼haloferax+volcanii).
7. D_melanogaster.fna.
8. D_melanogaster.gff
(downloadable at https://www.ncbi.nlm.nih.gov/
genome/?term¼drosophila+melanogaster).
3.2.2 Annotation of ORFs
of M. musculus with
ORFtrack
In order to reduce the execution time (around 64 h on a single
CPU), we recommend running ORFtrack on a cluster. The following command displays all the “seqid” values contained in the first
column of the input GFF file (usually chromosomes and contigs):
> orftrack-fna M_musculus.fna -gff M_musculus.gff --show-chr
The ORF annotation can be therefore distributed over multiple
CPUs (i.e., one job per “seqid”), reducing substantially the computational time. That way, ORFtrack must be launched as many times
as different “seqid” are indicated in the original GFF. Here, ORFtrack is launched on the chromosome NC_000067.7 with the
following instruction:
> orftrack-fna
M_musculus.fna
-gff
M_musculus.gff
-chr
NC_000067.7
Extracting all annotated ORFs with ORFget takes around 3 h on a
single CPU and generates a 7.5 GB FASTA file containing up to
89 106 noncoding ORFs. Characterizing their fold potential and
disorder and aggregation propensities with ORFold would take
about 6 months on a single CPU. Consequently, we recommend
running ORFold on a representative subset of noncoding ORFs.
Indeed, a subset of 20,000 ORFs is sufficient to estimate the fold
potential and the disorder and aggregation propensities of the
76
Chris Papadopoulos et al.
3.2.3 Extraction and
Writing of the ORFs and
CDS of M. musculus with
ORFget
Definition of a Minimal
Subset Size to Characterize
the Fold Potential and
Structural Properties of
Noncoding ORFs
Extraction and Writing of
the Amino Acid Sequences
of a Dataset of 20,000
Noncoding ORFs
whole dataset of noncoding ORFs. The Kolmogorov-Smirnov test
p-value calculated for the comparison of the HCA score distribution obtained with a subset of 20,000 randomly selected noncoding ORFs with that of the complete set of noncoding ORFs of
Drosophila melanogaster is not significant. The same observations
are made for the IUPred and Tango score distributions and hold
also for other species such as Haloferax volcanii and E. coli. Consequently, in the next section, ORFold will be applied to a set of
20,000 randomly selected noncoding ORFs extracted from the
complete set of mouse noncoding ORFs.
The following instruction allows the extraction of a subset of
20,000 noncoding ORFs (see Note 3 for more advanced examples):
> orfget -fna M_musculus.fna -gff mapping_orf_M_musculus.gff
-features_include nc -o M_musculus_noncoding -N 20000
Then, in order to compare the fold potential and the disorder
and aggregation propensities of the noncoding ORFs of
M. musculus with those of the CDS, we reconstruct the amino
acid sequences of all the isoforms annotated in the original GFF file:
> orfget M_musculus.fna -gff M_musculus.gff -features_include
CDS -o M_musculus_CDS
3.2.4 Characterization of
the Fold Potential and the
Structural Properties of a
Set of 20,000 Noncoding
ORFs Along with those of
M. musculus CDS
We execute ORFold on the small dataset of randomly selected
noncoding ORFs and the complete set of mouse isoforms:
> orfold -fna M_musculus_noncoding.pfasta M_musculus_CDS.
pfasta -options HIT
ORFold provides us with two tables, containing the fold potential and the disorder and aggregation propensities of the 20,000
noncoding ORFs and the 92,473 mouse isoforms (around 40 h on
a single CPU).
3.2.5 Comparison of the
Fold Potential of the
Noncoding ORFs and the
CDS Calculated for
Different Species
ORFplot can handle multiple datasets at the same time. Following
the same protocol as the one used for the mouse, we also calculated
the fold potential of a subset of 20,000 noncoding ORFs and all
CDS of H. volcanii, E. coli, and D. melanogaster. We then present
the HCA score distributions of all datasets on the same graph.
> orfplot -tab E_coli_CDS.tab H_volcanii_CDS.tab D_melanogaster_CDS.tab M_musculus_CDS.tab -names “E. coli” “H. volcanii”
“D. melanogaster” “M. musculus”
> orfplot -tab E_coli_noncoding.tab H_volcanii_noncoding.tab
Exploring the Peptide Potential of Genomes
77
Fig. 6 (a) Distribution of the HCA scores calculated for the CDS of E. coli, H. volcanii, D. melanogaster, and
M. musculus (dark blue, light blue, dark orange, and light orange curves, respectively). (b) Distribution of the
HCA scores calculated for the noncoding ORFs of E. coli, H. volcanii, D. melanogaster, and M. musculus (dark
blue, light blue, dark orange, and light orange curves, respectively). The HCA score distribution of the globular
proteins is presented with the gray histogram. Each distribution is compared with the one of the globular
proteins set with a Kolmogorov-Smirnov test. Asterisks on the plot denote the level of significance:
*** < 0.001
D_melanogaster_noncoding.tab mouse_noncoding.tab -names “E.
coli” “H. volcanii” “D. melanogaster” “M. musculus”
Figure 6 shows, for the four species, the HCA score distributions of the corresponding CDS (Fig. 6a) and noncoding ORFs
(Fig. 6b). Although the fold potential distributions of the CDS
display slight variations among the four species, the vast majority
(more than 85%) exhibit intermediate HCA scores typical of the
scores obtained for the globular proteins. This reflects that being
foldable is a trait that has been strongly selected during evolution.
However, the fold potential distribution of the noncoding ORFs
calculated for H. volcanii is clearly different from those of the other
species. Indeed, the other species are mostly characterized by noncoding ORFs that, similarly to CDS, encode peptides predicted as
foldable. Conversely, the noncoding ORFs of H. volcanii are
enriched in sequences with low HCA scores that are likely to
encode disordered peptides. Whether this enrichment in hydrophilic sequences comes from the fact that this species lives in
hypersaline environments is an exciting question that deserves
further investigations.
78
Chris Papadopoulos et al.
3.3 Probing the Fold
Potential of a Set of
Mouse Noncoding
ORFs Shown to Be
Pervasively Translated
Recently, Ruiz-Orera et al. [1] revealed with ribosome profiling
experiments the translation of 721 ORFs in mouse lncRNAs (i.e.,
translated lncRNA-ORFs). They are not conserved across neighboring species nor subjected to selective pressure. The authors
propose them as intermediates between noncoding ORFs and de
novo genes [1]. This prompts us to ask whether their
corresponding peptides display specific structural properties compared to peptides encoded by ORFs in other lncRNAs (i.e., nontranslated lncRNA-ORFs). Therefore, in this section, we
characterize their respective HCA score distributions, along with
those of the CDS and the subset of 20,000 randomly selected
noncoding ORFs defined in Subheading 3.2. The amino acid
sequences of all translated products identified in Ruiz_Orera et al.
[1] (i.e., products coming from protein coding genes or noncoding
regions) can be downloaded at https://figshare.com/articles/
dataset/Ruiz-Orera_et_al_2017_/4702375?file¼10323906. We
extracted the sequences of the 721 translated lncRNA-ORFs by
searching the sequences containing either the “lncRNAa:translated:NC” or the “novel:translated:NC” pattern in their annotation. Then, 20,000 nontranslated lncRNA-ORFs were extracted
randomly from the GFF generated with ORFtrack in Subheading
3.2 with the following instruction:
> orfget -fna M_musculus.fna -gff mapping_orf_M_musculus.gff
-features_include
nc_ovp_same-lncRNA
-o
M_musculus_nc_ovp_same-lncRNA -N 20000
The amino acid sequences of the 721 translated lncRNA-ORFs
and the 20,000 nontranslated lncRNA-ORFs can be directly given
as input to ORFold.
> orfold -fna M_musculus_nc_ovp_same-lncRNA.pfasta M_musculus_translated_721_orfs.pfasta -options H
We subsequently plot the fold potentials of the four sets of
ORFs with ORFplot:
>
orfplot
M_musculus_CDS.tab
M_musculus_noncoding.tab
M_musculus_nc_ovp_same-lncRNA.tab
M_musculus_translate-
d_721_orfs.tab -names “CDS” “Noncoding ORFs” “Nontranslated
lncRNA-ORFs" “Translated lncRNA-ORFs”
Figure 7 shows the HCA score distributions of the four sets of
ORFs. If the nontranslated lncRNA-ORFs display similar HCA
scores to noncoding ORFs (Kolmogorov-Smirnov test,
P ¼ 0.46), the 721 translated lncRNA-ORFs exhibit a clearly
different HCA value distribution from the three other datasets
(Kolmogorov-Smirnov test, P ¼ 5.9 106, 4.8 106, and
Exploring the Peptide Potential of Genomes
79
Fig. 7 Distribution of the HCA scores calculated for the CDS, the 20,000 noncoding ORFs, the 2000
nontranslated lncRNA-ORFs, and the 721 translated lncRNA-ORFs of M. musculus (dark blue, light blue,
dark orange, and light orange curves, respectively). The HCA score distribution of the set of globular proteins is
presented with the gray histogram. Each distribution is compared with that of the globular proteins with a
Kolmogorov-Smirnov test. Asterisks on the plot denote the level of significance: *** < 0.001
2.4 106 with nontranslated lncRNA-ORFs, noncoding ORFs,
and CDS, respectively). Although they are characterized by a
majority of intermediate HCA score sequences expected to be
foldable, they are clearly enriched in disorder-prone sequences,
recalling the observation made by Wilson et al. [31] that young
proteins are more disordered than old ones. That said, it is interesting to note that, similarly to the two other noncoding ORF categories, the translated lncRNA-ORFs exhibit a majority of
sequences that potentially encode peptides expected to be foldable.
Further investigations are needed to determine whether their
corresponding peptides fold to a well-defined and stable 3D structure or to a molten globule.
4
Conclusion
Here, we presented three protocols that all aim at characterizing the
fold potential and the structural properties of different sets of
ORFs, including coding sequences, the ensemble or a representative subset of the noncoding ORFs of a genome, or a specific subset
of sequences of interest. ORFtrack is very fast, annotating a million
ORFs in a few hours. In addition, it allows the user to deal with
80
Chris Papadopoulos et al.
different levels of annotation and various combinations of selection
patterns, thereby facilitating the definition of many ORF categories. ORFold can handle many inputs and enables the simultaneous visualization of the fold potential calculated for different
datasets or the manual inspection of the fold potential or structural
properties of all annotated ORFs of a genome with a genome
viewer. In addition, ORFold can be used to probe the fold potential
and the structural properties of any set of amino acid sequences
without any genomic information including, for instance, designed
peptides or de novo peptides identified with mass spectrometry in
different tissues or conditions. Finally, ORFmine opens up new
applications in peptide discovery and characterization. In particular, recent studies have reported the existence of de novo peptides
associated with human diseases [11, 32–37]. ORFtrack can be used
to mine noncoding genomes for the identification of de novo
peptides which are usually difficult to identify with mass spectrometry experiments (for example, peptides resulting from the translation of RNAs associated with diseases). On the other hand, ORFold
provides valuable and complementary information with the characterization of their fold potential and structural properties.
5
Notes
1. Notice that the genomic features of a GFF3 file follow a specific
hierarchy. For example, the feature “gene” has children (e.g.,
CDS, exons, tRNAs, rRNAs). In addition, features of the same
level can overlap with each other (e.g., a CDS and its
corresponding exon). By default, the features “gene” and
“exon” are not considered. ORFs that match with the feature
“gene” will be annotated according to its children or related
features (mRNA, tRNA. . .). For example, ORFs overlapping
tRNAs on the same strand necessarily overlap the parent genes
as well, but for a more precise annotation, ORFtrack will
annotate them as nc_ovp_same-tRNA instead of
nc_ovp_same-gene. Finally, an ORF that matches the feature
“CDS” usually matches the corresponding “exon” feature as
well. However, the “exon” feature is not considered, and the
ORF will be annotated as c_CDS if it is in the same frame as the
CDS, or as nc_(same/opp)_ovp-CDS if it is in another frame.
2. Notice that the following instructions will lead to the same
result:
> orfget -fna E_coli.fna -gff mapping_orf_E_coli.gff -features_include nc_intergenic nc_ovp -o E_coli_noncoding
Exploring the Peptide Potential of Genomes
81
3. Notice that ORFget can extract a random subset of ORFs
belonging to a specific category (e.g., extraction of 20,000
noncoding ORFs overlapping lncRNAs on the same strand)
as follows:
> orfget -fna M_musculus.fna -gff mapping_orf_M_musculus.gff
-features_include
nc_ovp_same-lncRNA
-o
M_musculus_nc_ORF_ovp_same-lnRNA -N 20000
References
1. Ruiz-Orera J, Verdaguer-Grau P, VillanuevaCañas J et al (2018) Translation of neutrally
evolving peptides provides a basis for de novo
gene evolution. Nat Ecol Evol 2:890–896
2. Chen J, Brunner A-D, Cogan JZ et al (2020)
Pervasive functional translation of noncanonical human open reading frames. Science 367:
1140–1146
3. Ingolia NT, Lareau LF, Weissman JS (2011)
Ribosome profiling of mouse embryonic stem
cells reveals the complexity and dynamics of
mammalian proteomes. Cell 147:789–802
4. Li J, Liu C (2019) Coding or noncoding, the
converging concepts of RNAs. Front Genet 10:
496
5. Slavoff SA, Mitchell AJ, Schwaid AG et al
(2013) Peptidomic discovery of short open
reading frame–encoded peptides in human
cells. Nat Chem Biol 9:59
6. Prabakaran S, Hemberg M, Chauhan R et al
(2014) Quantitative profiling of peptides from
RNAs classified as noncoding. Nat Commun 5:
5429
7. Samayoa J, Yildiz FH, Karplus K (2011) Identification of prokaryotic small proteins using a
comparative genomic approach. Bioinformatics 27:1765–1771
8. Hobbs EC, Fontaine F, Yin X, Storz G (2011)
An expanding universe of small proteins. Curr
Opin Microbiol 14:167–173
9. Eguen T, Straub D, Graeff M, Wenkel S (2015)
MicroProteins: small size–big impact. Trends
Plant Sci 20:477–482
10. Deng Y, Bamigbade AT, Hammad MA et al
(2018) Identification of small ORF-encoded
peptides in mouse serum. Biophys Rep 4:
39–49
11. Wang S, Mao C, Liu S (2019) Peptides
encoded by noncoding genes: challenges and
perspectives. Signal Transduct Target Ther 4:
1–12
12. Carvunis A-R, Rolland T, Wapinski I et al
(2012) Proto-genes and de novo gene birth.
Nature 487:370–374
13. Schaefer C, Schlessinger A, Rost B (2010) Protein secondary structure appears to be robust
under in silico evolution while protein disorder
appears not to be. Bioinformatics 26:625–631
14. Tretyachenko V, Vymětal J, Bednárová L et al
(2017) Random protein sequences can form
defined secondary structures and are welltolerated in vivo. Sci Rep 7:1–9
15. Keefe AD, Szostak JW (2001) Functional proteins from a random-sequence library. Nature
410:715–718
16. Neme R, Amador C, Yildirim B et al (2017)
Random sequences are an abundant source of
bioactive RNAs or peptides. Nat Ecol Evol 1:
1–7
17. Faure G, Callebaut I (2013) Comprehensive
repertoire of foldable regions within whole
genomes. PLoS Comput Biol 9:e1003280
18. Faure G, Callebaut I (2013) Identification of
hidden relationships from the coupling of
hydrophobic cluster analysis and domain architecture information. Bioinformatics 29:
1726–1733
19. Bitard-Feildel T, Callebaut I (2018) HCAtk
and pyHCA: A toolkit and python API for the
hydrophobic cluster analysis of protein
sequences. bioRxiv 249995
20. Lamiable A, Bitard-Feildel T, Rebehmed J et al
(2019) A topology-based investigation of protein interaction sites using hydrophobic cluster
analysis. Biochimie 167:68–80
21. Linding R, Schymkowitz J, Rousseau F et al
(2004) A comparative study of the relationship
between protein structure and β-aggregation in
globular and intrinsically disordered proteins. J
Mol Biol 342:345–353
22. Fernandez-Escamilla A-M, Rousseau F,
Schymkowitz J, Serrano L (2004) Prediction
of sequence-dependent and mutational effects
82
Chris Papadopoulos et al.
on the aggregation of peptides and proteins.
Nat Biotechnol 22:1302–1306
23. Rousseau F, Schymkowitz J, Serrano L (2006)
Protein aggregation and amyloidosis: confusion of the kinds? Curr Opin Struct Biol 16:
118–126
24. Mészáros B, Erdős G, Dosztányi Z (2018)
IUPred2A: context-dependent prediction of
protein disorder as a function of redox state
and protein binding. Nucleic Acids Res 46:
W329–W337
25. Erdős G, Dosztányi Z (2020) Analyzing protein disorder with IUPred2A. Curr Protoc Bioinformatics 70:e99
26. Dosztányi Z (2018) Prediction of protein disorder based on IUPred. Protein Sci 27:
331–340
27. Robinson JT, Thorvaldsdóttir H, Winckler W
et al (2011) Integrative genomics viewer. Nat
Biotechnol 29:24–26
28. Bitard-Feildel T, Callebaut I (2017) Exploring
the dark foldable proteome by considering
hydrophobic amino acids topology. Sci Rep 7:
1–13
29. Mészáros B, Simon I, Dosztányi Z (2009) Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 5:
e1000376
30. Bartonek L, Braun D, Zagrovic B (2020) Frameshifting preserves key physicochemical
properties of proteins. Proc Natl Acad Sci U S
A 117:5907–5912
31. Wilson BA, Foy SG, Neme R, Masel J (2017)
Young genes are highly disordered as predicted
by the preadaptation hypothesis of de novo
gene birth. Nat Ecol Evol 1:1–6
32. Yin X, Jing Y, Xu H (2019) Mining for missed
sORF-encoded peptides. Expert Rev Proteomics 16:257–266
33. Lawrence MS, Stojanov P, Polak P et al (2013)
Mutational heterogeneity in cancer and the
search for new cancer-associated genes. Nature
499:214–218
34. Yadav M, Jhunjhunwala S, Phung QT et al
(2014) Predicting immunogenic tumour
mutations by combining mass spectrometry
and exome sequencing. Nature 515:572–576
35. Sendoel A, Dunn JG, Rodriguez EH et al
(2017) Translation from unconventional 50
start sites drives tumour initiation. Nature
541:494–499
36. Barbosa C, Peixeiro I, Romão L (2013) Gene
expression regulation by upstream open
reading frames and human disease. PLoS
Genet 9:e1003529
37. von Bohlen AE, Böhm J, Pop R et al (2017) A
mutation creating an upstream initiation codon
in the SOX 9 50 UTR causes acampomelic
campomelic dysplasia. Mol Genet Genomic
Med 5:261–268
Chapter 4
Computational Identification and Design of Complementary
β-Strand Sequences
Yoonjoo Choi
Abstract
The ß-sheet is a regular secondary structure element which consists of linear segments called ß-strands.
They are involved in many important biological processes, and some are known to be related to serious
diseases such as neurologic disorders and amyloidosis. The self-assembly of ß-sheet peptides also has
practical applications in material sciences since they can be building blocks of repeated nanostructures.
Therefore, computational algorithms for identification of ß-sheet formation can offer useful insight into the
mechanism of disease-prone protein segments and the construction of biocompatible nanomaterials.
Despite the recent advances in structure-based methods for the assessment of atomic interactions, identifying amyloidogenic peptides has proven to be extremely difficult since they are structurally very flexible.
Thus, an alternative strategy is required to describe ß-sheet formation. It has been hypothesized and
observed that there are certain amino acid propensities between ß-strand pairs. Based on this hypothesis,
a database search algorithm, B-SIDER, is developed for the identification and design of ß-sheet forming
sequences. Given a target sequence, the algorithm identifies exact or partial matches from the structure
database and constructs a position-specific score matrix. The score matrix can be utilized to design novel
sequences that can form a ß-sheet specifically with the target.
Key words Beta strand, Beta sheet, Complementary sequence, Amyloid, Computational design,
Amino acid propensity
1
Introduction
One of the major elements of protein structure is the ß-sheet which
consists of adjacent linear strands in a parallel or antiparallel
arrangement. A ß-sheet is composed of two or multiple ß-strands
connected by hydrogen bonds. Though the bonds are formed
between backbone atoms, there are certain statistical propensities
of residue pairs between ß-strands [1, 2]. For example, the diphenyalanine (FF) motif can be self-assembled by π stacking interactions [3]. Amino acids with aromatic residues tend to form a
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_4,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
83
84
Yoonjoo Choi
ß-sheet pairing with adjacent valine or glycine [2]. Charge–charge
interactions between neighboring strands may also be important to
form ß-sheets [4].
ß-strand forming peptides have practical applications in many
fields. The self-assembly nature of ß-sheet forming amyloidogenic
peptides has profound applications in nanomedical sciences. Various peptides derived from amyloid-ß (Aß), which is an important
biomarker for Alzheimer’s disease [5], are utilized to construct
diverse nanostructures [6–9]. Their mechanical stability, thermal
robustness, and biocompatibility have significant advantages in
medical engineering such as tissue regeneration and engineering
[10, 11]. On the other hand, serious diseases such as Alzheimer’s,
Parkinson’s, type II diabetes, and amyloidosis [12, 13] are related
to ß-sheet aggregation, and there are therapeutic agents for such
aggregate-prone regions [14, 15]. Recent studies have shown that
amyloidogenic peptides can be used for ß-sheet stacking [16, 17],
implying that complementary ß-strand sequences can be utilized as
therapeutic peptides [15, 18].
The importance and applicability of ß-sheet forming peptides
have attracted great attention and a number of computational
algorithms for the identification of ß-sheet forming regions have
been developed based on known ß-strand pairing patterns [19–
26]. However, utilization of such patterns has critical limitations
because they have poor commonalities and thus are in general very
noisy [27–29]. Therefore, an alternative strategy for the recognition of meaningful patterns is required.
Here, we provide a step-by-step guide to the B-SIDER (β-Stacking Interaction DEsign for Reciprocity) [30] software,
which finds complementary ß-strand sequences for a given peptide.
The computational method generates target-specific statistical patterns for the query from a well-curated structure database. Significant statistical patterns are amplified by overlapping partially
matched pairing matches (Fig. 1).
2
2.1
Materials
Software
B-SIDER is a Python-implemented database search algorithm. The
software can run on any standard desktop machine with the necessary software components.
1. Python 3.4 or higher (https://www.python.org).
2. The B-SIDER software:
(a) Make a root directory for the software (e.g., [/home/user/
B-SIDER]). Henceforth, directories and file names are in
[italics]; customize as desired.
(b) Download B-SIDER (available at https://github.com/
yoonjoolab/B-SIDER). The main script of B-SIDER is a
Complementary β-Strand Sequence Identification
85
Fig. 1 Overview of the B-SIDER protocol. (a) Identical sequences (YLLYY) form an antiparallel ß-sheet (PDB ID:
4E0L). (b) B-SIDER divides the query sequence into smaller peptides (minimum by default: 3) and finds exact
matches from a predefined database (see Fig. 2). Complementary sequences of the matches are extracted. (c)
A position-specific score matrix (PSSM) is constructed from the identified complementary sequences. In this
example, hydrophobic amino acids are likely to be complementary to the target sequence. (d) Based on the
PSSM, it can be estimated how likely a pair of identical target sequences can form a ß-sheet. The rank is
evaluated against randomly generated sequences. In this case, YLLYY may be self-assembled as an
antiparallel ß-sheet with a high probability
single executable Python file. The database required to
run B-SIDER and other scripts are in [B-SIDER/
database]. A Jupyter notebook file is also included in the
root folder.
3. The database builder file [B-SIDER/database/B-SIDER_DB_builder.py] is written for PyMOL [31].
(a) PyMOL 1.8
#download).
2.2
Database Files
or
higher
(https://pymol.org/2/
The complementarity sequence database is an essential component
of B-SIDER. It is pre-built for users and placed in [episweep/database/comp_seq_DB.db]. The background frequency file is optional
and built-in, but users may utilize other values for their own
purposes. The files should follow specific formats described in the
Notes.
86
Yoonjoo Choi
1. comp_seq_DB.db: Complementarity sequence database, in
SQLite3 format (see Note 1).
2. background_frequency.csv: Amino acid background frequencies
[30, 32], in CSV format (see Note 2). The values are precalculated and hard-coded in B-SIDER.
2.3
Output Files
The primary output values of B-SIDER are the best-matched complementary sequence for the given target sequence and its complementarity score (the lower the better). Users can also extract
intermediate result files as output.
1. Position-specific score matrix (PSSM), in CSV (comma separated value) format (see Note 3).
2. Sequences used to construct the PSSM, in plain text format (see
Note 4).
3
Methods
3.1 Database
Construction
The methods are illustrated with a case study application to a
ß-sheet amyloid mimic [17]. In this example, two identical
sequences (YLLYY) form an antiparallel ß-sheet. This self-assembly
nature is frequently observed in disease-related amyloidogenic
peptides [33].
B-SIDER takes an amino acid sequence as input. The query is
then split into smaller linear fragments and compared to sequences
in a database. If there are any exact matches, their complementary
sequences are extracted. The final output is a PSSM score and the
best sequence based on the PSSM. The B-SIDER software can also
calculate complementarity between two given sequences and compare to their complementarity to randomly generated sequences, to
estimate how likely it is the pair can form a ß-sheet.
B-SIDER identifies highly probable complementary sequences
from a well-defined database. A full database is provided in the
B-SIDER software. If one wants to update the existing database,
or construct a new one from scratch, the PyMOL script [B-SIDER/
database/B-SIDER_DB_builder.py] can be utilized. In this section,
a crystal structure of S. enterica CheB methylesterase (PDB ID:
1CHD chain A) is used as an example. There are nine strands in the
1CHD:A structure (one pair makes an antiparallel ß-sheet and
seven consecutive parallel strands form a parallel ß-sheet (Fig. 2).
1. The script takes two arguments, one for the structure file must
be given, but the other ([example.db] in the following example)
is optional.
$ pymol 1chdA.pdb -cq B-SIDER_DB_builder.py -- 1chdA.pdb
example.db
Complementary β-Strand Sequence Identification
87
Fig. 2 B-SIDER database. (a) There are nine strands (numbered from 0 to 8)
which form two ß-sheets in 1CHD:A (Parallel ß-sheet in blue and antiparallel in
red). (b) An example of B-SIDER database. The PyMol script identifies ß-strands
and their complementary sequence information from the structure
2. If the argument for the database is missing, [comp_seq_DB.db]
is automatically generated.
$ pymol 1chdA.pdb -cq B-SIDER_DB_builder.py -- 1chdA.pdb
3. Users can create their own databases using either a new set of
protein structures or house-implemented scripts following the
structure of the database (see Note 1).
88
Yoonjoo Choi
3.2 Identification of
Complementary
Sequence and
Construction of
Position-Specific
Score Matrix
Once a database is constructed, the only essential query input is a
target sequence. There are also several other options available (see
Note 5). Some basic commands are as follows:
1. To find complementarity sequences of a query sequence (e.g.,
YLLYY), the basic command line is:
$ python B-SIDER.py -t YLLYY
2. The default complementarity direction is antiparallel. If parallel, its direction needs to be specified (0 for antiparallel and
1 for parallel).
$ python B-SIDER.py -t YLLYY -p 1
3. All the processes are printed as standard output. One may
prune intermediate processes by controlling verbosity (0 for
no standard output and 1 for vice versa).
$ python B-SIDER.py -t YLLYY -v 0
4. If a new database file is used for the sequence, one can explicitly
specify the database. For example, if the new database is [example.db]:
$ python B-SIDER.py -t YLLYY-d example.db
5. All matched complementary sequences can be saved in a text
file.
$
python3
B-SIDER.py
-t
YLLYY-d
example.db
-o
complmentary_seq.txt
6. The position-specific score matrix can be saved in CSV
(Note 3).
$
python3
B-SIDER.py
-t
YLLYY-d
example.db
-o
complmentary_seq.txt -s score.csv
7. There are other minor options, which are provided but not
recommended to use unless necessary. See Note 5.
Complementary β-Strand Sequence Identification
89
8. B-SIDER can be loaded as a Python module.
(a) Load B-SIDER.
>>> import importlib
>>> bsider = importlib.import_module("B-SIDER")
(b) Initialize a B-SIDER class object (see Note 6). The class
initialization needs to specify a database. If not given, the
default database file [./database/comp_seq_DB.db] is
loaded.
>>> a = bsider.B_SIDER(“./database/comp_seq_DB.db”)
(c) Specify a target sequence, parallelity and sequence output
file and search for complementary sequences from the
database.
>>> t = “YLLYY” # target sequence
>>> p = False # parallelity: antiparallel
>>> n = 3 # The shortest number of residues for
search
>>> s = "ex_complementary_sequences.txt" # sequence
output
>>> a.comp_seq_search(target=s, parallel=p,
min_frag=n, output=s)
(d) Build a PSSM. If necessary, specify the background frequency file (see Note 2). The score matrix is printed as
standard output, but can be saved as a CSV file.
>>> b = “./database/background_frequency.csv"
>>> o = "ex_score_output.csv"
>>> a.build_score_matrix(background_frequency=b,
output=o)
3.3 Estimation of
Complementarity
Against Random
Sequences
Given two sequences, B-SIDER can estimate how likely the two
sequences are to form a ß-sheet together, compared to forming one
by associating with randomly generated sequences. According to
our previous study [30], B-SIDER scores of amyloidogenic
sequences, which are prone to aggregation and forming ß-sheets,
tend to be approximately within 5% of randomly generated
sequences.
90
Yoonjoo Choi
1. The basic execution only requires two sequences. The other
options (see Note 5) can also be specified if required.
$ python B-SIDER.py -t YLLYY -c YYLLY
2. By default, 10,000 randomly generated sequences are used for
comparison. The number of random sequences (e.g., 100,000)
can be specified as follows:
$ python B-SIDER.py -t YLLYY -c YYLLY -n 100000
3. If loaded as a Python module, execute the following attribute.
>>> c = “YYLLY”
>>> n = 100000
>>> a.compare_complementarity(comp_seq=c, randnum=n)
4. The quantile rank score of the two sequences is on average
1.94% (σ: 0.14), which indicates that the sequences are highly
likely to form a ß-sheet (Fig. 1d).
4
Notes
1. The file for the complementary strand sequences consists of
two tables.
pdb
strand
where
(a) pdb has five elements as follows:
l
code: The name of the structure (e.g., 1chdA.pdb).
l
parallel: Parallelity of two strands; “0” for antiparallel
and “1” for parallel ß-sheet.
l
strand1: The index of stand 1.
l
strand2: The index of stand 2.
l
seq1: The sequence of strand 1.
l
seq2: The sequence of strand 2.
(b) strand has four elements: code as in the pdb table, chain for
the chain identifier, strand_num for the strand index, and
range for residue numbers separated by commas.
Complementary β-Strand Sequence Identification
91
(c) The current sequence database is built based on a precompiled PISCES [34] list (sequence identities <90% by
chain, resolution <3 Å, R-factor < 0.3, sequence length
from 40 to 10,000). The original work additionally used
TM-score < 0.7 [30, 35]. However, though TM-score
calculation consumes a considerable amount of time, an
extremely low number of sequences are actually filtered
(data not shown). This filter is no longer used in
B-SIDER.
2. The background amino acid frequency file is in two-column
CSV format, with each row listing an amino acid (“AA”) and its
frequency (“freq”). The background frequency is calculated
from HOMSTRAD [32, 36]. The file is hard-coded in
B-SIDER and thus does not have to be specified if one uses
default values.
AA
Freq
A
0.0628
C
0.0193
E
0.0537
D
0.0709
G
0.0513
F
0.0315
I
0.0356
H
0.0224
K
0.0577
M
0.0155
L
0.0637
N
0.0523
Q
0.0343
P
0.0784
S
0.0694
R
0.0427
T
0.0642
W
0.0106
V
0.0499
Y
0.03
3. The position-specific score matrix contains information of
log-scaled amino acid frequency for each position [30, 32].
92
Yoonjoo Choi
Pos
1
2
3
4
5
Target
Y
L
L
Y
Y
A
0.173
0.103
0.058
0.301
0.380
C
0.129
0.325
0.151
0.135
0.049
D
0.295
1.301
1.472
0.932
0.993
...
...
...
...
...
...
T
0.042
0.097
0.115
0.249
0.488
V
0.847
0.987
0.970
0.959
0.946
W
1.371
0.598
0.602
0.968
2.454
Y
0.644
0.795
0.620
0.892
0.699
4. The target sequence is divided into shorter linear fragments
ranging from the full length to a user-defined number of residues (default: 3; see Note 5i).
5. The following options are available for B-SIDER.
(a) -h or --help: Help message and exit.
(b) -v or --verbose [0 or 1]: Output verbosity. “1” for True
and “0” for False. The default is True.
(c) -t or --target [sequence]: Target sequence.
(d) -p or --parallel [0 or 1]: Parallelity. “1” for parallel sheet
and “0” for antiparallel. The default: False.
(e) -d or --database [database file]: Database file. If not given,
the script tries to use [./database/comp_seq_DB.db] (see
Note 1).
(f) -b or --background_freq [background file]: Background
frequency file in CSV. If not given, default values are
used (see Note 2).
(g) -s or --score_output [score output file]: Output file for the
position-specific score matrix in CSV (see Note 3). The
default is none, and if the verbosity is true, the score
matrix is printed as standard output.
(h) -o or --seq_output [sequence output file]: List of sequences
in text used to construct the position-specific score matrix.
The default is none, and if the verbosity is true, the score
matrix is printed as standard output.
(i) -m or --min_frag [integer value > 0]: The shortest number of residues for search. The default is 3 (see Note 4).
(j) -c or --compare [sequence]: Sequence for comparison. If
present, the complementarity of this sequence for the
target is compared against a certain number of random
sequences.
Complementary β-Strand Sequence Identification
93
(k) -n or --randnum [integer value >0]: Number of random
sequences to compare. The default is 10,000.
6. The B_SIDER class object has three main attributes as follows:
(a) comp_seq_search(self, target, parallel, min_frag ¼ 3, output ¼ None): Database search.
(b) build_score_matrix(self, output ¼ None, background_frequency ¼ None): Construction of a PSSM based.
(c) compare_complementarity(self,
comp_seq,
randnum ¼ 10,000): Estimation of complementarity against
random sequences.
5
Conclusion
Through an extensive benchmark study and experimental validation [30], we showed that B-SIDER can be practically applicable to
the design of novel peptides. Future developments will address the
design of novel ß-sheet proteins and nanostructure scaffolds.
Acknowledgments
This work was supported by the National Research Foundation of
Korea (NRF) grant funded by the Korea government (MSIT)
(No. 2018R1A5A2024181).
References
1. Mandel-Gutfreund Y, Zaremba SM, Gregoret
LM (2001) Contributions of residue pairing to
β-sheet formation: conservation and covariation of amino acid residue pairs on antiparallel
β-strands. J Mol Biol 305(5):1145–1159
2. Steward RE, Thornton JM (2002) Prediction
of strand pairing in antiparallel and parallel
β-sheets using information theory. Proteins 48
(2):178–191
3. Reches M, Gazit E (2003) Casting metal nanowires within discrete self-assembled peptide
nanotubes. Science 300(5619):625–627
4. Shammas SL, Knowles TP, Baldwin AJ, MacPhee CE, Welland ME, Dobson CM, Devlin
GL (2011) Perturbation of the stability of amyloid fibrils through alteration of electrostatic
interactions. Biophys J 100(11):2783–2791
5. Lansbury PT, Lashuel HA (2006) A centuryold debate on protein aggregation and neurodegeneration enters the clinic. Nature 443
(7113):774–779
6. Ryu J, Park CB (2008) High-temperature selfassembly of peptides into vertically well-aligned
nanowires by aniline vapor. Adv Mater 20
(19):3754–3758
7. Smith AM, Williams RJ, Tang C, Coppo P,
Collins RF, Turner ML, Saiani A, Ulijn RV
(2008) Fmoc-diphenylalanine self assembles
to a hydrogel via a novel architecture based on
π–π interlocked β-sheets. Adv Mater 20
(1):37–41
8. Yan X, Cui Y, He Q, Wang K, Li J (2008)
Organogels based on self-assembly of diphenylalanine peptide and their application to immobilize quantum dots. Chem Mater 20
(4):1522–1526
9. Yan X, He Q, Wang K, Duan L, Cui Y, Li J
(2007) Transition of cationic dipeptide nanotubes into vesicles and oligonucleotide delivery. Angew Chem 119(14):2483–2486
10. Stupp SI (2010) Self-assembly and biomaterials. Nano Lett 10(12):4783–4786
94
Yoonjoo Choi
11. Zhang S (2003) Fabrication of novel biomaterials through molecular self-assembly. Nat Biotechnol 21(10):1171–1178
12. Chiti F, Dobson CM (2017) Protein misfolding, amyloid formation, and human disease: a
summary of progress over the last decade.
Annu Rev Biochem 86:27–68
13. Richardson JS, Richardson DC (2002) Natural
β-sheet proteins use negative design to avoid
edge-to-edge aggregation. Proc Natl Acad Sci
99(5):2754–2759
14. Giorgetti S, Greco C, Tortora P, Aprile FA
(2018) Targeting amyloid aggregation: an
overview of strategies and mechanisms. Int J
Mol Sci 19(9):2677
15. Sormanni P, Aprile FA, Vendruscolo M (2015)
Rational design of antibodies targeting specific
epitopes within intrinsically disordered proteins. Proc Natl Acad Sci 112(32):9902–9907
16. Gallardo R, Ramakers M, De Smet F, Claes F,
Khodaparast L, Khodaparast L, Couceiro JR,
Langenberg T, Siemons M, Nyström S (2016)
De novo design of a biologically active amyloid.
Science 354(6313):aah4949
17. Liu C, Zhao M, Jiang L, Cheng P-N, Park J,
Sawaya MR, Pensalfini A, Gou D, Berk AJ,
Glabe CG (2012) Out-of-register β-sheets suggest a pathway to toxic amyloid aggregates.
Proc Natl Acad Sci 109(51):20913–20918
18. Kumar DKV, Choi SH, Washicosky KJ, Eimer
WA, Tucker S, Ghofrani J, Lefkowitz A,
McColl G, Goldstein LE, Tanzi RE (2016)
Amyloid-β peptide protects against microbial
infection in mouse and worm models of Alzheimer’s disease. Sci Transl Med 8
(340):340ra372
19. Bryan AW Jr, Menke M, Cowen LJ, Lindquist
SL, Berger B (2009) BETASCAN: probable
β-amyloids identified by pairwise probabilistic
analysis. PLoS Comput Biol 5(3):e1000333
20. Fernandez-Escamilla A-M, Rousseau F,
Schymkowitz J, Serrano L (2004) Prediction
of sequence-dependent and mutational effects
on the aggregation of peptides and proteins.
Nat Biotechnol 22(10):1302–1306
21. Kim C, Choi J, Lee SJ, Welsh WJ, Yoon S
(2009) NetCSSP: web application for predicting chameleon sequences and amyloid fibril
formation. Nucleic acids research 37
(suppl_2):W469–W473
22. Maurer-Stroh S, Debulpaep M, Kuemmerer N,
De La Paz ML, Martins IC, Reumers J, Morris
KL, Copland A, Serpell L, Serrano L (2010)
Exploring the sequence determinants of amyloid structure using position-specific scoring
matrices. Nat Methods 7(3):237–242
23. Tartaglia GG, Vendruscolo M (2008) The
Zyggregator method for predicting protein
aggregation propensities. Chem Soc Rev 37
(7):1395–1401
24. Trovato A, Chiti F, Maritan A, Seno F (2006)
Insight into the structure of amyloid fibrils
from the analysis of globular proteins. PLoS
Comput Biol 2(12):e170
25. Tsolis AC, Papandreou NC, Iconomidou VA,
Hamodrakas SJ (2013) A consensus method
for the prediction of ‘aggregation-prone’ peptides in globular proteins. PLoS One 8(1):
e54175
26. Zibaee S, Makin OS, Goedert M, Serpell LC
(2007) A simple algorithm locates β-strands in
the amyloid fibril core of α-synuclein, Aβ, and
tau using the amino acid sequence alone. Protein Sci 16(5):906–918
27. Bhattacharjee N, Biswas P (2010) Positionspecific propensities of amino acids in the
β-strand. BMC Struct Biol 10(1):1–10
28. Fujiwara K, Toda H, Ikeguchi M (2012)
Dependence of α-helical and β-sheet amino
acid propensities on the overall protein fold
type. BMC Struct Biol 12(1):1–15
29. Hutchinson EG, Sessions RB, Thornton JM,
Woolfson DN (1998) Determinants of strand
register in antiparallel β-sheets of proteins. Protein Sci 7(11):2287–2300
30. Yu T-G, Kim H-S, Choi Y (2019) B-SIDER:
computational algorithm for the design of
complementary β-sheet sequences. J Chem
Inf Model 59(10):4504–4511
31. Schrödinger LLC The PyMOL molecular graphics system. Version 20
32. Choi Y, Deane CM (2010) FREAD revisited:
accurate loop structure prediction using a database
search
algorithm.
Proteins
78
(6):1431–1440
33. Haass C, Selkoe DJ (2007) Soluble protein
oligomers in neurodegeneration: lessons from
the Alzheimer’s amyloid β-peptide. Nat Rev
Mol Cell Biol 8(2):101–112
34. Wang G, Dunbrack RL Jr (2003) PISCES: a
protein sequence culling server. Bioinformatics
19(12):1589–1591
35. Zhang Y, Skolnick J (2005) TM-align: a protein structure alignment algorithm based on
the TM-score. Nucleic Acids Res 33
(7):2302–2309
36. Mizuguchi K, Deane CM, Blundell TL, Overington JP (1998) HOMSTRAD: a database of
protein structure alignments for homologous
families. Protein Sci 7(11):2469–2471
Chapter 5
Dynamics of Amyloid Formation from Simplified
Representation to Atomistic Simulations
Phuong Hoang Nguyen , Pierre Tufféry , and Philippe Derreumaux
Abstract
Amyloid fibril formation is an intrinsic property of short peptides, non-disease proteins, and proteins
associated with neurodegenerative diseases. Aggregates of the Aβ and tau proteins, the α-synuclein protein,
and the prion protein are observed in the brain of Alzheimer’s, Parkinson’s, and prion disease patients,
respectively. Due to the transient short-range and long-range interactions of all species and their high
aggregation propensities, the conformational ensemble of these devastating proteins, the exception being
for the monomeric prion protein, remains elusive by standard structural biology methods in bulk solution
and in lipid membranes. To overcome these limitations, an increasing number of simulations using different
sampling methods and protein models have been performed. In this chapter, we first review our main
contributions to the field of amyloid protein simulations aimed at understanding the early aggregation steps
of short linear amyloid peptides, the conformational ensemble of the Aβ40/42 dimers in bulk solution, and
the stability of Aβ aggregates in lipid membrane models. Then we focus on our studies on the interactions of
amyloid peptides/inhibitors to prevent aggregation, and long amyloid sequences, including new results on
a monomeric tau construct.
Key words Amyloid, Aggregation, Simulations, Intrinsically disordered proteins, Bulk solution,
Membranes, Inhibitors, Aβ, Tau, α-synuclein
1
Introduction
Aβ, tau, α-synuclein, and prion protein misfolding and aggregation
in the central nervous system lead to three neurodegenerative diseases in elderly, Alzheimer’s, Parkinson’s, and prion diseases [1–
3]. The senile extracellular Alzheimer’s disease plaques between
neurons contain Aβ42 and Aβ40 peptides, Aβ42 being more toxic
than Aβ40, with the Aβ sequence consisting of a charged hydrophilic N-terminus (residues 1–16), the central hydrophobic core
(CHC, residues 17–21), a charged region (residues 22–29), and a
hydrophobic C-terminus (residues 30–40/42). The Aβ peptides
result from the proteolytic cleavage of the transmembrane amyloid
precursor protein (APP) by secretases [2]. The senile intracellular
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_5,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
95
96
Phuong Hoang Nguyen et al.
Alzheimer’s disease filaments in neurons are made of hyperphosphorylated tau proteins of 441 residues consisting of a charged
N-terminal (residues 1–207), a proline-rich region, a microtubulebinding domain of four repeats spanning residues 244–368, and a
C-terminal region [4]. The α-synuclein protein has a membranebinding domain (residues 1–66), a hydrophobic region spanning
residues 67–96, and a highly charged C-terminal (residues 97–140)
implicated in calcium binding and in protein–protein interactions at
the surface of synaptic vesicles [1, 5]. In contrast to the monomeric
disordered Aβ, tau, and α-synuclein structures, the monomeric
prion structure obtained by nuclear magnetic resonance (NMR)
using solution full-length and C-terminal domain from both mouse
and Syrian hamster prion reveals an unstructured N-terminal fragment (23–124) and a C-terminal globular domain (125–228) with
three α-helices and a short antiparallel β-sheet between helices
1 and 2 [6]. Despite low sequence identity, the four proteins
aggregate over time to amyloid fibrils with a common cross-β
structure [7–10].
Aggregation is a nonequilibrium process where the amyloid
proteins form fibrils with sigmoidal kinetics profiles in which the
proteins self-assemble (lag-phase) prior to fibril elongation (growth
phase), followed by a plateau where the fibrils and free monomers
are in equilibrium (saturation phase) [7, 11]. The lag phase time of
fibrillation and the fibril growth time vary with amino acid length,
mutation, temperature, pH, concentration, agitation, shear forces,
metal ions, crowding, and the presence of lipids such as cholesterol
[11–15]. Of note, in vitro aggregation of full-length tau requires
the addition of heparin due to the highly charged N- and
C-terminal regions and is impacted by the pattern of
hyperphosphorylation [16].
The molecular mechanisms involved in amyloid fibril formation
are well described by primary nucleation, (fragmentation and
surface-catalyzed) secondary nucleation, and elongation. It is now
possible to determine the rate constants of each of these microscopic processes operating at multiple time-scales by fitting the
solutions of nonlinear chemical master equations to the experimental aggregation curves [11, 15]. Such a knowledge is essential to
design molecular inhibitors that target selectively mature amyloid
fibrils or intermediate oligomeric species and interfere with the
desired microscopic aggregation step. Interestingly, the Aβ, tau,
and α-synuclein proteins have hydrophobic aggregation-prone
regions which by themselves form amyloid fibrils in vitro, e.g., the
CHC consisting of 17LVFFA21 and the C-terminus 37GGVIA42 in
Aβ, the PH6 and PH6* motifs consisting of PHF6*:
275
VQIINK280 and PHF6: 306VQIVYK311 in tau, and the NAC
region in α-synuclein [17–19]. These fragments have therefore
represented ideal models for understanding amyloid fibril formation by computational studies [20, 21].
Dynamics of Amyloid Proteins by Simulations
97
It has also been demonstrated that the early formed aggregates
of Aβ, tau, and α-synuclein are toxic [2, 22, 23]. This finding has
also motivated many computational studies to disclose the structures of the oligomeric species in various environments. This is a
challenge computationally due to the size and heterogeneity of the
conformational ensemble to be explored and the need of accurate
force fields to model intrinsically disordered proteins [24, 25]. The
next sections review our main contributions to the field of amyloid
protein simulations. We also present new results of the application
of the PEP-FOLD framework to a monomer of tau construct.
2
Early Aggregation Steps of Short Linear Amyloid Peptides in Bulk Solution
To accelerate aggregation, all simulations use an mM peptide concentration, namely three and six orders of magnitude higher than
in vitro and in vivo conditions, respectively. Various oligomers of
the Aβ16–22, Aβ37–42, Aβ25–35, Aβ11–25, KFFE,
GNNQQNQ, and NNQQ from yeast protein Sup35, NFGAIL,
and SNNFGAILSS from islet amyloid polypeptide (IAPP), and
NHVTLSQ from β2-microglobulin peptides (Table 1) were
explored by coarse-grain (OPEP) and atomistic simulations, based
on molecular dynamics (MD), Monte Carlo (MC), the ActivationRelaxation Technique (ART), standard replica exchange MD
(REMD), and replica exchange with solute tempering (REST2)
[26–46]. OPEP, which consists of an all-atom representation of
the backbone, one bead per side chain, the exception being proline,
and an implicit solvent model, was parametrized on well-folded
peptides and proteins [47–52]. ART accelerates the search for
lowest-energy microstates by finding repeatedly activated mechanisms that connect minima via first-order saddle-points and accepting/rejecting them by the Metropolis criterion [53–55].
Independently of the protein (coarse-grain or atomistic) and
solvent (implicit or explicit) representation and the amino acid
sequence, self-assembly of up to 20 peptides starts by a hydrophobic collapse and the formation of disordered oligomers, which
evolve in time to transient β-rich aggregates [35–37]. These heterogeneous β-rich assemblies can have various sheet-to-sheet pairing angles ranging from parallel to perpendicular, and form open
and closed β-barrels [28–30, 37, 39, 42]. Of note, these topologies
have been observed in vitro by X-ray crystallography of macrocyclic
β-sheet mimics [56] and designed amyloid peptides [57]. For all
peptides, however, the free energy landscape is dominated by amorphous aggregates [34, 42, 43]. Our studies, as other computational
studies [58, 59], highlight the impact of the monomeric amyloidcompetent state, which corresponds to an extended conformation
for short amyloid sequences, on (1) the kinetics of aggregation,
e.g., the association/dissociation times of all oligomers and the
Table 1
Summary of the different amyloid systems studied
Applicationsa
a
Sampling
b
Force field
References
Aggregation of short linear peptides
ART
MDc
REMDd
OPEP
REST2e
OPEP
[26–32]
OPEP, all-atom [33–37]
All-atom
[38–40]
[41–46]
All-atom
[42]
Nucleation sizes of Aβ16–22 and Aβ37–42
REMC
On-lattice
OPEP
[71–73]
Impact of hydrodynamics on early aggregation steps LBMD
OPEP
[78, 79]
Aβ40 WT, Aβ42 WT dimers
REMD
All-atom
[87, 88]
Aβ40 WT, Aβ42 WT, D23N monomers and dimers
ART,
REMD
OPEP
[89, 90]
Aβ16–35 monomer and dimer
REMD
OPEP
[92]
Aβ42 (S8C) dimers
REMD
All-atom
[93]
Aβ42 H6R, D7N dimers
MD
All-atom
[94–96]
Aβ42 A2T, A2V dimers
REMD
All-atom
[97, 99]
Aβ1–28 WT and A2V monomers
REMD
All-atom
[98]
Trimeric Aβ11–42 structures in DPPC membrane
REMD
All-atom
[115]
Tetrameric Aβ40, Aβ42 WT, D23N, A2T β-barrels
and Aβ helix bundles in DPPC lipid membrane
REMD
All-atom
[112, 113, 116]
Aβ29–42 dimer with membranes containing
omega-3 and omega-6 fatty acids
MD
All-atom
[118]
Aβ1–28/NQ-Trp
REMD
All-atom
[124, 125]
Aβ16–22 oligomers/N-methylated inhibitor
MD
OPEP
[120]
Aβ17–42 trimers/small molecule inhibitors
REMD
All-atom
[121]
Aβ40/42 dimers/small molecule inhibitors
MD, REMD All-atom
[122, 123, 127]
Aβ16–22, Aβ25–35 oligomers/carbon nanotubes
MD, REMD All-atom
[134, 135]
Prion (127–164) monomer
MC
OPEP
[142]
Prion WT monomer
MD
REMD
All-atom
All-atom
[143, 144, 147]
[145]
Prion (125–228) dimer
MD
OPEP
[148]
Dimeric tau (306–378) construct and
phosphorylation
REMD, MD All-atom
[149]
α-synuclein monomer
PEP-FOLD
[25]
Monomeric tau (306–378)
PEP-FOLD
OPEP
This work
Simulations in bulk solution unless specified
Aβ16–22 dimer and trimer, Aβ11–25 tetramer, NFGAIL dodecamer, KFFE dimer up to heptamer, and 3-mers, 12-mers
and 20-mers of GNNQQNY
c
All-atom for Aβ16–22 dimer, OPEP for 4-mers up 16-mers of β2m(83–89), Aβ(16–22) 8-mers and IAPP(20–29) trimer
d
All-atom for Aβ16–22 dimer and trimer, 7-mers of β2m(83–89) and 16-mers of Aβ37–42; OPEP for ccβ peptide,
2-mers and 3-mers of Aβ16–22, 6-mers of Aβ25–35, 20-mers of NNQQ and 3-mers
e
12-mers and 20-mers of GNNQQNY
b
Dynamics of Amyloid Proteins by Simulations
99
conversion time between architectures and (2) on the thermodynamics of aggregation, such as the population of the substates
having different β-strand mismatches (mixed parallel/antiparallel
β-strands), β-sheet sizes, and topologies [27, 35]. The simulations
also emphasize a complex free energy landscape with β-rich oligomers having in-register and out-of-register conformations and the
existence of many kinetic traps [26–31, 40, 42], as it has been
observed experimentally in the late steps of aggregation [60] or
computationally by Markov chain models in the lock phase of
oligomers [61].
It is not yet understood which recent modern force field is the
most appropriate for describing the aggregation pathways and the
equilibrium ensemble of intrinsically disordered proteins [33, 62–
66]. Similarly, the nucleus size (N*) for primary nucleation,
corresponding to the highest free energy aggregate from which
fibrils can grow, remains elusive, simulations predicting N*
between 5 and 40 [21, 67–74]. Yet, this knowledge is important
as nucleation seed size determines amyloid clearance and establishes
a barrier to prion appearance in yeast [75]. Another important
issue, when performing coarse-grained simulations with an implicit
solvent representation, is that exchange of momenta between the
peptide’s particles and the solvent and solvent-mediated correlations are ignored. In this context, we coupled Lattice-Boltzmann
MD simulation, which includes naturally hydrodynamic interactions, to the CG OPEP model [76, 77] and performed simulations
of Aβ16–22 peptides in bulk solution [78–80]. Two main results
are important. For a system of 100 peptides, hydrodynamic interactions augment the oligomer size of the first two clusters and the
exchange of peptides compared to the results of standard Langevin
dynamics [78]. For a system of 1000 peptides at a concentration of
60 mM, we can follow the formation and growth of a large elongated oligomer and its slow β-sheet structuring resulting from the
fusion and dissociation of small disordered aggregates. Many
mechanisms are observed from elongation to surface-catalyzed
effects, and in particular the lateral growth mechanism on the
surface of prefibrillar states, which is sustained by long-range
hydrodynamic correlations, and allows the formation of large
branched structures consisting of 600 peptides, spanning a few
tens of nanometers and hosting annular pores of dimensions
3–5 nm [79, 80]. This computational simulation at a quasiatomistic representation of a system of unprecedented size—previous simulations exploring at most 125 peptides with very few
degrees of freedom—illustrates the critical contribution of secondary nucleation to amyloid aggregation kinetics. It is interesting that
amyloid plaques with annular pores have been evidenced by electron microscopy and antibodies in the brain of Alzheimer’s disease
patients [81]. Of note, our early phase of aggregation would
approach the second time scale at nM (in vivo) concentrations.
100
3
Phuong Hoang Nguyen et al.
Dimeric Aβ40/42 States in Bulk Solution
The wild-type (WT) Aβ40/42 dimers, known to be structurally
heterogeneous by experiments, are of high interest as they are the
smallest species to lead to tau hyperphosphorylation and neuritic
degeneration, and their levels have been found to increase sharply
and correlate with plaque load in brain tissue of Alzheimer’s disease
patients [82]. Experiments have shown that the English (H6R)
familial Alzheimer disease mutation, the Tottori (D7N) mutation,
the Flemish (A21G) mutation, and the Iowa (D23N) mutation
speed up the fibril formation process of both Aβ40 and Aβ42
peptides in vitro and increase toxicity to cells [21]. In contrast,
the engineered disulfide-bond-locked double (S8C) mutant has
been shown to form an exclusive homogeneous and neurotoxic
dimer [83]. Mutations at position 2 have dramatic impact on AD
risk; A2V is causative, A2T is protective. The A2V mutation
enhances aggregation kinetics while the A2T mutation only retards
amyloid fibril by increasing the lag phase time. Interestingly the
mixture of WT and A2V also retards fibril formation and protects
against Alzheimer’s disease [84–86].
The Aβ40 and Aβ42 alloforms were subjected to atomistic
REMD in explicit solvent [87, 88]. Using several force fields, the
equilibrium configurations are found to be disordered, with crosscollision sections, hydrodynamics radii, and SAXS profiles of the
ensembles independent of the force field. Intramolecular β-hairpins
spanning residues 17–21 and 30–36 are however observed with a
population varying from 1.5 to 13% according to the force field.
The ensemble of both alloforms is stabilized by nonspecific interactions with many hydrophobic residues exposed to solvent,
explaining therefore their propensity to be toxic at the dimeric
level. Simulations also reveal that the Aβ42 dimer has a higher
propensity than the Aβ40 dimer to form β-strands at the CHC
and at the C-terminal (residues 30–40), consistent with other
computational studies [89, 90]. The dimers have no defined interfaces, and the random organization with transient secondary structures is substantially preferred over two chains with β-hairpin and
fibril-like conformations [88]. The formation and the transition
between the two latter β-rich dimers have also been discussed by
other studies [91, 92].
Using atomistic simulations, the dimers of S8C and WT Aβ42
have the same secondary structures and cross-collision sections.
Upon S8C mutation, the lifetime of the intramolecular threestranded sheet spanning residues 17–21, 30–36, and 39–41 is
increased by a factor of 3. This single common structural feature
shared by both species, which does not exist in the Aβ40 WT
dimers, is likely to contribute to Aβ42 toxicity [93]. The H6R,
D7N, A21G, and D23N mutations change the population of
Dynamics of Amyloid Proteins by Simulations
101
β-strand at the CHC region and at the C-terminus and impact the
population of intramolecular and intermolecular salt bridges
involving E22, D23, and K28 by reducing the formation time of
the loop region for a fibril-like conformation. We also find, consistent with experimental studies, that these mutations produce different effects on Aβ dimer depending on whether they occur in
Aβ40 or Aβ42 [90, 94–96].
Comparing atomistic REMD simulations of WT, WT-A2V, and
A2V-A2V Aβ40 dimers, we find that upon single mutation, the
intrinsic disorder and the intermolecular potential energies are
reduced, and the population of intramolecular three-stranded
β-sheets is increased [97]. A reduced intrinsic disorder upon A2V
mutation was already discussed for the Aβ28 monomer [98]. Analyzing REMD simulations of WT-A2T Aβ40 dimer [99], we provide evidence that the retard in the lag phase time of fibrillation
results from an increase of intrapeptide stability over interpeptide
stability in the heterozygous dimers. This finding was further corroborated by other simulations of A2T and A2V Aβ42 dimers
[100, 101]. Investigating the local structure and dynamics of
hydration, it was also found that the survival probability of ordered
water molecules decays more rapidly for the Aβ N-terminus AD
causative (A2V, H6R, D7N and D7H) mutants than the A2T
protective mutant [102].
4
Oligomeric States of Aβ in Lipid Membrane Models
Interactions of amyloid oligomers with cell membranes are believed
to contribute to toxicity. On the one hand, the membrane can
trigger amyloid fibril formation at a lower peptide concentration
[103, 104]. On the other hand, accumulation of oligomers on the
membrane surface can impart inequal stress on bilayer, extract lipids
into and contribute to the formation of stable phospholipid/oligomer complexes, or create pores transporting Ca2+ ions through the
membrane [105–110]. Due to structural heterogeneity of the oligomers, many pores of various inner diameters made of different
numbers of Aβ subunits have been modeled based on atomic force
microscopy images and MD simulations [110].
Based on size exclusion chromatography, transmission electron
microscopy, circular dichroism, and NMR information on the
structural environment experienced by the Q15, N27, and M35
side chains [111], we recently modeled a tetrameric β-barrel consisting of two distinct β-hairpins, with an asymmetric arrangement
of eight antiparallel β-strands and an inner pore diameter of 0.7 nm
for both alloforms of Aβ [112]. Using extensive atomistic REMD
simulations, we found that this barrel exists transiently for Aβ42
and not for Aβ40 within a dipalmitoylphosphatidylcholine (DDPC)
lipid bilayer membrane, and this may explain the higher toxicity
102
Phuong Hoang Nguyen et al.
effect of Aβ42 than its Aβ40 counterpart [112]. The same simulations indicate that the lower and higher induced toxicity of the A2T
and D23N mutants cannot be correlated to their tetrameric
β-barrel pore-forming probabilities, at least in a DDPC membrane
environment [113], but addition of cholesterol in the membrane
composition could make a difference [114]. As the structures of
amyloid oligomers inserted into a membrane remain elusive, we
explored the stability of Aβ11–42 trimers with parallel (U-shape
fibril) and antiparallel (β-hairpin) β-sheet structures in a DPPC
membrane. Our REMD simulations strongly suggest that these
two assemblies represent minimal seeds or nuclei for the formation
of amyloid fibrils, a variety of β-barrel pores and various aggregates
for Aβ sequences [115]. Apart from β-barrel pores, α-helical pores
are also possible [107, 116]. The equilibrium structures are in all
cases dependent, however, on the lipid composition [117]. For
instance, it is known that the omega-3 polyunsaturated fatty acid
slows the progression of Alzheimer’s disease, while its omega-6
counterpart is linked to increased risk. We showed by MD simulations that variation in the abundance of the 1-palmitoyl-2-oleoylsn-glycero-3-phosphocholine (POPC), omega-3, and omega-6
modulates the conformational ensemble of the Aβ29–42
dimer [118].
5
Study of Inhibitors/Aβ Amyloid Oligomers
Understanding the mechanistic determinants of Aβ amyloidinhibitor interaction is continuously pursued to develop more efficient drugs [70]. A plethora of inhibitors aimed at targeting either
oligomers, the secondary nucleation or fibril elongation, have been
tested in bulk solution and in vivo [119]. We performed numerous
simulations to study the detailed interactions of Aβ fragments,
Aβ40 and Aβ42 oligomers with a large set of inhibitors including
N-methylated
peptide,
EGCG,
curcumin,
resveratrol,
1,4-naphthoquinon-2-yl-L-tryptophan (NQ-Trp), astaxanthin,
and betanin [120–127]. We have shown, as other computational
studies [128–130], that there are many binding sites with small
occupancies and contact surfaces. Even with a binding pocket,
there are multiple binding modes, demonstrating the transient
character of Aβ oligomer/inhibitor interactions in bulk solution.
This is consistent with the absence of nonspecific interactions as
evidenced by NMR and the low affinity of drugs for Aβ monomers
and small oligomers [131] .
Besides small molecules, carbon nanoparticles such as fullerene
and carbon nanotubes can also impede the fibrillation of Aβ and
β2m proteins [132, 133]. In this context, we have shown by
atomistic REMD simulations that carbon nanotubes impact both
Dynamics of Amyloid Proteins by Simulations
103
the primary and secondary nucleation mechanism by destabilizing
β-rich oligomers and fibril assemblies for both the Aβ16–22 and
Aβ25–35 peptides [134, 135].
Several reasons have been put forward to explain the repetitive
failures of small drugs and antibodies targeting Aβ oligomers
[131]. As the lipid membrane can catalyze the formation of toxic
amyloid intermediates, the interplay between amyloid oligomers,
inhibitors, and the lipid membrane should also be considered more
systematically [136, 137]. Similarly, clinical trials might consider a
synergy of multiple drugs targeting Aβ and tau along with the use
of laser or bubble cavitation techniques to destabilize amyloid
aggregates [138–141].
6
Long Amyloid Constructs
The conformational ensemble of several monomeric prion fragments has been explored by coarse-grained OPEP Monte Carlo
[48], and atomistic MD and REMD simulations. The CG simulations on PrP (143–158), starting from random states, showed that
this fragment forms helix by itself, consistent with experiments
[142]. In contrast, the fragment PrP(128–164) was found to
code either for an alpha/β topology, as found in the NMR structure
of recombinant full-length PrP, or a β-hairpin spanning residues
142–167, as found by NMR for PrP(142–167) [142]. Using atomistic MD, we showed that the helix H1 (residues 143–158) is rather
stable upon P102L, M129V, and G131V mutation and deletion of
the residues coding the first or the second β-strand [143, 144];
using REMD simulations, we found an intermediate state characterized by a significant detachment of helix H1 from PrP-core
[145], consistent with other studies [146], which forms very easily
for the E211D mutation by standard MD simulations [147]. This
is of interest as mutation at position 211 drives a switch between
Creutzfeldt-Jakob disease (CJD) and Gerstmann-Str€aussler Scheinker syndrome [147]. Apart from the implication of H1 detachment
in the early steps of aggregation, we found that the CJD-causing
T183A mutation accelerates the conversion of the helix H2 into
β-sheets in the monomer and dimer of PrP [148].
In comparison to the Aβ protein, the number of computational
studies on α-synuclein monomer and long tau constructs is very
small. We recently performed atomistic REMD simulations on the
tau R3-R4 domain (residues 306–378) starting from the fibril
topology [149]. Note that the cryo-electron microscopy of fulllength tau fibril in the brain of an individual with Alzheimer’s
disease reveals an R3–R4 domain with a C-shaped topology of
eight β-sheets, while the rest of the protein is flexible [9]. We
found that the WT R3–R4 dimer populates elongated, U-shaped,
V-shaped, and globular topologies rather than C-shaped forms.
104
Phuong Hoang Nguyen et al.
MD simulations revealed that upon phosphorylation of Ser356
(pSer356) there is a substantial decrease of intermediates near the
fibril-like conformers, compared to its WT counterpart [149]. This
result explains why WT K18 develops seeding activity more rapidly
than pSer356 K18 [150].
Recently, we applied the chain-growth PEP-FOLD approach,
developed for ab initio structure prediction of well-folded proteins
[151–155] and protein–peptide complexes [156] to the monomeric α-synuclein protein [25]. As expected from experiments,
the α-synuclein monomer was found highly dynamic
(Rg distribution varying between 1.5 nm and 2.5 nm) and without
any well-defined structure (65% of turn/coil). We observed however a high propensity of broken helices in the N-terminus, consistent with α-synuclein’s membrane binding properties [25] .
Using a total of 500 PEP-FOLD simulations, we explored here
the conformations of the R3–R4 tau monomer, the R3–R4 domain
being known to be the core of the amyloid fibril. As seen on the free
energy landscape, there are no dominant structures, the Rg distribution varying between 1.2 and 2.0 nm and the Cα end-to-end
distance between 0.5 and 5.0 nm (Fig. 1). Yet, β-strands signals are
detected by the structural alphabet profile along the sequence, and
regions with a β-strand propensity of more than 0.2 encompass
residues 306–315, 317–321, 329–332, 335–345, 349–354,
361–364, and 367–378 (Fig. 2). It is interesting that these regions
fit well the β-strands observed in the fibril, which encompass residues 306–311, 313–322, 327–331, 336–341, 343–347, 349–354,
356–363, and 368–378 [9]. Most of structures are however devoid
of β-sheets and are random coil with some helical content (Fig. 1a–
c), and a few structures have the propensity to form a double
U-shape free of intramolecular H-bonds (Fig. 1d). Overall, we do
not find any evidence of a fibril-like conformation encoded in the
dimeric ensemble.
7
Conclusions
We have reviewed our contributions to the field of amyloid simulations. Conformational ensembles of amyloid monomer and oligomers have been addressed from bulk solution or membrane
environment to complexes with different classes of inhibitors.
Improved coarse-grained and atomistic force fields should tell us
more about the link between oligomers and toxicity. This requires
however to mimic in vivo conditions such as crowding and lipids
and to play with aging parameters such as variation of the fluid flow
in the brain extracellular space [157, 158].
Dynamics of Amyloid Proteins by Simulations
105
Fig. 1 PEP-FOLD free energy landscape (in kcal/mol) of the tau (306–378)
monomer projected on the radius of gyration and the end-to-end distance. The
N-terminus is denoted by a sphere, and the C-terminus by a square
Acknowledgments
We acknowledge support by the “Initiative d’Excellence” program
from the French State (Grant “DYNAMO”, ANR-11-LABX0011-01, and “CACSICE”, ANR-11-EQPX-0008).
Notes The authors declare no competing financial interest.
106
Phuong Hoang Nguyen et al.
Fig. 2 Structural alphabet profile predicted for tau-306-378. Green tones correspond to extended conformations, red tones to helical conformations and blue one to coil regions. Extended regions. Each column
corresponds to a fragment of four amino acids (first column corresponds to VQIV, last column corresponds
to KLTF)
References
1. Goldberg MS, Lansbury PT Jr (2000) Is there
a cause-and-effect relationship between alphasynuclein fibrillization and Parkinson’s disease? Nat Cell Biol 2:E115–E119
2. Hardy J, Selkoe DJ (2002) The amyloid
hypothesis of Alzheimer’s disease: progress
and problems on the road to therapeutics.
Science 297:353–356
3. Scheckel C, Aguzzi A (2018) Prions, prionoids and protein misfolding disorders. Nat
Rev Genet 19:405–418
4. Barthélemy NR, Li Y, Joseph-Mathurin N,
Gordon BA, Hassenstab J, Benzinger TLS,
Buckles V, Fagan AM, Perrin RJ, Goate AM
et al (2020) Dominantly Inherited Alzheimer
Network. A soluble phosphorylated tau signature links tau, amyloid and the evolution of
stages of dominantly inherited Alzheimer’s
disease. Nat Med 26:398–407
5. Auluck PK, Caraveo G, Lindquist S (2010)
Alpha-Synuclein: membrane interactions and
toxicity in Parkinson’s disease. Annu Rev Cell
Dev Biol 26:211–233
6. Riek R, Hornemann S, Wider G, Billeter M,
Glockshuber R, Wuthrich K (1996) NMR
structure of the mouse prion protein domain
PrP(121-321). Nature 382:180–182
7. Dobson CM (1999) Protein misfolding, evolution and disease. Trends Biochem Sci
24:329–332
8. Lu JX, Qiang W, Yau WM, Schwieters CD,
Meredith SC, Tycko R (2013) Molecular
structure of β-amyloid fibrils in Alzheimer’s
disease brain tissue. Cell 154:1257–1268
9. Fitzpatrick AWP, Falcon B, He S, Murzin AG,
Murshudov G, Garringer HJ, Crowther RA,
Ghetti B, Goedert M, Scheres SHW (2017)
Cryo-EM structures of tau filaments from
Alzheimer’s disease. Nature 547:185–190
10. Guerrero-Ferreira R, Taylor NM, Mona D,
Ringler P, Lauer ME, Riek R, Britschgi M,
Stahlberg H (2018) Cryo-EM structure of
alpha-synuclein fibrils. elife 7:e36402
11. Meisl G, Michaels TCT, Linse S, Knowles TPJ
(2018) Kinetic analysis of amyloid formation.
Methods Mol Biol 1779:181–196
12. Buttstedt A, Wostradowski T, Ihling C,
Hause G, Sinz A, Schwarz E (2013) Different
morphology of amyloid fibrils originating
from agitated and non-agitated conditions.
Amyloid 2:86–92
13. Luo XD, Kong FL, Dang HB, Chen J, Liang
Y (2016) Macromolecular crowding favors
the fibrillization of β2-microglobulin by accelerating the nucleation step and inhibiting
fibril disassembly. Biochim Biophys Acta
1864:1609–1619
14. Xu W, Zhang C, Derreumaux P, Gr€aslund A,
Morozova-Roche L, Mu Y (2011) Intrinsic
determinants of Aβ(12-24) pH-dependent
self-assembly revealed by combined computational and experimental studies. PLoS One 6:
e24329
15. Habchi J, Chia S, Galvagnion C, Michaels
TCT, Bellaiche MMJ, Ruggeri FS,
Sanguanini M, Idini I, Kumita JR, Sparr E
et al (2018) Cholesterol catalyses Aβ42 aggregation through a heterogeneous nucleation
Dynamics of Amyloid Proteins by Simulations
pathway in the presence of lipid membranes.
Nat Chem 10:673–683
16. Sibille N, Sillen A, Leroy A, Wieruszeski JM,
Mulloy B, Landrieu I, Lippens G (2006)
Structural impact of heparin binding to fulllength Tau as studied by NMR spectroscopy.
Biochemistry 45:12560–12572
17. Balbach JJ, Ishii Y, Antzutkin ON, Leapman
RD, Rizzo NW, Dyda F, Reed J, Tycko R
(2000) Amyloid fibril formation by A beta
16-22, a seven-residue fragment of the Alzheimer’s beta-amyloid peptide, and structural
characterization by solid state NMR. Biochemistry 39:13748–13759
18. Inouye H, Sharma D, Goux WJ, Kirschner
DA (2006) Structure of core domain of
fibril-forming PHF/Tau fragments. Biophys
J 90:1774–1789
19. Bodles AM, Guthrie DJ, Greer B, Irvine GB
(2001) Identification of the region of
non-Abeta component (NAC) of Alzheimer’s
disease amyloid responsible for its aggregation
and toxicity. J Neurochem 78:384–395
20. Mousseau N, Derreumaux P (2005) Exploring the early steps of amyloid peptide aggregation by computers. Acc Chem Res
38:885–891
21. Nasica-Labouze J, Nguyen PH, Sterpone F,
Berthoumieu O, Buchete NV, Coté S, De
Simone A, Doig AJ, Faller P, Garcia A et al
(2015) Amyloid beta protein and Alzheimer’s
disease: when computer simulations complement experimental studies. Chem Rev
115:3518–3563
22. Lasagna-Reeves CA, Castillo-Carranza DL,
Guerrero-Muoz MJ, Jackson GR, Kayed R
(2010) Preparation and characterization of
neurotoxic tau oligomers. Biochemistry
49:10039–10041
23. Fusco G, Chen SW, Williamson PTF,
Cascella R, Perni M, Jarvis JA, Cecchi C,
Vendruscolo M, Chiti F, Cremades N et al
(2017) Structural basis of membrane disruption and cellular toxicity by α-synuclein oligomers. Science 2017(358):1440–1443
24. Graen T, Klement R, Grupi A, Haas E, Grubmüller H (2018) Transient secondary and tertiary structure formation kinetics in the
intrinsically disordered state of α-synuclein
from atomistic simulations. ChemPhysChem
19:2507–2511
25. Nguyen PH, Derreumaux P (2020) Structures of the intrinsically disordered Aβ, tau
and α-synuclein proteins in aqueous solution
from computer simulations. Biophys Chem
264:106421
107
26. Santini S, Wei G, Mousseau N, Derreumaux P
(2004) Pathway complexity of Alzheimer’s
beta-amyloid Abeta16-22 peptide assembly.
Structure 12:1245–1255
27. Santini S, Mousseau N, Derreumaux P (2004)
In silico assembly of Alzheimer’s Abeta16-22
peptide into beta-sheets. J Am Chem Soc
126:11509–11516
28. Wei G, Mousseau N, Derreumaux P (2004)
Sampling the self-assembly pathways of KFFE
hexamers. Biophys J 87:3648–3656
29. Melquiond A, Boucher G, Mousseau N, Derreumaux P (2005) Following the aggregation
of amyloid-forming peptides by computer
simulations. J Chem Phys 122:174904
30. Melquiond A, Mousseau N, Derreumaux P
(2006) Structures of soluble amyloid oligomers from computer simulations. Proteins
65:180–191
31. Boucher G, Mousseau N, Derreumaux P
(2006) Aggregating the amyloid Abeta
(11-25) peptide into a four-stranded betasheet structure. Proteins 65:877–888
32. Melquiond A, Gelly JC, Mousseau N, Derreumaux P (2007) Probing amyloid fibril formation of the NFGAIL peptide by computer
simulations. J Chem Phys 126:065101
33. Man VH, He X, Derreumaux P, Ji B, Xie XQ,
Nguyen PH, Wang J (2019) Effects of
all-atom molecular mechanics force fields on
amyloid peptide assembly: the case of Aβ16-22
dimer.
J
Chem
Theory
Comput
15:1440–1452
34. Mo Y, Lu Y, Wei G, Derreumaux P (2009)
Structural diversity of the soluble trimers of
the human amylin(20-29) peptide revealed by
molecular dynamics simulations. J Chem Phys
130:125101
35. Lu Y, Derreumaux P, Guo Z, Mousseau N,
Wei G (2009) Thermodynamics and dynamics
of amyloid peptide oligomerization are
sequence dependent. Proteins 75:954–963
36. Wei G, Song W, Derreumaux P, Mousseau N
(2008) Self-assembly of amyloid-forming
peptides by molecular dynamics simulations.
Front Biosci 13:5681–5692
37. Song W, Wei G, Mousseau N, Derreumaux P
(2008) Self-assembly of the beta2microglobulin NHVTLSQ peptide using a
coarse-grained protein model reveals a betabarrel
species.
J
Phys
Chem
B
112:4410–4418
38. Nguyen PH, Li MS, Derreumaux P (2011)
Effects of all-atom force fields on amyloid
oligomerization: replica exchange molecular
dynamics simulations of the Aβ(16-22)
108
Phuong Hoang Nguyen et al.
dimer and trimer. Phys Chem Chem Phys
13:9778–9788
39. De Simone A, Derreumaux P (2010) Low
molecular weight oligomers of amyloid peptides display beta-barrel conformations: a replica exchange molecular dynamics study in
explicit solvent. J Chem Phys 132:165103
40. Nguyen PH, Derreumaux P (2013) Conformational ensemble and polymorphism of the
all-atom Alzheimer’s Aβ(37-42) amyloid peptide
oligomers.
J
Phys
Chem
B
117:5831–5840
41. Spill YG, Pasquali S, Derreumaux P (2011)
Impact of thermostats on folding and aggregation properties of peptides using the optimized potential for efficient structure
prediction coarse-grained model. J Chem
Theory Comput 7:1502–1510
42. Nasica-Labouze J, Meli M, Derreumaux P,
Colombo G, Mousseau N (2011) A multiscale approach to characterize the early aggregation steps of the amyloid-forming peptide
GNNQQNY from the yeast prion sup-35.
PLoS Comput Biol 7:e1002051
43. Lu Y, Wei G, Derreumaux P (2012) Structural, thermodynamical, and dynamical properties of oligomers formed by the amyloid
NNQQ peptide: insights from coarse-grained
simulations. J Chem Phys 137:025101
44. Nguyen PH, Okamoto Y, Derreumaux P
(2013) Communication: simulated tempering with fast on-the-fly weight determination.
J Chem Phys 138:061102
45. Chebaro Y, Pasquali S, Derreumaux P (2012)
The coarse-grained OPEP force field for
non-amyloid and amyloid proteins. J Phys
Chem B 116:8741–8752
46. Wei G, Mousseau N, Derreumaux P (2007)
Computational simulations of the early steps
of protein aggregation. Prion 1:3–8
47. Derreumaux P (1999) From polypeptide
sequences to structures using Monte Carlo
simulations and an optimized potential. J
Chem Phys 5:2301–2310
48. Derreumaux P (2001) Generating ensemble
averages for small proteins from extended
conformations by Monte Carlo simulations.
Phys Rev Lett 1:206–209
49. Maupetit J, Tuffery P, Derreumaux P (2007)
A coarse-grained protein force field for folding and structure prediction. Proteins
69:394–408
50. Sterpone F, Nguyen PH, Kalimeri M, Derreumaux P (2013) Importance of the ion-pair
interactions in the OPEP coarse-grained
force field: parametrization and validation. J
Chem Theory Comput 9:4574–4584
51. Sterpone F, Melchionna S, Tuffery P,
Pasquali S, Mousseau N, Cragnolini T,
Chebaro Y, St-Pierre JF, Kalimeri M, Barducci
A et al (2014) The OPEP protein model:
from single molecules, amyloid formation,
crowding and hydrodynamics to DNA/RNA
systems. Chem Soc Rev 43:4871–4893
52. Kalimeri M, Derreumaux P, Sterpone F
(2015) Are coarse-grained models apt to
detect protein thermal stability? The case of
OPEP force field. J Non-Cryst Solids
407:494–501
53. Mousseau N, Derreumaux P, Barkema GT,
Malek R (2001) Sampling activated mechanisms in proteins with the activation-relaxation
technique. J Mol Graph Model 19:78–86
54. Wei GH, Derreumaux P, Mousseau N (2003)
Sampling the complex energy landscape of a
simple
beta-hairpin.
J
Chem
Phys
13:6403–6406
55. Mousseau N, Derreumaux P (2008) Exploring energy landscapes of protein folding and
aggregation. Front Biosci 13:4495–44516
56. Kreutzer AG, Nowick JS (2018) Elucidating
the structures of amyloid oligomers with macrocyclic β-hairpin peptides: insights into Alzheimer’s disease and other amyloid diseases.
Acc Chem Res 51:706–718
57. Laganowsky A, Liu C, Sawaya MR, Whitelegge JP, Park J, Zhao M, Pensalfini A, Soriaga
AB, Landau M, Teng PK et al (2012) Atomic
view of a toxic amyloid small oligomer. Science 335:1228–1231
58. Matthes D, Gapsys V, Brennecke JT, de Groot
BL (2016) An atomistic view of amyloidogenic self-assembly: structure and dynamics
of heterogeneous conformational states in
the pre-nucleation phase. Sci Rep 6:33156
59. Levine ZA, Shea JE (2017) Simulations of
disordered proteins and systems with conformational heterogeneity. Curr Opin Struct Biol
43:95–103
60. Decatur SM (2006) Elucidation of residuelevel structure and dynamics of polypeptides
via isotope-edited infrared spectroscopy. Acc
Chem Res 39:169–175
61. Jia Z, Schmit JD, Chen J (2020) Amyloid
assembly is dominated by misregistered
kinetic traps on an unbiased energy landscape.
Proc Natl Acad Sci U S A 117:10322–10328
62. Samantray S, Yin F, Kav B, Strodel B (2020)
Different force fields give rise to different
amyloid aggregation pathways in molecular
dynamics simulations. J Chem Inf Model
60:6462–6475
63. Robustelli P, Piana S, Shaw DE (2018) Developing a molecular dynamics force field for
Dynamics of Amyloid Proteins by Simulations
both folded and disordered protein states.
Proc Natl Acad Sci U S A 115:E4758–E4766
64. Rahman MU, Rehman AU, Liu H, Chen HF
(2020) Comparison and evaluation of force
fields for intrinsically disordered proteins. J
Chem Inf Model 60:4912–4923
65. Huang J, Rauscher S, Nawrocki G, Ran T,
Feig M, de Groot BL, Grubmüller H, MacKerell AD Jr (2017) CHARMM36m: an
improved force field for folded and intrinsically disordered proteins. Nat Methods
14:71–73
66. Carballo-Pacheco M, Ismail AE, Strodel B
(2018) On the applicability of force fields to
study the aggregation of amyloidogenic peptides using molecular dynamics simulations. J
Chem Theory Comput 14:6063–6075
67. Baftizadeh F, Pietrucci F, Biarnés X, Laio A
(2013) Nucleation process of a fibril precursor
in the C-terminal segment of amyloid-β. Phys
Rev Lett 110:168103
68. Šarić A, Michaels TCT, Zaccone A, Knowles
TPJ, Frenkel D (2016) Kinetics of spontaneous filament nucleation via oligomers: insights
from theory and simulation. J Chem Phys
145:211926
69. Lee CT, Terentjev EM (2017) Mechanisms
and rates of nucleation of amyloid fibrils. J
Chem Phys 147:105103
70. Nguyen P, Derreumaux P (2014) Understanding amyloid fibril nucleation and aβ oligomer/drug interactions from computer
simulations. Acc Chem Res 47:603–611
71. Tran TT, Nguyen PH, Derreumaux P (2016)
Lattice model for amyloid peptides: OPEP
force field parametrization and applications
to the nucleus size of Alzheimer’s peptides. J
Chem Phys 144:205103
72. Chiricotto M, Tran TT, Nguyen PH,
Melchionna S, Sterpone F, Derreumaux P
(2017) Coarse-grained and all-atom simulations towards the early and late steps of amyloid fibril formation. Isr J Chem 57:564–573
73. Sterpone F, Doutreligne S, Tran TT,
Melchionna S, Baaden M, Nguyen PH, Derreumaux P (2018) Multiscale simulations of
biological systems using the OPEP coarsegrained model. Biochem Biophys Res Commun 498:296–304
74. Szała-Mendyk B, Molski A (2020) Clustering
and fibril formation during GNNQQNY
aggregation: a molecular dynamics study. Biomol Ther 10:1362
75. Villali J, Dark J, Brechtel TM, Pei F, Sindi SS,
Serio TR (2020) Nucleation seed size determines amyloid clearance and establishes a
109
barrier to prion appearance in yeast. Nat
Struct Mol Biol 27:540–549
76. Sterpone F, Derreumaux P, Melchionna S
(2015) Protein simulations in fluids: coupling
the OPEP coarse-grained force field with
hydrodynamics. J Chem Theory Comput
11:1843–1853
77. Chiricotto M, Sterpone F, Derreumaux P,
Melchionna S (2016) Multiscale simulation
of molecular processes in cellular environments. Philos Trans A Math Phys Eng Sci
374:20160225
78. Chiricotto M, Melchionna S, Derreumaux P,
Sterpone F (2016) Hydrodynamic effects on
β-amyloid (16-22) peptide aggregation. J
Chem Phys 145:035102
79. Chiricotto M, Melchionna S, Derreumaux P,
Sterpone F (2019) Multiscale aggregation of
the amyloid Aβ16-22 peptide: from disordered
coagulation and lateral branching to amorphous prefibrils. J Phys Chem Lett
10:1594–1599
80. Nguyen PH, Sterpone F, Derreumaux P
(2020) Aggregation of disease-related peptides. Prog Mol Biol Transl Sci 170:435–460
81. Lasagna-Reeves CA, Glabe CG, Kayed R
(2011) Amyloid-β annular protofibrils evade
fibrillar fate in Alzheimer disease brain. J Biol
Chem 286:22122–22130
82. Lesné SE, Sherman MA, Grant M,
Kuskowski M, Schneider JA, Bennett DA,
Ashe KH (2013) Brain amyloid-β oligomers
in ageing and Alzheimer’s disease. Brain
136:1383–1398
83. Müller-Schiffmann A, Andreyeva A, Horn
AH, Gottmann K, Korth C, Sticht H (2011)
Molecular engineering of a secreted, highly
homogeneous, and neurotoxic Aβ dimer.
ACS Chem Neurosci 2:242–248
84. Rousseau F, Schymkowitz J, De Strooper B
(2014) The Alzheimer disease protective
mutation A2T modulates kinetic and thermodynamic properties of amyloid-β (Aβ) aggregation. J Biol Chem 289:30977–30989
85. Zheng X, Liu D, Roychaudhuri R, Teplow
DB, Bowers MT (2015) Amyloid β-protein
assembly: differential effects of the protective
A2T mutation and recessive A2V familial Alzheimer’s disease mutation. ACS Chem Neurosci 6:1732–1740
86. Messa M, Colombo L, del Favero E, Cantù L,
Stoilova T, Cagnotto A, Rossi A, Morbin M,
Di Fede G, Tagliavini F et al (2014) The
peculiar role of the A2V mutation in amyloid-β (Aβ) 1-42 molecular assembly. J Biol
Chem 289:24143–24152
110
Phuong Hoang Nguyen et al.
87. Tarus B, Tran TT, Nasica-Labouze J,
Sterpone F, Nguyen PH, Derreumaux P
(2015) Structures of the Alzheimer’s wildtype Aβ1-40 dimer from atomistic simulations. J Phys Chem B 119:10478–10487
88. Man VH, Nguyen PH, Derreumaux P (2017)
High-resolution structures of the amyloid-β
1-42 dimers from the comparison of four
atomistic force fields. J Phys Chem B
121:5977–5987
89. Côté S, Derreumaux P, Mousseau N (2011)
Distinct morphologies for amyloid beta protein monomer: Aβ1-40, Aβ1-42, and Aβ1-40
(D23N). J Chem Theory Comput
7:2584–2592
90. Côté S, Laghaei R, Derreumaux P, Mousseau
N (2012) Distinct dimerization for various
alloforms of the amyloid-beta protein: Aβ
(1-40), Aβ(1-42), and Aβ(1-40)(D23N). J
Phys Chem B 116:4043–4055
91. Cao Y, Jiang X, Han W (2017) Self-assembly
pathways of β-sheet-rich amyloid-β(1-40)
dimers: markov state model analysis on millisecond hybrid-resolution simulations. J Chem
Theory Comput 13:5731–5744
92. Chebaro Y, Mousseau N, Derreumaux P
(2009) Structures and thermodynamics of
Alzheimer’s amyloid-beta Abeta(16-35)
monomer and dimer by replica exchange
molecular dynamics simulations: implication
for full-length Abeta fibrillation. J Phys
Chem B 113:7668–7675
93. Man VH, Nguyen PH, Derreumaux P (2017)
Conformational ensembles of the wild-type
and S8C Aβ1-42 Dimers. J Phys Chem B
121:2434–2442
94. Viet MH, Nguyen PH, Derreumaux P, Li MS
(2014) Effect of the English familial disease
mutation (H6R) on the monomers and
dimers of Aβ40 and Aβ42. ACS Chem Neurosci 5:646–657
95. Viet MH, Nguyen PH, Ngo ST, Li MS, Derreumaux P (2013) Effect of the Tottori familial disease mutation (D7N) on the monomers
and dimers of Aβ40 and Aβ42. ACS Chem
Neurosci 4:1446–1457
96. Huet A, Derreumaux P (2006) Impact of the
mutation A21G (Flemish variant) on Alzheimer’s beta-amyloid dimers by molecular
dynamics
simulations.
Biophys
J
91:3829–3840
97. Nguyen PH, Sterpone F, Campanera JM,
Nasica-Labouze J, Derreumaux P (2016)
Impact of the A2V mutation on the heterozygous and homozygous Aβ1-40 dimer structures from atomistic simulations. ACS Chem
Neurosci 7:823–832
98. Nguyen PH, Tarus B, Derreumaux P (2014)
Familial Alzheimer A2 V mutation reduces
the intrinsic disorder and completely changes
the free energy landscape of the Aβ1-28
monomer. J Phys Chem B 118:501–510
99. Nguyen PH, Sterpone F, Pouplana R,
Derreumaux P, Campanera JM (2016)
Dimerization mechanism of Alzheimer Aβ40
peptides: the high content of intrapeptidestabilized conformations in A2V and A2T heterozygous dimers retards amyloid fibril formation. J Phys Chem B 120:12111–12126
100. Das P, Chacko AR, Belfort G (2017) Alzheimer’s protective cross-interaction between
wild-type and A2T variants alters Aβ42 dimer
structure. ACS Chem Neurosci 8:606–618
101. Li H, Nam Y, Salimi A, Lee JY (2020) Impact
of A2V mutation and histidine tautomerism
on Aβ42 monomer structures from atomistic
simulations.
J
Chem
Inf
Model
60:3587–3592
102. Aggarwal L, Biswas P (2020) Effect of Alzheimer’s disease causative and protective
mutations on the hydration environment of
amyloid-β. J Phys Chem B 124:2311–2322
103. Banerjee S, Hashemi M, Lv Z, Maity S,
Rochet JC, Lyubchenko YL (2017) A novel
pathway for amyloids self-assembly in aggregates at nanomolar concentration mediated
by the interaction with surfaces. Sci Rep
7:45592
104. Alvarez AB, Caruso B, Rodrı́guez PEA, Petersen SB, Fidelio GD (2020) Aβ-amyloid fibrils
are self-triggered by the interfacial lipid environment and low peptide content. Langmuir
36:8056–8065
105. Farrugia MY, Caruana M, Ghio S,
Camilleri A, Farrugia C, Cauchi RJ,
Cappelli S, Chiti F, Vassallo N (2020) Toxic
oligomers of the amyloidogenic HypF-N protein form pores in mitochondrial membranes.
Sci Rep 10:17733
106. Ghio S, Camilleri A, Caruana M, Ruf VC,
Schmidt F, Leonov A, Ryazanov S,
Griesinger C, Cauchi RJ, Kamp F et al
(2019) Cardiolipin promotes pore-forming
activity of alpha-synuclein oligomers in mitochondrial membranes. ACS Chem Neurosci
10:3815–3829
107. Österlund N, Moons R, Ilag LL, Sobott F,
Gr€aslund A (2019) Native ion mobility-mass
spectrometry reveals the formation of β-barrel
shaped amyloid-β hexamers in a membranemimicking environment. J Am Chem Soc
141:10440–10450
108. Ait-Bouziad N, Lv G, Mahul-Mellier AL,
Xiao S, Zorludemir G, Eliezer D, Walz T,
Dynamics of Amyloid Proteins by Simulations
Lashuel HA (2017) Discovery and characterization of stable and toxic Tau/phospholipid
oligomeric complexes. Nat Commun 8:1678
109. Jang H, Arce FT, Ramachandran S, Kagan
BL, Lal R, Nussinov R (2014) Disordered
amyloidogenic peptides may insert into the
membrane and assemble into common cyclic
structural
motifs.
Chem
Soc
Rev
43:6750–6764
110. Connelly L, Jang H, Arce FT, Capone R,
Kotler SA, Ramachandran S, Kagan BL,
Nussinov R, Lal R (2012) Atomic force
microscopy and MD simulations reveal porelike structures of all-D-enantiomer of Alzheimer’s β-amyloid peptide: relevance to the ion
channel mechanism of AD pathology. J Phys
Chem B 116:1728–1735
111. Serra-Batiste
M,
Ninot-Pedrosa
M,
Bayoumi M, Gairı́ M, Maglia G, Carulla N
(2016) Aβ42 assembles into specific β-barrel
pore-forming oligomers in membranemimicking environments. Proc Natl Acad Sci
U S A 113:10866–10871
112. Nguyen PH, Campanera JM, Ngo ST,
Loquet A, Derreumaux P (2019) Tetrameric
Aβ40 and Aβ42 β-barrel structures by extensive atomistic simulations. I. In a bilayer mimicking a neuronal membrane. J Phys Chem B
123:3643–3648
113. Ngo ST, Nguyen PH, Derreumaux P (2020)
Impact of A2T and D23N mutations on tetrameric Aβ42 barrel within a dipalmitoylphosphatidylcholine lipid bilayer membrane by
replica exchange molecular dynamics. J Phys
Chem B 124:1175–1182
114. Di Scala C, Yahi N, Boutemeur S, Flores A,
Rodriguez L, Chahinian H, Fantini J (2016)
Common molecular mechanism of amyloid
pore formation by Alzheimer’s β-amyloid
peptide and α-synuclein. Sci Rep 6:28781
115. Ngo ST, Nguyen PH, Derreumaux P (2020)
Stability of Aβ11-40 trimers with parallel and
antiparallel β-sheet organizations in a
membrane-mimicking environment by replica
exchange molecular dynamics simulation. J
Phys Chem B 124:617–626
116. Ngo ST, Derreumaux P, Vu VV (2019) Probable transmembrane amyloid α-helix bundles
capable of conducting Ca2+ ions. J Phys
Chem B 123:2645–2653
117. Sahoo A, Matysiak S (2019) Computational
insights into lipid assisted peptide misfolding
and aggregation in neurodegeneration. Phys
Chem Chem Phys 21:22679–22694
118. Lu Y, Shi XF, Nguyen PH, Sterpone F, Salsbury FR Jr, Derreumaux P (2019) Amyloid-β
(29-42) dimeric conformations in membranes
111
rich in omega-3 and omega-6 polyunsaturated fatty acids. J Phys Chem B
123:2687–2696
119. Doig AJ, Derreumaux P (2015) Inhibition of
protein aggregation and amyloid formation
by small molecules. Curr Opin Struct Biol
30:50–56
120. Chebaro Y, Derreumaux P (2009) Targeting
the early steps of Abeta16-22 protofibril disassembly by N-methylated inhibitors: a
numerical study. Proteins 75:442–452
121. Chebaro Y, Jiang P, Zang T, Mu Y, Nguyen
PH, Mousseau N, Derreumaux P (2012)
Structures of Aβ17-42 trimers in isolation
and with five small-molecule drugs using a
hierarchical computational procedure. J Phys
Chem B 116:8412–8422
122. Zhang T, Zhang J, Derreumaux P, Mu Y
(2013) Molecular mechanism of the inhibition of EGCG on the Alzheimer Aβ(1-42)
dimer. J Phys Chem B 117:3993–4002
123. Zhang T, Xu W, Mu Y, Derreumaux P (2014)
Atomic and dynamic insights into the beneficial effect of the 1,4-naphthoquinon-2-yl-Ltryptophan inhibitor on Alzheimer’s Aβ1-42
dimer in terms of aggregation and toxicity.
ACS Chem Neurosci 5:148–159
124. Tarus B, Nguyen PH, Berthoumieu O,
Faller P, Doig AJ, Derreumaux P (2015)
Molecular structure of the NQTrp inhibitor
with the Alzheimer Aβ1-28 monomer. Eur J
Med Chem 91:43–50
125. Berthoumieu O, Nguyen PH, Castillo-Frias
MP, Ferre S, Tarus B, Nasica-Labouze J,
Noël S, Saurel O, Rampon C, Doig AJ et al
(2015) Combined experimental and simulation studies suggest a revised mode of action
of the anti-Alzheimer disease drug NQ-Trp.
Chemistry 21:12657–12666
126. Minh Hung H, Nguyen MT, Tran PT,
Truong VK, Chapman J, Quynh Anh LH,
Derreumaux P, Vu VV, Ngo ST (2020)
Impact of the astaxanthin, betanin, and
EGCG compounds on small oligomers of
amyloid Aβ40 peptide. J Chem Inf Model
60:1399–1408
127. Nguyen PH, Del Castillo-Frias MP,
Berthoumieux O, Faller P, Doig AJ, Derreumaux P (2018) Amyloid-β/drug interactions
from computer simulations and cell-based
assays. J Alzheimers Dis 64:S659–S672
128. Liang C, Savinov SN, Fejzo J, Eyles SJ, Chen
J (2019) Modulation of amyloid-β42 conformation by small molecules through nonspecific binding. J Chem Theory Comput
15:5169–5174
112
Phuong Hoang Nguyen et al.
129. Tran L (2018) Understanding the binding
mechanism of amyloid-β inhibitors from
molecular simulations. Curr Pharm Des
24:3341–3346
130. Zhu M, De Simone A, Schenk D, Toth G,
Dobson CM, Vendruscolo M (2013) Identification of small-molecule binding pockets in
the soluble monomeric form of the Aβ42
peptide. J Chem Phys 139:035101
131. Doig
AJ,
Del
Castillo-Frias
MP,
Berthoumieu O, Tarus B, Nasica-Labouze J,
Sterpone F, Nguyen PH, Hooper NM,
Faller P, Derreumaux P (2017) Why is
research on amyloid-β failing to give new
drugs for Alzheimer’s disease? ACS Chem
Neurosci 8:1435–1437
132. Kim JE, Lee M (2003) Fullerene inhibits
beta-amyloid peptide aggregation. Biochem
Biophys Res Commun 303:576–579
133. Linse S, Cabaleiro-Lago C, Xue WF, Lynch I,
Lindman S, Thulin E, Radford SE, Dawson
KA (2007) Nucleation of protein fibrillation
by nanoparticles. Proc Natl Acad Sci U S A
104:8691–8696
134. Fu Z, Luo Y, Derreumaux P, Wei G (2009)
Induced beta-barrel formation of the Alzheimer’s Abeta25-35 oligomers on carbon nanotube surfaces: implication for amyloid fibril
inhibition. Biophys J 97:1795–1803
135. Li H, Luo Y, Derreumaux P, Wei G (2011)
Carbon nanotube inhibits the formation of
β-sheet-rich oligomers of the Alzheimer’s
amyloid-β(16-22) peptide. Biophys J
101:2267–2276
136. Limbocker R, Mannini B, Ruggeri FS,
Cascella R, Xu CK, Perni M, Chia S, Chen
SW, Habchi J, Bigi A et al (2020) Trodusquemine displaces protein misfolded oligomers
from cell membranes and abrogates their
cytotoxicity through a generic mechanism.
Commun Biol 3:435
137. Cox SJ, Lam B, Prasad A, Marietta HA,
Stander NV, Joel JG, Sahoo BR, Guo F, Stoddard AK, Ivanova MI et al (2020) Highthroughput screening at the membrane interface reveals inhibitors of amyloid-β. Biochemistry 59:2249–2258
138. Man VH, Derreumaux P, Li MS, Roland C,
Sagui C, Nguyen PH (2015) Picosecond dissociation of amyloid fibrils with infrared laser:
a nonequilibrium simulation study. J Chem
Phys 143:155101
139. Man VH, Derreumaux P, Nguyen PH (2016)
Nonequilibrium all-atom molecular dynamics
simulation of the bubble cavitation and application to dissociate amyloid fibrils. J Chem
Phys 145:174113
140. Kawasaki T, Man VH, Sugimoto Y,
Sugiyama N, Yamamoto H, Tsukiyama K,
Wang J, Derreumaux P, Nguyen PH (2020)
Infrared laser-induced amyloid fibril dissociation: a joint experimental/theoretical study
on the GNNQQNY peptide. J Phys Chem B
124:6266–6277
141. Man VH, Wang J, Derreumaux P, Nguyen
PH (2021) Nonequilibrium Molecular
Dynamics Simulations Of Infrared LaserInduced Dissociation of a tetrameric Aβ42 β
-barrel in a neuronal membrane model. Chem
Phys Lipids 234:105030
142. Derreumaux P (2001) Evidence that the
127-164 region of prion proteins has two
equi-energetic conformations with beta or
alpha features. Biophys J 81:1657–1665
143. Santini S, Derreumaux P (2004) Helix H1 of
the prion protein is rather stable against environmental perturbations: molecular dynamics
of mutation and deletion variants of PrP
(90-231). Cell Mol Life Sci 61:951–960
144. Santini S, Claude JB, Audic S, Derreumaux P
(2003) Impact of the tail and mutations
G131V and M129V on prion protein flexibility. Proteins 51:258–265
145. De Simone A, Zagari A, Derreumaux P
(2007) Structural and hydration properties
of the partially unfolded states of the prion
protein. Biophys J 93:1284–1292
146. Wille H, Dorosh L, Amidian S, SchmittUlms G, Stepanova M (2019) Combining
molecular dynamics simulations and experimental analyses in protein misfolding. Adv
Protein Chem Struct Biol 118:33–110
147. Peoc’h K, Levavasseur E, Delmont E, De
Simone A, Laffont-Proust I, Privat N,
Chebaro Y, Chapuis C, Bedoucha P, Brandel
JP et al (2012) Substitutions at residue 211 in
the prion protein drive a switch between CJD
and GSS syndrome, a new mechanism governing inherited neurodegenerative disorders.
Hum Mol Genet 21:5417–5428
148. Chebaro Y, Derreumaux P (2009) The conversion of helix H2 to beta-sheet is accelerated in the monomer and dimer of the prion
protein upon T183A mutation. J Phys Chem
B 113:6942–6948
149. Derreumaux P, Man VH, Wang J, Nguyen
PH (2020) Tau R3-R4 domain dimer of the
wild type and phosphorylated ser356
sequences. I. In solution by atomistic simulations. J Phys Chem B 124:2975–2983
150. Haj-Yahya M, Gopinath P, Rajasekhar K,
Mirbaha H, Diamond MI, Lashuel HA
(2020) Site-specific hyperphosphorylation
inhibits, rather than promotes, tau
Dynamics of Amyloid Proteins by Simulations
fibrillization, seeding capacity, and its microtubule binding. Angew Chem Int Ed Engl
59:4059–4067
151. Maupetit J, Derreumaux P, Tuffery P (2009)
PEP-FOLD: an online resource for de novo
peptide structure prediction. Nucleic Acids
Res 37:W498–W503
152. Maupetit J, Derreumaux P, Tufféry P (2010)
A fast method for large-scale de novo peptide
and miniprotein structure prediction. J Comput Chem 31:726–738
153. Thévenet P, Shen Y, Maupetit J, Guyon F,
Derreumaux
P,
Tufféry
P
(2012)
PEP-FOLD: an updated de novo structure
prediction server for both linear and disulfide
bonded cyclic peptides. Nucleic Acids Res 40:
W288–W293
154. Shen Y, Maupetit J, Derreumaux P, Tufféry P
(2014) Improved PEP-FOLD approach for
peptide and miniprotein structure prediction.
J Chem Theory Comput 10:4745–4758
155. Sutherland GA, Grayson KJ, Adams NBP et al
(2018) Probing the quality control
113
mechanism of the Escherichia coli twinarginine translocase with folding variants of a
de novo-designed heme protein. J Biol Chem
293:6672–6681
156. Lamiable A, Thévenet P, Rey J, Vavrusa M,
Derreumaux
P,
Tufféry
P
(2016)
PEP-FOLD3: faster de novo structure prediction for linear peptides in solution and in
complex. Nucleic Acids Res 44:W449–W454
157. Ngo ST, Nguyen PH, Derreumaux P (2021)
Cholesterol molecules alter the energy landscape of small Aβ 1-42 oligomers. J Phys
Chem B 125(9):2299–2307. https://doi.
org/10.1021/acs.jpcb.1c00036
158. Ramamoorthy A, Sahoo BR, Zheng J,
Chiricotto M, Straub JE, Dominguez L,
Shea J-E, Dokholyan NV, De Simone A et al
(2021) Amyloid oligomers: a joint experimental/computational perspective on Alzheimer’s disease, Parkinson’s disease, type II
diabetes and amyotrophic lateral sclerosis.
Chem Rev 121(4):2545–2647. https://doi.
org/10.1021/acs.chemrev.0c01122
Chapter 6
Predicting Membrane-Active Peptide Dynamics in Fluidic
Lipid Membranes
Charles H. Chen, Karen Pepper, Jakob P. Ulmschneider,
Martin B. Ulmschneider, and Timothy K. Lu
Abstract
Understanding the interactions between peptides and lipid membranes could not only accelerate the
development of antimicrobial peptides as treatments for infections but also be applied to finding targeted
therapies for cancer and other diseases. However, designing biophysical experiments to study molecular
interactions between flexible peptides and fluidic lipid membranes has been an ongoing challenge. Recently,
with hardware advances, algorithm improvements, and more accurate parameterizations (i.e., force fields),
all-atom molecular dynamics (MD) simulations have been used as a “computational microscope” to
investigate the molecular interactions and mechanisms of membrane-active peptides in cell membranes
(Chen et al., Curr Opin Struct Biol 61:160–166, 2020; Ulmschneider and Ulmschneider, Acc Chem Res
51(5):1106–1116, 2018; Dror et al., Annu Rev Biophys 41:429–452, 2012). In this chapter, we describe
how to utilize MD simulations to predict and study peptide dynamics and how to validate the simulations
by circular dichroism, intrinsic fluorescent probe, membrane leakage assay, electrical impedance, and
isothermal titration calorimetry. Experimentally validated MD simulations open a new route towards
peptide design starting from sequence and structure and leading to desirable functions.
Key words Protein design, Molecular dynamics simulations, Membrane-active peptides, Protein
folding, Pore formation
1
Introduction
Membrane-active peptides (MAPs) are a ubiquitous part of the
innate immune defense system and also play a prominent role in
protein misfolding diseases, such as Alzheimer’s disease and Parkinson’s disease. Antimicrobial peptides (AMPs), a large subgroup
of MAPs, are typically amphiphilic peptides that selectively target
and kill bacteria at low micromolar concentrations, often without
harming mammalian cells [4–6]. Until now, more than 3000 AMPs
have been reported and characterized, seven of which have been
approved as antibacterial agents by the U.S. Food and Drug
Administration (FDA) [7]. These 3000 AMPs vary widely in size,
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_6,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
115
116
Charles H. Chen et al.
sequence, secondary structure, and physicochemical properties
(e.g., hydrophobic content and net charge), and no common
sequence motif has been discovered to date. This lack of known
sequence–function relationship is unusual for proteins. For the
subgroup of pore-forming AMPs, these peptides bind to cell membrane and spontaneously assemble in the lipid bilayer as a channel
or pore-like structure, though not all are cytolytic; however, we do
not yet understand the root causes of membrane-disruption activity. For example, a small number of amino acid mutations in melittin, a powerful helical AMP, can significantly change pore stability
[8], as well as other functional properties, such as antimicrobial
activity [9, 10] and cell selectivity [10]. Wiedman et al. showed that
just a few amino acid substitutions in Melp5, a gain-of-function
melittin variant, can alter membrane poration activity by disrupting
liposomes but can do so only in acidic conditions [11, 12], and
even the potency against cell membranes and membrane pore size
are affected by these minor changes in amino acid sequence. In
nature, minor changes of a MAP sequence can promote membrane
disruption and cause protein-misfolding diseases. Typical examples
are neurodegenerative peptides, such as amyloid-beta, alpha-synuclein, and TDP-43 C-terminal fragments, in which minor mutations can result in protein misfolding and correlate with
neurodegenerative diseases [13–18]. Peptide length is also an
important factor for hydrophobic mismatch to span the cell membrane [19–21]. Ulrich et al. reported several rationally designed
helical peptides with repeated KIAGKIA motifs with peptide
lengths between 14 and 28 amino acids, and they validated that
long peptide length can affect a peptide’s ability to damage cell
membranes [20] and penetrate into the membrane [21]. They
found that longer peptides were more likely to damage cell membranes than shorter peptides with similar amino acid content.
Although these studies have provided us with an improved
understanding of the correlation between peptide sequence and
pore formation, the pore-forming mechanisms and multimeric
functional structures in the membrane still remain largely undetermined. Several studies have addressed the difficulty of predicting
the form of ensembles of transient channel structures and capturing
highly dynamic peptide–peptide and peptide–lipid interactions in a
fluid lipid bilayer [2, 7, 22]. Multimeric MAP pore models and
pore-forming
mechanisms
using
molecular
dynamics
(MD) simulations have been proposed [23–25]. However, these
studies generally assume an initial pore configuration, which given
the near-infinite number of possible structures, is unlikely to yield a
firm basis for functional optimization. We have demonstrated the
feasibility of using unbiased long-timescale MD simulations to
predict the functional pore structures of the AMPs without bias
[26, 27]. Advanced MD simulations can provide insight, at the
level of atomic detail, into the diverse dynamic structures formed
in complex fluid cell membranes.
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
2
117
Materials
2.1 Molecular
Dynamics (MD)
Simulations
MD simulations can be used as a computational microscope [3] to
study the atomic details of peptide folding and interactions with
other molecules, e.g., peptide, lipid, and water
1. GROningen MAchine for Chemical Simulations (GROMACS)
program can be downloaded through the link http://manual.
gromacs.org/documentation/ [28].
2. HIPPO simulation package can be downloaded through the
link https://www.biowerkzeug.com.
3. Visual Molecular Dynamics (VMD) molecular visualization
program can be downloaded through the link http://www.ks.
uiuc.edu/Research/vmd/ [29].
4. CHARMM-GUI web-based graphical user interface can be
accessed through the link http://www.charmm-gui.org/ [30]
to build the simulation model.
2.2 Combinatorial
Peptide Libraries
A combinatorial peptide library can be adapted to various kinds of
experiments for screening and offers a useful approach to study
sequence–function relationship
1. Preparing beads: Tentagel® NH2 macrobeads (280–320 μm
particle size) (Rapp-Polymere; MB300002) are solvated in
methanol and incubated overnight.
2. Reaction vials: Peptide synthesis vessels, solid phase, T-bore
PTFE stopcocks 10 mL (Chemglass Life Science; CG-186401).
3. Photolinker: Fmoc-photolabile linker (Advanced Chem Tech;
RT1095).
4. Preparing cocktail for deprotection of the side chains: 88% vol
TFA, 5% vol phenol (preheated on 40 C hot plate), 5% vol
pure water, and 2% vol triisopropylsilane (using needle) are
mixed in a glass vial. The mixture is allowed to cool at
20 C for 15 min.
5. Treating peptides before dissolving for stock solution: The
synthesized peptides on the resin are dissolved in the predissolving buffer (50% vol hexafluoroisopropanol [HFIP] and
50% vol water) and treated with UV light for 3 h until dry.
This step is performed in the hood because HFIP is highly
volatile.
2.3 Circular
Dichroism (CD)
Spectroscopy
CD spectroscopy can be used to characterize the secondary structure of the peptide in aqueous conditions with or without lipid
vesicles
118
Charles H. Chen et al.
1. Cuvette: Macro quartz rectangular cuvette 1 mm (Fisher Scientific; 14958110).
2. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 is
prepared using 0.292 g sodium phosphate, monobasic (MW:
137.99) and 0.773 g sodium phosphate, dibasic (MW: 268.07)
in 500 mL Milli-Q® water. The pH is adjusted using 1 M
hydrochloric acid. The buffer is filtered using Stericup-GV
Sterile Vacuum Filtration System.
2.4 Oriented Circular
Dichroism (OCD)
Spectroscopy
OCD spectroscopy can be used to characterize the transmembrane
activity of the peptide in lipid bilayers
2.5 Liposome
Fluorescent Leakage
Assay
Liposome fluorescent leakage assays can be used to determine the
extent of membrane lysis or poration by the peptide. This assay can
be utilized as a screening platform to evaluate membrane-active
peptides
1. Quartz glass plate: Circular quartz glass high performance
plates 200 nm to 2500 nm with 20-mm diameter and
1.25 mm thickness (Hellma Analytics; 202-QS) are used.
1. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 with
100 mM potassium chloride is prepared using 0.292 g sodium
phosphate, monobasic (MW: 137.99), 0.773 g sodium phosphate, dibasic (MW: 268.07), and 3.728 g potassium chloride
in 500 mL Milli-Q® water. The pH is adjusted using 1 M
hydrochloric acid. The buffer is filtered using Stericup-GV
Sterile Vacuum Filtration System.
2. Fluorescent buffer: 0.061 g HEPES (4-(2-hydroxyethyl)-1piperazineethanesulfonic acid), 0.059 g sodium chloride,
0.268 g ANTS (8-aminonaphthalene-1,3,6-trisulfonic acid,
disodium salt), and 0.950 g DPX ( p-xylene-bis-pyridinium
bromide) are mixed in 10 mM sodium phosphate pH 7 buffer
with 100 mM potassium chloride.
3. Chromatography column: DWK Life Sciences Kimble™
Kontes™ FlexColumn™ Economy Columns (Fisher Scientific;
K4204010750) are used.
2.6 Tryptophan
Fluorescence
Quenching Assay
Tryptophan fluorescence quenching assay can be used to study
peptide–lipid interactions and evaluate peptide-binding specificity
1. 96-well microplate: Greiner UV-Star® 96 flat-bottom well
plates made of clear cyclic olefin copolymer (COC) (SigmaAldrich; M3812-40EA) are used.
2. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 with
100 mM potassium chloride is prepared using 0.292 g sodium
phosphate, monobasic (MW: 137.99), 0.773 g sodium phosphate, dibasic (MW: 268.07), and 3.728 g potassium chloride
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
119
in 500 mL Milli-Q® water. The pH is adjusted using 1 M
hydrochloric acid. The buffer is filtered using Stericup-GV
Sterile Vacuum Filtration System.
2.7 Electrical
Impedance
Spectroscopy
Electrical impedance spectroscopy can be used to monitor the lipid
membrane (i.e., resistance and conductance) and study peptide–
lipid interactions.
1. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 with
100 mM potassium chloride is prepared using 0.292 g sodium
phosphate, monobasic (MW: 137.99), 0.773 g sodium phosphate, dibasic (MW: 268.07), and 3.728 g potassium chloride
in 500 mL Milli-Q® water. The pH is adjusted using 1 M
hydrochloric acid. The buffer is filtered using the StericupGV Sterile Vacuum Filtration System.
2. Silicon plate: Polished n-type silicon wafers (<1 1 1>,
ρ ¼ 0.001–0.005 Ω cm) (Silicon Quest International, San
Jose, CA) are used.
2.8 Isothermal
Titration Calorimetry
(ITC)
ITC can be used to characterize the thermodynamic parameters of
peptide–lipid interactions, e.g., binding stoichiometry, binding
enthalpy, and binding constant.
1. Aqueous buffer: 10 mM sodium phosphate buffer pH 7 with
100 mM potassium chloride is prepared using 0.292 g sodium
phosphate, monobasic (MW: 137.99), 0.773 g sodium phosphate, dibasic (MW: 268.07), and 3.728 g potassium chloride
in 500 mL Milli-Q® water. The pH is adjusted using 1 M
hydrochloric acid. The buffer is filtered using Stericup-GV
Sterile Vacuum Filtration System. The buffer is degassed
under vacuum over 30 min.
3
Methods
Peptide assembly and oligomerization in the cell membrane play a
critical role in many biological processes [31–35]. These peptides
adsorb, fold, cross, and form a functional structure or aggregate in
lipid bilayers, processes that often involve transient structures and
mechanisms. The difficulty of observing these transient structures
experimentally [2, 26, 33, 36, 37] limits our understanding of how
peptides interact with lipids in biological systems, especially as most
of the experiments do not directly provide molecular details of the
transitions but show their equilibrium states [21, 38–42]. Thus, the
details of the molecular mechanisms underpinning activity and the
chemical interactions driving them remain unclear. Atomic-detail
MD simulations, fueled by hardware advances, algorithm improvements, and more accurate force fields are becoming an increasingly
120
Charles H. Chen et al.
powerful way to study these dynamic and transient events [2, 43–
46]. This method allows us to study atomic details, movements,
kinetics, chemical interactions, and assemblies of peptides in cell
membranes or model lipid bilayers [1, 2, 26, 36, 47, 48].
MD calculates the physical movement of atoms by applying
Newton’s laws of motion at the atomic level. An atom can be
represented by a point of mass m and charge q.
2
m
∂ r
¼ F́ ðŕ, v́, t Þ
∂t 2
ð1Þ
The force (F́ ) on each atom is a function of its coordinate (ŕ),
velocity (v́), and time (t). m is the mass of each atom. The overall
potentials and parameters are determined by the force field. Each
atom can be attached to other atoms either via springs with force
constants ki (covalent bonds) or via electromagnetism. Gravitation
is neglected in MD simulations of biomolecules as it is significantly
smaller than the corresponding electrostatic forces. There are two
types of interaction between any two atoms: bonded (covalent
bonds) and nonbonded (electromagnetism). Therefore, the overall
potential energy (Vtotal) function is given by:
V total ¼ V bonded þ V nonbonded
ð2Þ
The bonded potential energy can be presented as:
V bonded ¼ V bonds þ V angles þ V improperdihedrals þ V torsionangles
ð3Þ
The nonbonded potential energy can be represented by a combination of the Lennard-Jones potential and the Coulomb
potential:
V bonded ¼ V LJ þ V Coulomb
ð4Þ
Therefore, all atoms in the system are simulated by integrating
Newton’s equation Eq. (1), which can be done using the
Verlet algorithm. This algorithm is an ideal finite difference scheme
and is used in MD simulations because it is stable, time-reversible,
and energy-conserving.
m
3.1 Molecular
Dynamics (MD)
Simulations
r ðt þ dt Þ þ r ðt dt Þ 2r ðt Þ
¼ F ðt Þ
Δt 2
ð5Þ
In the following example, we applied unbiased all-atom MD simulations (see Note 1) using GROMACS [28] and Hippo BETA
simulation packages http://www.biowerkzeug.com and the VMD
molecular visualization program http://www.ks.uiuc.edu/
Research/vmd/ [29]. The coordinates and structures of extended
peptides were generated using Hippo BETA [49]. These initial
structures were relaxed in the isothermal–isobaric (NPT) ensemble
using atomic detail Monte Carlo (MC) simulations, a computational algorithm relying on repeated random sampling to achieve
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
121
energy minimization, for 200 MC steps. The relaxation step allows
the initial structures to form a low-energy configuration and prevent unlikely interaction between atoms. Water was treated implicitly using a generalized Born implicit solvent (GBIS) [50]. The
combination of MC simulations with GBIS greatly accelerates
structural relaxation and equilibrates angles, torsions, and distances
between atoms. After relaxation, the peptides were placed in
all-atom peptide/lipid/water systems containing model membranes (i.e., lipid bilayers) with 100 mM potassium and chloride
ions using CHARMM-GUI http://www.charmm-gui.org/, a
web-based graphical user interface to generate the input files for
the simulation setup [30]. Protein folding simulations were equilibrated for 10 ns to relax the system, applying position restraints to
the peptide. For pore-forming simulations, single peptides were
allowed to fold in the lipid bilayer for ~600 ns; subsequently, the
systems were multiplied four times in both the x and y directions
(i.e., 2 2 in x- and y-axis). MD simulations were performed with
GROMACS 5.0.4 using the CHARMM36 force field [28, 51] in
conjunction with the TIP3P water model [52]. Electrostatic interactions were computed using particle mesh Ewald (PME) [53], and
a cutoff of 10 Å was used for Van der Waals interactions
[54]. Bonds involving hydrogen atoms were constrained using
the LINCS algorithm [55]. The integration time-step was 2 fs,
and neighbor lists were updated every five steps. All simulations
were performed in the NPT ensemble, without any restraints or
biasing potentials. Water and the peptide were each coupled separately to a heat bath (i.e., simulation temperature) with a time
constant τT ¼ 0.5 ps using velocity rescale temperature coupling
[56]. Atmospheric pressure of 1 bar was maintained using weak
semi-isotropic pressure coupling with compressibility κ z ¼ κ xy ¼ 4.6
· 105 bar1 and time constant τP ¼ 1 ps.
In order to reveal the most highly populated pore assemblies
during the simulations, a complete list of all oligomers was constructed for each trajectory frame. A transmembrane (TM) pore
assembly, which is an oligomer of the order n (number of peptides),
is considered any set of n TM peptides that are in mutual contact,
defined as having heavy atoms (nitrogen, carbon, or oxygen) with a
minimum distance of <3.5 Å between them. This definition frequently overcounts the oligomeric state due to numerous transient
surface-bound (S-state) peptides that are only loosely attached to
the transmembrane-inserted peptides that make up the core of the
oligomer. These S-state peptides on the membrane frequently
change position or enter and leave the stable part of the TM pore
assembly. To focus the analysis on true longer-lived TM pores, the
tilt angle τ of the peptides with a cut-off criterion of 65 was
introduced. Any peptide with τ 65 was considered to be in the
S-state (i.e., the peptide stayed at the membrane interface and did
not span the membrane) and removed from the oligomeric analysis.
122
Charles H. Chen et al.
This strategy greatly reduced the background noise (i.e., it eliminated S-state peptides near the TM pore assembly) in the oligomeric clustering algorithm by focusing on the true long-lived pore
structures. Population plots of the occupation percentage of oligomer n multiplied by its number of peptides (n) were then constructed. These plots reveal how many peptides were concentrated
in which oligomeric state during the simulation time.
3.2 Combinatorial
Peptide Libraries
Combinatorial peptide libraries (see Note 2) can be synthesized on
Tentagel® NH2 macrobeads with 280–320 μm particle size
(~65,550 beads/g) using Fmoc solid-phase peptide synthesis.
The sequences are made using a split and pool method [57]. Briefly,
the batch is split into several portions after the first reaction, and
each of the portions is synthesized with different amino acids. The
completed portions are pooled, mixed, and then split again into
several portions. This cycle is repeated until the combinatorial
peptide library has been synthesized. Each macrobead is attached
to only one peptide sequence, via a photolinker attached between
peptide and bead. The molecular weight and peptide sequence of
the peptide library are verified by matrix-assisted laser desorption
ionization time-of-flight (MALDI-TOF) mass spectrometry and
Edman sequencing. Edman sequencing, first developed by Pehr
Edman in 1950 [58], can be divided into four steps: (1) coupling—the amino group at the N-terminal end of the peptide is
coupled to phenyl isothiocyanate; (2) cleavage—the first peptide
bond is cleaved in strong acid (trifluoroacetic acid; TFA) resulting
in smaller peptide fragments and cyclized anilinothiazolinone
(ATZ) amino acid; (3) conversion—the ATZ amino acid is separated from the peptide fragment by organic extraction with ethyl
acetate and converted to phenylthiohydantoin (PTH) amino acid in
25% TFA (v/v in ddH2O); and (4) analysis using MALDI-TOF
mass spectrometry.
The beads are solvated in a minimum amount of methanol,
spread as a dispersed single layer on a glass plate, and dried under
air. The photolinker between peptide and bead is cleaved by exposure to 5 h of low-power UV light on a dry bead. The UV-cleaved
beads are transferred to 96-well microplates as one bead per well.
Peptides on the UV-cleaved beads are each dissolved in HFIP/
water (1:1 ratio) for another 3 h of low-power ultraviolet (UV) light
(dual optical wavelength at 365 nm and 405 nm) until dry, dissolved in water or DMSO, quantified by tryptophan absorbance
(molar extinction coefficient of tryptophan at 280 nm is
5690 M1 cm1) using a nanodrop, and stored at 20 C.
Sample size of the screening can be estimated by using simple
random sampling with >80% coverage of the sequences (P).
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
123
Fig. 1 Secondary structure of peptide. CD spectra of three different peptides with varied secondary structures:
random coil (black; HSP1 peptide in aqueous buffer), alpha helix (red; LDKA peptide in lipids), and beta strand
(blue; TDP-43 D1 peptide in lipids at high temperature, which forms protein misfolding)
n
1
P ¼1 1
ð6Þ
N
N denotes the total size of the sequences in the peptide library, and
n indicates the sample size for screening.
3.3 Circular
Dichroism (CD)
Spectroscopy
CD spectroscopy uses circularly polarized light to investigate optically active chiral molecules and records the differential absorption
of left- and right-handed (circularly polarized) light (Δε ¼ εL εR),
which can yield an estimation of the secondary structure (e.g.,
helix, beta-strand, and random coil) of proteins in native environments [59]. Different secondary structures result in varied CD
spectra (Fig. 1). Alpha helices have negative bands at 222 nm and
208 nm and a positive band at 193 nm, and beta strands have a
negative band at 218 nm and a positive band at 195 nm [60].
As an example of the use of CD spectroscopy (see Note 3),
peptide solutions (50 μM) in 10 mM phosphate buffer (pH 7.0)
were co-incubated with 800 μM large unilamellar vesicles (LUVs)
in identical buffer. LUVs were made by lipid extrusion [61]. Briefly,
lipids were dissolved in chloroform then mixed and dried under
nitrogen gas in a glass vial, and the remaining chloroform was
removed under vacuum overnight. Then lipids were resuspended
in 10 mM sodium phosphate buffer (pH ¼ 7) with 100 mM
potassium chloride. LUVs were generated by extruding the lipid
suspension 10 times through 0.1 μm nucleopore polycarbonate
filters to give LUVs of 100 nm diameter. CD spectra were recorded
using synchrotron radiation circular dichroism (SRCD) spectroscopy with the CD beam lines on ASTRID2 at Aarhus University in
Denmark and ANKA at Karlsruhe Institute of Technology in
124
Charles H. Chen et al.
Germany. Spectra were recorded from 270 to 170 nm with a step
size (λ) of 0.5 nm, a bandwidth of 0.5 nm, and a dwell time of 2 s.
The averaged baseline was subtracted from each spectrum and then
averaged over three repeat scans. The averaged spectra were normalized to molar ellipticity [θ] per residue, which is a common
measurement unit for estimating secondary structure in proteins,
peptides, and polymers. Molar ellipticity is defined as the tangent
ratio of the minor to major elliptical axis: tan θ ¼ (EL ER)/
(EL + ER), where θ is the ellipticity given by the machine. EL and
ER are the magnitudes of the electric field vectors of the left- and
right-circularly polarized light, respectively. The raw data were
analyzed using DichroWeb http://dichroweb.cryst.bbk.ac.uk/
[59, 62, 63], a web-based algorithm based on analysis techniques
using reference datasets derived from characterized peptides/proteins with known structures.
3.4 Oriented Circular
Dichroism (OCD)
Spectroscopy
Membrane-active peptides can spontaneously span the cell membrane through peptide–peptide interactions, membrane defects, or
water permeation. As they traverse the membrane, peptides assume
one of two common states: an S-state or a TM state. S-state and TM
peptides in lipid membranes can be identified by OCD spectroscopy (see Note 4). In an experiment that we studied, a small
membrane-active AMP from Hyla punctata in lipid bilayer
[47]. 20 μg of peptides were dissolved in chloroform and added
to lipid(s) in chloroform at the specific molar ratio, e.g., peptide:
lipid (P:L) ¼ 1:10. The mixtures were dried by a low flow of
nitrogen gas followed by high vacuum overnight. The dried sample
was resuspended in 40 μL of pure HFIP, which is an organic solvent
that can make a consistent thin lipid film on a glass surface. 20 μL of
the peptide/lipid mixture in HFIP was dripped and spread on a
glass plate. The other 20 μL was used for the replicate. After
vacuum drying to remove the HFIP, 2 μL ddH2O (sterile-filtered)
was added to the glass plate to saturate the lipid film, and the plate
was placed in a chamber containing a saturated solution of potassium sulfate (120 g/L at 25 C). Oriented bilayers formed after
equilibrating the lipid sample in the chamber at 25 C overnight.
Spectra were recorded from 270 to 160 nm with a step size (λ) of
0.5 nm, a bandwidth of 0.5 nm, and a dwell time of 2 s, and
averaged over eight rotational angles, which rotated the sample
around the beam axis by 360 . Each spectrum was averaged over
three repeat scans. The averaged baseline was subtracted from the
spectra, averaged over three times, and normalized to molar ellipticity [θ] using the equation:
½θ ¼ 100 θ=ðC l Þ
ð7Þ
where θ is the ellipticity, C is the peptide density in molar concentration, and l is the pathlength of the film. OCD procedures on the
ANKA beamline at Karlsruhe Institute of Technology gave similar
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
125
results, even though the cuvette settings were different in that
ANKA had a hydration chamber and an automatic rotation system
for rotational angles.
3.5 Liposome
Fluorescent Leakage
Assay
The liposome fluorescent leakage assay is a common biophysical
method for studying interactions between peptides and lipid membranes (see Note 5). The fluorescent dyes can be self-quenched or
quenched by another compound at a threshold concentration when
the donor–acceptor distance is within 15 Å, which is called Dexter
electron transfer. At that proximity, the electron of the quencher
(donor) is transferred to the lowest unoccupied molecular orbital
(LUMO) of the excited fluorescent dye (acceptor), and one electron from the acceptor moves to the highest occupied molecular
orbital (HOMO) of the donor, thus stopping the fluorescence.
When the peptides induce pore formation or membrane disruption,
the release of fluorescent dyes from the liposome will show the
fluorescent intensity, which can be recorded, and the peptideinduced leakage fraction from the vesicles, which can be calculated.
Different size of the fluorescent dyes can be used to measure the
peptide-induced pore size in the membrane. Two examples are
shown below: ANTS/DPX leakage assay (small fluorescent dye;
MW ¼ 427) and macromolecule release assay (large fluorescent
dye; MW 1000; the size is dependent on the dextran).
3.5.1 ANTS/DPX Leakage
Assay
Lipids in chloroform were mixed and dried under nitrogen gas in a
glass vial, and the remaining organic solvent was removed under
vacuum overnight. Then lipids were resuspended in 5 mM 8-aminonaphthalene-1,3,6-trisulfonic acid (ANTS) and 12.5 mM
p-Xylene-bis(N-pyridinium bromide) (DPX) phosphate buffer at
pH 7 (10 mM sodium phosphate with 100 mM potassium chloride). The dyes were entrapped in 0.1 μm diameter-extruded LUVs
with lipids. Gel filtration chromatography of Sephadex G-100
(GE Healthcare Life Sciences Inc) was used to remove externalfree ANTS/DPX from LUVs with entrapped contents. LUVs were
diluted to 0.5 mM and used to measure the leakage activity by
addition of aliquots of peptide. Leakage was measured using fluorescence emission spectra after 3 h incubation. The spectra were
recorded using excitation and emission wavelengths of 350 nm and
510 nm, respectively, for ANTS/DPX with a BioTek Synergy H1
Hybrid Multi-Mode Reader. 10% vol nonionic surfactant Triton
X-100 (Triton) was used as the positive control to measure the
maximum leakage of the vesicle. The leakage fraction can be calculated using the equation:
%leakage ¼ I peptide I 0 =ðI Triton I 0 Þ
ð8Þ
126
Charles H. Chen et al.
where Ipeptide, I0, and ITriton are the fluorescent intensity of the
peptide-induced vesicle, untreated vesicle, and Triton-induced vesicle, respectively.
3.5.2 Macromolecule
Release Assay
Dextrans of several sizes were prepared and coupled with both
5-carboxytetramethylrhodamine (TAMRA) and biotin as
TAMRA-biotin–dextran (TBD) conjugates. The conjugated TBD
was entrapped in LUVs as described above. External-free TBD
conjugate was removed by incubation with an immobilized streptavidin agarose resin, which has high affinity for the biotin in the
conjugate. The resin was spun down, and the TBD-containing
vesicles in the supernatant were transferred to a new glass vial.
Streptavidin labeled with an Alexa-488 fluorophore was added
during the leakage experiment with the peptide, as previously
described [8, 12]. The sample was incubated for 3 h before measuring Alexa-488 fluorescence. A control without added peptide
served as the 0% leakage signal, and the addition of 10% vol detergent Triton was used to determine 100% leakage as positive control.
The TBD conjugate released from the vesicle will bind to the
streptavidin-Alexa-488 fluorophore and cause Förster resonance
energy transfer (FRET) between Alexa-488 (donor) and released
TAMRA (acceptor) in TBD. The electron transfer of FRET is
different from the Dexter mechanism. In the latter case, the charge
fluctuations in donor and acceptor can affect each other over a
distance through energy transfer (not electron) from the electronic
excited state of the acceptor to the donor through nonradiative
dipole–dipole coupling. The fluorescence of the AlexaFluor
488 was measured using excitation and emission wavelengths of
490 nm and 525 nm, respectively, by a BioTek Synergy H1 Hybrid
Multi-Mode Reader. The normalized leakage fraction can be calculated using the equation:
%leakage ¼ I 0 I peptide =ðI 0 I Triton Þ
ð9Þ
where Ipeptide, I0, and ITriton are the fluorescent intensity of the
peptide-induced vesicle, untreated vesicle, and Triton-induced vesicle, respectively.
3.6 Tryptophan
Fluorescence
Quenching Assay
Tryptophan fluorescence quenching is a biophysical method used
for determining the degree of burial of a tryptophan side chain
[64]. It is usually applied to quantify conformational changes in
protein folding and the strength of peptide binding to membranes
(see Note 6). The intrinsic fluorescence of aromatic tryptophan
residues is highly sensitive to its environment and is quenched in
nonpolar environments, e.g., in lipid bilayers, in hydrophobic protein cores, or buried in the interface of a binding partner. The
emission spectrum of the quenched fluorescence will be shifted
toward lower wavelengths (blue shift) upon increasing hydrophobicity of the local environment. This process is dynamic and
reversible.
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
127
Tryptophan residues emit fluorescence in aqueous solution at a
wavelength of 350 nm, but when tryptophan is fully buried in a
lipid membrane (which acts as a quencher), the emission undergoes
a blueshift to ~320 nm. The spectrum of tryptophan fluorescence
emission between 300 and 350 nm was recorded following excitation at 280 nm wavelength to monitor the interactions between
peptides and lipids. Peptides (50 μM) and extruded LUVs
(600 μM) were prepared in 10 mM phosphate buffer (pH 7.0).
The solutions were incubated and measured after 60 min. Excitation was fixed at 280 nm (slit 9 nm), and emission was collected
from 300 to 450 nm (slit 9 nm). The spectra were recorded using a
Synergy H1 Hybrid Multi-Mode Reader and a Cytation™ 5 Cell
Imaging Multi-Mode Reader from BioTek and were averaged by
three scans. The scattering of the fluorescence spectrum was normalized [65] to evaluate the maximum wavelength shift
[40, 66]. The negative control consisted of free peptide in buffer.
Additional LUVs were titrated with a fixed concentration of the
peptide until the blueshift reached an equilibrated state, i.e., until
the tryptophan was fully buried in the hydrophobic core of the
membrane.
The membrane partitioning can be calculated according to
White et al. [65] and Rodnin et al. [67] using the following
equation:
K x ∙½L I ¼ 1 þ ðI 1 1Þ
ð10Þ
½W þ ðK x ∙½L Þ
where I is relative emission intensity of tryptophan, [L] is the lipid
concentration, I1 is the emission intensity of tryptophan at infinite
lipid saturation, [W] is the concentration of water (55.3 M), and Kx
is the mole fraction partitioning coefficient.
Kx ¼
½P bil =½L ½P water =½W ð11Þ
where [Pbil] and [Pwater] are the bulk concentration of peptide
associated with the lipid membrane and in water, respectively. The
calculated Kx can be used to determine membrane partitioning free
energy (ΔG):
ΔG ¼ RT ∙ ln ðK x Þ
ð12Þ
where R is the gas constant (1.985 103 kcal/mol∙K) and T is the
temperature in Kelvin.
However, this technique may not be equally accurate for all
peptide structures. Although most of the peptides have tryptophan
fluorescence peaks at ~348 nm (indicative of monomeric peptides
or low multimeric soluble aggregates), some peptides can fold into
a helix and form multimeric aggregates in the aqueous phase or at
128
Charles H. Chen et al.
higher concentration that bury the tryptophan in the hydrophobic
core. This folding results in blue-shifted spectra and small spectral
widths and affects the accuracy of the measurements.
3.7 Electrical
Impedance
Spectroscopy
Electrical impedance spectroscopy measures the resistance and conductance of a lipid bilayer coated on a silica plate (see Note 7). This
method, which involves a three-electrode setup with a silver/silver
chloride reference electrode and a platinum counter electrode, can
be used to monitor the status of a lipid bilayer over time, e.g.,
membrane lysis and membrane poration [8]. The supported bilayer
preparation and the measurement of the impedance were modified
following techniques first established by the Hristova and Searson
Labs [8, 68]. As an example, the top leaflet of the bilayers contained
100% POPC (1-palmitoyl-2-oleoyl-glycero-3-phosphocholine),
and the bottom leaflet consisted of 18.5% wt PEG (polyethylene
glycol; average Mn ¼ 2000) and 81.5% wt POPC. The bilayers
were prepared on a silicon plate of orientation (111) plane, in which
the Miller indices represent the symbolic vector for atomic planes in
crystal lattices, and the bilayers were determined by the LangmuirBlodgett (LB) method. The LB method is used to compress a lipid–
polymer (POPC-PEG) monolayer on the water surface and deposit
the monolayer on the silicon plate. The plate is then transferred to
the three-electrode setup and connected to the electrical impedance
spectroscope. Impedance was measured over a frequency range of
105 to 1 Hz with a 20 mV root-mean-squared (RMS) AC perturbation and at a potential of 0 V with respect to the reference
electrode. Spectra were recorded at 2-min intervals in the first
hour and at 1-h intervals subsequently. The experiments were
performed in the dark to prevent photo effects in the silicon. The
results were fitted to an equivalent circuit model to determine the
values of resistance and capacitance of the semiconductor–liquid
interface (Rp: resistance of the semiconductor–liquid interface, and
Cp: capacitance of the semiconductor–liquid interface) and the
bilayer membrane (Rm: resistance of the lipid bilayer membrane,
and Cm: capacitance of the lipid bilayer membrane). The analysis
was conducted using Electrochemical Impedance Spectroscopy
Software (Gamry Instruments Inc., Pennsylvania, USA). The values
were used to determine the normalized membrane resistance (Rm/
R0, which demonstrates the permeation of the membrane) and
change of the capacitance (Cm C0, which is correlated to the
membrane thickness). The normalized membrane resistance, which
is a force that impedes the flow of electric current across the
membrane, offers the physical property of the membrane and can
be used to study membrane poration and membrane lysis. Change
of the capacitance provides the relative thickness of the membrane,
as measured by the equation:
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
C ε
¼
A d
129
ð13Þ
where C is the capacitance, A is the surface area, ε is the permittivity, and d is the thickness of the membrane. The change of membrane conductance was calculated from the difference of the inverse
bilayer resistance at 0 and 17 h after addition of the peptide. The
conductance per mole of the peptide was calculated from the
experimental peptide-to-lipid value. The lipid concentrations on
the silicon plate were quantified using the change of lipid concentrations in the buffer by a colorimetric Stewart assay with ammonium ferrothiocyanate [69].
3.8 Isothermal
Titration Calorimetry
4
Isothermal titration calorimetry (ITC) measurements with a
MicroCal VP-ITC microcalorimeter (Microcal, Inc.) were performed to determine the thermodynamic parameters of the peptide–lipid interactions [70, 71] (see Note 8). The lipid vesicle
suspension (16.23 mM) was titrated into a peptide (38.1 μM)
solution in the sample cell. All the samples, which had been
degassed under vacuum over 30 min, were prepared in 10 mM
phosphate buffer (pH 7.0). The lipid solution was added in 6 μL
aliquots into the reaction cell (volume ¼ 1.46 mL) containing a
38.1 μM peptide solution with injection duration of 12 s. The
equilibration time between each titration step was 15 min. A first
titration of 6 μL was disregarded to ensure that a premixing of both
solutions during the equilibration time would not affect the first
titration step. The stirring speed of the injection syringe was
307 rounds per minute (rpm). The thermodynamic parameters
were calculated using the standard ITC software, which utilizes a
stoichiometric model of binding. Membrane partitioning, however, is not stoichiometric [65]; consequently, the actual errors in
free energy determination might be larger due to cross correlation
of binding stoichiometry (n) and dissociation constant (Kd) fitting
parameters.
Notes
1. Molecular dynamics (MD) simulations: MD simulations provide atomic details of how peptides fold in water and interact
with lipid membranes [1, 2]. Many rare events and thermodynamic parameters can be studied and validated by this method,
e.g., peptide binding and folding at the membrane interface
and peptide aggregation and assembly within the lipid membrane. Disordered aggregates have been observed with several
small peptides. The peptide-induced water permeation and ion
flux can be monitored throughout the simulation [17, 27,
47]. MD simulations allow us to determine the critical amino
acids that bind and interact with lipids, peptides, and other
130
Charles H. Chen et al.
compounds [26, 27, 36, 43, 47, 72–74]. The lifetime of the
peptide assembly (e.g., functional channel-like structure) can
be measured using the Arrhenius equation by performing the
simulations at different temperatures [26]. Simulations performed at higher temperatures can increase sampling kinetics
and allow us to study rare events, such as peptide folding,
bilayer partitioning, and pore assembly, without the need for
advanced sampling techniques [75, 76]. However, this technique is suitable only for thermostable peptides and lipid membranes; therefore, other experiments are strongly needed for
verification. Most of the analysis can be conducted using the
GROMACS package, VMD software, and basic programming
(e.g., Python).
2. Combinatorial peptide libraries vs MD simulations: Our previous study has shown that the hydrophobic moment can be
correlated with peptide binding for the zwitterionic lipid
bilayer and anionic lipid bilayer [22, 47]. The peptides were
evaluated using the liposome fluorescent leakage assay with
fixed peptide and lipid concentrations (0.5 μM peptide concentration against 0.5 mM lipid concentration; peptide to lipid
ratio is 1:1000). This assay allowed us to determine whether
the peptides have the ability to porate or lyse the membrane.
The selected peptides were then studied in MD simulations
with two different lipid bilayers. The peptides can either be
placed on one side or on both sides of the bilayer, and the
model can be built using the GROMACS package, the
CHARMM-GUI web-based interface, and VMD software.
After several microseconds of simulation, the interactions
between peptides and lipid membranes can be varied. The
snapshots can be captured using VMD software, and the trajectories can be refined and analyzed by using the GROMACS
package (e.g., gmx trjconv and gmx mindist).
3. Circular dichroism (CD) spectroscopy vs MD simulations: The
simulated secondary structure of the peptide can be averaged
and compared with the experimental secondary structure from
the CD spectroscopy to validate the accuracy [72]. In CD
spectroscopy, secondary structure can be characterized in aqueous conditions, at varied lipid concentrations, and for different
lipid types. The fractional content of alpha helix and beta sheet
derived from CD spectroscopy can be analyzed and quantified
using DichroWeb [59, 62, 63] and compared with the averaged secondary structure from the simulations. The simulated
secondary structure can be analyzed using VMD software
(Extension ! Analysis ! Timeline ! Calculate ! Cal. Sec.
Struct) and GROMACS package (e.g., gmx helix).
4. Oriented circular dichroism (OCD) spectroscopy vs MD simulations: Some membrane-active peptides that can insert into the
membrane and traverse it have a tilt angle, depending on the
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
131
peptide length and membrane thickness. As noted above (Subheading 3.4), OCD spectroscopy offers an averaged fraction of
the TM peptides and S-state peptides [76]. The results of OCD
spectroscopy can be compared with the MD simulations as an
indication of how likely the peptides are to get to the TM state.
The simulations can be analyzed using the GROMACS package (e.g., gmx helixorient).
5. Liposome fluorescent leakage assay vs MD simulations: Some
membrane-active peptides can pierce the cell membrane. Liposome fluorescent leakage assays are an easy and quick approach
to measure the resulting pore size using fluorescent dyes [8, 12,
27], such as small fluorescent dyes (ANTS and DPX;
MW ¼ 422–427) or macromolecule dyes (TAMRA-labeled
dextran and AF488-labeled streptavidin; MW ¼ 3k or 10k).
The size of the peptide-induced pore can be characterized in
the simulations by measuring the size of the aggregate and
water flux using the GROMACS package (e.g., gmx hole [77]).
6. Tryptophan fluorescence quenching assay vs MD simulations: The
tryptophan fluorescence quenching assay can be used as a
platform to evaluate peptide binding to different types of lipids
[22, 67]. This technique can be limited by the hydrophobicity
of the peptides and their aggregation and folding in aqueous
buffer, and binding may not involve a two-state transition.
Nevertheless, the tryptophan fluorescence quenching assay is
useful for screening peptide libraries consisting of 102 to 104
peptides to identify membrane-binding peptides for specific
applications. The assay is done with microwell plates read by a
microplate reader. The simulations can be utilized to study how
peptides interact with the lipid bilayer and bind onto the membrane interface using VMD software and the GROMACS package (e.g., gmx mindist).
7. Electrical impedance spectroscopy vs MD simulations: Electrical
impedance spectroscopy is a tool with which to monitor the
resistance and capacitance of the lipid bilayer on a silicon plate
[8, 68, 78], which can be correlated to the membrane permeability and membrane thickness, respectively. Electrical impedance spectroscopy can reveal whether membrane poration is a
transient event or an equilibrium state and whether a particular
peptide promotes membrane poration or lyses the membrane.
For membrane poration, the resistance decreases while the
capacitance remains constant after the peptide is added into
the chamber. For membrane lysis, resistance also decreases but
the capacitance increases, because the peptide can peel off the
membrane from the silico plate and reduce the membrane
thickness. Electrical impedance spectroscopy can be compared
with the peptide assembly simulations using the GROMACS
package (e.g., gmx trjconv and gmx traj).
132
Charles H. Chen et al.
8. Isothermal titration calorimetry (ITC) vs MD simulations: ITC
can measure the thermodynamic parameters of peptide–lipid
interactions [22, 47], e.g., binding stoichiometry, binding
enthalpy, and binding constant. Similar to the tryptophan fluorescence quenching assay, the measured values may not accurately represent a two-state transition and can be difficult to
analyze, and the errors of the measured quantity may be larger
due to cross correlation of the binding stoichiometry and
binding-constant fitting parameters. However, ITC still can
be useful to determine whether the peptide has selectivity for
certain lipid types, and ITC measurements can be checked
against the simulations obtained with simulated peptide folding (e.g., helical fraction).
5
Conclusions and Future Perspective
Recent developments in all-atom MD simulations of polypeptides
have provided insights into the molecular details of their mechanism of action and the pathways utilized for membrane interaction,
making MD simulations essential tools to complement biophysical
and in vitro experiments [2, 23–26, 43, 45, 48, 76, 79–82]. In
addition, they are useful for in silico protein design for drug discovery. Several examples have applied MD simulations to design
peptides for pharmaceutical applications, e.g., as peptide chaperones for stabilizing the human butyrylcholinesterase [73] and more
potent antimicrobial peptides [48].
Although MD simulations are a powerful tool, experimental
techniques are required to validate their accuracy. Different proteins and environmental conditions may require comparisons of
several forcefields [43, 44, 83–85], with a special focus on protein–lipid interactions [86] and more realistic multicomponent
membrane compositions [87]. The bottlenecks of simulation timescales and simulation box size also limit our understanding. Simulation events that are observed may not have reached their
equilibrium state. For example, amyloid peptides may induce fibril
formation or induce other complicated pathways that take much
longer than the available simulation time. The size of the simulation box can also yield less accurate results than would be obtained
at a more realistic scale.
Here, we have shown several experimental techniques that can
be used in conjunction with MD simulations to validate the results.
There are also many other experimental techniques that have not
been mentioned in this chapter, e.g., fluorescent-labeled peptide
for membrane partitioning [67], nuclear magnetic resonance
[25, 81], and neutron diffraction [81].
Ultimately, MD simulations provide a wide range of applications for structure prediction, mechanism and pathway elucidation,
protein design, and drug discovery. The results of MD simulations
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
133
compare well with those of other computational techniques, e.g.,
machine learning [88], high-throughput screening [89], and
molecular docking [90]. The advancement of computing hardware
and algorithms will extend the timescales, allow for building larger
and more realistic simulation systems, and in the near future
increase our understanding of complex biological functions.
Acknowledgments
This work was supported by the National Institute of Allergy and
Infectious Diseases of the National Institutes of Health (NIH)
under Award Number U19AI142780. The content is solely the
responsibility of the authors and does not necessarily represent the
official views of the NIH. The authors thank Kalina Hristova at
Johns Hopkins University, William Wimley at Tulane University,
Alexey Ladokhin at University of Kansas Medical Center, Jochen
Bürck at Karlsruhe Institute of Technology, Nykola Jones at Aarhus
University, Katherine Tripp at Johns Hopkins University, Gregory
Wiedman at Seton Hall University, Sarah Kim at Duke University,
Evan Troendle at King’s College London, and Yukun Wang at Yale
University for valuable discussions about the experimental setups
and simulations.
References
1. Chen CH et al (2020) Understanding and
modelling the interactions of peptides with
membranes: from partitioning to self-assembly.
Curr Opin Struct Biol 61:160–166
2. Ulmschneider JP, Ulmschneider MB (2018)
Molecular dynamics simulations are redefining
our view of peptides interacting with biological
membranes. Acc Chem Res 51(5):1106–1116
3. Dror RO et al (2012) Biomolecular simulation:
a computational microscope for molecular
biology. Annu Rev Biophys 41:429–452
4. Zasloff M (1987) Magainins, a class of antimicrobial peptides from Xenopus skin: isolation,
characterization of two active forms, and partial cDNA sequence of a precursor. Proc Natl
Acad Sci U S A 84(15):5449–5453
5. Lehrer RI et al (1989) Interaction of human
defensins with Escherichia coli. Mechanism of
bactericidal activity. J Clin Invest 84
(2):553–561
6. Yeaman MR, Yount NY (2003) Mechanisms of
antimicrobial peptide action and resistance.
Pharmacol Rev 55(1):27–55
7. Chen CH, Lu TK (2020) Development and
challenges of antimicrobial peptides for therapeutic applications. Antibiotics 9(1):24
8. Wiedman G et al (2014) Highly efficient
macromolecule-sized poration of lipid bilayers
by a synthetically evolved peptide. J Am Chem
Soc 136(12):4724–4731
9. Krauson AJ, He J, Wimley WC (2012) Gainof-function analogues of the pore-forming
peptide melittin selected by orthogonal highthroughput screening. J Am Chem Soc 134
(30):12732–12741
10. Krauson AJ et al (2015) Conformational finetuning of pore-forming peptide potency and
selectivity.
J
Am
Chem
Soc
137
(51):16144–16152
11. Wiedman G, Wimley WC, Hristova K (2015)
Testing the limits of rational design by engineering pH sensitivity into membrane-active
peptides. Biochim Biophys Acta 1848
(4):951–957
12. Wiedman G et al (2017) pH-triggered, macromolecule-sized poration of lipid bilayers by
synthetically evolved peptides. J Am Chem
Soc 139(2):937–945
13. Sreedharan J et al (2008) TDP-43 mutations in
familial and sporadic amyotrophic lateral sclerosis. Science 319(5870):1668–1672
134
Charles H. Chen et al.
14. Chen AK et al (2010) Induction of amyloid
fibrils by the C-terminal fragments of TDP-43
in amyotrophic lateral sclerosis. J Am Chem
Soc 132(4):1186–1187
15. Liu GC et al (2013) Delineating the
membrane-disrupting and seeding properties
of the TDP-43 amyloidogenic core. Chem
Commun 49(95):11212–11214
16. Sun CS et al (2014) The influence of pathological mutations and proline substitutions in
TDP-43 glycine-rich peptides on its amyloid
properties and cellular toxicity. PLoS One 9
(8):e103644
17. Chen CH et al (2016) Mechanisms of membrane pore formation by amyloidogenic peptides in amyotrophic lateral sclerosis.
Chemistry 22(29):9958–9961
18. Laos V et al (2019) Characterizing TDP-43307319 oligomeric assembly: mechanistic and
structural implications involved in the etiology
of amyotrophic lateral sclerosis. ACS Chem
Neurosci 10(9):4112–4123
19. Gagnon MC et al (2017) Influence of the
length and charge on the activity of α-helical
amphipathic antimicrobial peptides. Biochemistry 56(11):1680–1695
20. Grau-Campistany A et al (2015) Hydrophobic
mismatch demonstrated for membranolytic
peptides, and their use as molecular rulers to
measure bilayer thickness in native cells. Sci
Rep 5:9388
21. Grau-Campistany A et al (2016) Extending the
hydrophobic mismatch concept to amphiphilic
membranolytic peptides. J Phys Chem Lett 7
(7):1116–1120
22. Chen CH et al (2020) Rational tuning of a
membrane-perforating antimicrobial peptide
to selectively target membranes of different
lipid
composition.
bioRxiv:2020.11.01.364091
23. Leveritt JM, Pino-Angeles A, Lazaridis T
(2015) The structure of a melittin-stabilized
pore. Biophys J 108(10):2424–2426
24. Perrin BS, Pastor RW (2016) Simulations of
membrane-disrupting peptides I: alamethicin
pore stability and spontaneous insertion. Biophys J 111(6):1248–1257
25. Perrin BS et al (2016) Simulations of
membrane-disrupting peptides II: AMP Piscidin 1 favors surface defects over pores. Biophys
J 111(6):1258–1266
26. Wang Y et al (2016) Spontaneous formation of
structurally diverse membrane channel architectures from a single antimicrobial peptide.
Nat Commun 7:13535
27. Chen C et al (2019) Simulation-guided rational de novo design of a small pore-forming
antimicrobial peptide. J Am Chem Soc 141
(12):4839–4848
28. Pronk S et al (2013) GROMACS 4.5: a highthroughput and highly parallel open source
molecular simulation toolkit. Bioinformatics
29(7):845–854
29. Humphrey W, Dalke A, Schulten K (1996)
VMD: visual molecular dynamics. J Mol
Graph 14(1):33–38, 27-8
30. Lee J et al (2016) CHARMM-GUI input generator for NAMD, GROMACS, AMBER,
OpenMM, and CHARMM/OpenMM simulations using the CHARMM36 additive force
field. J Chem Theory Comput 12(1):405–413
31. Quist A et al (2005) Amyloid ion channels: a
common structural link for protein-misfolding
disease. Proc Natl Acad Sci U S A 102
(30):10427–10432
32. Li J et al (2017) Membrane active antimicrobial peptides: translating mechanistic insights
to design. Front Neurosci 11:73
33. Guha S et al (2019) Mechanistic landscape of
membrane-permeabilizing peptides. Chem
Rev 119(9):6040–6085
34. Sani MA, Separovic F (2016) How membraneactive peptides get into lipid membranes. Acc
Chem Res 49(6):1130–1138
35. Mangoni ML, McDermott AM, Zasloff M
(2016) Antimicrobial peptides and wound
healing: biological and therapeutic considerations. Exp Dermatol 25(3):167–173
36. Ulmschneider JP (2017) Charged antimicrobial peptides can translocate across membranes
without forming channel-like pores. Biophys J
113(1):73–81
37. Wimley WC, Hristova K (2011) Antimicrobial
peptides: successes, challenges and unanswered
questions. J Membr Biol 239(1–2):27–34
38. Kreutzberger MA, Pokorny A, Almeida PF
(2017)
Daptomycin-phosphatidylglycerol
domains in lipid membranes. Langmuir 33
(47):13669–13679
39. Lee MT et al (2018) Comparison of the effects
of daptomycin on bacterial and model membranes. Biochemistry 57(38):5629–5639
40. Kim SY et al (2019) Mechanism of action of
peptides that cause the pH-triggered macromolecular poration of lipid bilayers. J Am
Chem Soc 141(16):6706–6718
41. Kurgan KW et al (2019) Retention of native
quaternary structure in racemic melittin crystals. J Am Chem Soc 141(19):7704–7708
42. Keener JE et al (2019) Chemical additives
enable native mass spectrometry measurement
of membrane protein oligomeric state within
Predicting Membrane-Active Peptide Dynamics in Fluidic Lipid Membranes
intact nanodiscs. J Am Chem Soc 141
(2):1054–1061
43. Wang Y et al (2014) How reliable are molecular
dynamics simulations of membrane active antimicrobial peptides? Biochim Biophys Acta
1838(9):2280–2288
44. Huang J, MacKerell AD (2018) Force field
development and simulations of intrinsically
disordered proteins. Curr Opin Struct Biol
48:40–48
45. Venable RM, Kr€amer A, Pastor RW (2019)
Molecular dynamics simulations of membrane
permeability. Chem Rev 119(9):5954–5997
46. Pan AC et al (2019) Atomic-level characterization of protein-protein association. Proc Natl
Acad Sci U S A 116(10):4244–4249
47. Chen CH, Ulmschneider JP, Ulmschneider
MB (2020) Mechanisms of a small
membrane-active antimicrobial peptide from
Hyla punctata. Aust J Chem 73(3):236–245
48. Chen CH et al (2019) Simulation-guided
rational de novo design of a small pore-forming
antimicrobial peptide. J Am Chem Soc 141
(12):4839–4848
49. Ulmschneider JP, Ulmschneider MB, Di Nola
A (2006) Monte Carlo vs molecular dynamics
for all-atom polypeptide folding simulations. J
Phys Chem B 110(33):16733–16742
50. Ulmschneider JP, Jorgensen WL (2004) Polypeptide folding using Monte Carlo sampling,
concerted rotation, and continuum solvation. J
Am Chem Soc 126(6):1849–1857
51. Huang J, MacKerell AD (2013) CHARMM36
all-atom additive protein force field: validation
based on comparison to NMR data. J Comput
Chem 34(25):2135–2145
52. Jorgensen WL, Chandrasekhar J, Madura JD
(1983) Comparison of simple potential functions for simulating liquid water. J Chem Phys
79:926–935
53. Essmann U, Perera L, Berkowitz ML (1995) A
smooth particle mesh Ewald method. J Chem
Phys 103(19):8577–8593
54. Huang K, Garcı́a AE (2014) Effects of truncating van der Waals interactions in lipid bilayer
simulations. J Chem Phys 141(10):105101
55. Hess B, Bekker H, Berendsen HJC, Fraaije
JGEM (1997) LINCS: a linear constraint
solver for molecular simulations. J Comput
Chem 18(12):1463–1472
56. Mor A, Ziv G, Levy Y (2008) Simulations of
proteins with inhomogeneous degrees of freedom: the effect of thermostats. J Comput
Chem 29(12):1992–1998
57. Lam KS et al (1991) A new type of synthetic
peptide library for identifying ligand-binding
activity. Nature 354(6348):82–84
135
58. Edman P (1949) A method for the determination of amino acid sequence in peptides. Arch
Biochem 22(3):475
59. Whitmore L, Wallace BA (2008) Protein secondary structure analyses from circular dichroism spectroscopy: methods and reference
databases. Biopolymers 89(5):392–400
60. Greenfield NJ (2006) Using circular dichroism
spectra to estimate protein secondary structure.
Nat Protoc 1(6):2876–2890
61. Hope MJ et al (1985) Production of large unilamellar vesicles by a rapid extrusion procedure: characterization of size distribution,
trapped volume and ability to maintain a membrane potential. Biochim Biophys Acta 812
(1):55–65
62. Whitmore L, Wallace BA (2004) DICHROWEB, an online server for protein secondary
structure analyses from circular dichroism spectroscopic data. Nucleic Acids Res 32(Web
Server issue):W668–W673
63. Lobley A, Whitmore L, Wallace BA (2002)
DICHROWEB: an interactive website for the
analysis of protein secondary structure from
circular dichroism spectra. Bioinformatics 18
(1):211–212
64. Akbar SM, Sreeramulu K, Sharma HC (2016)
Tryptophan fluorescence quenching as a binding assay to monitor protein conformation
changes in the membrane of intact mitochondria. J Bioenerg Biomembr 48(3):241–247
65. White SH et al (1998) Protein folding in membranes: determining energetics of peptidebilayer interactions. Methods Enzymol
295:62–87
66. Ladokhin AS, Jayasinghe S, White SH (2000)
How to measure and analyze tryptophan fluorescence in membranes properly, and why
bother? Anal Biochem 285(2):235–245
67. Rodnin MV et al (2020) Experimental and
computational characterization of oxidized
and reduced protegrin pores in lipid bilayers. J
Membr Biol 253(3):287–298
68. Lin J et al (2008) Impedance spectroscopy of
bilayer membranes on single crystal silicon.
Biointerphases 3(2):FA33
69. Stewart JC (1980) Colorimetric determination
of phospholipids with ammonium ferrothiocyanate. Anal Biochem 104(1):10–14
70. Breukink E et al (2000) Binding of Nisin Z to
bilayer vesicles as determined with isothermal
titration
calorimetry.
Biochemistry
39
(33):10247–10254
71. Abraham T et al (2005) Isothermal titration
calorimetry studies of the binding of a rationally designed analogue of the antimicrobial
136
Charles H. Chen et al.
peptide gramicidin s to phospholipid bilayer
membranes. Biochemistry 44(6):2103–2112
72. Chen CH et al (2014) Absorption and folding
of melittin onto lipid bilayer membranes via
unbiased atomic detail microsecond molecular
dynamics simulation. Biochim Biophys Acta
1838(9):2243–2249
73. Wang Q et al (2018) Proline-rich chaperones
are compared computationally and experimentally for their abilities to facilitate recombinant
butyrylcholinesterase tetramerization in CHO
cells. Biotechnol J 13(3):e1700479
74. Ulmschneider MB et al (2015) Peptide folding
in translocon-like pores. J Membr Biol 248
(3):407–417
75. Ulmschneider MB et al (2010) Mechanism and
kinetics of peptide partitioning into membranes from all-atom simulations of thermostable peptides. J Am Chem Soc 132
(10):3452–3460
76. Ulmschneider MB et al (2014) Spontaneous
transmembrane helix insertion thermodynamically mimics translocon-guided insertion. Nat
Commun 5:4863
77. Smart OS, Goodfellow JM, Wallace BA (1993)
The pore dimensions of gramicidin A. Biophys
J 65(6):2455–2460
78. Wiedman G et al (2013) The electrical
response of bilayers to the bee venom toxin
melittin: evidence for transient bilayer permeabilization. Biochim Biophys Acta 1828
(5):1357–1364
79. Upadhyay SK et al (2015) Insights from microsecond atomistic simulations of melittin in thin
lipid bilayers. J Membr Biol 248(3):497–503
80. Pino-Angeles A, Lazaridis T (2018) Effects of
peptide charge, orientation, and concentration
on melittin transmembrane pores. Biophys J
114(12):2865–2874
81. Mihailescu M et al (2019) Structure and function in antimicrobial piscidins: histidine position, directionality of membrane insertion, and
pH-dependent permeabilization. J Am Chem
Soc 141(25):9837–9853
82. Westerfield J et al (2019) Ions modulate key
interactions between pHLIP and lipid membranes. Biophys J 117(5):920–929
83. Robustelli P, Piana S, Shaw DE (2018) Developing a molecular dynamics force field for both
folded and disordered protein states. Proc Natl
Acad Sci U S A 115(21):E4758–E4766
84. Poger D, Caron B, Mark AE (2016) Validating
lipid force fields against experimental data:
progress, challenges and perspectives. Biochim
Biophys Acta 1858(7, Part B):1556–1565
85. van Gunsteren WF et al (2018) Validation of
molecular simulation: an overview of issues.
Angew Chem Int Ed Engl 57(4):884–902
86. Corradi V et al (2019) Emerging diversity in
lipid-protein interactions. Chem Rev 119
(9):5775–5848
87. Marrink SJ et al (2019) Computational modeling of realistic cell membranes. Chem Rev 119
(9):6184–6226
88. H€ase F et al (2019) How machine learning can
assist the interpretation of. Chem Sci 10
(8):2298–2307
89. Doerr S et al (2016) HTMD: high-throughput
molecular dynamics for molecular discovery. J
Chem Theory Comput 12(4):1845–1852
90. Salmaso V, Moro S (2018) Bridging molecular
docking to molecular dynamics in exploring
ligand-protein recognition process: an overview. Front Pharmacol 9:923
Chapter 7
Coarse-Grain Simulations of Membrane-Adsorbed Helical
Peptides
Manuel N. Melo
Abstract
The amphipathic α-helix is a common motif for peptide adsorption to membranes. Many physiologically
relevant events involving membrane-adsorbed peptides occur over time and size scales readily accessible to
coarse-grain molecular dynamics simulations. This methodological suitability, however, comes with a
number of pitfalls. Here, I exemplify a multi-step adsorption equilibration procedure on the antimicrobial
peptide Magainin 2. It involves careful control of peptide freedom to promote optimal membrane adsorption before other interactions are allowed. This shortens preparation times prior to production simulations
while avoiding divergence into unrealistic or artifactual configurations.
Key words Peptide, Alpha-helix, Amphipathicity, Molecular dynamics, Coarse grain, Membrane
adsorption, Equilibration
1
Introduction
Amphipathicity in proteins has long been recognized as a driving
feature for adsorption to lipid membranes; one that takes advantage
of the membrane’s own amphipathic environment at the interface
between the lipids’ aliphatic tails and their polar headgroups [1]. In
this context, amphipathic α-helices are a common adsorption motif
[2], in which amino acid residues are organized around a helix in a
way that segregates apolar side chains from polar/charged ones,
usually along the helical diameter.
In proteins, the structural role of amphipathicity in α-helices is
not restricted to membrane adsorption: Schiffer and Edmundson
first proposed the amphipathicity-highlighting representation now
commonly known as “Edmundson wheel” (Fig. 1) for the visualization of soluble protein features [4]. Besides wheel representations, metrics such as the hydrophobic moment [5] or hydrophobic
angle [6] can be used in the quantification and visualization of
amphipathicity. In bioactive peptides, however, amphipathicity is a
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_7,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
137
138
Manuel N. Melo
Fig. 1 (a): Edmundson wheel representation of the sequence and α-helical structure of Magainin 2, highlighting
the polar/apolar segregation of residues along its surface (drawn with the aid of the NetWheels tool [3]);
residues are color-coded red for charged anionic, blue for charged cationic, green for polar uncharged, and
white for apolar. The same color code is used throughout the structural representations in this chapter. (b):
The two types of position restraints employed in this chapter for peptide equilibration; full line: harmonic
position restraint (from Eq. 1); dashed line: flat-bottom harmonic position restraint (from Eq. 2). (c): The Martini
α-helical CG structure for Magainin 2, where for simplicity only backbone particles are shown; the black arrow
represents the hydrophobic moment. Also in (c) are represented the degrees of freedom affected by the
restraints in (b), with red arrows indicating restrained translations and green arrows indicating free translations/rotations. (d): Initial system setup for single Magainin 2 peptides on a POPC membrane (at a global 1:312
peptide-to-lipid ratio). (e): Initial system setup for multiple 28 Magainin 2 peptides on a POPC membrane (at a
global 1:48 peptide-to-lipid ratio)
hallmark of membrane interaction. Prominent examples can be
found, among others, in antimicrobial peptides [7], in cellpenetrating peptides [8, 9] and in membrane-remodeling
peptides [10].
Over the past two decades, the membrane activity of bioactive
peptides gradually came into the scope of molecular dynamics
(MD) simulations. This was boosted both by the continued
increase in computational power and by the development of
coarse-grain (CG) MD models. Biomolecular CG models, such as
the popular Martini framework [11, 12], simplify the structural
representation of molecules, exchanging fine detail for large simulation speedup. This allows the extension of simulation times into
the microsecond to millisecond range, and system sizes up to
hundreds of nanometers [13]. CG MD is also useful as an accelerated step prior to conversion to full detail for subsequent simulation at atomistic resolution [14].
1.1 Simulation
Pitfalls
This chapter focuses on the setup of CG MD systems of membraneadsorbed peptides. The process of membrane adsorption by amphipathic peptides can, in principle, be followed by MD (see the caveat
for CG in point 1 below). However, phenomena of interest usually
occur after peptides are adsorbed. To obviate this potentially time-
CG Simulations of Membrane-Adsorbed Helical Peptides
139
consuming step, I describe methods to directly prepare systems
with already-adsorbed peptides. These take into account a number
of considerations:
1. Amphipathic helical peptides often fold as a response to an
amphipathic environment and are unstructured otherwise
[1]. This is one of the reasons why simulating the adsorption
process from the aqueous phase can be time consuming.
Another limitation is that CG models do not always have the
ability to reproduce folding dynamics—Martini, for instance,
restrains protein structure to a given input throughout the
simulation [15, 16], and a Martini helix will always be a helix.
A corollary is that any peptide configuration simulated as an
amphipathic α-helix away from the membrane’s hydrophilic/
hydrophobic interface is likely to be unrealistic.
2. Besides the adsorption distance in the previous point, another
aspect for stability is helix orientation with respect to the membrane. Peptides will likely orient so that their hydrophobic
residues face the membrane core [2]. Simulation of configurations that are not stably equilibrated in this respect is, again,
unrealistic.
3. Simulated systems may have multiple adsorbed peptides, or
other membrane components such as transmembrane proteins.
Peptide interactions with one another or with additional components must only be allowed after proper equilibration
according to points 1 and 2.
Ignoring the above points can have a range of consequences.
When simulating isolated peptides in membranes, a poorly equilibrated adsorption can result in peptide diffusion back to the aqueous phase (see the illustrative example with the Magainin
2 antimicrobial peptide in Fig. 2) or in adsorption at
non-representative depths/orientations. While the former entails
a waste of computational resources, the latter, if undetected, will
yield erroneous measurements of membrane interaction
parameters.
The most serious consequences of inadequate equilibration,
however, occur when simulated systems contain multiple adsorbed
peptides—incidentally, a condition of many of the peptides’ proposed mechanisms [17]. Premature peptide–peptide contacts,
before hydrophilic/hydrophobic interactions with the membrane
are satisfied, will lead to artificial oligomerization propensities.
Namely, exposed peptidic hydrophobic patches will tend to bind
and form aggregates that will then have reduced drive for deeper
membrane interaction. When large numbers of mis-equilibrated
peptides are allowed to interact in a simulation, disordered peptide
aggregates may accumulate atop of the membrane and actively
perturb it in an unrealistic fashion (compare the interaction features
140
Manuel N. Melo
Fig. 2 (a) and (b): z-distance between the peptides’ center-of-geometry and the membrane’s top leaflet PO4
layer, for 12 independent single-peptide systems; in (a), the 12 systems were simulated unrestrained, and
traces for peptides that ultimately leave the membrane are highlighted in red; in (b), a flat-bottom position
restraint in z was applied to the peptide backbone particles, with onset (rfb) at 3.0 nm from the membrane’s
center (roughly 1 nm above the PO4 layer) and force constant 500 kJ mol1 nm2 (the average position of the
onset is represented in (b) by the dotted line). (c): Helix orientation relative to the z-axis for the systems
simulated with z-restraints, expressed as the dihedral angle between the plane containing the hydrophobic
moment and the helix axis (see Fig. 1) and the plane containing the helix axis and the z-axis; adsorbed
peptides converge on orientations where their hydrophobic moment points roughly away from the + z direction
and into the membrane core. (d): Final structure of one of the simulations without restraints where the peptide
left the membrane. (e): Representative final structure of the peptides simulated with flat-bottom restraining
potentials preventing their desorption back into the aqueous phase; the restraining potential shape relative to
the membrane is illustrated by the dashed line. Overall, the use of restraints in z promotes a quick and
consistent membrane adsorption, even if the restraining potential, with onset outside of the membrane, does
not actively affect the adsorbed state
of high densities of Magainin 2 in Fig. 3e–h and in refs. 20 and 18).
In reality, by contrast, peptides that are water-soluble are less likely
to have structures with spatially segregated hydrophobic residues
before complete adsorption and are therefore also unlikely to
aggregate with one another at that stage.
CG Simulations of Membrane-Adsorbed Helical Peptides
141
Fig. 3 Adsorption equilibration for simulations under different restraints. (a) and (b): z-distance to the PO4 layer
and hydrophobic moment angle with the z-axis, as in Fig. 2a–c, for 28 Magainin 2 peptides simulated
simultaneously, with only flat-bottom restraints in z; (c) and (d) are the same measurements for an analogous
system, but with the first and last backbone particles of each peptide also pinned in the xy-plane by harmonic
potentials of force constant 500 kJ mol-1 nm-2. (e): Snapshot after 30 ns of a 28-peptide system simulated
without any restraints, where peptides quickly and stably aggregate with one another, away from the
membrane interface. (f) and (g) are the final snapshots corresponding to the simulations in panels (a)/(b)
and (c)/(d), respectively. In (g), yellow arrows indicate the xy-pinned backbone particles for one of the
peptides; this xy restraining effectively prevents peptide lateral diffusion, yet panel (d) shows that peptides
retain their ability to orient relative to the membrane. In (f), without lateral restraints, part of the peptides is
able to correctly associate into dimers [18, 19] but proper membrane adsorption is delayed (compare (a) and
(b) with (c) and (d)) and at least 5 peptides form an artifactual aggregate protruding into the aqueous phase
(yellow arrow). Finally, (h) shows that after the restraints in (g) are lifted, the system quickly (600 ns)
progresses towards a realistic distribution of dimers and monomers, all membrane-adsorbed
1.2 Equilibration
Strategy
To properly equilibrate membrane-adsorbed peptides according to
the above requirements, specific restraints to their freedom must be
imposed while they converge to stable adsorption depths and
orientations [18, 21]. This is akin to the usual practice of restraining atomistic protein backbone motion when initially equilibrating
a system after solvation: as much as possible, introduced instability
should be resolved by the faster degrees of freedom, rather than
being allowed to drive the system into states from which convergence back to representative configurations may be too slow.
In this chapter I use the adsorption of the antimicrobial peptide
Magainin 2 onto a palmitoyl-oleoyl-phosphatidylcholine (POPC)
bilayer (Fig. 1) to exemplify in practice how restraints along the
z axis can be used to keep peptides in the membrane vicinity,
promoting the equilibration of adsorption depth (Fig. 2). For
systems with multiple peptides I employ further restraints, pinning
peptide termini in the xy plane, which lets adsorption depth and
orientation equilibrate while preventing lateral diffusion and
142
Manuel N. Melo
untimely peptide–peptide contacts (Fig. 3). See Fig. 1b and c for a
depiction of the employed restraints and their effect on the peptides’ degrees of freedom.
2
2.1
Software and Models
Forcefield
The protocols in this chapter were tested with the Martini 2.2
forcefield, but instructions should hold unaltered for most Martini
protein implementations [15, 16]. See Note 1 for applicability to
other forcefields.
2.2 Simulation
Package
System preparation and simulation is exemplified with the GROMACS 2020.5 [22] simulation package, but the procedure is compatible with GROMACS versions 5.0 or higher (when flat-bottom
position restraints were introduced). See Note 2 for other compatibility considerations.
2.3 System
Construction
CG structures and topologies are constructed using the martitool [23] from α-helical atomistic structures (in the examples in Figs. 1, 2 and 3, the starting Magainin 2 atomistic structure
was first constructed as an ideal helix using Avogadro v1.2.0 [24]).
Previous versions of martinize2 or of martinize [25] can also
be used.
Membranes are constructed with the insane.py script [26]
but any other source of flat, equilibrated Martini membranes is
acceptable (such as those generated by the CHARMM-GUI tool
[27]).
Peptide–membrane juxtaposition is done using the MDAnalysis v1.0.0 Python package [28] together with tools from the GROMACS suite.
2.4 Restraining
Potentials
Two types of restraining potentials are needed. The first is a simple
harmonic potential V that restrains particle position r along a given
dimension to a reference position r0, with force constant k, according to Eq. 1:
nize2
ð1Þ
V ¼ kðr r 0 Þ2
The second restraining potential is a piecewise extension of the
harmonic potential, in that the potential only starts increasing at a
distance rfb from r0. The potential is flat at zero between r0 rfb
and r0 + rfb, hence the name “flat-bottom potential”:
2
k (r − r0 − rfb ) , r > r0 + rfb
V =
2
k (r − r0 + rfb ) , r < r0 − rfb
0,
otherwise
ð2Þ
CG Simulations of Membrane-Adsorbed Helical Peptides
143
The shape of the two potentials can be compared in
Fig. 1b. Either potential can be independently applied to each of
the x, y, and z dimensions. For a membrane with normal aligned
with z, the procedures in this chapter use flat-bottom restraining
potentials along z and harmonic restraining potentials on x and y.
2.5 Equilibration
Monitoring
3
Evolution of equilibration in Figs. 2 and 3 was monitored visually,
using VMD v1.9.3 [29], and quantitatively, using custom tools
written in Python using the MDAnalysis, NumPy v1.19 [30], and
Matplotlib v3.3.3 [31] packages. Two metrics were followed:
l
Each peptide’s center-of-geometry position in z relative to the
top leaflet’s PO4 layer (assuming peptides are being adsorbed
onto the top leaflet).
l
The alignment with the z-axis of a reference vector for each
peptide helix. Alignment can be the simple angle of the reference
vector with + z or, as in Figs. 2 and 3, it can be the dihedral
torsional angle around the helical axis between the reference
vector and + z. Figures 1, 2 and 3 depict/employ as reference
vector the hydrophobic moment (as implemented in the
3D-HM tool [32]); this highlights hydrophobic orientation
towards the membrane core during equilibration, but see Note
4 for simpler metrics.
Methods
These steps assume that typical Martini run parameters [33, 34] for
energy minimization, pressure and temperature equilibration, and
production are used, but these can be adapted if other forcefields
are employed.
Pressure coupling should be done semi-isotropically (in xy and
z separately) unless the specific application demands otherwise.
When employing pressure coupling together with position
restraints GROMACS requests that you decide how to scale the
restraint reference points (r0) with pressure scaling. For the
restraints used here it is advisable to set the refcoord_scaling¼com run parameter.
These instructions also assume that the peptides will be added
to only one of the membrane leaflets. Nonetheless, the steps are
readily extensible to equilibrating adsorption on both leaflets simultaneously by simply adding two layers of peptides. The involved
restraints do not require any adjustment in that case and only the
equilibration monitoring must be adapted to reverse distance/
angle signs for part of the peptides.
144
Manuel N. Melo
3.1 Common
Preparation
1. Create a membrane of suitable size and composition using
insane.py. Energy-minimize it, equilibrate pressure and
temperature, and then equilibrate lipid mixing, if needed, for
a suitable amount of time (which will depend on membrane
size and composition).
2. Obtain the CG topology and structure for your helical peptide
using martinize2.
3. Ensure that the peptide lies with its helical axis parallel to the
membrane surface. The gmx editconf GROMACS command
can do this using the -princ flag and subsequently the rotate flag, if needed. Alternatively, MDAnalysis can be
used to orient the molecule programmatically in Python.
3.2 Single-Peptide
Adsorption
1. Modify the peptide topology to add a flat-bottom position
restraint to all backbone particles. Make this restraint operate
along the z-axis to confine the particles to a horizontal slab with
a force constant of 500 kJ mol1 nm2. The flat-bottom distance rfb should be 3.0 nm—the reference point will be later set
to the membrane center, so this potential will leave a clearance
of about 1 nm on either side of the membrane (assuming a
typical membrane thickness of about 2 nm per leaflet; adjust if
working with membranes of significantly different thickness).
For a GROMACS topology, the position restraint directive will
look like this:
[ position_restraints ]
1
2
5
3.0
500
...
2. Optionally, modify the topology to enclose the position_restraints block in a GROMACS preprocessing #ifdef
MACRO_NAME/#endif directive (the actual name for MACRO_NAME can be chosen by the user). This enables easy restraint
control in run parameter (.mdp) files using the define
keyword.
3. Add the peptide’s topology to the membrane system description. Juxtapose the peptide’s structure coordinates with those
of the membrane system, making sure that the peptide backbone is placed close to, but above the phosphate layer
(or below, if adding peptides to the bottom leaflet); any needed
vertical displacement can be done prior to juxtaposition using
gmx editconf, MDAnalysis, or even interactively, using the
structure modification capabilities of VMD. The juxtaposition
itself can be done using the MDAnalysis.Merge functionality,
or by concatenating structure files by hand (with due care to
keep file format integrity).
CG Simulations of Membrane-Adsorbed Helical Peptides
145
4. Generate a reference structure for GROMACS by setting r0 for
each position restraint. This is done by creating a copy of the
juxtaposed structure file where every backbone particle is
placed at the z-level of the membrane center. This is most easily
accomplished with MDAnalysis:
import MDAnalysis as mda
u = mda.Universe(’juxtaposed.gro’)
# adjust for the appropriate lipid selection
membrane = u.select_atoms(’resname POPC’)
# adjust backbone name if not using Martini
bb = u.select_atoms(’name BB’)
membrane_zcog = membrane.center_of_geometry()[2]
pos = bb.positions
pos[:,2] = membrane_zcog
bb.positions = pos
u.atoms.write(’reference.gro’)
5. You can now energy-minimize and equilibrate the system. The
Martini CG forcefield is usually robust to the blunt coordinate
juxtaposition strategy used here (see Note 3). Flat-bottom
restraints should be active until adsorption and orientation
converge, and then switched off for production runs. This
equilibration may be carried out simultaneously with pressure/temperature equilibration.
6. Equilibration monitorization in the previous step can be done
as in Figs. 2 and 3, by measuring peptide–PO4 distances and
helix orientations. Distances can be measured using the gmx
traj GROMACS command but the gmx helixorient command, unfortunately, cannot process CG structures. MDAnalysis can be used to measure both distance and helix orientation
(see Note 4).
3.3 Multiple Peptide
Adsorption
1. Modify the peptide topology to add a flat-bottom position
restraint to all backbone particles, as in Subheading 3.2, step
1. Add a second set of position restraints on the first and last
backbone beads of the peptide (see Note 5 for other possibilities when peptide density is low). These restraints should be of
the harmonic type, and act only in the x and y dimensions, with
the same force constant as their flat-bottom counterparts:
[ position_restraints ]
1
1
500 500 0
51
1
500 500 0
2. Optionally, you can split the restraints in the topology into
separate [ position_restraints ] blocks, each under its
146
Manuel N. Melo
own GROMACS preprocessing #ifdef/#endif directive.
This enables independent restraint control.
3. Multiply the peptide structure in x and y, using the gmx genconf GROMACS tool to achieve the desired number of peptides. To control inter-peptide spacing use either the -dist
flag or adjust the empty space around the template peptide
structure using gmx editconf with the -d flag. See Note 5
on how to reduce bias in peptide distribution at this step.
4. Add the peptide’s topology to the membrane system description. Juxtapose the peptides’ structure coordinates with those
of the membrane system, following the same considerations as
in Subheading 3.2, step 3. See Note 6 if using large membranes
where buckling prevents proper peptide placement or if the
membrane buckling amplitude is larger than the z-restraints’
flat-bottom region.
5. Generate a reference structure for GROMACS as in Subheading 3.2, step 4. The above code snippet is still valid in this
context but if generating the reference by other means note
that while in Subheading 3.2, step 4 only the z-coordinate of
the beads in the reference file mattered, when also restraining in
x and y those coordinates are no longer arbitrary and must be
kept unchanged (so that peptide termini are pinned to their
initial xy position).
6. As in Subheading 3.2, step 5, you can now energy-minimize
and equilibrate the system, with restraints active until adsorption and orientation converge (monitor convergence as in
Subheading 3.2, step 6). Afterwards, include an additional
equilibration period in your unrestrained production run to
allow for unbiased peptide redistribution. Depending on the
system, this may take from hundreds of nanoseconds to many
microseconds.
4
Notes
1. The procedure in this chapter is, in principle, generic and
applicable to different forcefields, including atomistic ones.
However, for models that allow folding/unfolding [35] or
that have breakable/soft elastic networks [36], care must be
taken that peptides remain in their desired adsorption structures; otherwise, peptides may misfold and become trapped in
less representative membrane-interacting configurations.
2. While harmonic position restraints are ubiquitous across simulation software, the used flat-bottom restraints are much less
common. When unavailable, soft harmonic position restraints
in z, centered on the expected peptide adsorption depth, can be
CG Simulations of Membrane-Adsorbed Helical Peptides
147
used as a substitute. These can also be used when membranes
have a non-flat geometry for which no readily usable flatbottom potential exists. This alternative has the disadvantage
of imposing a nonzero restraint force even when peptides are
close to their adsorption equilibrium depths.
3. Energy minimization after direct juxtaposition of peptide coordinates may not converge if particles become overlapped. While
energy minimization with Martini is typically robust to close
contacts, large systems, or juxtapositions too deep into the
membrane may be too unstable. Solutions are to try i) shallower juxtapositions, ii) removal of solvent molecules in the
immediate vicinity of peptides, or iii) minimization using softcore Lennard-Jones potentials (that do not have a singularity at
zero distance; GROMACS allows the use of such potentials
when free-energy mode is activated in the run parameters). If
this procedure is extended to atomistic systems, solution ii is
likely the only viable route to a stable system construction.
4. The use of the hydrophobic moment as reference vector for
measuring helix orientation is needlessly complex—it involves
defining it for the initial atomistic structure and then expressing it as a function of CG particle positions—; it was only used
here so that the orientation of hydrophobic residues towards
the membrane core could also be visualized in Figs. 2c, 3b, and
3d. Any vector with a significant component orthogonal to the
helix axis can be used to gauge orientation (for instance, an
i ! i + 1 backbone–backbone vector). Likewise, the measure of
the simple angle (rather than the dihedral torsion angle) with
+ z is also sufficient to monitor convergence. These two simplifications can be easily implemented using MDAnalysis with the
following snippet (exemplified for the single-peptide case):
import MDAnalysis as mda
import numpy as np
# adjust for the appropriate topology/trajectory files
u = mda.Universe(’topology.tpr’, ’trajectory.xtc’)
# adjust backbone name if not using Martini
bbs = u.select_atoms(’name BB’)
mid_bb = len(bbs)//2
angles = []
for frame in u.trajectory:
vec = bbs.positions[mid_bb+1] - bbs.positions[mid_bb]
norm = np.linalg.norm(vec)
angles.append(np.arccos(vec[2]/norm))
5. It is good practice, when multiplying a structure, to assign
random rotations in xy to each copy so as to minimize initial
structure bias. gmx genconf can do this using the -rot and -
148
Manuel N. Melo
flags but note that at high peptide densities there may
be no other option than to set peptides parallel to one another.
If space does allow for random rotation, then xy-restraining
should be applied not at the termini, but on a single residue at
the helix center, to allow rotation in place around the z-axis also
during equilibration.
maxrot
6. If the target membrane is large enough to spontaneously
buckle with significant amplitude, a solution is to employ zaxis flat-bottom position restraints also on the lipids, thus
damping the buckling. Such restraints can be applied to the
lipids’ glycerol moieties, restraining them to be within 2.0 nm
of the membrane center. This should only be done during
adsorption equilibration, to allow proper action of the peptide
restraints.
References
1. Sankaram MB, Marsh D (1993) Protein-lipid
interactions with peripheral membrane
proteins. In: Watts A (ed) Protein-lipid interactions, new comprehensive biochemistry, vol
25. Elsevier, chap 6, pp 127–162, https://doi.
org/10.1016/S0167-7306(08)60235-5
2. Hristova K, Wimley WC, Mishra VK, Anantharamiah GM, Segrest JP, White SH (1999)
An amphipathic α-helix at a membrane interface: a structural study using a novel X-ray
diffraction method. J Molecular Biol 290(1):
99–117. https://doi.org/10.1006/jmbi.
1999.2840
3. Mól AR, Castro MS, Fontes W (2018) NetWheels: a web application to create high quality
peptide helical wheel and net projections.
bioRxiv https://doi.org/10.1101/416347
4. Schiffer M, Edmundson AB (1967) Use of
helical wheels to represent the structures of
proteins and to identify segments with helical
potential. Biophys J 7(2):121–135. https://
doi.org/10.1016/S0006-3495(67)86579-2
5. Eisenberg D, Weiss RM, Terwilliger TC
(1982) The helical hydrophobic moment: a
measure of the amphiphilicity of a helix. Nature
299(5881):371–374. https://doi.org/10.103
8/299371a0
6. Wieprecht T, Dathe M, Epand RM,
Beyermann M, Krause E, Maloy WL, MacDonald DL, Bienert M (1997) Influence of the
angle subtended by the positively charged helix
face on the membrane activity of amphipathic,
antibacterial peptides. Biochemistry 36(42):
12869–12880. https://doi.org/10.1021/
bi971398n
7. Tossi A, Sandri L, Giangaspero A (2000)
Amphipathic,
alpha-helical
antimicrobial
peptides. Biopolymers 55(1):4–30. https://
doi.org/10.1002/1097-0282(2000)55:1
$langle$4::AID-BIP30$rangle$3.0.CO;2-M
8. Zaro JL, Shen WC (2015) Cationic and amphipathic cell-penetrating peptides (CPPs): Their
structures and in vivo studies in drug delivery.
Front Chem Sci Eng 9(4):407–427. https://
doi.org/10.1007/s11705-015-1538-y
9. Henriques ST, Melo MN, Castanho MARB
(2006) Cell-penetrating peptides and antimicrobial peptides: how different are they? Bioc h e m J 3 9 9 ( 1 ) : 1 – 7 . h t t p s : // d o i .
org/10.1042/BJ20061100
10. Drin G, Antonny B (2010) Amphipathic helices and membrane curvature. FEBS Lett
584(9):1840–1847. https://doi.org/10.101
6/j.febslet.2009.10.022
11. Marrink SJ, Risselada HJ, Yefimov S, Tieleman
DP, de Vries AH (2007) The MARTINI force
field: coarse grained model for biomolecular
simulations. J Phys Chem B 111(27):
7812–7824. https://doi.org/10.1021/jp0
71097f
12. Bruininks BMH, Souza PCT, Marrink SJ
(2019) A practical view of the martini force
field, Springer, New York, pp 105–127.
https://doi.org/10.1007/978-1-4939-960
8-7_5
13. Pezeshkian W, König M, Wassenaar TA, Marrink SJ (2020) Backmapping triangulated surfaces to coarse-grained membrane models.
Nature Commun 11(1). https://doi.
org/10.1038/s41467-020-16094-y
14. Rzepiela AJ, Sengupta D, Goga N, Marrink SJ
(2009) Membrane poration by antimicrobial
peptides combining atomistic and coarse-
CG Simulations of Membrane-Adsorbed Helical Peptides
grained descriptions. Faraday Discussions 144:
431–443. https://doi.org/10.1039/b90161
5e
15. Monticelli L, Kandasamy SK, Periole X, Larson
RG, Tieleman DP, Marrink SJJ (2008) The
MARTINI coarse-grained force field: extension to proteins. J Chem Theory Comput
4(5):819–834. https://doi.org/10.1021/
ct700324x
16. Periole X, Cavalli M, Marrink SJ, Ceruso MA
(2009) Combining an elastic network with a
coarse-grained molecular force field: structure,
dynamics, and intermolecular recognition. J
Chem Theory Comput 5(9):2531–2543.
https://doi.org/10.1021/ct9002114
17. Melo MN, Ferre R, Castanho MARB (2009)
Antimicrobial peptides: linking partition, activity and high membrane-bound concentrations.
Nat Rev Microbiol 7(3):245–50. https://doi.
org/10.1038/nrmicro2095
18. Su J, Marrink SJ, Melo MN (2020) Localization preference of antimicrobial peptides on
liquid-disordered membrane domains. Front
Cell Develop Biol 8. https://doi.org/10.
3389/fcell.2020.00350
19. Mukai Y, Matsushita Y, Niidome T,
Hatekeyama T, Aoyagi H (2002) Parallel and
antiparallel dimers of magainin 2: their interaction with phospholipid membrane and antibacterial activity. J Peptide Sci 8(10):570–577.
https://doi.org/10.1002/psc.416
20. Woo HJ, Wallqvist A (2011) Spontaneous
buckling of lipid bilayer and vesicle budding
induced by antimicrobial peptide magainin 2:
a coarse-grained simulation study. J Phys Chem
B 1 1 5 ( 2 5 ) : 8 1 2 2 – 8 1 2 9 . h t t p s : // d o i .
org/10.1021/jp2023023
21. Su J, Thomas AS, Grabietz T, Landgraf C,
Volkmer R, Marrink SJ, Williams C, Melo
MN (2018) The N-terminal amphipathic
helix of Pex11p self-interacts to induce membrane remodelling during peroxisome fission.
Biochimica et Biophysica Acta (BBA) - Biomembranes 1860(6):1292–1300. https://doi.
org/10.1016/j.bbamem.2018.02.029
22. Abraham MJ, Murtola T, Schulz R, Páll S,
Smith JC, Hess B, Lindah E (2015) GROMACS: high performance molecular simulations through multi-level parallelism from
laptops to supercomputers. SoftwareX 1–2:
19–25. https://doi.org/10.1016/j.softx.201
5.06.001
23. Kroon PC (2021) Martinize2 and Vermouth.
https://github.com/marrink-lab/vermouthmartinize. Accessed 28 Jan 2021
24. Hanwell MD, Curtis DE, Lonie DC,
Vandermeersch T, Zurek E, Hutchison GR
149
(2012) Avogadro: an advanced semantic chemical editor, visualization, and analysis platform.
J Ch eminfor m 4 (1):17. https://doi.
org/10.1186/1758-2946-4-17
25. Martinize (2017) http://cgmartini.nl/index.
php/tools2/proteins-and-bilayers/204martinize. Accessed 28 Jan 2021
26. Wassenaar TA, Ingólfsson HI, Böckmann RA,
Tieleman DP, Marrink SJ (2015) Computational lipidomics with insane: a versatile tool
for generating custom membranes for molecular simulations. J Chem Theory Comput 11(5):
2144–2155. https://doi.org/10.1021/acs.
jctc.5b00209
27. Lee J, Hitzenberger M, Rieger M, Kern NR,
Zacharias M, Im W (2020) CHARMM-GUI
supports the Amber force fields. J Chem Phys
153(3). https://doi.org/10.1063/5.0012280
28. Gowers RJ, Linke M, Barnoud J, Reddy TJE,
Melo MN, Seyler SL, Domański J, Dotson DL,
Buchoux S, Kenney IM, Beckstein O (2016)
MDAnalysis: a Python package for the rapid
analysis
of
molecular
dynamics
simulations. In: Benthall S, Rostrup S (eds)
Proceedings of the 15th Python in science conference, SciPy, pp 98–105. https://doi.org/
10.25080/Majora-629e541a-00e
29. Humphrey W, Dalke A, Schulten K
(1996) VMD: visual molecular dynamics. J
Molecular Graph 14(1):33–38. https://doi.
org/10.1016/0263-7855(96)00018-5
30. Harris CR, Millman KJ, van der Walt SJ,
Gommers R, Virtanen P, Cournapeau D,
Wieser E, Taylor J, Berg S, Smith NJ, Kern R,
Picus M, Hoyer S, van Kerkwijk MH, Brett M,
Haldane A, del Rı́o JF, Wiebe M, Peterson P,
Gérard-Marchant P, Sheppard K, Reddy T,
Weckesser W, Abbasi H, Gohlke C, Oliphant
TE (2020) Array programming with NumPy.
Nature 585(7825):357–362. https://doi.
org/10.1038/s41586-020-2649-2. 200
6.10256
31. Hunter JD (2007) Matplotlib: A 2D graphics
environment. Comput Sci Eng 9(3):90–95.
https://doi.org/10.1109/MCSE.2007.55
32. Reißer S, Strandberg E, Steinbrecher T, Ulrich
AS (2014) 3D hydrophobic moment vectors as
a tool to characterize the surface polarity of
amphiphilic peptides. Biophys J 106(11):
2385–2394. https://doi.org/10.1016/j.
bpj.2014.04.020
33. De Jong DH, Baoukina S, Ingólfsson HI, Marrink SJ (2016) Martini straight: boosting performance using a shorter cutoff and GPUs.
Comput Phys Commun 199:1–7. https://doi.
org/10.1016/j.cpc.2015.09.014
150
Manuel N. Melo
34. Martini run input parameters (2017). http://
c g m a r t i n i . n l / i n d e x . p h p / f o r c e - fi e l d parameters/input-parameters.
Accessed
28 Jan 2021
35. Darré L, Machado MR, Brandner AF, González HC, Ferreira S, Pantano S (2015) SIRAH: a
structurally unbiased coarse-grained force field
for proteins with aqueous solvation and longrange electrostatics. J Chem Theory Comput
11(2):723–739. https://doi.org/10.1021/
ct5007746
36. Poma AB, Cieplak M, Theodorakis PE (2017)
Combining the MARTINI and StructureBased Coarse-Grained Approaches for the
Molecular Dynamics Studies of Conformational Transitions in Proteins. J Chem Theory
Comput 13(3):1366–1374. https://doi.org/
10.1021/acs.jctc.6b00986
Chapter 8
Peptide Dynamics and Metadynamics: Leveraging
Enhanced Sampling Molecular Dynamics to Robustly Model
Long-Timescale Transitions
Joseph Clayton, Lokesh Baweja, and Jeff Wereszczynski
Abstract
Molecular dynamics simulations can in theory reveal the thermodynamics and kinetics of peptide conformational transitions at atomic-level resolution. However, even with modern computing power, they are
limited in the timescales they can sample, which is especially problematic for peptides that are fully or
partially disordered. Here, we discuss how the enhanced sampling methods accelerated molecular dynamics
(aMD) and metadynamics can be leveraged in a complementary fashion to quickly explore conformational
space and then robustly quantify the underlying free energy landscape. We apply these methods to two
peptides that have an intrinsically disordered nature, the histone H3 and H4 N-terminal tails, and use
metadynamics to compute the free energy landscape along collective variables discerned from aMD
simulations. Results show that these peptides are largely disordered, with a slight preference for α-helical
structures.
Key words Peptide dynamics, Accelerated molecular dynamics, Collective variables, Metadynamics
1
Introduction
Molecular dynamics (MD) simulations have become an invaluable
tool in the study of biomolecular structure, function, and dynamics
[1, 2]. Through the development of carefully optimized force fields
[3, 4], they model biologically relevant motions by integrating
Newton’s equations of motion in complex heterogeneous systems.
This can lead to powerful insights into the atomic-level descriptions
of biologically relevant systems, producing models of biomolecular
mechanisms across vast time and length scales, providing novel
insights into experimental results, and aiding the design of new
experiments. Although there has been a significant rise in computational power over the past decade, which in no small part is due to
the development of GPU programming [5, 6] and special purpose
machines [7, 8], MD simulations are still typically on the
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_8,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
151
152
Joseph Clayton et al.
microseconds or shorter timescale—primarily due to the small
timestep required to capture motions between bonded atoms.
Because of this, sampling of long-timescale events and achieving
biological equilibrium are often impossible to observe through
conventional molecular dynamics (cMD) simulations, especially
for highly flexible systems such as peptides.
To circumvent this limitation, enhanced sampling methods
have been developed to efficiently extend MD to characterize the
thermodynamic properties of hard to sample systems. In this chapter, we describe two complementary approaches for using enhanced
sampling methods to determine the conformations of peptides in
silico, along with their respective solution populations and free
energies. First, we describe accelerated molecular dynamics
(aMD) simulations. In aMD, the energy landscape of a system is
flattened through the addition of a “boost” potential [9]. This
lowers energy barriers and promotes the rapid sampling of novel
conformational states [10–14]; however, it can be difficult to
robustly reweight the results to compute the free energy of states
in the physically relevant unaccelerated system. Second, we describe
the method of metadynamics, in which energy is adaptively added
to the system along a low number of predefined reaction coordinates [15–17]. In metadynamics, more care must be taken than in
aMD to ensure the reaction coordinates are properly chosen,
making these simulations more difficult to properly set
up. However, metadynamics allows for the robust calculation of
the system’s underlying free energy landscape [18–20]. Given the
strengths and weaknesses of both approaches, it is natural to use
aMD and metadynamics in a hierarchal protocol. Here, we illustrate
how this can be achieved for two intrinsically disordered proteins
(IDPs): the N-terminal tails of the H3 and H4 histones [21–
23]. aMD simulations are used to gain a qualitative understanding
of these systems to quickly sample phase space and define appropriate reaction coordinates, whereas metadynamics calculations are
used to refine these results and quantitatively compute the underlying free energy landscape. Both methods are implemented in a
wide array of popular MD packages, and here we have used NAMD
v 2.14 [6] for all calculations.
2
2.1
Finding Peptide Conformations Using Accelerated Molecular Dynamics (aMD)
Theory
Sampling in cMD simulations is limited in part by the presence of
high energy barriers between conformations. To overcome this,
aMD aims to increase the rate of sampling rare motions and configurations by altering a system’s potential energy landscapes
through the addition of a “boost” potential [5, 9, 10, 24]:
Simulating Peptide Dynamics with aMD and Metadynamics
153
E thresh
E(x)
α=0.5
α=2.0
α=
10.0
x
Fig. 1 A demonstration of the alterations done by accelerated MD to a given
potential, dictated by Eq. 1. The altered potentials (dashed lines) all have the
same threshold energy (dotted line), but have different tuning parameters. The
threshold energy dictates which portion of the potential is affected, while the
tuning parameter controls the alteration; as the tuning parameter approaches
zero, the altered region becomes constant and approaches the threshold energy
V ðrÞ ¼ V ðrÞ þ V boost ðrÞ
(
0
V boost ðrÞ ¼
V ðrÞ > E thresh
2
ðE
V ðrÞÞ
α þ thresh
E thresh V ðrÞ
ð1Þ
else
The altered potential reduces the depth of energy wells and
flattens the overall landscape, making energy barriers easier to cross
and rare conformations accessible on a shorter timescale. The
functional form of the boost potential has two parameters: Ethresh,
which is the threshold energy below which the boost potential is
applied, and α, a tuning parameter that dictates how deep the
energy wells are in the accelerated landscape. Figure 1 demonstrates
how the aMD potential landscape is altered by the choice of α;
smaller values of α produce landscapes with shallower wells and
lower barriers, but only for regions below the threshold energy.
Practically, running an aMD simulation requires choosing not
only the parameters Ethresh and α but also the degrees of freedom to
which aMD should be applied. This is especially important since, as
Fig. 1 demonstrates for aMD on a simple one-dimensional landscape, even simple biomolecular systems have thousands to millions
of degrees of freedom. Although any combination of potential
energy terms could be used, in general, there are three typical
implementations of aMD:
1. aMD may be applied to the total dihedral energy. Given their
importance in driving biomolecular structures, accelerating the
154
Joseph Clayton et al.
motions of all dihedrals is a natural choice for improving the
sampling of peptide conformations.
2. aMD may be applied to the entire potential energy surface.
Since the total potential energy is dominated by electrostatic
and, in explicit solvent simulations, solvent–solvent interactions, accelerating along the total potential landscape may
help speed sampling between states where charge–charge interactions provide stabilizing forces and it may increase the diffusive properties of the system.
3. As a hybrid of these two approaches, in the “dual boost”
method there are two aMD potentials applied: one to the
dihedral potential energy and the other to the total potential
energy. This combines the advantages of both approaches:
increased conformational sampling of backbones and sidechains, increased breaking and formation of hydrogen bonds
and other electrostatically driven interactions, and reduced
solvent viscosity.
In general, the dual boost approach is the one we recommend
most new users try for their system of interest. In theory, other
choices of the aMD potential form may be made, such as applying
aMD to only selected dihedrals of importance in a protein’s binding
site; however, the act of choosing the appropriate degrees of
freedom to accelerate increases the complexity of implementing
aMD, which takes away from the algorithm’s simplicity [13, 25]
(see Note 1).
In addition to choosing the aMD implementation, one must
also select a set of aMD parameters. In general, we have found that
there are a wide range of potential aMD parameters for any given
system, but a good place to start is to set the tuning parameter
(s) according to the size of the system and the threshold energy
(or energies) according to the average for the potential in a short
cMD simulation. Here, in a dual boost implementation, we chose
the following:
αD ¼ ð1=5Þ ð3:5 kcal=molÞ ðnum: of residuesÞ
E thresh,D ¼ αD þ ðavg: dihed: potentialÞ
αtotal ¼ ð1=5Þ ð1 kcal=molÞ ðnum: of atomsÞ
E thresh,total ¼ αtotal þ ðavg: total potentialÞ
Here αD and αtotal are the tuning parameters for the dihedral
and total energy terms, and Ethresh,D and Ethresh,total the respective
threshold energy. If running only a dihedral or total boost simulation, then only the respective α or Ethresh should be used. In some
cases, we have found that these parameters may be insufficient to
achieve the desired level of sampling, in which case one may try
increasing each Ethresh parameter by an additional value of α. However, care must be taken to not “over-accelerate” and create
Simulating Peptide Dynamics with aMD and Metadynamics
155
unwanted distortions in the system of interest, such as melting of
stable secondary structure elements.
2.2 Example:
Exploring
the Conformational
Space of the Histone
H3 and H4 N-Terminal
Tails
For the purpose of this example, we took the N-terminus tails of
histone-3 (H3) and histone-4 (H4) from the nucleosome core
particle and modeled each as individual peptides of length 42 and
23 residues using the AMBER19SB force field [4]. Both have been
shown to have disordered states in solution and provide excellent
examples of difficult to sample IDPs based on their lengths and
cationic nature. To further enhance sampling, we used a
generalized Born implicit solvent model [26, 27], which speeds
sampling by drastically reducing both the system sizes and the
friction within each system.
To find suitable tuning parameters and threshold energies, we
first obtained the average dihedral and total potential energies by
running a short 500 ps cMD simulation of each system. We then set
the tuning parameter and energy thresholds using the protocol
detailed above. Here, the average dihedral and potential energies
for H3 were 199 and 1361 kcal/mol; using the suggestion above,
the tuning parameters were set to 30.8 and 136.2 kcal/mol, and
the thresholds were set to 230 and 1225 kcal/mol for the dihedral and potential boosts, respectively. Using NAMD (see Notes
2 and 3), we simulated 600 ns of aMD for each system with both
the dihedral and potential energies boosted using the above recommended values, as well as cMD simulations for comparison. For
each of these simulations, the root-mean-squared-deviation
(RMSD) matrix of each frame compared to the rest of the simulation is shown in Fig. 2. In the aMD simulations, the RMSD varies
quickly between frames, with differences as high as 12 and 7 Å in
the case of the H3 and H4 tails. In contrast, the cMD simulations
show reduced sampling for both peptides, with systems remaining
in long-lived/stable conformations for significantly longer periods
of time as seen by the reduction in the bright-colored lines and
appearance of “boxes” of low RMSD values.
Since the altered potential lowers energy barriers, aMD simulations can show states not easily sampled in cMD simulations—thus
revealing motions that occur on timescales longer than the simulation. These states and motions can frequently be defined through a
set of collective variables. There are multiple ways to use aMD to
determine collective variables, including dimension reduction analysis like principal component analysis (PCA) [28] and leveraging
previous experimental and computational results (known conformational changes [29–31], for example). Visualization is a good
first step; here we chose to use VMD, as it has a graphical user
interface (GUI) plugin that allows the user to define a collective
variable and visualize how the quantity evolves over the course of
the trajectory [32] (see Note 4). This GUI uses the Colvars module
[33], the NAMD implementation of collective variables, thus any
156
Joseph Clayton et al.
Fig. 2 The root-mean-square deviation (RMSD) matrix for the H3 and H4 aMD simulations (left), with the cMD
simulations (right) for comparison
variable defined by the GUI can be easily used in a new simulation.
From our observations, we found that both the H3 and H4 tails
sampled a range of compact and extended conformations with
helical regions (Fig. 3). Both had an average helicity near 0.5, and
H3 sampled end-to-end distances up to 60 Å, whereas H4 only
sampled extensions up to 45 Å due to its shorter length. Based on
this, we defined two collective variables: the distance between the
backbone atoms of the terminal residues and the alpha Colvars
component which estimates the overall helicity (see Notes 5 and
6). In general, these are natural collective variables for sampling
peptides that have a helical propensity [34]. These were utilized in
the next section, in which the underlying free energy landscape was
rigorously quantified with metadynamics calculations.
Bin count
(normalized)
Simulating Peptide Dynamics with aMD and Metadynamics
157
H3
Bin count
(normalized)
0.4
0.6
Helicity
H3
H4
0
25
50
H4
75
End to end distance (Å)
Fig. 3 Sampling along the two selected collective variables (helicity and end-toend distance) for the aMD simulations. A sample structure from each system
(left) shows both peptides can form short helices separated by unstructured
loops
3
3.1
Quantifying Free Energy Landscapes with Metadynamics
Theory
In theory, converged potentials of mean force (PMFs) can be
calculated from performing a Boltzmann inversion of the sampling
in cMD simulations. However, the computational effort required
to do this for even small systems is typically intractable, since they
will spend the majority of their time in local minima and fail to
sample transitions and new free energy minima states. Metadynamics helps to overcome this issue by adding a history-dependent
bias along a low number of collective variables [15–17]. This bias
consists of Gaussian deposits that are periodically added at time
intervals of τ:
X
ðx x ðt ÞÞ2
V bias ðx, t Þ ¼ w
ð2Þ
exp
2δx 2
t¼τ, 2τ, ...
where x is a collective variable space, x(t) is the value of a collective
variable at time t, and w and δx are height and width parameters for
the Gaussian deposits, respectively. As the system samples a local
free energy minima, the introduced bias slowly grows until it “fills”
the minima—allowing the system to escape and sample other states.
To show how the bias grows, an example of a one-dimensional
metadynamics simulation is shown in Fig. 4; the system initially
remains in the global minimum, causing the bias to increase in that
region over time. The bias eventually compensates for the minimum, allowing the system to easily sample the second minimum;
Joseph Clayton et al.
Intermediate
G(x)
G(x)
Initial
CV x
CV x
Estimated
G(x)
G(x)
Final
G(x)
158
CV x
CV x
Fig. 4 An illustration of metadynamics calculations along a single variable with
two stable states separated by a barrier. The simulation initially starts in one of
these states, and as it progresses the periodic deposits to the bias “fill” the
potential well (top row), allowing the simulation to cross the barrier and sample
the second state. As the simulation progresses and samples the second state,
the bias fully compensates the underlying free energy surface and provides an
estimation of the surface along the variable (bottom row)
eventually the bias compensates for both minima, making the effective landscape flat. Note that the flattening of the landscape is akin
to aMD; unlike aMD, however, the bias from metadynamics is time
dependent and is not uniform over the course of the simulation.
Since the bias aims to sample and match the underlying landscape, the negative of the bias will estimate the shape of the system’s
free energy landscape [35–37]:
lim V bias ðx, t Þ ¼ G ðx Þ þ C
t!1
Once the underlying energy surface has been balanced by the
bias, any additional Gaussian deposits introduce error into the
estimate and causes it to fluctuate around the true landscape. To
prevent this fluctuation, Barducci et al. developed a “well-tempered” version where the height of the Gaussian deposits decreases
as they are deposited in the same region [38]:
Simulating Peptide Dynamics with aMD and Metadynamics
V ðx, tÞ bias
w 0 ðx, tÞ ¼ w exp
ΔT
ðT þ ΔT Þ
V bias ðx, tÞ
GðxÞ ¼
ΔT
159
ð3Þ
Here a new parameter, ΔT, determines how quickly deposits
decrease in height as a minimum is filled. Since simulations cannot
be run indefinitely, this parameter also introduces a maximum
threshold to Vbias; this threshold can be used to limit the collective
variable sampling to only biologically relevant regions [38] (see
Note 7).
In NAMD, collective variables can be defined by activating the
Colvars module, which will take a configuration as input. This
configuration file consists of blocks that define the variables and
biases; an example Colvars configuration file is shown in Fig. 5,
where our two collective variables (helicity and end-to-end distance) and a metadynamics protocol are defined for the H3 tail
system. For this system, we created a grid with a 0.025 and 2.5 Å
resolution for the helicity and distance, respectively, which results in
approximately 40 bins in both the helicity and distance coordinates;
NAMD uses this spacing to determine the width of the Gaussian
deposits and the resolution of the resulting energy landscape estimate. Increasing the resolution (i.e., decreasing the grid width) will
give more detailed estimates; however, the bias will evolve more
slowly and thus the landscape estimate will require more simulation
time to converge.
3.2 Example: Using
Metadynamics
to Quantify Free
Energy Landscapes
of the Histone H3
and H4 N-Terminal
Tails
To examine the thermodynamics of the H3 and H4 tails, both
peptides were solvated using the OPC model [39] in a 150 mM
NaCl environment. We then performed a single 2 μs well-tempered
metadynamics simulation for each model, utilizing the two collective variables based on our aMD results and the metadynamics
parameters discussed above. In each of these simulations, there
was significant sampling in the end-to-end and helicity coordinates
spaces as the peptides rapidly transitioned between diverse configurations (see Fig. 6 for details).
There are multiple methods for assessing convergence in metadynamics simulations. Here, we take advantage of the property
inherent in well-tempered simulations that the hill heights will
decrease as a region of phase space is repeatedly sampled. The
heights of the Gaussian deposits were monitored over time
(Fig. 6), and while they started at the initial value of 0.5 kcal/mol
in both systems, new deposits heights approached zero around
1.2 μs in the case of the H3 tails and 1.5 μs for the H4 tails. This
indicates that additional sampling will have minimal effect on the
computed PMF, as is also shown by the cumulative hill heights
converging around these times as well. Indeed, we observed little
difference in the PMFs computed after 1.2 μs in H3 and 1.5 μs in
160
Joseph Clayton et al.
colvarsTrajFrequency
500
colvarsRestartFrequency
1000
colvar {
name heli
width 0.025
lowerBoundary 0.0
upperBoundary 1.0
alpha {
residueRange 2-43
}
}
colvar {
name dist
width 2.5
lowerBoundary 0.0
upperBoundary 120.0
distance {
group1 {
# Selection: "resid 2 and backbone"
atomNumbers 7 9 15 17
}
group2 {
# Selection: "resid 43 and backbone"
atomNumbers 652 654 674 675
}
}
}
metadynamics {
name
meta-H3
colvars heli dist
hillWeight
0.5
newHillFrequency
500
dumpFreeEnergyFile
yes
writeHillsTrajectory
on
hillwidth
1.0
wellTempered
on
biasTemperature
2000
}
Fig. 5 An example of a Colvars configuration file, generated from the Colvars
Dashboard VMD plugin. The file consists of two types of code blocks: a type that
defines a variable and one that defines a bias or protocol. Here two blocks define
the helicity and distance parameter, while the final block defines a
metadynamics protocol. Note that the first two lines are not set in a block;
these are two global parameters in the Colvars module that set how often output
files are written
H4 tails. Plotting the hill height as a function of time is also
instructive for highlighting when systems sample a new region of
collective variable space, as sudden increases in the hill height (such
as in H3 around 0.7 μs) are indicative of sampling regions without
previously deposited hills.
161
H3
0.5
10000
0
0.0
2.0
Cumulative
hill height
Hill height
Simulating Peptide Dynamics with aMD and Metadynamics
End-to-end
Helicity
0.75
0.50
0.25
50
Helicity
End-to-end
distance (Å)
Simulation time ( s)
2.0
H4
0.5
5000
0
0.0
2.0
Cumulative
hill height
Hill height
Simulation time ( s)
0.75
0.50
0.25
50
0
0.0
0.5
1.0
1.5
Simulation time ( s)
Helicity
End-to-end
distance (Å)
Simulation time ( s)
2.0
Fig. 6 Sampling and convergence of metadynamics calculations. Both the H3
and H4 peptides rapidly transition between helical and nonhelical, as well as
extended vs compact states. The hill heights as a function of time can be used to
gauge convergence, as the heights of hills added in well-sampled landscapes
will approach zero, and the cumulative hill heights will level off
The final free energy landscapes are shown in Fig. 7; both
landscapes are largely flat and indicate these peptides exist in a
variety of conformations in solution. The global minima for both
systems correspond to an end-to-end distance of 4 Å which corresponds to interactions between the N- and C-terminal residues;
however, both landscapes are relatively flat and sample a wide range
of helicities and end-to-end distances, and neither has large free
energy barriers dividing metastable states. From this, we can conclude that both peptides are capable of sampling a wide range of
conformations and are not locked into discrete states. While the
162
Joseph Clayton et al.
Fig. 7 Potentials of mean force (PMFs) for H3 and H4 tails as computed from
well-tempered metadynamics calculations. Both landscapes are overall
relatively flat, with broad energies wells indicating that conformations covering
a wide range of helical and end-to-end distances are easily accessible in
solution. All energies are in kcal/mol
sampling in these simulations is similar to that in the implicit aMD
simulations above (Fig. 3), those simulations suggested a stronger
preference for helical structures. For example, the H4 tail strongly
sampled helicities between 0.4 and 0.6 in the aMD simulations, but
the metadynamics PMFs reveal a range of accessible states between
0.2 and 0.8 with little preference for a particular value. This discrepancy is likely due to both the more robust reweighting mechanisms used in metadynamics and the different solvent models; in this
case, the explicit model using the OPC water model is known to
match experiments for the H4 tail [40]. Nevertheless, an implicit
model can be useful for finding long-timescale motions as the
simplified model is less computationally intensive.
Simulating Peptide Dynamics with aMD and Metadynamics
4
163
Conclusions
Here, we described a hierarchal approach for characterizing the
conformational space of peptides. Initially, accelerated MD simulations are used to quickly scan the accessible peptide states. Given
that aMD already disturbs the potential energy landscape, we
elected to use an implicit solvent model in this stage as the goal
was to qualitatively describe the energetically accessible peptide
conformations. These aMD simulations were then manually
inspected to determine appropriate collective variables, which
were then used in well-tempered metadynamics simulations with
explicit solvent. Although explicit solvent systems run at a fraction
of the speed of implicit solvent simulations, these were able to
accurately quantify the underlying free energy landscapes of each
system. Both aMD and metadynamics have their respective
strengths and weaknesses, and combining both of them can lead
to the efficient and rigorous characterization of peptide states.
5
Notes
1. Several variations of aMD exist, including selective aMD [25],
windowed aMD [41], replica exchange aMD [42], rotatable
dihedral aMD [43], and Gaussian aMD [40, 44]. Among these
methods are ways to approximate the underlying free energy by
reweighting frames directly from the aMD trajectories [45];
here, we elected to use aMD with implicit solvent only to
quickly sample different peptide configurations, then to
robustly compute the free energy surface with metadynamics
in explicit solvent.
2. Here we chose to use NAMD, but many molecular dynamic
engines have aMD and free energy estimation methods implemented including umbrella sampling [46, 47], steered molecular dynamics [48, 49], adaptive biasing force [50], and
adaptively biased molecular dynamics [51]. These methods
work similar to metadynamics, as they each incorporate a bias
into the system to sample and estimate the free energy along a
collective variable space. Each vary in how the bias is applied
and how the free energy is calculated, but all have been well
studied and developed [52, 53]; if metadynamics is not implemented in the desired engine, these can be suitable alternatives.
3. Here we boosted both the dihedral and total potential terms;
however, the implementation in NAMD applies a boost to the
dihedral and (total—dihedral) potentials. The reason behind
this is to avoid boosting the dihedral potential twice, as the
total potential includes the dihedral energy. Our recommended
164
Joseph Clayton et al.
method of finding optimal aMD parameters thus should be
adjusted, such that the energy threshold for the total potential
uses the average total potential energy from a short cMD
simulation without the dihedral term. This discrepancy is not
present in AMBER, as it will boost the total potential and add
an additional boost to the dihedral term while in dual
boost mode.
4. While other visualization packages exist, VMD has dashboard
plugins that work well with the Colvars module and
PLUMED, another collective variable module used in GROMACS [54] (as well as NAMD v2.12 and later). These plugins
provide several useful tools, including defining variables from
VMD’s atom selection, plotting the evolution of a variable over
the trajectory, and plotting one variable against a second. The
user can thus create and visualize a collective variable, compare
variables to each other, and output a configuration file once a
suitable set has been found.
5. Finding good collective variables can be difficult and can
depend on the question at hand. Here we used implicit solvent
and dual boost aMD to enhance sampling; however, both of
these techniques can lead to artifacts. A good rule of thumb is
to ensure known secondary structure, if any, is conserved; if this
is not conserved, reduce the aMD bias by reducing the threshold energies by factors of alpha. Here we used both methods in
order to quickly determine extreme motions, which may or
may not be relevant to biological processes of the histone tails.
6. Here we focused on using two collective variables, but in
theory one can use any number of variables in metadynamics.
However, increasing the number of variables exponentially
increases the sampling space, which in turn increase the simulation effort to reach equilibrium, and in practice it is often not
feasible to use more than three dimensions—so the aim should
always be to find the minimum number of variables needed to
describe the quantity in question. The correlation between two
collective variables can be estimated by plotting one variable
against another; such a pairwise plot can be easily made using
the Colvars dashboard.
7. This chapter discussed the original and well-tempered metadynamics methods; however, other variants exist including using
multiple walkers [55], ensemble biased [56], and merging
metadynamics with the adaptive biasing force algorithm
[57]. The multiple walker variant is a popular method, as it
allows the user to run parallel simulations; since the communications between simulations are minimal, this method can
efficiently scale on clusters of loosely coupled nodes. One could
follow our example here and use aMD to seed different initial
conditions for multiple walkers, thus sampling different regions
of collective variable space simultaneously.
Simulating Peptide Dynamics with aMD and Metadynamics
165
Acknowledgments
This work in the Wereszczynski group was supported by the
National Science Foundation [MCB-1716099] and the National
Institutes of Health [1R35GM119647].
References
1. Karplus M, McCammon JA (2002) Molecular
dynamics simulations of biomolecules. Nat
Struct Biol 9:646–652. https://doi.org/10.
1038/nsb0902-646
2. Hollingsworth SA, Dror RO (2018) Molecular
dynamics simulation for all. Neuron 99:
1129–1143. https://doi.org/10.1016/j.neu
ron.2018.08.011
3. Huang J, Rauscher S, Nawrocki G et al (2017)
CHARMM36m: an improved force field for
folded and intrinsically disordered proteins.
Nat Methods 14:71–73. https://doi.org/10.
1038/nmeth.4067
4. Tian C, Kasavajhala K, Belfon KAA et al (2020)
ff19SB: amino-acid-specific protein backbone
parameters trained against quantum mechanics
energy surfaces in solution. J Chem Theory
Comput 16:528–552. https://doi.org/10.
1021/acs.jctc.9b00591
5. Salomon-Ferrer R, Götz AW, Poole D et al
(2013) Routine microsecond molecular
dynamics simulations with AMBER on GPUs.
2. Explicit solvent particle mesh Ewald. J Chem
Theory Comput 9:3878–3888. https://doi.
org/10.1021/ct400314y
6. Phillips JC, Hardy DJ, Maia JDC et al (2020)
Scalable molecular dynamics on CPU and GPU
architectures with NAMD. J Chem Phys 153:
044130.
https://doi.org/10.1063/5.
0014475
7. Shaw DE, Deneroff MM, Dror RO et al
(2008) Anton, a special-purpose machine for
molecular dynamics simulation. Commun
ACM 51:91–97. https://doi.org/10.1145/
1364782.1364802
8. Ohmura I, Morimoto G, Ohno Y et al (2014)
MDGRAPE-4: a special-purpose computer
system for molecular dynamics simulations.
Phil Trans R Soc A 372:20130387. https://
doi.org/10.1098/rsta.2013.0387
9. Hamelberg D, Mongan J, McCammon JA
(2004) Accelerated molecular dynamics: a
promising and efficient simulation method for
biomolecules.
J
Chem
Phys
120:
11919–11929. https://doi.org/10.1063/1.
1755656
10. Hamelberg D, de Oliveira CAF, McCammon
JA (2007) Sampling of slow diffusive conformational transitions with accelerated molecular
dynamics. J Chem Phys 127:155102. https://
doi.org/10.1063/1.2789432
11. Grant BJ, Gorfe AA, McCammon JA (2009)
Ras conformational switching: simulating
nucleotide-dependent conformational transitions with accelerated molecular dynamics.
PLoS Comput Biol 5:e1000325. https://doi.
org/10.1371/journal.pcbi.1000325
12. de Oliveira CAF, Grant BJ, Zhou M, McCammon JA (2011) Large-scale conformational
changes of Trypanosoma cruzi proline racemase predicted by accelerated molecular
dynamics simulation. PLoS Comput Biol 7:
e1002178. https://doi.org/10.1371/journal.
pcbi.1002178
13. Doshi U, Hamelberg D (2015) Towards fast,
rigorous and efficient conformational sampling
of biomolecules: advances in accelerated
molecular dynamics. Biochim Biophys Acta
Gen Subj 1850:878–888. https://doi.org/
10.1016/j.bbagen.2014.08.003
14. Kamenik AS, Lessel U, Fuchs JE et al (2018)
Peptidic macrocycles—conformational sampling and thermodynamic characterization. J
Chem Inf Model 58:982–992. https://doi.
org/10.1021/acs.jcim.8b00097
15. Laio A, Gervasio FL (2008) Metadynamics: a
method to simulate rare events and reconstruct
the free energy in biophysics, chemistry and
material science. Rep Prog Phys 71:126601.
https://doi.org/10.1088/0034-4885/71/
12/126601
16. Barducci A, Bonomi M, Parrinello M (2011)
Metadynamics. WIREs Comput Mol Sci 1:
826–843. https://doi.org/10.1002/wcms.31
17. Bussi G, Laio A (2020) Using metadynamics to
explore complex free-energy landscapes. Nat
Rev Phys 2:200–212. https://doi.org/10.
1038/s42254-020-0153-0
18. Bochicchio D, Panizon E, Ferrando R et al
(2015) Calculating the free energy of transfer
of small solutes into a model lipid membrane:
comparison between metadynamics and
umbrella sampling. J Chem Phys 143:
144108.
https://doi.org/10.1063/1.
4932159
19. Capelli R, Bochicchio A, Piccini G et al (2019)
Chasing the full free energy landscape of neuroreceptor/ligand unbinding by metadynamics
166
Joseph Clayton et al.
simulations. J Chem Theory Comput 15:
3354–3361.
https://doi.org/10.1021/acs.
jctc.9b00118
20. Tanida Y, Matsuura A (2020) Alchemical free
energy calculations via metadynamics: application to the theophylline-RNA aptamer complex. J Comput Chem 41:1804–1819.
https://doi.org/10.1002/jcc.26221
21. Potoyan DA, Papoian GA (2011) Energy landscape analyses of disordered histone tails reveal
special organization of their conformational
dynamics. J Am Chem Soc 133:7405–7415.
https://doi.org/10.1021/ja1111964
22. Iwasaki W, Miya Y, Horikoshi N et al (2013)
Contribution of histone N-terminal tails to the
structure and stability of nucleosomes. FEBS
Open Bio 3:363–369. https://doi.org/10.
1016/j.fob.2013.08.007
23. Erler J, Zhang R, Petridis L et al (2014) The
role of histone tails in the nucleosome: a
computational study. Biophys J 107:
2911–2922. https://doi.org/10.1016/j.bpj.
2014.10.065
24. Wang Y, Harrison CB, Schulten K, McCammon JA (2011) Implementation of accelerated
molecular dynamics in NAMD. Comput Sci
Disc 4:015002. https://doi.org/10.1088/
1749-4699/4/1/015002
25. Wereszczynski J, McCammon JA (2010) Using
selectively applied accelerated molecular
dynamics to enhance free energy calculations.
J Chem Theory Comput 6:3285–3292.
https://doi.org/10.1021/ct100322t
26. Onufriev A, Bashford D, Case DA (2000)
Modification of the generalized born model
suitable for macromolecules. J Phys Chem B
104:3712–3720. https://doi.org/10.1021/
jp994072s
27. Onufriev A, Bashford D, Case DA (2004)
Exploring protein native states and large-scale
conformational changes with a modified
generalized born model. Proteins 55:
383–394.
https://doi.org/10.1002/prot.
20033
28. Wereszczynski J, McCammon JA (2012)
Nucleotide-dependent mechanism of Get3 as
elucidated from free energy calculations. Proc
Natl Acad Sci 109:7759–7764. https://doi.
org/10.1073/pnas.1117441109
29. Bešker N, Gervasio FL (2012) Using metadynamics and path collective variables to study
ligand binding and induced conformational
transitions. In: Baron R (ed) Computational
drug discovery and design. Springer,
New York, NY, pp 501–513
30. Matsunaga Y, Komuro Y, Kobayashi C et al
(2016) Dimensionality of collective variables
for describing conformational changes of a
multi-domain protein. J Phys Chem Lett 7:
1446–1451.
https://doi.org/10.1021/acs.
jpclett.6b00317
31. Ahalawat N, Mondal J (2018) Assessment and
optimization of collective variables for protein
conformational landscape: GB1 β-hairpin as a
case study. J Chem Phys 149:094101. https://
doi.org/10.1063/1.5041073
32. Humphrey W, Dalke A, Schulten K
(1996) VMD: visual molecular dynamics. J
Mol Graph 14:33–38. https://doi.org/10.
1016/0263-7855(96)00018-5
33. Fiorin G, Klein ML, Hénin J (2013) Using
collective variables to drive molecular dynamics
simulations. Mol Phys 111:3345–3362.
https://doi.org/10.1080/00268976.2013.
813594
34. Hazel A, Chipot C, Gumbart JC (2014) Thermodynamics of Deca-alanine folding in water. J
Chem Theory Comput 10:2836–2844.
https://doi.org/10.1021/ct5002076
35. Laio A, Rodriguez-Fortea A, Gervasio FL et al
(2005) Assessing the accuracy of metadynamics
{
. J Phys Chem B 109:6714–6721. https://
doi.org/10.1021/jp045424k
36. Bussi G, Laio A, Parrinello M (2006) Equilibrium free energies from nonequilibrium metadynamics. Phys Rev Lett 96:090601. https://
doi.org/10.1103/PhysRevLett.96.090601
37. Crespo Y, Marinelli F, Pietrucci F, Laio A
(2010) Metadynamics convergence law in a
multidimensional system. Phys Rev E 81:
055701.
https://doi.org/10.1103/
PhysRevE.81.055701
38. Barducci A, Bussi G, Parrinello M (2008) Welltempered metadynamics: a smoothly converging and tunable free-energy method. Phys Rev
Lett 100:020603. https://doi.org/10.1103/
PhysRevLett.100.020603
39. Izadi S, Anandakrishnan R, Onufriev AV
(2014) Building water models: a different
approach. J Phys Chem Lett 5:3863–3871.
https://doi.org/10.1021/jz501780a
40. Shabane PS, Izadi S, Onufriev AV (2019) General purpose water model can improve atomistic simulations of intrinsically disordered
proteins. J Chem Theory Comput 15:
2620–2634.
https://doi.org/10.1021/acs.
jctc.8b01123
41. Sinko W, de Oliveira CAF, Pierce LCT,
McCammon JA (2012) Protecting high energy
barriers: a new equation to regulate boost
energy in accelerated molecular dynamics
simulations. J Chem Theory Comput 8:
17–23. https://doi.org/10.1021/ct200615k
Simulating Peptide Dynamics with aMD and Metadynamics
42. Fajer M, Hamelberg D, McCammon JA
(2008) Replica-exchange accelerated molecular dynamics (REXAMD) Applied to Thermodynamic Integration. J Chem Theory Comput
4:1565–1569.
https://doi.org/10.1021/
ct800250m
43. Doshi U, Hamelberg D (2012) Improved statistical sampling and accuracy with accelerated
molecular dynamics on rotatable torsions. J
Chem Theory Comput 8:4004–4012.
https://doi.org/10.1021/ct3004194
44. Miao Y, Feher VA, McCammon JA (2015)
Gaussian accelerated molecular dynamics:
unconstrained enhanced sampling and free
energy calculation. J Chem Theory Comput
11:3584–3595.
https://doi.org/10.1021/
acs.jctc.5b00436
45. Miao Y, Sinko W, Pierce L et al (2014)
Improved reweighting of accelerated molecular
dynamics simulations for free energy calculation. J Chem Theory Comput 10:2677–2689.
https://doi.org/10.1021/ct500090q
46. Kumar S, Rosenberg JM, Bouzida D et al
(1992) THE weighted histogram analysis
method for free-energy calculations on
biomolecules. I. The method. J Comput
Chem 13:1011–1021. https://doi.org/10.
1002/jcc.540130812
47. Kumar S, Rosenberg JM, Bouzida D et al
(1995) Multidimensional free-energy calculations using the weighted histogram analysis
method. J Comput Chem 16:1339–1350.
https://doi.org/10.1002/jcc.540161104
48. Park S, Khalili-Araghi F, Tajkhorshid E, Schulten K (2003) Free energy calculation from
steered molecular dynamics simulations using
Jarzynski’s equality. J Chem Phys 119:
3559–3566.
https://doi.org/10.1063/1.
1590311
49. Jarzynski C (1997) Nonequilibrium equality
for free energy differences. Phys Rev Lett 78:
2690–2693. https://doi.org/10.1103/Phy
sRevLett.78.2690
167
50. Darve E, Rodrı́guez-Gómez D, Pohorille A
(2008) Adaptive biasing force method for scalar and vector free energy calculations. J Chem
Phys 128:144120. https://doi.org/10.1063/
1.2829861
51. Babin V, Roland C, Sagui C (2008) Adaptively
biased molecular dynamics for free energy calculations. J Chem Phys 128:134101. https://
doi.org/10.1063/1.2844595
52. Wereszczynski J, McCammon JA (2012) Statistical mechanics and molecular dynamics in
evaluating thermodynamic properties of biomolecular recognition. Q Rev Biophys 45:
1–25.
https://doi.org/10.1017/
S0033583511000096
53. Chipot C (2014) Frontiers in free-energy calculations of biological systems: WIREs
Computational Molecular Science: frontiers in
free-energy calculations. WIREs Comput Mol
Sci 4:71–89. https://doi.org/10.1002/wcms.
1157
54. Abraham MJ, Murtola T, Schulz R et al (2015)
GROMACS: high performance molecular
simulations through multi-level parallelism
from laptops to supercomputers. SoftwareX
1–2:19–25. https://doi.org/10.1016/j.softx.
2015.06.001
55. Raiteri P, Laio A, Gervasio FL et al (2006)
Efficient reconstruction of complex free energy
landscapes by multiple walkers metadynamics. J
Phys Chem B 110:3533–3539. https://doi.
org/10.1021/jp054359r
56. Marinelli F, Faraldo-Gómez JD (2015)
Ensemble-biased metadynamics: a molecular
simulation method to sample experimental distributions. Biophys J 108:2779–2782. https://
doi.org/10.1016/j.bpj.2015.05.024
57. Fu H, Shao X, Cai W, Chipot C (2019) Taming
rugged free energy landscapes using an average
force. Acc Chem Res 52:3254–3264. https://
doi.org/10.1021/acs.accounts.9b00473
Chapter 9
Metadynamics Simulations to Study the Structural
Ensembles and Binding Processes of Intrinsically
Disordered Proteins
Rui Zhou and Mojie Duan
Abstract
The structures of intrinsically disordered proteins (IDPs) are highly dynamic. It is hard to characterize the
structures of these proteins experimentally. Molecular dynamics (MD) simulation is a powerful tool in the
understanding of protein dynamic structures and function. This chapter describes the application of
metadynamics-based enhanced sampling methods in the study of phosphorylation regulation on the
structure of kinase-inducible domains (KID). The structural properties of free pKID and KID were
obtained by parallel tempering metadynamics combined with well-tempered ensemble (PTMetaD WTE)
method, and the binding free energy surfaces of pKID/KID and KIX were characterized by bias-exchanged
metadynamics (BE-MetaD) simulations.
Key words Structure ensemble, Intrinsically disordered protein, Binding processes, Molecular
dynamics simulations, Metadynamics, Kinase-inducible domain
1
Introduction
In this chapter, we focus on how to use the metadynamics
simulation to study intrinsically disordered proteins [1]. The
kinase-inducible domain (KID) is used as an example [2, 3]. As a
phosphorylated inducible protein, the phosphorylation on Ser133
of KID stimulates gene expression, which depends on interaction
between the coactivator KIX domain and the transcriptional coactivator CREB-binding protein (CBP) [4]. The phosphorylated
KID (pKID) undergoes a disordered-to-helical structure transition
upon binding to KIX. The bound structure of pKID is formed by
two α-helices, i.e., αA (from residue 120 to 129) and αB (from
residue 132 to 144) [5]. The phosphorylation on Ser133 is critical
for the binding between pKID and KIX and increases the affinity by
almost two orders of magnitude [6, 7].
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_9,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
169
170
Rui Zhou and Mojie Duan
The conformational spaces of free KID and pKID were sampled by parallel-tempering metadynamics combined with welltempered ensemble (PTMetaD-WTE) [8], and the binding process
between pKID and KIX was studied by bias-exchange metadynamics (BE-MetaD) [9]. The metadynamics [10, 11] was developed to compute the free energy profile on some predefined
reaction coordinates (RCs), namely collective variables (CVs).
The evolution of the system is driven by an external bias potential
updated with a given period. However, metadynamics suffers from
two major problems: (a) it is hard to evaluate the convergence of
the free energy surface and decide when to stop a simulation; (b) it
is not trivial to select appropriate CVs for describing complex
processes. Here, two advanced metadynamics technologies, i.e.,
PTMetaD-WTE and BE-MetaD, were employed to overcome the
above problems. In PTMetaD-WTE, multiple replicas are
simulated at different temperatures and periodic exchanges
between replicas are performed. In this way, the approach allows
the system to overcome high free energy barriers. BE-MetaD allows
conformational exchanges between different CVs and therefore
dramatically increases the sampling efficiency.
The results show that both pKID and KID are disordered, with
some transient helical structures [1]. However, more hydrophobic
interactions are formed in pKID. Our results revealed that the
binding of the intrinsically disordered pKID follows a flexible conformational selection mechanism.
2
2.1
Theory
Metadynamics
In metadynamics, a history-dependent external bias potential is
added to the energy function of the system. This potential can be
written as a sum of Gaussians deposited as a function of the CVs to
prevent the system from visiting conformations similar to those that
have already been sampled. The bias potential is as following:
Z τ
m X
S i ðR Þ S i ðR ðt ÞÞ2
ω exp dt
ð1Þ
V ðS, t Þ ¼
2 σ 2i
0
i¼1
where ω and σ i are the height and weight of the Gaussian bias,
respectively. Si(R) is the ith CV value of coordinates R.
2.2
PTMetaD-WTE
For temperature-based REMD (tREMD), multiple replicas run
simultaneously at different temperatures, and the adjacent replicas
are randomly exchanged based on the Metropolis criterion. It is
possible to overcome the energetic barriers after exchanging a
low-temperature replica and a high-temperature replica. The acceptance ratio for an exchange involving replicas a and b is:
Metadynamics for Intrinsically Disordered Proteins
171
min f1, exp ½ðβb βa Þ ðU ðR b Þ U ðR a ÞÞ þ βa ðV a ðS ðR a ÞÞ
V a ðS ðR b ÞÞ þ βb ðV b ðS ðR b ÞÞ V b ðS ðR a ÞÞg
ð2Þ
where β is 1/KBT, KB is the Boltzman constant, and T is the
temperature of a given replica. U(R) is the potential energy of the
system and V is the bias potential. If the exchange is accepted, the
coordinates in the replica a and b are exchanged.
2.3
3
BE-MetaD
Like the PTMetaD, the bias-exchange metadynamics (BE-MetaD)
also exchanged between different replicas. However, unlike the
replicas are corresponding to the system under different temperatures, the replicas in BE-MetaD method relate to different reaction
coordinates at the same temperature. Based on this strategy, this
method is able to consider larger number of CVs simultaneously
and can efficiently reach equilibration.
MD Settings
3.1
System Settings
The initial structures of free pKID and KID were built based on the
experimental structure of the pKID-KIX complex (PDB ID: 1KDX
[2]). The phosphorylated Ser133 was mutated back to serine for
KID. The pKID and KID were capped with acetyl (ACE) and
amine (NH2) groups at the N- and C-terminus, respectively. The
box size was set to 53 53 63 Å3 for free state pKID and 55 55 64 Å3 for KID. TIP3P [12] water molecules were added to
solvate the systems. Sodium and chloride ions were added to neutralize the systems, and the final concentrations of the ions were set
to be 100 mM. The amber99SB-ILDN force field [13] was
employed. Unbiased molecular dynamics simulations were performed to equilibrate the conformations of free pKID and KIX in
aqueous solution. An isotropic scheme was utilized to couple the
pressures. The Particle-Mesh Ewald method [14] was employed to
calculate long-range electrostatics with a real-space cutoff of 10 Å.
The temperature was kept at 300 K with the V-rescale method
[15], and the pressure was controlled by Parrinello-Rahman
barostat [16].
3.2
PTMetaD-WTE
The initial structures of free pKID/KID in the PTMetaD-WTE
simulation were obtained from unbiased molecular dynamics simulations at high temperature. 12 replicas were simulated spanning
the temperatures: 288 K, 300 K, 313 K, 327 K, 342 K, 359 K,
377 K, 398 K, 421 K, 446 K, 475 K, and 508 K. The PTMetaDWTE simulations were implemented in a two-step scheme. First,
the parallel tempered simulations in the well-tempered ensemble
(PT-WTE) on the potential energy surface were performed. The
bias factor was set to be 30. The height of the initial bias energy was
172
Rui Zhou and Mojie Duan
Fig. 1 The system energy in the replicas under different temperatures as a function of simulation time. (a)
pKID; (b) KID
1.0 kJ/mol and the width was 300 kJ/mol (Eq. 1). Exchange of
configurations between adjacent replicas was attempted every
150 fs. After 30 ns simulations of each replica, the height of the
bias energy decreased to a value close to 0 and the exchange
acceptance probability between adjacent replicas was about
0.3 (Eq. 2). The potential energy underwent large fluctuations
and exchanged between the neighboring replicas (Fig. 1). The
average potential energy in the PT-WTE simulation remains close
to the canonical value but had large fluctuations. Next, simulations
of all replicas were performed with a static energy bias in the
potential energy space, constructed in PT-WTE. The historydependent energy bias was added to two collective variables to
enhance the sampling of the structure of the αA and αB regions
of KID, i.e., the α-score for residues 120–129 and the α-score for
residues 134–144 [17]. The definition of these CVs is as following:
X 1 ri 8
0:08
α‐score ¼
ð3Þ
ri 12
i 1 0:08
where the bias factor γ was set to be 16, the height of the initial bias
was 1.0 kJ/mol, the width was 0.2 rad, MetaD bias was deposited
every 500 steps, where each step was 1.5 fs. Exchanges of configurations between neighboring replicas were attempted every 750 fs.
3.3
BE-MetaD
By combining the replica exchange and metadynamics, the simulations are exchanged in different replicas, which could be present by
different collective variables. In this work, for the Bias-exchange
metadynamics (BE-MetaD) simulation, the initial structures were
built based on the experimental complex structure (PDB ID:
1KDX). Similar to the regular simulations, sodium and chloride
ions were added to neutralize the systems, and the final concentration of the ions was set to be 100 mM. 10,092 and 9746 water
molecules were added to solvate the pKID+KIX and KID+KIX
Metadynamics for Intrinsically Disordered Proteins
173
Fig. 2 The collective variables as a function of simulation time in BE-MetaD
simulations. (a) The CV1 values of pKID. (b) The CV1 values of KID
systems, respectively. The box sizes were 74 74 73 Å3 and 74 74 72 Å3 for pKID+KIX and KID+KIX, respectively. Four biased
replicas were run along with four CVs for the BE-MetaD simulations. The exchanges between the replicas were attempted every
4 ps. 450 ns simulation were performed on each replica, and a total
of 1.8 μs for each system. The simulations reached equilibrium
when the systems covered the CV-space (Fig. 2).
The Gaussian bias was applied to four CVs: α-score for residues
120–129 in pKID or KID (CV1), α-score for residues 134–144 in
pKID or KID (CV2), the COM distance between pKID/KID and
KIX (CV3), and the number of native contacts between pKID/
KID and KIX (CV4). The α-score CVs were employed to describe
the folding of pKID, the CV3 was used to depict the binding
process, and the CV4 describes the progress of binding between
pKID/KID and KIX. The COM distance in CV3 was limited to less
than 4.0 nm with a harmonic restrained potential during the simulation, to focus sampling on the relevant regions of configurational
space. The harmonic potential had the following form:
174
Rui Zhou and Mojie Duan
(
VM ¼
1
kðS S 0 Þ2 , if S > S 0
2
0, if S S 0
ð4Þ
where S corresponds to the COM distance between pKID/KID
and KIX. S0 was 3.0 nm. The force constant k was 500 kJ/
(mol·nm2). The CV4 was calculated as a sum of switching
functions:
X
1
ð5Þ
Q ¼
0
1
þ
exp
β
r
λr
ij
ij
ij
where rij represents the COM distance between heavy atoms in
pKID/KID and KIX whose distances are closer than 0.45 nm in the
experimental structure. We used r 0ij ¼0.45, λ ¼ 1.8, β ¼ 50 nm1
[18]. The Gaussian potential height w was set to 2.0 kJ/mol for all
CVs, the Gaussian width was 0.2 for CV1 to CV3 and 10 for CV4.
The bias factor was set to 32 in all replicas. The Gaussian bias was
deposited every 5 ps.
4
Implementation
4.1 PT-WTE
Metadynamics
1. Preprocessing of the protein.
Software: Gromacs2018 [19] gmx.
Module usage: pdb2gmx, editconf, solvate, genion.
2. Energy minimization.
Software: Gromacs2018 gmx.
Module usage: grompp, mdrun.
Command:
gmx grompp -f minim.mdp -c pKID_solv_ions.gro -p pKID.top -o
em.tpr
gmx mdrun –s em.tpr –deffnm em
3. NVT equilibration.
Software: Gromacs2018 gmx.
Module usage: grompp, mdrun.
Command:
gmx grompp -f tem_nvt0.mdp -c em.gro -p pKID.top -n index.ndx
-o pKID-$TEMP-anne-nvt.tpr
gmx mdrun -s pKID-$TEMP-anne-nvt.tpr -deffnm pKID-$TEMP-annenvt
Metadynamics for Intrinsically Disordered Proteins
175
4. NPT equilibration.
Software: Gromacs2018 gmx.
Module usage: grompp, mdrun.
Command:
gmx grompp -f npt.mdp -c pKID-$TEMP-anne-nvt.gro -t pKID$TEMP-anne-nvt.cpt -p pKID.top -n index.ndx -o pKID-$TEMPnpt.tpr
mdrun -s pKID-$TEMP-npt.tpr -deffnm pKID-$TEMP-npt -gpu_id
01 -nt 12
5. Parallel tempering simulation.
Software: Gromacs2018 with plumed-2.4 [20, 21] patched.
Module usage: grompp, mdrun.
Command:
gmx grompp -f remd.mdp -c pKID-$TEMP-npt.gro -t pKID-$TEMPnpt.cpt -p pKID.top -o pKID-PT.tpr
mpirun -np 12 -hostfile hosts mdrun_mpi -s pKID-PT-charm5ns
-deffnm pKID-PT-5ns -plumed plumed_PT.dat -multi 12 -replex
500 -gpu_id 0011
6. Parallel tempering with well-tempered ensemble.
Software: Gromacs2018 with plumed-2.4 patched.
Module usage: grompp, mdrun.
Command:
gmx grompp -f remd.mdp -c pKID-PT-5ns0.gro -t pKID-PT-5ns0.cpt
-p pKID.top -o pKID-PTWTE-30ns0.tpr
mpirun -np 12 -hostfile hosts mdrun_mpi -s pKID-PTWTE-30ns
-plumed plumed_PTWTE -deffnm pKID-PTWTE-30ns- -multi 12 -replex
100 -gpu_id 0011
7. PT-WTE metadynamics.
Software: Gromacs2018 with plumed-2.4 patched.
Module usage: grompp, mdrun.
Command:
gmx grompp -f remd.mdp -c pKID-PTWTE-30ns.gro -t pKID-PTWTE30ns-.cpt -p pKID.top -o pKID-PTMetaDWTE-300ns0.tpr
mpirun -np 12 -hostfile hosts mdrun_mpi -s pKID-PTMetaDWTE300ns -plumed plumed_PTMetaDWTE -deffnm pKID-PTMetaDWTE-300ns-multi 12 -replex $steps -gpu_id 0011
176
Rui Zhou and Mojie Duan
Plumed files:
######### plumed file for PT-WTE #########
MOLINFO STRUCTURE=pKID-exp.pdb
# set up the two CVs and total energy
ALPHARMSD RESIDUES=2-11 TYPE=DRMSD LESS_THAN=<RATIONAL R_0=0
.08 NN=8 MM=12 > LABEL=CV1
ALPHARMSD RESIDUES=16-26 TYPE=DRMSD LESS_THAN=<RATIONAL R_0=
0.08 NN=8 MM=12 > LABEL=CV2
ene: ENERGY
# Activate metadynamics in ene
# Well-tempered metadynamics is activated
#
wte: METAD ARG=ene PACE=500 HEIGHT=$HEIGHT SIGMA=$SIGMA
FILE=HILLS_PTWTE_ BIASFACTOR=20 TEMP=$TEMP
# monitor the three variables and the metadynamics bias
potential
PRINT STRIDE=1000 ARG=CV1.lessthan,CV2.lessthan,ene,wte.bias
FILE=COLVAR_PTWTE_PT-WTE
############################################################
######### PT-WTE Metadynamics #########
RESTART
MOLINFO STRUCTURE=pKID-exp.pdb
# set up two CVs and total energy
ALPHARMSD RESIDUES=2-11 TYPE=DRMSD LESS_THAN=<RATIONAL R_0=0
.08 NN=8 MM=12 > LABEL=CV1
ALPHARMSD RESIDUES=16-26 TYPE=DRMSD LESS_THAN=<RATIONAL R_0=
0.08 NN=8 MM=12 > LABEL=CV2
ene: ENERGY
# Activate metadynamics in ene
# Well-tempered metadynamics is activated
#
wte: METAD ARG=ene PACE=999999999 HEIGHT=$HEIGHT SIGMA=$SIGMA
FILE=HILLS_PTWTE_ BIASFACTOR=20 TEMP=$TEMP
#active metadynamics,depositing a Gaussian every 500 time
steps
metad: METAD ARG=CV1.lessthan,CV2.lessthan PACE=500 HEIGHT=
$HEIGHT SIGMA=$S1,$2 FILE=HILLS_PTMetaDWTE BIASFACTOR=16 TEMP=
$TEMP
Metadynamics for Intrinsically Disordered Proteins
177
# monitor the three variables and the metadynamics bias
potential
PRINT STRIDE=1000 ARG=CV1.lessthan,CV2.lessthan,ene,wte.bias,
metad.bias FILE=COLVAR_PTMetaDWTE
############################################################
4.2
BE-MetaD
1. The preprocessing, energy minimization, and equilibration
steps were similar to the PT-WTE metadynamics.
2. BE-MetaD.
Software: Gromacs2018 with plumed-2.4 patched.
Module: grompp, mdrun.
Command:
gmx grompp -f mdrun.mdp -cpt npt_$replic -p pKID-KIX.top -o
mdrun_$replica.tpr
mpirun -np 4 mdrun -s mdrun_ -plumed plumed_be.dat -deffnm
mdrun_extend_ -gpu_id 01 -nt 4
############## plumed.1 file for BE-MetaD ##############
INCLUDE FILE=plumed-common.dat # include the definition of CVs
be:
METAD
ARG=CV1.lessthan
HEIGHT=$HEIGHT
SIGMA=$SIGMA
PACE=2500 BIASFACTOR=32 GRID_MIN=$CV_MIN GRID_MAX=$CV_MAX
GRID_BIN=100 FILE=HILLS
PRINT ARG=CV1.lessthan,CV2.lessthan,CV3,CV4,be.bias,duwall.
bias STRIDE=2500 FILE=COLVAR
##############################################################################
5
Concluding Notes
1. Both pKID and KID are disordered with some transient helical
structures.
2. More hydrophobic interactions are formed in the phosphorylated KID, which promote the formation of the special hydrophobic residue cluster (HRC).
3. The binding mechanism of the intrinsically disordered pKID
follows a flexible conformational selection mechanism.
178
Rui Zhou and Mojie Duan
References
1. Liu N, Guo Y, Ning S, Duan M (2020) Phosphorylation regulates the binding of intrinsically disordered proteins via a flexible
conformation selection mechanism. Commun
Chem 3:123
2. Radhakrishnan I et al (1997) Solution structure of the KIX domain of CBP bound to the
transactivation domain of CREB: a model for
activator:coactivator
interactions.
Cell
91(6):741–752
3. Sugase K, Dyson HJ, Wright PE (2007) Mechanism of coupled folding and binding of an
intrinsically disordered protein. Nature
447(7147):1021–1025
4. Zor T et al (2002) Roles of phosphorylation
and helix propensity in the binding of the KIX
domain of CREB-binding protein by constitutive (c-Myb) and inducible (CREB) activators.
J Biol Chem 277(44):42241–42248
5. Radhakrishnan I et al (1998) Conformational
preferences in the Ser(133)-phosphorylated
and non-phosphorylated forms of the kinase
inducible transactivation domain of CREB.
FEBS Lett 430(3):317–322
6. Dahal L, Shammas SL, Clarke J (2017) Phosphorylation of the IDP KID modulates affinity
for KIX by increasing the lifetime of the complex. Biophys J 113(12):2706–2712
7. Zor T et al (2002) Roles of phosphorylation
and helix propensity in the binding of the KIX
domain of CREB-binding protein by constitutive (c-Myb) and inducible (CREB) activators.
J Biol Chem 277(44):42241–42248
8. Prakash MK, Barducci A, Parrinello M (2011)
Replica temperatures for uniform exchange
and efficient roundtrip times in explicit solvent
parallel tempering simulations. J Chem Theory
Comput 7(7):2025–2027
9. Piana S, Laio A (2007) A bias-exchange
approach to protein folding. J Phys Chem B
111(17):4553–4559
10. Laio A, Parrinello M (2002) Escaping freeenergy minima. Proc Natl Acad Sci U S A
99(20):12562–12566
11. Valsson O, Tiwary P, Parrinello M (2016)
Enhancing important fluctuations: rare events
and metadynamics from a conceptual viewpoint. Annu Rev Phys Chem 67:159–184
12. Jorgensen WL, Chandrasekhar J, Madura JD
et al (1983) Comparison of simple potential
functions for simulating liquid water. J Chem
Phys 79:926–935
13. Lindorff-Larsen K et al (2010) Improved sidechain torsion potentials for the Amber ff99SB
protein force field. Proteins 78(8):1950–1958
14. Essmann U, Perera L, Berkowitz ML (1995) A
smooth particle mesh Ewald method. J Chem
Phys 103:8577
15. Bussi G, Donadio D, Parrinello M (2007)
Canonical sampling through velocity rescaling.
J Chem Phys 126(1):014101
16. Parrinello M, Rahman A (1980) Crystal structure and pair potentials: a molecular-dynamics
study. Phys Rev Lett 45:1196
17. Pietrucci F, Laio A (2009) A collective variable
for the efficient exploration of protein betasheet structures: application to SH3 and GB1.
J Chem Theory Comput 5(9):2197–2201
18. Best RB, Hummer G, Eaton WA (2013) Native
contacts determine protein folding mechanisms in atomistic simulations. Proc Natl Acad
Sci U S A 110(44):17874–17879
19. Berendsen HJ, Spoel D, Drunen R (1995)
GROMACS: a message-passing parallel molecular dynamics implementation. Comput Phys
Commun 91:43–56
20. Tribello GA et al (2014) PLUMED2: new
feathers for an old bird. Comput Phys Commun 185:604
21. Bussi G, Tribello GA (2019) Analyzing and
biasing simulations with PLUMED. Methods
Mol Biol 2022:529–578
Chapter 10
Computational and Experimental Protocols to Study
Cyclo-dihistidine Self- and Co-assembly: Minimalistic
Bio-assemblies with Enhanced Fluorescence and Drug
Encapsulation Properties
Asuka A. Orr, Yu Chen, Ehud Gazit, and Phanourios Tamamis
Abstract
Our published studies on the self- and co-assembly of cyclo-HH peptides demonstrated their capacity to
coordinate with Zn(II), their enhanced photoluminescence and their ability to self-encapsulate epirubicin, a
chemotherapy drug. Here, we provide a detailed description of computational and experimental methodology for the study of cyclo-HH self- and co-assembling mechanisms, photoluminescence, and drug
encapsulation properties. We outline the experimental protocols, which involve fluorescence spectroscopy,
transmission electron microscopy, and atomic force microscopy protocols, as well as the computational
protocols, which involve structural and energetic analysis of the assembled nanostructures. We suggest that
the computational and experimental methods presented here can be generalizable, and thus can be applied
in the investigation of self- and co-assembly systems involving other short peptides, encapsulating compounds and binding to ions, beyond the particular ones presented here.
Key words Molecular dynamics, Nanostructure, Biomaterials, Drug delivery, Electron microscopy,
Charmm program, Generalized Born, Association free energy
1
Introduction
Supramolecular self-assembly of biomolecules into nanostructures
with diverse hierarchical architectures is essential to the physiological functions across all kingdom of life [1]. As the keys to fundamental working principles of biology, proteins and peptides are
endowed with the propensity to form complex architectures
uniquely suited for specialized functions. These multiple, welldefined, supramolecular self-assemblies, with different sequences,
shapes, and functions, enable living systems to respond to internal
Asuka A. Orr and Yu Chen contributed equally to this work.
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_10,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
179
180
Asuka A. Orr et al.
and external stimuli and engage in complex behavior [1, 2]. Peptides
and peptide derivatives have received significant attention as potential nanotechnology building blocks due to their flexibility, variability in molecular design, and ease of large-scale synthesis [2].
In the past few years, many short peptides and their synthetic
analogues have been assembled into structures according to the
minimalist approach originally described by DeGrado and coworkers [3]. We and others have taken a reductionist approach to form
numerous self-assembled peptide structures. One of the prominent
examples is diphenylalanine (FF) [4], a self-assembling dipeptide
sequence initially identified as the minimal core recognition motif
of amyloid β-protein, the amyloidogenic polypeptide associated
with Alzheimer’s disease [5]. FF peptide self-assemblies have been
shown to display intriguing features, including biosensing, energy
storage, super hydrophobic surfaces, and photoluminescence [4–
10]. Furthermore, recent studies revealed that cyclo-dipeptides
with 2,5-diketopiperazine backbone configurations, derived from
dehydration condensation of linear-dipeptides, self-assemble into
oligomeric nanostructures [11]. Especially, inspired by the molecular structure of BFPms1 [12], we successfully constructed a short
fluorescent peptide core encapsulated by the peptide scaffold building module to implement the concept of “self-assembly locking
strategy” [13, 14]. We reported the demonstration of a bright
fluorescent peptide with quantum yields of up to 70% for green
fluorescence, exemplifying the potential of such structures to serve
as bioinspired, organic, supramolecular alternatives to complement
their state-of-the-art inorganic counterparts [14]. Importantly, our
studies also aimed to provide fundamental insights into the underlying molecular self-assembling mechanisms and modulation of the
photoluminescence properties of these materials, which remain
difficult to solve. Such insights can be of utmost importance for
further utilization and future exploitation of the assemblies, toward
novel biological materials with advanced functional applications,
including but not limited to cancer drug delivery.
In this chapter, we provide a detailed description of selected
computational and experimental methods for the study of cycloHH peptide self- and co-assembly mechanisms, photoluminescence, and drug encapsulation properties, included in our recently
published papers [13, 14]. We first give an overview of the experimental protocols used to study the co-assembly of cyclo-HH with
different ions and molecules to ultimately produce a fluorescent
drug nanocarrier, and we describe three key experimental protocols
(fluorescence spectroscopy, transmission electron microscopy, and
atomic force microscopy) for studying the morphology and fluorescence of the formed nanostructures. We then give an overview of
the computational protocol, based on molecular dynamics
(MD) simulations, and the structural and energetic analysis of the
early stages of cyclo-HH self-assembly and co-assembly. The
Cyclo-Dihistidine Self- and Co-Assembly
181
protocol was used to study cyclo-HH self-assembly in the presence
and absence of Zn(II) ions [13], and the co-assembly of cyclo-HH
with Zn(II) ions, in the presence or absence of and nitrate ions, and
Zn(II) ions, nitrate ions, and the chemotherapy drug epirubicin
(EPI) [14]. We particularly focus here on the theoretical developments used to study these systems. We consider that the computational and experimental methods presented here in detail are
generalizable, and thus can be applied to the self- and
co-assembly of systems involving other short peptides, encapsulating compounds and binding to ions, beyond cyclo-HH.
2
2.1
Materials
Peptide
2.2 Simulation and
Analysis Software
3
The peptide, cyclic(L-histidine-D-histidine) (cyclo-HH), was purchased from GL Biochem (Shanghai, China) and had a degree of
purity higher than 95%. Zinc nitrate (Zn(NO3)2), dimethylformamide (DMF), dimethyl sulfoxide (DMSO), and isopropanol were
purchased from Sigma-Aldrich (Rehovot, Israel). Epirubicin
hydrochloride was purchased from Glentham life science. All materials were used as received without further purification. Water was
processed using a Millipore purification system (Darmstadt, Germany) with minimum resistivity of 18.2 MΩ cm.
In the computational methods described, the CHARMM [15]
program (http://charmm.chemistry.harvard.edu) was used to perform MD simulations and additional energetic calculations. The
analysis was primarily performed using FORTRAN programs and
other programs listed below. Visual inspection of simulations was
performed with VMD [16].
Methods
3.1 Experimental
Methods
3.1.1 Co-assembly
1. To study assembly, cyclo-HH and co-assembling molecules or
ions are mixed under controlled experimental conditions,
resulting in the formation of nanostructures. For the assembly
of cyclo-HH in the presence of Zn(II) and nitrate ions, we first
prepared a fresh stock solution of cyclo-HH. 5.48 mg of lyophilized cyclo-HH peptide powder was dissolved into 5% (v/v)
DMF/isopropanol mixed solvent in a 2 mL scintillation vial at
a concentration of 0.02 mmol mL1. Then, 2.97 mg of metal
salt Zn(NO3)2 was added into the peptide solution under
vigorous sonication for 5 min. The vial was heated at 80 C
for 1 h with ramping rate 1 C/min and was cooled down to
room temperature overnight. The color of the solution will
subsequently change to a light yellow. For co-assembly of
cyclo-HH in the presence of Zn(II) ions, nitrate ions, and EPI,
182
Asuka A. Orr et al.
lyophilized cyclo-HH peptide powder was dissolved into 4%
(v/v) DMSO/isopropanol at a concentration of 5.48 mg mL1.
Following that, 0.25 mg EPI and 2.97 mg metal salt Zn(NO3)2
were added under vigorous sonication, followed by one-hour
incubation inside of an 80 C (with a temperature rise rate of
1 C/min) water bath and subsequent cooling to ambient temperature. The elevated temperature is intended to accelerate selfassembly, followed by cooling to equilibrium at room temperature. The obtained red suspension was then centrifuged at 12,557
g for 20 min, and the precipitates were washed three times with
Milli-Q water to remove any excess EPI and Zn(NO3)2.
3.1.2 Fluorescence
Spectroscopy
2. Fluorescence spectroscopy is a key measurement to assess the
optical property of peptide assemblies. For the assembly of
cyclo-HH in the presence of Zn(II) and nitrate ions, a
600 μL sample solution of the assembled material formed by
cyclo-HH in the presence of Zn(II) and nitrate ions (cycloHH-Zn(NO3)2) was pipetted into a 1.0 cm path-length quartz
cuvette (Hellma Analytics, item no.: HL108-F-10-40, light
path: 10 4 mm), and the spectrum was collected using a
FluoroMax-4 spectrofluorometer (Horiba Jobin Yvon, Kyoto,
Japan) at ambient temperature. The excitation and emission
wavelengths were set at 300–500 nm and 300–700 nm, respectively, with a slit of 2 nm (see Note 1). The resulting excitation–
emission matrix contour profile shows that the cyclo-HH-Zn
(NO3)2 material has bright fluorescence properties (Fig. 1).
3.1.3 Transmittance
Electron Microscopy and
Atomic Force Microscopy
3. Transmittance electron microscopy and atomic force microscopy (AFM) were employed to identify the morphologies of
the nanostructures newly formed by the co-assembled cycloHH. To study the material formed by the co-assembly of cycloHH in the presence of Zn(II) and nitrate ions through AFM,
we first attached a mica sheet to a microscope slide using
nonconductive double-sided adhesive tab (as shown in
Fig. 2). Then, the mica sheet (see Note 2) (Highest Grade V1
Mica Discs, 12 mm, item no.: 50-12, Ted Pella, Inc) was rinsed
with water and gently purge-dried with nitrogen (99.99%).
5 μL of cyclo-HH-Zn(NO3)2 sample solution was dropped
onto freshly cleaved mica surface and dried by N2 purge. A
topographic image was recorded under a Dimension icon AFM
(Bruker) in the tapping mode at ambient temperature, with a
512 512 pixel resolution and a scanning speed of 1.0 Hz.
Nanoscope Analysis software was used for data collection and
analysis. As for the sample preparation protocol of TEM, cycloHH-Zn(NO3)2 was first sonicated for 10 min. Then, 10 μL
cyclo-HH-Zn(NO3)2 sample solution was gently dropped
onto a glow discharge copper grid coated with a thin carbon
film (Formvar Carbon Film 400 mesh, Copper, item no.:
FCF400-CU, Electron Microscopy Sciences). After 2 min,
Cyclo-Dihistidine Self- and Co-Assembly
183
Fig. 1 Typical excitation–emission matrix contour profile of cyclo-HH-Zn(NO3)2
Fig. 2 Mica sheet attached to microscope slide
the excess solution was removed with a filter paper. TEM
images were viewed using an FEI Tecnai F20 electron microscope operating at 80 kV. AFM and TEM images show that the
morphology of cyclo-HH-Zn(NO3)2 consists of nanoparticles
of about 30 nm (Fig. 3).
3.2 Computational
Methods
In the following steps, we describe the methodology followed to
simulate and analyze the produced trajectories to obtain insights
into the short peptide self-assembly properties. The short peptide
simulation systems highlighted here correspond to cyclo-HH selfand co-assembly in different environments. Specifically, we
184
Asuka A. Orr et al.
Fig. 3 AFM (left) and TEM (right) images of the morphology of cyclo-HH-Zn(NO3)2
Fig. 4 Schematic of the overall computational methodology to study cyclo-HH self-assembly
considered the self-assembly of cyclo-HH in the presence and
absence of Zn(II) ions in methanol [13] and the co-assembly
cyclo-HH with Zn(II) ions in the presence or absence of nitrate
ions, and Zn(II) ions, nitrate ions, and EPI in isopropanol [14]. An
overview of the computational methodology is presented in Fig. 4.
Cyclo-Dihistidine Self- and Co-Assembly
3.2.1 MD Simulation
Setup and Execution
185
1. The 3D structures of the molecules under investigation are
constructed to match the experimental systems. The 3D structures for compounds can be obtained from existing structure
databases, such as the ZINC Database [17] or PubChem [18],
or can be manually built through programs such as Marvin
Sketch [19]. For the studies of cyclo-HH co-assembly
[13, 14], the 3D structures of all molecules were manually
built through Marvin Sketch to ensure the correct protonation
state observed in the experimentally resolved crystal structures.
2. Molecular force fields are chosen to describe the molecules’
and solvents’ interactions. In our studies, we used polarizable
force fields and nonpolarizable force fields in separate studies of
cyclo-HH self- and co-assembly [13, 14]. In the studies involving EPI and nitrate ions, polarizable force fields were not used
as parameters and topologies for both nitrate ions and EPI are
not readily available (see Note 3). To ensure that a force field
can adequately describe the co-assembly system under investigation, computationally derived results can be compared to
experimental results or additional computational derived
results with higher accuracy. For the simulations of cyclo-HH
co-assembly using nonpolarizable force fields, elementary
structures of cyclo-HH with Zn(II) ions were in line with
both experimentally derived crystal structures and elementary
structures derived from simulations using the Drude polarizable force field [13, 14, 20].
3. Different initial conformations of the molecules under investigation are generated through short infinite dilution simulations. These different initial conformations of the molecules
will be used as starting points for the finite dilution simulations
described in the following steps (see Note 4). To generate
different configurations of the molecules under investigation,
infinite dilution simulations were performed for each molecule,
independently. In the infinite dilution simulations, two bonded
atoms are aligned and fixed to remove the translation and
rotation of the molecule. Translations and rotations are introduced to copies of the molecule when they are initially placed
on a grid to build the starting structure of the finite dilution
simulations (see step 4). In the simulations of cyclo-HH selfand co-assembly, the monomer molecules (cyclo-HH and EPI)
were independently simulated for 10 ns with structures
extracted every 10 ps to generate 1000 possible configurations
of each molecule [13, 14].
4. The initial structures of the finite dilution simulations are generated by placing the molecules in random configurations and
orientations on a grid. The configurations are randomly
selected from the pool of 1000 possible configurations (see
step 3) for each molecule. The molecules are placed on a grid
186
Asuka A. Orr et al.
such that they are equally spaced; the initial distance between
each molecule’s nearest atoms is within the cutoff of nonbonded interactions to facilitate the formation of an initial
aggregate [21]. In this way, each molecule can initially “interact” with another molecule within the simulations. The number of molecules and ions within the simulation system should
be sufficiently large to enhance statistical analysis of interactions
formed [21].
5. The initial configuration of the grid of molecules and ions under
investigation is solvated in solvent boxes. The solvent molecules
used to build the solvent box should be in line with experiments
(methanol and isopropanol for references [13] and [14], respectively). The solvent box is periodically replicated through periodic boundary conditions, and the molecules and ions within
the simulations are free to move within the replicated solvent
box. The size of the solvent box could be set to a value that is not
very large, to facilitate the interaction of the molecules and ions
and enhance the sampling and formation of clusters within the
simulations [21] (see Note 5). Subsequently, the charge of the
simulation systems is neutralized by introducing counterions
through Monte Carlo simulations [22, 23]. In the simulations
of cyclo-HH self- and co-assembly, the size of the grid of molecules and the solvent boxes was selected to increase the
simulated concentration of the co-assembling molecules, compared to experiments, to facilitate self-assembly [13, 14]. Simulations were performed at 300 K, in line with the room
temperature used at the experiments for the systems to cool
and equilibrate. Simulation input files followed a general flow
indicated by CHARMM- GUI [22, 23], adjusted and changed
for the current systems under investigation.
6. Prior to the production simulation runs, the simulation systems
are first energetically minimized and equilibrated. An energy
minimization is performed on the starting structures. The simulation systems are subsequently subjected to a position-restrained
equilibration, which aims to avoid any unnecessary and sudden
structural distortion when initiating the MD simulation production stage. In this step, all heavy atoms are constrained to their
starting positions, allowing the solvent molecules and ions to
equilibrate around the assembling molecules.
7. Finally, the simulation systems are investigated using multinanosecond MD simulations with all constraints imposed in
the previous steps released. The duration of the MD simulations can be tailored to the systems under investigation (see
Note 6). In the production stage, simulation snapshots are
saved throughout the duration of the simulations for
subsequent structural and energetic analysis (Fig. 4, see steps
9–31). For the simulations of cyclo-HH self- and co-assembly,
100 ns MD simulations were performed; this duration proved
Cyclo-Dihistidine Self- and Co-Assembly
187
to be sufficiently long to observe the formation and reformation of aggregates within the simulation and the convergence
of structural and energetic properties [13, 14]. Additional
details on the simulation methodology are provided in references [13] and [14].
8. During the construction of the starting structures of each
simulation system and the subsequent MD simulations, it is
recommended to check the simulation visually and the
corresponding output files to ensure that the starting structures were built appropriately and that the simulations are
progressing properly.
3.2.2 Structural Analysis
9. The simulation trajectories provide structures of aggregates
formed by the self-assembling molecules. These aggregates
can be characterized in terms of the specific interactions
formed within the aggregates or the overall structural properties (e.g., compactness, solvent accessibility) by postprocessing the simulation trajectories in structural analysis
programs (Fig. 4). In simulations of cyclo-HH self- and
co-assembly, the structural analysis programs focused on the
formation of specific interactions within the formed aggregates as well as the overall geometric properties of the aggregates (composition, compactness, location of molecules
within the aggregates) [13, 14].
10. Pair-wise interactions, based on atom-to-atom distances,
between co-assembling molecules in the simulations can be
recorded and characterized by post-processing the simulation
trajectories in structural analysis programs. The interactions
that are chosen to be tracked and their corresponding distance
cutoffs can be guided by experimental results (e.g., crystal
structures) and/or visual inspection of the MD simulation
trajectories (see Note 7).
11. Information on which molecules or ions are interacting can be
tabulated in suitably defined matrices, which can be defined in
FORTRAN programs. The raw data from the trajectories are
used to populate matrices of the form g(axis,entity,atom,resid,
i) containing the coordinates of each atom. The index “axis”
runs from 1 to 3 and corresponds to the x-, y-, and z- axis,
respectively. The index “entity” runs from 1 to k, the total
number of molecule or ion types (two for cyclo-HH and Zn
(II) ions, three for cyclo-HH, Zn(II) ions, and nitrate ions,
and four for cyclo-HH, Zn(II) ions, nitrate ions, and EPI.
The index “atom” runs from 1 to a, the total number of
atoms in “entity.” The index “resid” runs from 1 to j(entity),
the total number of molecule or ion copies of “entity”; and
index “i” runs from 1 to S, the total number of snapshots to
be analyzed in the trajectory.
188
Asuka A. Orr et al.
12. To determine if two molecules or ions are bonded and what
type of interaction they are bonded through, the distance
between each of their atoms per simulation snapshot is first
calculated from matrix g. A nested FORTRAN loop exhaustively calculates the distances between atoms in each simulation snapshot and stores the distances in a temporary variable,
“dist.” If “dist” is less than the defined distance cutoff (3.5 Å
for the simulations of cyclo-HH self- and co-assembly) for a
given pair of atoms belonging to different individual molecules or ions, then the two atoms (and their corresponding
molecules) are considered to be bonded. Once a pair of atoms
are within the distance cutoff, the program compares the
interacting atoms to a list of interaction type definitions, and
information on how the atoms are bonded is stored in matrix
flag(entity1, resid1, entity2, resid2, type,i). The indices
“entity1” and “entity2” run from 1 to k, the total number
of molecule or ion types; the indices “resid1” and “resid2”
run from 1 to j(entity1) or j(entity2), the total number of
molecule or ion copies of “entity1” or “entity2”; the index
“type” runs from 1 to T, the total number of possible interaction types in the library of user-defined interactions; and index
“i” runs from 1 to S, the total number of snapshots to be
analyzed in the trajectory. If an atom of entity1 and resid1 is
bonded to another atom of entity2 and resid2 through interaction type T1, where T1 is a number between 1 and T (see
Note 8), then flag(entity1, resid1, entity2, resid2, T1, i) will
be populated with a 1; otherwise, it will be populated with a
0. After all possible bonded atoms between each pair of
molecules or ions are identified in all analyzed simulation
snapshots, the data are printed in an output text file for
further analysis with the first, second, third, fourth, fifth,
and sixth columns populated with i, entity1, resid1, entity2,
resid2, and flag, respectively. In this way, each row contains
information on how two molecules or ions interact and at
what snapshot in the simulation the interaction is formed. For
the study of cyclo-HH self- and co-assembly, this file is named
“pairs.dat.”
13. The output of the previous step, “pairs.dat,” can be used to
group interacting molecules and ions into clusters such that a
number of s molecules or ions (cyclo-HH molecules, Zn
(II) ions, nitrate ions, or EPI) are defined to form a cluster
when a molecule or ion of any entity is in the vicinity of at least
one other. The clustering can be viewed as a two-stage process
in which temporary clusters of increasing size are detected in
the first stage (Fig. 5a), and a final list of clusters is output in
the second stage, with redundancies or the presence of smaller
clusters within larger clusters removed (Fig. 5b).
Cyclo-Dihistidine Self- and Co-Assembly
189
Fig. 5 Schematic of how clusters are detected in programs. (a) Temporary clusters are detected and expanded
through comparisons to a list of interacting pairs in “pairs.dat.” (b) Redundant temporary clusters are removed
in “cluster.dat” such that repeated clusters or smaller clusters that are simultaneously present within larger
clusters are removed. Clusters larger than two molecules or ions are colored in accordance to which cluster
they belong to. Colored lines indicate matches between molecules or ions within temporary clusters and
bonded pairs of molecules or ions listed in “pairs.dat.” Larger clusters were observed in the simulations
[13, 14] and are omitted for clarity
14. In the first stage, for each simulation snapshot, clusters of
increasing size are identified with individual molecules or
ions added one at a time (Fig. 5a). In a given snapshot,
temporary clusters of two molecules or ions are compared to
a list of bonded pairs of molecules or ions. Both the temporary clusters of two and the list of bonded pairs of molecules
or ions correspond to pairs.dat. If any one of the molecules or
ions present in the list of bonded pairs is present in the
190
Asuka A. Orr et al.
temporary cluster of two, then the molecule or ion is added to
the cluster and the cluster size is expanded to a temporary
cluster of three. Subsequently, the temporary clusters of three
molecules or ions are compared to the same list of bonded
pairs of molecules or ions. If any one of the molecules or ions
present in the list of bonded pairs is present in the cluster of
three, then the molecule or ion is added to the cluster and the
cluster size is expanded to a temporary cluster of four. This
process is repeated until no larger clusters are detected.
Through this process, temporary clusters of smaller sizes can
also be detected within larger clusters and the same temporary
cluster may be listed repeatedly, with the molecules or ions in
different orders (Fig. 5a, clusters in green, blue, and red).
15. In the second stage, after the largest temporary cluster is
detected, the presence of duplicate clusters and the presence
of smaller clusters within larger clusters are removed
(Fig. 5b). For example, if resid 17 of entity 2 is present in a
cluster of 5 at snapshot 210 (Fig. 5b), then it is no longer
considered to be part of a cluster of 4, 3, or 2 (Fig. 5a). The
remaining data are printed in an output text file for further
structural and energetic analysis with the first, second, and
third columns populated with i, s, and the molecules or ions
belonging to the cluster. In this way, each row corresponds to
an isolated cluster within the specified simulation snapshot i,
and the listed molecules or ions belong to the same isolated
cluster. A cluster of size s can be composed of several combinations of entities. For example, a cluster composed of 7
(cyclo-HH) + 3(EPI) + 4(Zn(II)) + 6(nitrate) and another
cluster composed of 6(cyclo-HH) + 1(EPI) + 5(Zn(II)) + 8
(nitrate) both have a cluster size of 20. Processing “pairs.dat”
prior to the detection of clusters, or processing “cluster.dat”
through Unix commands and FORTRAN programs, can
focus the analysis on clusters containing desired interactions
or compositions (see Note 9). For the study of cyclo-HH selfand co-assembly, this file is named “cluster.dat” (Fig. 5b).
16. The clusters of molecules and ions detected by the structural
analysis programs are extracted from the simulation trajectories and further analyzed. For the simulations of cycloHH self- and co-assembly, each cluster was independently
extracted and analyzed as described in steps 17–20.
17. The percent solvent exposure of a molecule or ion within a
cluster can provide insights into the geometric properties of
the cluster and the location of each molecule or ion within the
cluster. The solvent accessible surface area (SASA) provides a
metric for how “buried” a molecule or ion is within a cluster;
the larger the SASA of a molecule or ion, the more exposed it
is and the more likely it is to be at the surface of the cluster;
Cyclo-Dihistidine Self- and Co-Assembly
191
the smaller the SASA of a molecule or ion, the more “buried”
it is and the more likely it is to be encapsulated in the interior
of the cluster. The percent solvent exposure of a molecule or
ion can be measured by the SASA of the molecule or ion
divided by the total molecular surface area (TSA) of the
same molecule or ion. For such calculations, it is important
to select a probe radius in accordance with the solvent used in
the simulations (see Note 10).
For the simulations of cyclo-HH co-assembly in isopropanol, the percent solvent exposure was calculated for each
molecule and ion within a given cluster, as defined by “cluster.
dat,” to determine the existence of exterior and interior layers
and the composition within the layers in the observed clusters. As isopropanol was used as the solvent in the simulations,
the probe radius used in the calculations was set to 2.2 Å
[24, 25]. Subsequently, the running average percent exposure
of each molecule or ion within a given cluster was calculated,
beginning with the most buried entity (lowest percent exposure) and moving outwards. Molecules or ions with a running
average solvent exposure equal to or less than a specific percentage (chosen to be 45% [14]) were considered to be in the
interior layer of the clusters, whereas molecules or ions with a
running average solvent exposure greater than this percentage
were considered to be at the exterior of the clusters. The
probe radius and the percent cutoff can be tuned to ensure
that the cutoff adequately identifies and differentiates the
molecules or ions at the interior and the exterior of the cluster.
Subsequently, the percent population of cyclo-HH, Zn
(II) ions, and nitrate ions or cyclo-HH, Zn(II) ions, nitrate
ions, and EPI within the interior and exterior layers of the
clusters was calculated. The analysis showed that EPI and Zn
(II) ions were predominantly located in the interior of the
clusters, nitrate ions were predominantly located at the exterior of the clusters, and cyclo-HH was located in both the
interior and exterior layers of the clusters [14].
18. The radius of gyration of specific molecules or ions within the
formed aggregates of the simulation systems can provide
insights into their compactness within the aggregates or their
location with respect to other molecules or ions of the aggregates. The radius of gyration (Rg) is the square root of the
average deviation of N atoms (rk) from the geometric center (r
), and can be calculated using trajectory analysis tools such as
Wordom [26, 27]:
rffiffiffiffiffi
1 XN
Rg ¼
ðr r Þ2
ð1Þ
k¼1 k
N
192
Asuka A. Orr et al.
When comparing the radius of gyration of specific molecules or ions across different simulation systems, it is important
to ensure a “fair” comparison. Particularly, in the calculations
comparing the compactness of Zn(II) ions in clusters across
different simulation systems, the comparison was performed
between clusters containing the same number of cyclo-HH and
the number of Zn(II) ions within each cluster of the same size
were similar across the two systems. Thus, the difference in
radius of gyration was not due to a lower number of Zn
(II) ions present in the clusters of one system over the other.
In the simulations of cyclo-HH co-assembly with Zn
(II) ions in the absence and presence of nitrate ions, the radius
of gyration calculations suggested that Zn(II) ions are more
densely packed within cyclo-HH assemblies formed with
nitrate ions [14].
19. For co-assembled clusters containing different entities, the
radius of gyration of one entity within a cluster can be compared to the radius of gyration of another within the same
cluster to indicate if one entity is encapsulated by the other.
To compare the radius of gyration of different entities within a
given cluster, the radius of gyration of one entity (e.g., EPI)
should be subtracted from the radius of gyration of another
(e.g., cyclo-HH) within the same cluster. It is also important to
ensure a “fair comparison” when comparing the radius of
gyration of specific molecules or ions within different clusters
of the same simulation system. Restricting the analysis to clusters of a given percent composition of each entity and calculating the difference in radius of gyration between the entities of
interest per cluster (rather than comparing the average radius
of gyration for each entity across all clusters) can enable a “fair
comparison.” For the simulations of cyclo-HH in the presence
of Zn(II) ions, nitrate ions, and EPI, the calculations were
performed for aggregates containing at least 10 molecules,
with a composition ranging from 30% cyclo-HH and 70%
EPI to 70% cyclo-HH and 30% EPI [14]. The percent composition criterion was introduced to ensure that each cluster had a
sufficient number of cyclo-HH and EPI [14]. These calculations suggested that cyclo-HH was encapsulating EPI
(Fig. 6) [14].
20. Time evolution analysis, tracking geometric properties with
respect to simulation time, can provide insights into the process by which the molecules and ions of the clusters
co-assemble. Using the data in “pairs.dat,” each interaction
type can be plotted with respect to simulation time to observe
the order in which the interactions are formed. Using data
from “cluster.dat,” the composition of the clusters formed
within the simulations can also be plotted with respect to
Cyclo-Dihistidine Self- and Co-Assembly
193
Fig. 6 Molecular graphics image of EPI (red) and Zn(II) ions (yellow) encapsulated
by cyclo-HH (blue) and nitrate ions (green)
simulation time, to observe the order in which the molecules or
ions aggregate into clusters. The analysis can be tuned to
include only molecules or ions that eventually form large clusters specified by “cluster.dat.” For the simulations comparing
the self-assembly of cyclo-HH in the absence and presence of
Zn(II) in methanol, we tracked interactions between pairs of
cyclo-HH to uncover the order in which interactions were
formed between the pairs to ultimately form ordered elementary structures of bonded cyclo-HH pairs [13]. For the simulations of cyclo-HH with Zn(II) ions in the presence or absence
of nitrate ions, and the simulations of cyclo-HH, Zn(II) ions,
nitrate ions and EPI, we investigated the time evolution of
clusters composed of individual molecules or ions that eventually form large clusters [14]. In this case, due to the presence of
additional components, the definition of a “large” cluster
should balance statistical significance with cluster complexity.
For example, a larger cluster may be sufficiently complex, but
only occur one time within the simulation. Likewise, a small
cluster may occur many times within a simulation allowing for
sufficient statistical significance, but may not be complex
enough to describe a cluster. For the studies involving cycloHH with Zn(II) in the presence or absence of nitrate ions, we
focused on clusters containing at least 10 cyclo-HH. For the
simulations involving cyclo-HH, Zn(II) ions, nitrate ions, and
EPI, we focused on clusters that eventually lead to clusters
containing at least 10 molecules (either cyclo-HH or EPI)
with a composition ranging from 30% cyclo-HH and 70%
EPI to 70% cyclo-HH and 30% EPI. The additional percent
composition criterion for the simulations of cyclo-HH, Zn
(II) ions, nitrate ions, and EPI ensured that all clusters analyzed
had a sufficient number of cyclo-HH and EPI molecules to
observe both interior and exterior layers of the clusters.
194
Asuka A. Orr et al.
The plotted data showed that the interior cluster (composed predominantly by EPI, Zn(II) ions, and cyclo-HH)
forms first, followed by exterior molecules and ions of the
cluster (composed predominantly by cyclo-HH and nitrate
ions) wrapping around the preformed interior [14].
3.2.3 Energetic Analysis
21. Association free energy calculations can provide valuable
insights into the mechanism and driving forces leading to
the co-assembly and stabilization of clusters formed by
cyclo-HH (Fig. 4). The energy calculations can also serve to
complement the structural analysis and ensure that the conclusions derived from the analyses correlate and are consistent. The MM-GBSA approximation [28, 29] provides a
relatively fast and effective means to evaluate the association
free energy of the clusters in “thought” energy calculations
examining different potential pathways of co-assembly
[13, 14].
22. In these association free energy calculations, the Generalized
Born with a simple Switching (GBSW) implicit-solvent model
[30] was used to account for the solvent. In the implicitsolvent model, the dielectric constant can be tuned in accordance with the solvent used within the simulations and experiments. For the studies involving methanol, we used a
dielectric constant of 33.5 to account for the dielectric environment of methanol [13, 31]; for the studies involving isopropanol, we used a dielectric constant of 18.4 to account for
the dielectric environment of isopropanol [14, 32].
23. The inclusion of nonpolar solvation effects in these calculations is important and can cause inaccuracies in the calculated
energy values if not accounted carefully. Such contributions
can be calculated through a surface tension coefficient multiplying the solvent accessible surface area, and the surface
tension coefficient can be obtained by fitting the experimental
hydration energies [33]. We suggest that if comprehensive
studies of the appropriate surface tension coefficient in balance with the GBSW-implicit solvent are lacking for the solvent under investigation, nonpolar solvation effects may be
omitted to avoid any bias or inaccuracy driven by an arbitrarily
chosen value. Nevertheless, additional calculations can be
performed using the surface tension coefficient’s default
value for GBSW, which corresponds to 0.03 kcal mol1 Å2,
determined for water solvation [30], to ensure that the overall
trends remain the same. While the inclusion of nonpolar
solvation effects using the default surface tension coefficient
value may not be sufficiently accurate for other solvents, such
a calculation could be used to verify that its inclusion does not
change the overall trends, but only affects the resulting
Cyclo-Dihistidine Self- and Co-Assembly
195
absolute values. In the simulations of cyclo-HH self- and
co-assembly, energy calculations were performed with the
nonpolar solvation effects omitted, as a consensus surface
tension coefficient for isopropanol had not been previously
reported [13, 14]. However, we confirmed that the overall
trends remained the same using the default value for the
GBSW surface tension coefficient.
24. Clusters detected in the simulations are isolated from the
simulation systems and undergo a series of “thought” energy
calculations. To perform the calculations for all detected clusters independently, a FORTRAN program reads each line of
“cluster.dat.” For each line, the program writes a CHARMM
[15] script file for each “thought” energy calculation, executes the CHARMM [15] script, extracts energetic data from
the CHARMM [15] output, and calculates the association
free energies normalized by the size of the cluster. The FORTRAN program executes CHARMM [15] and extracts data
from the CHARMM [15] output through system calls combined with Unix commands. The calculated normalized energies for each “thought” energy calculation are printed into
separate data files for further analysis. Any conclusions derived
from such thought energy calculations are recommended to
be cross-validated with structural analysis described in the
sections above (see Note 11).
25. In the series of “thought” energy calculations, the isolated
cluster is subjected to different conditions. Figure 7 shows
possible “thought” free energy calculations that can be used
to examine different possible pathways of cyclo-HH co-assembly. In these energy calculations, different initial states of
the molecules and ions within the clusters are explored. For
example, all molecules and ions comprising a cluster can be
assumed to initially be completely isolated and immersed in
the surrounding solvent for one set of energy calculations. In
another set of energy calculations, a portion of the molecules
and ions composing a cluster can be assumed to be preformed
initially prior to co-assembly. Insights gained from the energy
calculations presented in Fig. 7 can lead to the formulation of
additional “thought” energy calculations exploring intermediate states that lead to the final formation of the cluster, as
presented in Fig. 8.
26. To represent the free energy for isolated individual molecules
or ions, which are part of a cluster, to spontaneously selfassemble into the cluster, the MM-GBSA association free
energy is calculated through Eq. 2. This corresponds to
pathways B, C, E, I of Fig. 7.
Xs
ΔG ðs Þ ¼ E cluster E
ð2Þ
i¼1 i
196
Asuka A. Orr et al.
Fig. 7 Schematic of “thought” energy calculations performed to examine possible pathways of self-assembly
for (a–d) cyclo-HH in the presence of Zn(II) ions, (e–h) cyclo-HH in the presence of Zn(II) and nitrate ions, and
(i–l) cyclo-HH in the presence of Zn(II) ions, nitrate ions, and EPI in references [13] and [14]
The energy of the cluster, Ecluster, represents the energy of
the constituent molecules and ions, with all other molecules
or ions deleted. The energy of each of the isolated, individual
molecules, or ions of the cluster, Ei, is calculated by assuming
that each molecule or ion, i, has the same conformation as
within the cluster, but isolated and fully immersed in solution,
with the all other molecules or ions within the cluster deleted.
Cyclo-Dihistidine Self- and Co-Assembly
197
Fig. 8 Schematic of potential pathways related to the most energetically favorable pathway of cyclo-HH-Zn
(NO3)2-EPI cluster formation according to Fig. 8. The free energy to associate the interior cluster from the
individual EPI and Zn(II) ions was unfavorable according to energy calculations (indicated by the red “X”).
However, the free energy to associate the interior layer is favorable if the Zn(II) ions are in a peptide-like
environment prior to association, and the interior layer is in a peptide-like environment prior to the formation of
the cluster
27. To represent the free energy for isolated individual molecules
or ions, which are part of a cluster, to aggregate onto a
preformed portion of the same final cluster, the MM-GBSA
association free energy is calculated through Eq. 3. This corresponds to pathways F, G, J, K of Fig. 7.
Xs
ΔG ðs Þ ¼ E cluster E preformed E
ð3Þ
i¼1 i
The energy of the cluster, Ecluster, represents the energy of
the constituent molecules and ions, with all other molecules
or ions deleted. The energy of the cluster, Epreformed, represents the energy of the molecules and ions constituting the
preformed portion of the cluster, with all other molecules or
ions deleted. The energy of the isolated, individual molecules,
or ions of the cluster, Ei, is calculated by assuming that each
molecule or ion, i, has the same conformation as within the
cluster, but isolated and fully immersed in solution, with the
all other molecules or ions within the cluster deleted.
28. To represent the free energy for two preformed portions of
the cluster to aggregate with each other, the MM-GBSA
association free energy is calculated through Eq. 4. This corresponds to pathways A, H, L of Fig. 7.
198
Asuka A. Orr et al.
ΔG ðs Þ ¼ E cluster E preformed 1 E preformed 2
ð4Þ
The energy of the cluster, Ecluster, represents the energy of
the constituent molecules and ions, with all other molecules
or ions deleted. The energy of the cluster, Epreformed 1, represents the energy of the molecules and ions constituting the
first preformed portion of the cluster, with all other molecules
or ions deleted. The energy of the cluster, Epreformed 2, represents the energy of the molecules and ions constituting the
second preformed portion of the cluster, with all other molecules or ions deleted.
29. In the aforementioned energy calculations, the cluster, preformed portions of the cluster, and/or individual molecules
or ions of the cluster are assumed to be fully immersed in pure
solvent (methanol and isopropanol in references [13] and
[14], respectively) through the deletion of molecules or ions
not involved in the energy calculation. Additional energy
calculations can be performed to examine hypothetical pathways in which the interior or the cluster, exterior of the
cluster, and/or individual molecules or ions of the cluster
are in a peptide-like environment, such as Fig. 7d.
In such hypothetical calculations, the nonpolar component of the association free energies is calculated in the same
way as described in Eqs. 2–4. The polar component of the
association free energies is calculated by setting to zero the
charge of all molecules or ions of the cluster, except for the
molecules or ions involved in the energy calculation. For
example, to examine the contribution of molecules or ions
co-assembling in a peptide-like environment, the association
free energy would be calculated through Eq. 2, except that
the polar component of Ei for a given molecule or ion is
calculated by setting the charge of all other molecules and
ions within the cluster to zero (Fig. 7d). In this way, the
calculations represent the energy in a peptide-like dielectric
environment, rather than a pure solvent dielectric environment (see Note 12).
30. To provide insights into the role of Zn(II) ions in clusters
formed by cyclo-HH in the presence of Zn(II), association
“thought” free energy calculations were performed under
several assumptions: (a) preformed assemblies of cyclo-HH
and preformed assemblies of Zn(II) join to form the final
cluster (Fig. 7a), (b) individual cyclo-HH molecules and Zn
(II) ions spontaneously assemble to form the final cluster
(Fig. 7b), (c) individual cyclo-HH molecules spontaneously
assemble to form the final cluster with Zn(II) ions not contributing energetically (Fig. 7c), and (d) individual cyclo-HH
Cyclo-Dihistidine Self- and Co-Assembly
199
molecules and Zn(II) ions spontaneously assemble to form
the final cluster, with Zn(II) ions being within the dielectric
environment of the final cluster (Fig. 7d).
These calculations revealed that cyclo-HH co-assembles
with Zn(II) ions through an “environment switching mechanism” by which Zn(II) ions are first pulled from the dielectric
environment of the surrounding methanol solvent by coordinating with individual or pairs of cyclo-HH, followed by the
assembly of the coordinated Zn(II) ions and cyclo-HH into
the final clusters [13].
31. To understand the mechanism by which cyclo-HH co-assembles with Zn(II) and nitrate ions, as well as the mechanism by
which cyclo-HH co-assembles with Zn(II) ions, nitrate ions,
and EPI, association “thought” free energy calculations were
also performed under several assumptions: (a) the individual
molecules and ions forming the cluster are initially completely
immersed in pure isopropanol and spontaneously selfassemble into a cluster (Fig. 7e, i), (b) the interior layer
assembly is preformed in pure isopropanol and individual
molecules and ions of the exterior layer, completely immersed
in pure isopropanol, subsequently aggregate onto the preformed interior layer to form a cluster (Fig. 7f, j), (c) the
exterior layer assembly is preformed in pure isopropanol and
individual molecules and ions forming the interior layer,
completely immersed in pure isopropanol, subsequently
aggregate on the preformed exterior layer to form a cluster
(Fig. 7g, k), and (d) the interior layer and the exterior layer
assemblies are individually preformed, initially not interacting
with each other and completely immersed in pure isopropanol, then subsequently aggregate to form a cluster (Fig. 7h, l).
These calculations suggested that the most energetically
favored pathway is when the interior nucleus assembles first,
followed by individual exterior components wrapping around
the interior nucleus to form the clusters [14]. Additional
energy calculations examining the most favorable pathway
according to Fig. 7 were performed to gain more insights
into the formation of clusters within the simulations of
cyclo-HH in the presence of Zn(II) ions, nitrate ions, and
EPI (Fig. 8). These calculations were in line with structural
calculations (see Note 11) and suggested that Zn(II) ions are
pulled from the isopropanol environment into a more
peptide-like environment by individual molecules or pairs of
cyclo-HH, enabling the self-encapsulation of EPI, which further facilitates the co-assembly of individual cyclo-HH and
nitrate ions wrapping around the preformed EPI-Zn(II)
interior [14].
200
4
Asuka A. Orr et al.
Notes
1. If the fluorescence response is low (<1 105 CPS), the slit
value can be increased. If the material has a high fluorescence
response (>1.7 107 CPS), then the slit value can be reduced
accordingly in order to avoid damage to the detector.
2. The substrate mica sheet should be first peeled off with tape to
expose a fresh surface.
3. The accessibility of polarizable force fields has been increased
since the debut of FFParam [34], through which users can
generate and optimize polarizable force fields compatible with
the Drude force field [20]. If polarizable force fields are available for all components of the simulation system or the user has
access to CGenFF [35, 36], FFParam [34], and Gaussian [37]
or Psi4 [38], then the use of polarizable force fields would be
recommended as in reference [13].
4. The use of multiple, replicate simulations with different initial
configurations and conditions can be advantageous to check
for reproducibility of computational results across all runs.
Additionally, replicate simulations can allow for the analysis of
statistical errors or to reveal any “trapping” (failure to explore
important configurations outside of an energetic well) within
the simulations [39].
5. If, within the simulations, the molecules and ions spend a large
portion of the simulation in remote parts of the solvent box
without interacting with other molecules or ions within the
simulation, then the size of the solvent box may be decreased as
a means to facilitate and accelerate the self-assembly process.
6. Convergence of MD simulations can be checked throughout
all the structural and energetic analysis. Plotting geometric
properties such as radius of gyration of the entire system,
number of clusters formed, or the types of interactions formed
as a function of time can provide a visual indication of whether
the plotted values become steady as the simulations progress.
Likewise, plotting the running average energy of the system
can also indicate whether longer simulation times are needed.
7. The interactions tracked in the simulations of cyclo-HH selfand co-assembly were guided by experimentally resolved crystal
structures and visual inspection of the simulation trajectories.
This helped to ensure that elements of the crystal structure
were reproduced in the simulations.
8. Each interaction type number corresponds to a specific interaction between atoms belonging to bonded molecules or ions.
9. To examine clusters containing a particular interaction type or
set of interaction types, “pairs.dat” can be processed to only
Cyclo-Dihistidine Self- and Co-Assembly
201
include the interaction types of interest prior to the detection
of clusters. In this way, the molecules or ions in the detected
clusters will be “connected” through the interactions of interest. This can be particularly useful when searching for ordered
structures within the simulations. To examine clusters composed of a certain composition of molecules or ions, for example, 50% cyclo-HH and 50% Zn(II) ions, then “cluster.dat” can
be processed to isolate the clusters with the desired
composition.
10. The choice of the probe radius can affect the calculated surface
area. It is recommended that the user consults the literature to
set the probe radius for the solvent under investigation.
11. The conclusions derived from the results of structural analysis
can be verified with the results of the energetic analysis, and
vice versa. If the results do not align, other potential thermodynamic pathways could be evaluated in the energetic analysis,
and any user-defined criteria of the structural analysis could be
compared with visual inspection of the simulation trajectories.
12. The energy calculations represent “bounds of actual scenarios,” since in reality, the molecules and ions of the cluster
cannot be fully immersed in pure solvent (methanol and isopropanol in the references [13] and [14], respectively) or be in
the dielectric environment of the formed cluster. However,
such calculations can provide insights into the pathways of
co-assembly and be cross-validated with the results of the
structural analysis.
Acknowledgments
A.A.O acknowledges the Texas A&M University Graduate Diversity Fellowship from the Texas A&M University Graduate and
Professional School. All MD simulations and computational analysis were conducted using the Ada supercomputing cluster at the
Texas A&M High Performance Research Computing Facility, and
additional facilities at Texas A&M University. E.G. acknowledges
the support part by the European Research Council under the
European Union Horizon 2020 research and innovation program
(No. 694426). E.G. also acknowledges support from NSF-BSF
Joint Funding Research Grants (No. 2020752). Y.C. gratefully
acknowledges the Center for Nanoscience and Nanotechnology
of Tel Aviv University for financial support. PT acknowledges support from the National Science Foundation (Award Number
2104558; NSF-BSF: Computational and Experimental Design of
Novel Peptide Nanocarriers for Cancer Drugs).
202
Asuka A. Orr et al.
References
1. Wang H, Feng Z, Xu B (2017) Bioinspired
assembly of small molecules in cell milieu.
Chem Soc Rev 46:2421–2436
2. Wei G, Su Z, Reynolds NP, Arosio P, Hamley
IW, Gazit E, Mezzenga R (2017) Selfassembling peptide and protein amyloids:
from structure to tailored function in nanotechnology. Chem Soc Rev 46:4661–4708
3. DeGrado WF, Wasserman ZR, Lear JD (1989)
Protein design, a minimalist approach. Science
243:622–628
4. Reches M, Gazit E (2003) Casting metal nanowires within discrete self-assembled peptide
nanotubes. Science 300:625–627
5. Gazit E (2007) Self assembly of short aromatic
peptides into amyloid fibrils and related nanostructures. Prion 1:32–35
6. Yemini M, Reches M, Gazit E, Rishpon J
(2005) Peptide nanotube-modified electrodes
for enzyme-biosensor applications. Anal Chem
77:5155–5159
7. Handelman A, Kuritz N, Natan A, Rosenman
G (2016) Reconstructive phase transition in
ultrashort peptide nanostructures and induced
visible
photoluminescence.
Langmuir
32:2847–2862
8. Guo C, Arnon ZA, Qi R, Zhang Q, AdlerAbramovich L, Gazit E, Wei G (2016) Expanding the nanoarchitectural diversity through
aromatic di- and tri-peptide coassembly:
nanostructures and molecular mechanisms.
ACS Nano 10:8316–8324
9. Nikitin T, Kopyl S, Shur VY, Kopelevich YV,
Kholkin AL (2016) Low-temperature photoluminescence in self-assembled diphenylalanine
microtubes. Phys Lett A 380:1658–1662
10. Guo C, Luo Y, Zhou R, Wei G (2014) Triphenylalanine peptides self-assemble into nanospheres and nanorods that are different from
the nanovesicles and nanotubes formed by
diphenylalanine
peptides.
Nanoscale
6:2800–2811
11. Tao K, Fan Z, Sun L, Makam P, Tian Z,
Ruegsegger M, Shaham-Niv S, Hansford D,
Aizen R, Pan Z, Galster S, Ma J, Yuan F,
Si M, Qu S, Zhang M, Gazit E, Li J (2018)
Quantum confined peptide assemblies with
tunable visible to near-infrared spectral range.
Nat Commun 9:3217
12. Barondeau DP, Kassmann CJ, Tainer JA, Getzoff ED (2002) Structural chemistry of a green
fluorescent protein Zn biosensor. J Am Chem
Soc 124:3522–3524
13. Tao K, Chen Y, Orr AA, Tian Z, Makam P,
Gilead S, Si M, Rencus-Lazar S, Qu S,
Zhang M, Tamamis P, Gazit E (2020)
Enhanced fluorescence for bioassembly by
environment-switching doping of metal ions.
Adv Funct Mater 30:1909614
14. Chen Y, Orr AA, Tao K, Wang Z, Ruggiero A,
Shimon LJW, Schnaider L, Goodall A, RencusLazar S, Gilead S, Slutsky I, Tamamis P, Tan Z,
Gazit E (2020) High-efficiency fluorescence
through bioinspired supramolecular selfassembly. ACS Nano 14:2798–2807
15. Brooks BR, Brooks CL, Mackerell AD,
Nilsson L, Petrella RJ, Roux B, Won Y,
Archontis G, Bartels C, Boresch S, Caflisch A,
Caves L, Cui Q, Dinner AR, Feig M, Fischer S,
Gao J, Hodoscek M, Im W, Kuczera K,
Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B,
Venable RM, Woodcock HL, Wu X, Yang W,
York DM, Karplus M (2009) CHARMM: the
biomolecular simulation program. J Comput
Chem 30:1545–1614
16. Humphrey W, Dalke A, Schulten K (1996)
VMD: visual molecular dynamics. J Mol
Graph 14(33-8):27–28
17. Sterling T, Irwin JJ (2015) ZINC 15--ligand
discovery for everyone. J Chem Inf Model
55:2324–2337
18. Kim S, Chen J, Cheng T, Gindulyte A, He J,
He S, Li Q, Shoemaker BA, Thiessen PA, Yu B,
Zaslavsky L, Zhang J, Bolton EE (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47:D1102–D1109
19. ChemAxon (n.d.) ChemAxon MarvinSketch.
Version 17.1.30. http://www.chemaxon.com.
Accessed 18 Dec 2020
20. Lin F-Y, Huang J, Pandey P, Rupakheti C, Li J,
Roux BT, MacKerell AD (2020) Further optimization and validation of the classical drude
polarizable protein force field. J Chem Theory
Comput 16:3221–3239
21. Tamamis P, Kasotakis E, Archontis G, Mitraki
A (2014) Combination of theoretical and
experimental approaches for the design and
study of fibril-forming peptides. Methods Mol
Biol 1216:53–70
22. Lee J, Cheng X, Swails JM, Yeom MS, Eastman
PK, Lemkul JA, Wei S, Buckner J, Jeong JC,
Qi Y, Jo S, Pande VS, Case DA, Brooks CL,
MacKerell AD, Klauda JB, Im W (2016)
CHARMM-GUI
Input
Generator
for
NAMD, GROMACS, AMBER, OpenMM,
and
CHARMM/OpenMM
Simulations
Using the CHARMM36 Additive Force Field.
J Chem Theory Comput 12:405–413
Cyclo-Dihistidine Self- and Co-Assembly
23. Jo S, Kim T, Iyer VG, Im W (2008)
CHARMM-GUI: a web-based graphical user
interface for CHARMM. J Comput Chem
29:1859–1865
24. Mayer SW (1963) A molecular parameter relationship between surface tension and liquid
compressibility. J Phys Chem 67:2160–2164
25. Tang KE, Bloomfield VA (2000) Excluded volume in solvation: sensitivity of scaled-particle
theory to solvent size and density. Biophys J
79:2222–2234
26. Seeber M, Cecchini M, Rao F, Settanni G,
Caflisch A (2007) Wordom: a program for efficient analysis of molecular dynamics simulations. Bioinformatics 23(19):2625–2627
27. Seeber M, Felline A, Raimondi F, Muff S,
Friedman R, Rao F, Caflisch A, Fanelli F
(2011) Wordom: a user-friendly program for
the analysis of molecular structures, trajectories, and free energy surfaces. J Comput
Chem 32:1183–1194
28. Gohlke H, Case DA (2004) Converging free
energy estimates: MM-PB(GB)SA studies on
the protein-protein complex Ras-Raf. J Comput Chem 25:238–250
29. Hayes JM, Archontis G (2012) MM-GB
(PB)SA calculations of protein-ligand binding
free energies. Molecular dynamics - studies of
synthetic and biological macromolecules
30. Im W, Lee MS, Brooks CL (2003) Generalized
born model with a simple smoothing function.
J Comput Chem 24:1691–1702
31. Wohlfarth C (2015) Static dielectric constants
of pure liquids and binary liquid mixtures: supplement to volume IV/17
32. Khimenko MT, Litinskaya VV, Khomenko GP
(1982) Effect of concentration on the polarizability of isopropyl alcohol in dimethyl sulfoxide. Zh Fiz Khim 56:867–870
203
33. Zhang J, Zhang H, Wu T, Wang Q, van der
Spoel D (2017) Comparison of implicit and
explicit solvent models for the calculation of
solvation free energy in organic solvents. J
Chem Theory Comput 13:1034–1043
34. Kumar A, Yoluk O, MacKerell AD (2020)
FFParam: Standalone package for CHARMM
additive and Drude polarizable force field
parametrization of small molecules. J Comput
Chem 41:958–970
35. Vanommeslaeghe K, MacKerell AD (2012)
Automation of the CHARMM General Force
Field (CGenFF) I: bond perception and atom
typing. J Chem Inf Model 52:3144–3154
36. Vanommeslaeghe K, Raman EP, MacKerell AD
(2012) Automation of the CHARMM General
Force Field (CGenFF) II: assignment of
bonded parameters and partial atomic charges.
J Chem Inf Model 52:3155–3168
37. Frisch MJ, Trucks GW, Schlegel HB, Scuseria
GE, Robb MA, Cheeseman JR et al (2016)
Gaussian 03. Gaussian, Inc., Wallingford, CT
38. Parrish RM, Burns LA, Smith DGA, Simmonett AC, DePrince AE, Hohenstein EG,
Bozkaya U, Sokolov AY, Di Remigio R,
Richard RM, Gonthier JF, James AM, McAlexander HR, Kumar A, Saitow M, Wang X,
Pritchard BP, Verma P, Schaefer HF,
Patkowski K, King RA, Valeev EF, Evangelista
FA, Turney JM, Crawford TD, Sherrill CD
(2017) Psi4 1.1: an open-source electronic
structure program emphasizing automation,
advanced libraries, and interoperability. J
Chem Theory Comput 13:3185–3197
39. Grossfield A, Zuckerman DM (2009) Quantifying uncertainty and sampling quality in biomolecular simulations. Annu Rep Comput
Chem 5:23–48
Chapter 11
Computational Tools and Strategies to Develop
Peptide-Based Inhibitors of Protein-Protein Interactions
Maxence Delaunay and Tâp Ha-Duong
Abstract
Protein-protein interactions play crucial and subtle roles in many biological processes and modifications of
their fine mechanisms generally result in severe diseases. Peptide derivatives are very promising therapeutic
agents for modulating protein-protein associations with sizes and specificities between those of small
compounds and antibodies. For the same reasons, rational design of peptide-based inhibitors naturally
borrows and combines computational methods from both protein-ligand and protein-protein research
fields. In this chapter, we aim to provide an overview of computational tools and approaches used for
identifying and optimizing peptides that target protein-protein interfaces with high affinity and specificity.
We hope that this review will help to implement appropriate in silico strategies for peptide-based drug
design that builds on available information for the systems of interest.
Key words Sequence-based peptide design, Peptide conformation-based methods, Protein-peptide
interface characterization, Peptide hit identification and optimization
1
Introduction
Association and dissociation of proteins are molecular events at the
basis of many crucial cellular processes. Therefore, perturbation of
protein interaction networks generally leads to severe human diseases such as cancer or degenerative diseases. Infectious diseases
also involve interactions between host and pathogen proteins
[1]. Accordingly, protein-protein interactions (PPIs) have become
the target of an increasing number of modulator molecules with
therapeutic perspectives but also as chemical biology tools to study
protein interactions [2]. Notably, one advantage of targeting PPIs
compared with single proteins is to reduce the probability of drug
resistance. Indeed, protein-protein interfaces being highly comple-
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_11,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
205
206
Maxence Delaunay and Tâp Ha-Duong
mentary, a mutation in one protein would require a second complementary mutation in its partner to preserve their association,
which is very unlikely [3].
Despite the generally large protein surfaces involved in PPIs,
several small molecules were successfully developed for inhibiting
protein-protein associations [4–6]. Nonetheless, similarly to small
molecules which competitively inhibit enzymes by mimicking
endogenous substrates, peptide derivatives which mimic peptide
segments at protein-protein interfaces should be highly specific
lead compounds. Moreover, many protein-protein associations,
particularly in signaling pathways, are characterized by low affinities
[7]. This offers a lot of room for developing peptidic binders with
higher affinity. For these reasons, peptides are very promising starting points for deriving potent and selective PPI inhibitors [8, 9].
However, peptides exhibit well-known in vivo issues which
have to be circumscribed in drug development projects: They are
easily degraded by proteolytic enzymes, they generally have poor
membrane permeability, and they potentially induce unwanted
immune responses [10]. These drawbacks can be limited by reducing their peptidic nature with various approaches such as cyclization, N-methylation, or incorporation of non-natural amino acids.
These peptide derivatives are expected to keep a potent activity on
targeted protein-protein interactions but with improved pharmacokinetic properties. Nonetheless, the daunting task in designing
peptide derivatives to modulate protein-protein associations
remains to find their optimal sequence of natural or non-natural
amino acids.
In order to assist the medicinal chemists and chemical biologists in this task, many computational approaches have been developed for the past decades. We attempt here to categorize the
various in silico strategies that use these computational tools to
design peptide-based molecules modulating PPIs. Similarly to the
design of small molecules inhibiting enzymes or receptors, strategies to develop therapeutic peptide derivatives can be classified into
ligand-based or structure-based approaches. In the first case, a
peptide segment with interesting bioactivity has been identified
but its targeted protein remains unknown. Peptide derivatives
with improved properties will then be searched by similarity with
the initial peptide segment. In the second case, the structure of the
protein-protein complex is partially or entirely known and the
strategies will consist in finding peptide derivatives having an optimal structural and chemical complementary with the targeted protein surfaces. Accordingly, the present chapter will be organized
into two parts, the peptide-based and target-based strategies.
In Silico Design of Peptide-Based PPI Inhibitors
2
207
Peptide-Based Strategies
In ligand-based virtual screening, the search for small molecules
similar to a query compound consists in comparing molecular
descriptors which encode chemical and structural information
into numbers. For example, molecular fingerprints encode in bit
vectors the presence in molecules of particular substructures or
fragments [11] and pharmacophore models capture in threedimensional arrangements the chemical features of a ligand that
are necessary for its bioactivity [12]. Regarding peptides, chemical
information is encoded in their sequence and conformational information mainly lies in their secondary structures. Accordingly, we
subdivided this section into sequence-based and conformationbased strategies.
2.1 Sequence-Based
Approaches
Various properties of a peptide can be inferred by searching for
homologous sequences in relevant databases. This can be routinely
performed by using sequence alignment methods, such as the basic
local alignment search tool (BLAST) [13]. For example, BLAST
searches in database of essential genes (DEG) [14] or protein
subcellular localization database (PSORTdb) [15] can rapidly provide useful information regarding the biological function of query
protein sequences to identify novel targets [16]. Likewise, many
databases dedicated to peptides, such as database of antimicrobial
activity and structure of peptides (DBAAS) [17] or database of
bioactive peptides (BIOPEP-UWMTM) [18], can be mined using
BLAST to obtain information about biological and therapeutical
activities of query peptide sequences.
However, similarity searches based on purely sequential representation of proteins or peptides can fail in case of low coverage or
low quality sequence alignments. Thus, alternative descriptors of
protein and peptide chemical information can be used, such as the
amino acid composition (AAC) which is simply the occurrence
frequencies of the 20 native amino acids in their sequence, the
dipeptide composition, or the pseudo amino acid composition
(PseAAC) concept which additionally includes some sequenceorder information via correlation factors of various chemical properties between (i,i+1), (i,i+2), and (i,i+3) pairs of amino acids
[19]. By comparing these sequence descriptors with those of proteins and peptides in relevant databases, it is again possible to
predict some of its functional, physical, chemical, and structural
features [20].
The common thread of sequence-based predictions of peptide
bioactivity is first to build a training set of peptides with the experimentally validated desired and non-desired property, to choose
some peptide chemical descriptors, and to train a machine learning
algorithm for determining the relevant chemical features which
208
Maxence Delaunay and Tâp Ha-Duong
Fig. 1 Schematic description of sequence-based prediction of peptide therapeutic property through machine
learning algorithms
separate the desired from non-desired peptides (Fig. 1). In a second
stage, the trained machine learning algorithm is applied to
unknown data sets of peptides to predict those with desired and
non-desired properties according to the most discriminating features. In subsections below, we highlight some studies which
applied these sequence-based approaches to predict peptide
bioactivity.
2.1.1 Prediction of
Peptide Therapeutic
Property
Many studies applied this general framework to identify peptides
against almost all main pathology classes. Among them, prediction
of anticancer peptides has attracted great interest from several
research groups [21–23]. By building specialized data sets of antiangiogenic peptides, other predictors were developed for identifying more specific peptides inhibiting angiogenesis as promising
cancer treatment [24, 25]. A second major class of diseases for
which sequence-based searches for therapeutic peptides have been
reported is infectious diseases. According to the property covered
by the training data sets, several models were developed to predict
antimicrobial [26], antibacterial [27], or antiviral [28] peptides.
In Silico Design of Peptide-Based PPI Inhibitors
209
To avoid peptide-induced unwanted immune responses, or,
conversely, to design peptide-based vaccines, it can be interesting
to anticipate the peptide immunogenicity property. To this end,
several studies combining data sets of binders and non-binders to
the major histocompatibility complex (MHC) proteins and various
machine learning algorithms have been used to predict whether a
peptide is likely to be a T-cell epitope presented on the cell surface
[29, 30]. It is worthy to note that most recent peptide training data
sets were built from the Immune Epitope Database (IEDB) [31]
which also includes biological data about peptides involved in
inflammatory disorders. Thus, by extracting data sets of peptides
which trigger the secretion of inflammatory cytokines, it is possible
to develop predictors of peptides with pro- or anti-inflammatory
properties [32, 33].
Finally, it should be mentioned that these sequence-based
methods can be virtually applied to investigate any peptide property, provided that high-quality training data sets can be built with
validated peptides having and not having the desired property
[34]. They have been widely used, for example, to predict peptide
capability to cross membranes and penetrate into cells [35–
38]. Another peptide property which can be investigated by
sequence-based approaches is their capability to bind proteins.
The studies which used such methods to predict protein-peptide
interactions are reviewed in the next subsection.
2.1.2 Prediction of
Protein-Peptide
Interactions
Three different levels of detail about protein-peptide interactions
can be obtained using sequence-based approaches [39]. The first
one is to identify proteins that bind a query sequence. In that case,
it is first necessary to build training data sets of peptides which bind
proteins and other ones which do not. Then, machine learning
algorithms are trained to discriminate binders from non-binders
by using relevant peptide chemical descriptors. It should be noted
that many variants of this approach were originally developed for
large scale predictions of protein-protein interactions in the perspective of deciphering interactomes of various organisms [40–
42]. They were also applied to predict interactions between
human and bacteria or virus proteins by using relevant data sets of
host-pathogen protein-protein interactions [43–45].
Beside these genome scale predictions, several sequence-based
studies have been conducted in order to specifically identify peptide
segments that mediate protein-protein associations. These peptides
can be classified into two types, the short linear motifs (SLiMs)
which have generally less than ten residues, a few very conserved
ones, and no particular secondary structures, and the molecular
recognition features (MoRFs) which have a longer sequence and
generally undergo a disorder-to-order transition upon binding.
Accordingly, several sequence-based algorithms were specifically
210
Maxence Delaunay and Tâp Ha-Duong
developed for mining SLiMs [46–48] and other ones for detecting
MoRFs [49–51] in various data sets of protein-protein complexes.
It is worthy to note that SLiMs generally bind to specific recognition modules such as SH3 or PDZ domains. Therefore, many
specialized peptide motif predictors were trained and developed
on specific data sets of these prevalent domains [46, 47, 52–54].
The second level of information that can be investigated with
sequence-based approaches is the identification of amino acid residues at protein-peptide interfaces. Just like homology modeling
which aims at predicting protein tertiary structures from sequences
and data sets of experimentally resolved structures, comparative
studies can also be used for determining protein-protein interfaces
by searching homologs of query sequences in data sets of known
protein-protein complexes [55, 56]. Nonetheless, instead of
searching homologous proteins by using sequence alignment
tools, such as BLAST, most of the sequence-based predictions of
interface residues employ numerical vectors encoding key chemical
features of protein sequences and various machine learning algorithms. The latter are generally trained on data sets of interface
residues and non-interface ones which were built from known
protein-peptide complexes. Commonly, interface residues are
defined as those with a solvent accessible surface area (SASA)
which decreases by more than 1 Å2 upon binding or as those
which are distant from a protein partner by less than a threshold
parameter.
It is worthy to note that most sequence-based machine learning
approaches were applied to predict residues of a protein which are
likely to be involved in binding any other protein partners [57–
59]. Nonetheless, several other studies tackled the problem of
predicting the interface residues on both proteins of specific complexes [60–62], providing precious detailed information on
protein-protein binding modes, especially for transient complexes
and those mediated by SLiMs and MoRFs. Regarding specifically
protein-peptide interfaces, very few studies based on sequences
only were reported in the literature. We found only two
sequence-based predictors of protein-peptide binding sites, namely
SPRINT [63] and SVMpep [64], both using support vector
machines (SVM) for classification of binding and non-binding
residues. It can be noted that, among the input physical chemical
features of amino acid residues, SVMpep includes intrinsic disorder
information predicted by the IUPred web server [65], which seems
to improve the prediction accuracy.
Finally, the third level of information that can be inferred from
sequence-based approaches is the protein-peptide binding affinities. Although machine learning classifiers were developed to
discriminate protein-protein interactions with low or high affinity
[66, 67], quantitative predictions of binding free energies (ΔG)
In Silico Design of Peptide-Based PPI Inhibitors
211
from sequences are generally performed with machine learning
regression methods. Using training data sets of experimental binding free energies of known protein-protein complexes and various
sequence descriptors, obtained correlations between predicted and
experimental affinities are very diverse, ranging from 0.3 to 0.8 with
an average value around 0.6, depending on the selected sequence
features and external data sets used for testing [67–71]. However,
these approaches seem to perform better in predicting changes in
binding free energies (Δ ΔG) upon mutations on one of the two
partner sequences [52, 68].
Indeed, thanks to data sets of experimental Δ ΔG of mutations
at the interface of protein-protein complexes [72, 73], different
machine learning regression methods yielded correlations between
predicted and experimental changes in binding free energies in the
narrower range of 0.7–0.9 on various tested data sets [52, 68, 74–
76]. It should be noted that most of these studies trained their
machine leaning algorithms with descriptors extracted from threedimensional structures of protein-protein complexes. Only two
purely sequence-based predictors were so far reported in the literature [77, 78]. Moreover, it is important to mention that when
predictors are blind tested on completely independent data sets of
protein-protein complexes, then correlations between predicted
and experimental Δ ΔG upon mutations significantly drop to a
range of 0.3–0.6 [74–78], indicating that there is still room for
improving these predictors. Lastly, although these sequence-based
approaches were mainly developed for protein-protein complexes,
some of them have been applied to protein-peptide interactions,
including PDZ-peptide associations [52, 68, 71] or complexes of
MDM2 with p53 MoRF [74, 75]. These studies pave the way for
the investigation and design of peptide sequences with optimal
binding free energies for target proteins.
2.2 ConformationBased Approaches
It is now recognized that protein sequence determines their threedimensional conformational ensemble which, in turn, confers their
biological activity. Therefore, many developments of peptide derivatives have been based on or oriented toward the structural properties of identified bioactive peptides. We highlight here two main
in silico conformation-based approaches to discover or design new
peptide derivatives, the peptide pharmacophore screening and the
stabilization of secondary structure mimics (Fig. 2).
2.2.1 Peptide
Pharmacophore-Based
Screening
Among the ligand-based approaches in drug discovery, the pharmacophore virtual screening is an efficient and popular computational tool which can harness the knowledge of peptide
conformations. Indeed, when a peptide segment is known to bind
a target, then the residue side chains that are important for binding
(hot spots) allow naturally to generate 3D-pharmacophore models.
These, in turn, are used to screen compound libraries and identify
212
Maxence Delaunay and Tâp Ha-Duong
Fig. 2 Schematic description of a peptide pharmacophore screening method (left) and molecular simulation
use for predicting pre-organized conformations of a constrained peptide (right)
new binders with a similar three-dimensional pharmacophoric
arrangement. It should be mentioned that, although such drug
developments are centered around a known peptide ligand, they
often require the knowledge of its three-dimensional structure
when bound to its target, conferring to these approaches a
non-purely ligand-based nature. This method was employed to
discover inhibitors of several protein-protein complexes [79–81],
notably involved in host-pathogen interactions [82–84]. Nonetheless, it should be noted that, after defining the peptide-based pharmacophore models, these studies often screened libraries of
commercially available small compounds, which generally leads to
hits being far from a peptide.
2.2.2 Constrained
Secondary Structure
Mimics
Since many protein-protein interactions are mediated by peptide
segments which are structured into α-helix, β-strand, or turns, a
promising drug design strategy is to stabilize or constrain the
peptide unbound state in these common secondary structures to
minimize the entropy cost of binding and improve the affinity
[85]. This can be achieved by using two main approaches, either
by peptide cyclization or by backbone stiffening. The first approach
includes the α-helix stapling which consists in linking the side
In Silico Design of Peptide-Based PPI Inhibitors
213
chains of two residues located on the same side of an α-helix, with
hydrocarbon, lactam, or triazole staples, for example [86, 87], and
the β-sheet closure which consists in linking the two proximate
residues at the extremities of a pair of β-strands, using hairpin
loops or β-turn mimics [88, 89]. On the other hand, the backbone
stiffening approach generally consists in inserting a chemical modification into the peptide backbone, such as disubstitution of the αcarbon [90] or substitution of the amide nitrogen [91, 92], in
order to restrain its accessible conformational space.
In both previous strategies, a particularly helpful computational
tool which can assist the design of these constrained peptides is
molecular dynamics (MD) simulation. This technique numerically
solves the Newton’s equations of motion for a system of particles
whose interactions are described by empirical potential functions
usually referred to as force fields. When their timescales are sufficiently long, MD simulations can efficiently sample the peptide
conformational ensembles and correctly predict their propensity
to form secondary structures [93–95]. It could be noted that
enhanced sampling techniques, such as replica exchange molecular
dynamics [96] or metadynamics simulations [97] can also be used
to generate more exhaustive conformational ensembles, especially
for constrained cyclic peptides. Hence, more and more peptide
derivative developments include MD studies to anticipate the
impact of chemical modifications upon stabilization of secondary
structures, as shortly presented below.
Regarding stapled helices, molecular simulations generally confirmed that they have more restricted conformational space than
their non-stapled counterparts, but they still keep a high degree of
conformational flexibility [98–100]. Importantly, these studies
demonstrated that staples do not necessarily increase the helical
propensity (or helicity) of stapled peptides, which seems to result
from a fine balance between peptide sequences and position,
length, and chemical nature of the staple [100–102]. Also, MD
studies of stapled helices in free and bound states emphasized the
point that high helicity of stapled peptides does not necessarily
correlate with high binding affinity [98, 100, 101]. This could be
due to the peptides’ need for sufficient flexibility to adjust their
structure in the partner binding site and/or to the fact that staples
participate in and therefore modulate their binding [103].
Enhanced molecular simulations of cyclic peptides mimicking
turns or β-structures also showed that cyclization certainly reduced
the heterogeneity of their conformations, but it still allows a significant amount of flexibility [104–107]. Notably, cyclic backbones
can still sample multiple conformational states, from compact to
elongated structures, and a major question raised in these studies is
whether, among them, there is a pre-organized one close to a
bioactive conformation [105, 108–111]. If such a bioactive
214
Maxence Delaunay and Tâp Ha-Duong
conformation can be identified within the peptide conformational
ensemble, then additional chemical modifications of the peptide
backbone, such as α-carbon disubstitution or N-methylation, can
be introduced to shift the conformational equilibrium in favor of
it. Here again, enhanced MD simulations can help to rationalize
and optimize the impact of these modifications on peptide derivative conformational space [92, 110, 112, 113]. All together, these
peptide conformation-based studies can guide chemists away from
less interesting modulators in order to limit costs of long synthesis
campaigns.
3
Target-Based Strategies
Target-based approaches for designing peptides modulating
protein-protein interactions require to gather structural information about the studied complexes. In many cases, these data are
difficult to obtain experimentally due to technical limitations but
also due to the low affinity and/or the transient character of many
protein-protein associations [114]. In that context, several computational tools have been developed these last decades to gain a
better insight into the structural determinants of these interactions.
In this section, we will first describe different in silico approaches to
investigate the interface of protein-peptide complexes by collecting
data about cavities and hot spots or by performing protein-peptide
docking. Next, we will see how these tools and information can
help the rational design of regulatory peptides by finding minimal
recognition motifs or by peptide library virtual screening. We will
also discuss the optimization methods to enhance affinity and
specificity of these compounds.
3.1 Structural
Characterization of
Protein-Peptide
Interfaces
When a protein-peptide binding mode is unknown but the tertiary
structure of the unbound protein is resolved, it is possible to
anticipate the ligand binding sites on the protein by predicting its
cavities and/or the few amino acids that predominantly contribute
to the binding free energy (hot spots). It is also possible to model
the complex three-dimension structures with protein-peptide
docking techniques (Fig. 3).
3.1.1 Cavity Detection
In classical structure-based drug design, an exploration of druggable cavities on a protein surface is generally performed prior to
chemical library virtual screening or fragment-based design
approaches [115]. This can be done with the web servers CASTp
[116] or FPOCKET [117], for example. However, as far as we
know, these algorithms were mostly applied to detect protein cavities for small ligands but not for peptide binding sites which are
wider and more difficult to identify.
In Silico Design of Peptide-Based PPI Inhibitors
215
Fig. 3 Computational approaches that can provide structural information about a protein-peptide interface.
Computational alanine-scanning is one method to identify hot spots from the three-dimensional structure of
protein-peptide complexes
We found in the literature only one study which investigates the
binding pocket on a protein involved in protein-peptide interactions [118]. Using accelerated molecular dynamics simulations and
a pocket identification method called VISM-CFA [119], the
authors characterized the dynamic behavior of the Bad peptide
binding site on Bcl-xL protein. They showed that the binding
pocket of the unbound protein is often in a non-druggable closed
state with a volume below 100 Å3. Nevertheless, they also could
identify minor conformations of apo Bcl-xL (10%) with a more
open binding pocket which could accommodate Bad peptide or
small ligands [118]. This study reminds us that detection of druggable pockets on an unbound protein should preferentially be
performed on its conformational ensemble rather than on a single
structure.
3.1.2 Hot Spot
Identification
As mentioned in the sequence-based section, hot spots of proteinprotein complexes can be determined with machine learning algorithms trained on data sets of interface and non-interface residues.
The sequence descriptors used in those cases are generally intrinsic
properties (polarity, hydrophilicity, hydrophobicity. . .) of protein
amino acids. However, the accuracy of these predictors can be
greatly improved by including structural properties such as residue
solvent accessible surface areas or inter-residue distances in known
tertiary and quaternary protein structures [120]. Thus, several
216
Maxence Delaunay and Tâp Ha-Duong
high-performance hot spots predictors using protein threedimensional structures have been developed and successfully
applied to many protein-protein complexes [121–125].
Alternative physics-based or energy-based methods were also
developed to identify hot spots on protein surfaces. Most of them
use fragment-based approaches which aim at determining the preferential binding sites of small organic probes on known structures
of proteins. This can be achieved by running docking calculations
of small compounds into target cavities with classical protein-ligand
docking programs, such as Gold [126] or Autodock Vina [127], as
demonstrated in the study by Wang et al. of human activin receptor
hot spots [128]. Another possibility to explore fragment binding
sites on proteins is to use molecular simulations, such as the grand
canonical ensemble Monte Carlo simulations employed by Kulp III
et al. to identify hot spots on various proteins, including lysozyme,
RecA, HIV protease, dihydrofolate reductase, elastase, MDM2,
and peptide deformylase [129, 130]. Importantly, these studies
indicate that hot spots are more correctly predicted by locating
high affinity binding sites for organic fragments which are also
low affinity binding sites for water molecules.
Experimentally, protein hot spots can be identified by using the
alanine-scanning mutagenesis method [131]. In the same spirit,
they can be predicted by using the computational alanine-scanning
(CAS). From a known quaternary structure of a protein-peptide
complex, the technique consists in estimating the binding free
energy change (Δ ΔG) upon mutation of residues at the interface
into alanine. Mutations that significantly impair the proteinpeptide binding energy identify the hot spots. This general scheme
was implemented into several molecular modeling software
packages, such as Rosetta (Flex_ddG) [132] or BUDE (BudeAlaScan) [133]. The main difference between these programs lies in the
methods used to compute binding free energies which can be fast
empirical energy functions, MM/PBSA calculations, or thermodynamic integrations [134]. Thanks to its rapidity and low-cost,
computational alanine-scanning was applied to identify hot spots
of many protein-protein interactions [135–139], including the
recent SARS-CoV-2 spike glycoprotein binding to host ACE2
receptors.
3.1.3 Protein-Peptide
Docking
The knowledge of the three-dimensional structure of a targeted
protein-protein complex is an invaluable information for structurebased design of PPI inhibitors. When only the tertiary structures of
two unbound partners are known, protein-protein or proteinpeptide docking are the main computational tools to generate
structural models of their binding mode. The first protein-protein
docking programs commonly consider proteins as rigid bodies.
They generally consist in two or three steps: First, the shape
In Silico Design of Peptide-Based PPI Inhibitors
217
complementary between the two protein structures is optimized
[140, 141]. Then the obtained quaternary structures are re-scored
by taking into account physical criteria such as electrostatic, van der
Waals interactions, or desolvation energies [142, 143]. Frequently,
these two steps are performed simultaneously. Generally, they are
followed by a third step consisting in molecular dynamics simulations to allow local relaxation of the protein-protein interface.
Naturally, rigid docking methods are not appropriate for highly
flexible proteins, especially for those which bind their partner
through peptide segments such as SLiMs or MorFs. In these
cases, protein-peptide docking programs should be preferred
since they take into account the peptide flexibility at an early
stage. As for hot spot predictions, one can distinguish knowledgebased from physics-based protein-peptide docking. In knowledgebased approaches, also called template-based docking, the protein
structure and peptide sequences are first used to search for homologous protein-peptide complexes in databases of experimentally
resolved quaternary structures. Then, similarly to homology modeling, protein structure alignment and peptide sequence alignment
are used to generate models of the protein-peptide binding mode.
In this type of docking, the peptide backbone flexibility is taken
into account by the different homologous peptide structures found
in the database. Most often, model building is followed by an
energy-based optimization to allow further structural flexibility,
such as in GalaxyPepDock [144], HDOCK [145], or
InterPep2 [146].
The physics-based methods for flexible peptide docking can be
subdivided into three different approaches: ensemble docking, ab
initio docking, and fragment-based docking. In ensemble docking,
the unbound peptide conformations are pre-sampled and the representative structures are rigidly docked into the protein. PepATTRACT [147], MdockPep [148], PIPER-FlexPepDock [149], or
HPEPDOCK [150] can be classified as ensemble docking methods. In ab initio approaches, the peptide conformations are sampled
on-the-fly during the docking process, using mainly molecular
simulations as in FlexPepDock [151], AnchorDock [152], or
CABS-dock [153]. In fragment-based methods, the peptide is cut
into shorter compounds and the fragments are docked onto protein. Then, the best modes of binding of each fragment are linked
to generate the binding mode of the initial peptide. DINC [154]
and IDP-LZerD [155] belong to this type of protein-peptide
docking.
3.2 Identification
and Optimization of
Peptide Hits
In drug design, hit identification is the process consisting in finding
compounds which bind a target and modify its activity. In this
subsection, we describe computational methods to identify peptide
hits modulating protein-protein interactions. Peptide hits can be
218
Maxence Delaunay and Tâp Ha-Duong
Fig. 4 Computational approaches used to identify a peptide hit and to optimize its sequence for higher affinity
and selectivity
derived mainly from minimal recognition motifs at structurally
known protein-protein interface or with (structure-based) virtual
screening of peptide libraries (Fig. 4).
3.2.1 Derivation of
Minimal Recognition Motifs
At many protein-protein interfaces, even those involving globular
proteins, a short peptide segment predominantly contributes to the
binding energy and is required to stabilize the complex
[156]. Finding this hot segment, also called minimal recognition
motif or self-inhibitory peptide, is often a good starting point for
developing potent protein-protein inhibitors [157]. When the quaternary structure of a complex is known, several computational
tools can assist the researchers to derive these minimal recognition
motifs.
The first approach consists in identifying the hot spots of a
complex and then in extracting the shortest peptide segment
which contains as many hot spots as possible [158, 159]. Generally,
the binding energies of these hot segments with their targets are
subsequently estimated by using docking calculations or molecular
simulations and compared to the initial protein-protein interactions
to support the minimal recognition motif design [156, 158–
160]. Another example of this approach was reported in two
In Silico Design of Peptide-Based PPI Inhibitors
219
different studies of the same target, the Hsp90 dimer. The identification by computational alanine-scanning of four hot spots on the
protein C-terminal α-helix served as a basis for designing several
peptide-based inhibitors of Hsp90 [161, 162].
In the previous approach, the first and last residues of the
minimal recognition motif still have to be chosen by the researchers, and validation of these choices by computing binding energies
can be quite tedious. Thus, a systematic method called Rosetta
Peptiderive has been developed to automatically identify hot segments from the three-dimensional structure of a given proteinprotein complex [163]. In this algorithm, a sliding window of
user-defined size runs along one protein sequence and, at each
position, isolates a peptide segment whose binding energy with
the protein partner is computed using the Rosetta energy function
[156]. Peptides which contribute the most to the protein-protein
interaction are selected as hot segments. Peptiderive was made
available to the scientific community through a web server [163]
and allowed several groups to rapidly design from identified selfinhibitory peptides several inhibitors of various protein-protein
interactions [160, 164–166]. It should be noted that, if the hot
segments found have the appropriate geometry, then Peptiderive
can automatically derive cyclic peptides by mutating their terminal
residues into cysteine and linking them by a disulfide-bond [163].
3.2.2 Structure-Based
Virtual Screening
When the tertiary structure of a protein is known, one major
strategy for drug discovery consists in docking millions of compounds from various chemical libraries into identified target cavities. Structure-based virtual screening has been applied to search
for inhibitors of various protein-protein interactions, but mainly
within libraries of small organic molecules [167–169]. Probably
because docking peptides requires more computational resources
than for small compounds, few papers reported the discovery of
protein-protein inhibitors by using structure-based screening of
peptide libraries.
Nevertheless, with the continuous increase of computing
power, recent studies using peptide screening have been reported
in the literature. In these studies, libraries of natural peptides
extracted from food were docked into angiotensin-conversion
enzymes [170, 171] or xanthine oxidase [172] to identify potent
peptide-based inhibitors of these proteins. Nonetheless, it should
be mention that the used libraries were mainly composed of very
short tri- or tetrapeptides, limiting the possibility to discover peptides long enough to competitively inhibit large protein-protein
interfaces.
In this respect, it is worthy to mention that several computational tools can boost virtual screening of peptides by facilitating
the generation of libraries of various peptides. The Robetta server,
220
Maxence Delaunay and Tâp Ha-Duong
for example, can be used to easily generate libraries of helical, loop,
or extended peptides [173]. Another example is the program
CycloPs which can simply generate large and diverse libraries of
cyclic peptides from natural and commercially available non-natural
amino acids [174]. However, as far as we know, no structure-based
virtual screening of CycloPs libraries has been reported in the
literature so far. This could be due again to the computationally
demanding calculations required for reliably docking several
thousands of peptides with more than five residues.
3.2.3 Improving Peptide
Affinity and Selectivity by
Sequence Optimization
After having found a peptide hit, it is generally worthwhile to
increase its affinity for its target to improve its inhibition potency.
Moreover, selectivity of therapeutic compounds is an important
requirement in drug development to lower the risk of
off-targeting. In the case of peptide-based inhibitors, computational tools can help to improve the affinity and selectivity of
identified peptide hits by optimizing their sequence. The guiding
principle of this hit-to-lead process is similar to that used in protein
redesign to improve their stability [175], since the physical forces
that drive protein folding also drive protein-protein and proteinpeptide binding.
In favorable cases where the protein-peptide quaternary structure is known, redesign techniques generally consist in exploring
the sequence space of the fixed-backbone peptide and finding those
which minimize an energy score. This can be the binding free
energy variation (Δ ΔG) relative to the initial peptide sequence for
affinity improvement, or the difference between binding free energies of the same sequence but for two different protein partners for
selectivity enhancement. In essence, these approaches are similar to
the computational alanine-scanning technique, except that each
residue of the redesigned peptide can be mutated into all possible
amino acids.
Rosetta [176] is probably the most used software to design or
redesign proteins, peptides, and their associations, but several other
programs can be used to perform these tasks, including K* [177],
ORBIT [178], Proteus [179], or dTERMen [180]. These programs exploit different algorithms to explore the protein and peptide sequence space, such as minimization methods, genetic
algorithm, or Monte Carlo sampling. They also differ in their
energy functions which combine to varying degrees ingredients of
physics-based all-atom force fields, implicit solvation models, and
knowledge-based potentials derived from protein complex structures [181, 182]. Interestingly, dTERMen uses a scoring function
derived from statistical potentials between tertiary structural motifs
(TERMs) frequently observed in protein three-dimensional structures [183]. Since these TERMs have characteristic sequence preferences [184], the structure-based interactions are converted into
In Silico Design of Peptide-Based PPI Inhibitors
221
sequence-based scoring functions which are extremely fast to evaluate, allowing to exhaustively explore sequence spaces of long
peptides and proteins [180].
Many applications of computational protein-peptide interface
redesign have been reported in the literature and subsequent experimental validations of their predictions highlight the reliability of
these approaches to improve the affinity and selectivity of peptides
for their target proteins. Among the success stories, highly selective
peptides were computationally designed against bZIP proteins
[185, 186], PDZ domains [177, 187, 188], amyloid fibrils [189],
the cytokine TNFα [190, 191], and several anti-apoptotic proteins
of the Bcl-2 family [180, 192, 193]. Interestingly, two studies
among the previously cited redesigned peptide inhibitors with
D-amino acids [189, 191], paving the way for the development
of peptide-based drugs with high affinity, selectivity, and metabolic
stability.
4
Conclusions
In this review, we classified the computational tools and strategies
for designing peptide-based inhibitors of PPIs into the two conventional ligand-based and structure-based categories. Nevertheless, the border between the two classes becomes more and more
porous and several peptide developments combined both
approaches. For example, sequence-based predictions of hot spots
at protein-peptide interfaces by machine learning algorithms are
more accurate when molecular descriptors include structural information, such as solvent accessible surface areas or inter-residue
distances. Hybrid approaches will probably become more frequent
in the near future.
In both categories, the main challenge remains to determine
the optimal peptide sequences which bind a protein target with the
best affinity and selectivity. This requires to be able to compute as
accurately as possible binding free energies and their relation to
sequences, structures, and dynamics. Notably, regarding peptides
which have generally more degrees of freedom than small organic
compounds, this objective calls for correctly characterizing their
conformational ensemble to quantitatively estimate the entropy
cost of association, especially for peptides which undergo a
disorder-to-order transition upon binding.
Lastly, in the perspective of drug design, it remains crucial to
reduce the peptidic nature of the identified peptide hits for increasing their stability against proteolytic enzymes (without decreasing
their affinity and selectivity). This can be achieved by introducing
non-natural amino acids, such as D-amino acids, peptoids, or
chemically modified side chains, in the early stages of the
222
Maxence Delaunay and Tâp Ha-Duong
development of PPI peptide-based inhibitors. Their membrane
permeability is also an important property which is worth investigating as early as possible in order to maximize the chances of
success in clinical trials.
References
1. Ryan DP, Matthews JM (2005) Proteinprotein interactions in human disease. Curr
Opin Struct Biol 15:441–446
2. Milroy L-G, Grossmann TN, Hennig S,
Brunsveld L, Ottmann C (2014) Modulators
of protein–protein interactions. Chem Rev
114:4695–4748
3. Archakov AI, Govorun VM, Dubanov AV,
Ivanov YD, Veselovsky AV, Lewi P, Janssen P
(2003) Protein-protein interactions as a target
for drugs in proteomics. Proteomics 3:
380–391
4. Sheng C, Dong G, Miao Z, Zhang W, Wang
W (2015) State-of-the-art strategies for targeting protein–protein interactions by smallmolecule inhibitors. Chem Soc Rev 44:
8238–8259
5. Modell AE, Blosser SL, Arora PS (2016) Systematic targeting of protein–protein interactions. Trends Pharmacolog Sci 37:702–713
6. Wichapong K, Poelman H, Ercig B,
Hrdinova J, Liu X, Lutgens E, Nicolaes GA
(2019) Rational modulator design by exploitation of protein–protein complex structures.
Future Med Chem 11:1015–1033
7. Yugandhar K, Gromiha MM (2016) Analysis
of protein-protein interaction networks based
on binding affinity. Current Protein Peptide
Sci 17:72–81
8. Nevola L, Giralt E (2015) Modulating protein–protein interactions: the potential of
peptides. Chem Commun 51:3302–3315
9. Cunningham AD, Qvit N, Mochly-Rosen D
(2017) Peptides and peptidomimetics as regulators of protein–protein interactions. Current Opin Struct Biol 44:59–66
10. Fosgerau K, Hoffmann T (2015) Peptide
therapeutics: current status and future directions. Drug Discovery Today 20:122–128
11. Cereto-Massagué A, Ojeda MJ, Valls C,
Mulero M, Garcia-Vallvé S, Pujadas G
(2015) Molecular fingerprint similarity search
in virtual screening. Methods 71:58–63
12. Kaserer T, Beck K, Akram M, Odermatt A,
Schuster D (2015) Pharmacophore models
and pharmacophore-based virtual screening:
concepts and applications exemplified on
hydroxysteroid dehydrogenases. Molecules
20:22799–22832
13. Altschul SF, Gish W, Miller W, Myers EW,
Lipman DJ (1990) Basic local alignment
search tool. J Mol Biol 215:403–410
14. Zhang R, Ou H-Y, Zhang C-T (2004) DEG:
a database of essential genes. Nucleic Acids
Res 32:D271–D272
15. Rey S, Acab M, Gardy JL, Laird MR,
deFays K, Lambert C, Brinkman FSL (2005)
PSORTdb: a protein subcellular localization
database for bacteria. Nucleic Acids Res 33:
D164–D168
16. Gawade P, Ghosh P (2018) Genomics driven
approach for identification of novel therapeutic targets in Salmonella enterica. Gene 668:
211–220
17. Pirtskhalava M, Gabrielian A, Cruz P, Griggs
HL, Squires RB, Hurt DE, Grigolava M,
Chubinidze
M,
Gogoladze
G,
Vishnepolsky B, Alekseev V, Rosenthal A, Tartakovsky M (2016) DBAASP v.2: an enhanced
database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides.
Nucleic Acids Res 44:D1104–D1112
18. Minkiewicz P, Iwaniak A, Darewicz M (2019)
BIOPEP-UWM database of bioactive peptides: current opportunities. Int J Mol Sci
20:5978
19. Chou K-C (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct Funct Genet 43:
246–255
20. Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ
(2011) Update of PROFEAT: a web server
for computing structural and physicochemical
features of proteins and peptides from amino
acid sequence. Nucleic Acids Res 39:
W385–W390
21. Chen W, Ding H, Feng P, Lin H, Chou K-C
(2016) iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 7:
16895–16909
22. Xu L, Liang G, Wang L, Liao C (2018) A
Novel hybrid sequence-based model for identifying anticancer peptides. Genes 9:158
In Silico Design of Peptide-Based PPI Inhibitors
23. Wei L, Zhou C, Chen H, Song J, Su R (2018)
ACPred-FL: a sequence-based predictor
using effective feature representation to
improve the prediction of anti-cancer peptides. Bioinformatics 34:4007–4016
24. Blanco JL, Porto-Pazos AB, Pazos A,
Fernandez-Lozano C (2018) Prediction of
high anti-angiogenic activity peptides in silico
using a generalized linear model and feature
selection. Sci Rep 8:15688
25. Laengsri
V,
Nantasenamat
C,
Schaduangrat
N,
Nuchnoi
P,
Prachayasittikul V, Shoombuatong W (2019)
TargetAntiAngio: a sequence-based tool for
the prediction and analysis of anti-angiogenic
peptides. Int J Mol Sci 20:2950
26. Bhadra P, Yan J, Li J, Fong S, Siu SWI (2018)
AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns
of amino acid properties and random forest.
Sci Rep 8:1697
27. Khosravian M, Kazemi Faramarzi F, Mohammad Beigi M, Behbahani M, Mohabatkar H
(2013) Predicting antibacterial peptides by
the concept of Chou’s Pseudo-amino acid
composition and machine learning methods.
Protein Peptide Lett 20:180–186
28. Schaduangrat
N,
Nantasenamat
C,
Prachayasittikul V, Shoombuatong W (2019)
Meta-iAVP: a sequence-based meta-predictor
for improving the prediction of antiviral peptides using effective feature representation.
Int J Mol Sci 20:5743
29. Tung C-W, Ziehm M, K€amper A,
Kohlbacher O, Ho S-Y (2011) POPISK:
T-cell reactivity prediction using support vector machines and string kernels. BMC Bioinf
12:446
30. Jorgensen KW, Rasmussen M, Buus S, Nielsen M (2014) NetMHCstab - predicting stability of peptide-MHC-I complexes; impacts
for cytotoxic T lymphocyte epitope discovery.
Immunology 141:18–26
31. Vita R, Mahajan S, Overton JA, Dhanda SK,
Martini S, Cantrell, JR, Wheeler DK, Sette A,
Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res
47:D339–D343
32. Gupta S, Mittal P, Madhu MK, Sharma VK
(2017) IL17eScan: a tool for the identification of peptides inducing IL-17 response.
Front Immunol 8:1430
33. Manavalan B, Shin TH, Kim MO, Lee G
(2018) AIPpred: sequence-based prediction
of anti-inflammatory peptides using random
forest. Front Pharmacol 9:276
223
34. Wei L, Zhou C, Su R, Zou Q (2019) PEPredSuite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinf 35:4272–4280
35. Tang H, Su, Z.-D., Wei, H.-H., Chen W, Lin
H (2016) Prediction of cell-penetrating peptides with feature selection techniques. Biochem Biophys Res Commun 477:150–154
36. Wei L, Xing P, Su R, Shi G, Ma ZS, Zou Q
(2017) CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides
and their uptake efficiency. J Proteome Res
16:2044–2053
37. Pandey P, Patel V, George NV, Mallajosyula
SS (2018) KELM-CPPpred: Kernel extreme
learning machine based prediction model for
cell-penetrating peptides. J Proteome Res 17:
3214–3222
38. Arif M, Ahmad S, Ali F, Fang G, Li M, Yu, D-J
(2020) TargetCPP: accurate prediction of
cell-penetrating peptides from optimized
multi-scale features using gradient boost decision tree. J Comput Aided Mol Des 34:
841–856
39. Chen M, Ju C JT, Zhou G, Chen X, Zhang T,
Chang K-W, Zaniolo C, Wang W (2019)
Multifaceted protein–protein interaction prediction based on Siamese residual RCNN.
Bioinformatics 35:i305–i314
40. Hashemifar S, Neyshabur B, Khan AA, Xu J
(2018) Predicting protein-protein interactions through sequence-based deep learning.
Bioinformatics 34:i802–i810
41. Tran L, Hamp T, Rost B (2018) ProfPPIdb:
Pairs of physical protein-protein interactions
predicted for entire proteomes. PLOS One
13:e0199988
42. Romero-Molina
S,
Ruiz-Blanco
YB,
Harms M, Münch J, Sanchez-Garcia E
(2019) PPI-detect: a support vector machine
model for sequence-based prediction of
protein-protein interactions: PPI-Detect: a
support vector machine model for sequencebased prediction of protein-protein interactions. J Comput Chem 40:1233–1242
43. Eid F-E, ElHefnawi M, Heath LS (2016)
DeNovo: virus-host sequence-based protein–protein interaction prediction. Bioinf
32:1144–1150
44. Lian X, Yang S, Li H, Fu C, Zhang Z (2019)
Machine-learning-based predictor of humanbacteria protein-protein interactions by incorporating comprehensive host-network properties. J Proteome Res 18:2195–2205
45. Kösesoy I, Gök M, Öz C (2019) A new
sequence based encoding for prediction of
224
Maxence Delaunay and Tâp Ha-Duong
host–pathogen protein interactions. Comput
Biol Chem 78:170–177
46. Tan S-H, Hugo W, Sung, W-K, Ng S-K
(2006) A correlated motif approach for
finding short linear motifs from protein interaction networks. BMC Bioinf 7:502
47. Leung HC-M, Siu M-H, Yiu S-M, Chin
FY-L, Sung KW-K (2009) Clustering-based
approach for predicting motif pairs from protein interaction data. J Bioinf Comput Biol
07:701–716
48. Hugo W, Ng S-K, Sung W-K (2011)
D-SLIMMER: domain-SLiM interaction
motifs miner for sequence based proteinprotein interaction data. J Proteome Res 10:
5285–5295
49. Disfani FM, Hsu W-L, Mizianty MJ, Oldfield
CJ, Xue B, Dunker AK, Uversky VN, Kurgan
L (2012) MoRFpred, a computational tool
for sequence-based prediction and characterization of short disorder-to-order transitioning
binding
regions
in
proteins.
Bioinformatics 28:i75–i83
50. Malhis N, Gsponer J (2015) Computational
identification of MoRFs in protein sequences.
Bioinformatics 31:1738–1744
51. He H, Zhao J, Sun G (2019) Computational
prediction of MoRFs based on protein
sequences and minimax probability machine.
BMC Bioinf 20:529
52. Chen JR, Chang BH, Allen JE, Stiffler MA,
MacBeath G (2008) Predicting PDZ
domain–peptide interactions from primary
sequences. Nat Biotechnol 26:1041–1045
53. Reimand J, Hui S, Jain S, Law B, Bader GD
(2012) Domain-mediated protein interaction
prediction: from genome to network. FEBS
Lett 586:2751–2763
54. Sarkar D, Jana T, Saha S (2018) LMDIPred: a
web-server for prediction of linear peptide
sequences binding to SH3, WW and PDZ
domains. PLOS One 13:e0200430
55. Xue LC, Dobbs D, Honavar V (2011)
HomPPI: a class of sequence homology
based protein-protein interface prediction
methods. BMC Bioinf 12:244
56. Garcia-Garcia J, Valls-Comamala V, Guney E,
Andreu D, Muñoz FJ, Fernandez-Fuentes N,
Oliva B (2017) iFrag: a protein–protein interface prediction server based on sequence fragments. J Mol Biol 429:382–389
57. Dhole K, Singh G, Pai PP, Mondal S (2014)
Sequence-based prediction of protein–protein
interaction sites with L1-logreg classifier. J
Theoret Biol 348:47–54
58. Jia J, Liu Z, Xiao X, Liu B, Chou, K-C (2016)
iPPBS-Opt: a sequence-based ensemble
classifier for identifying protein-protein binding sites by optimizing imbalanced training
datasets. Molecules 21:95
59. Hou Q, De Geest PFG, Griffioen CJ, Abeln S,
Heringa J, Feenstra KA (2019) SeRenDIP:
SEquential REmasteriNg to DerIve profiles
for fast and accurate predictions of PPI interface positions. Bioinformatics 35:4794–4796
60. Afsar Minhas FuA, Geiss BJ, Ben-Hur A
(2014) PAIRpred: partner-specific prediction
of interacting residues from sequence and
structure:
interface
prediction
using
PAIRpred. Proteins: Struct Funct Bioinf 82:
1142–1155
61. Meyer MJ, Beltrán JF, Liang S, Fragoza R,
Rumack A, Liang J, Wei X, Yu H (2018)
Interactome INSIDER: a structural interactome browser for genomic studies. Nat Methods 15:107–114
62. Sanchez-Garcia R, Sorzano COS, Carazo JM,
Segura J (2019) BIPSPI: a method for the
prediction of partner-specific protein-protein
interfaces. Bioinf 35:470–477
63. Taherzadeh G, Yang Y, Zhang T, Liew AW-C,
Zhou Y (2016) Sequence-based prediction of
protein-peptide binding sites using support
vector machine. J Comput Chem 37:
1223–1229
64. Zhao Z, Peng Z, Yang J (2018) Improving
sequence-based prediction of protein–peptide
binding residues by introducing intrinsic disorder and a consensus method. J Chem Inf
Model 58:1459–1468
65. Dosztányi Z, Csizmok V, Tompa P, Simon I
(2005) IUPred: web server for the prediction
of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434
66. Yugandhar K, Gromiha MM (2014) Feature
selection and classification of protein–protein
complexes based on their binding affinities
using machine learning approaches. Proteins:
Struct Funct Bioinf 82:2088–2096
67. Srinivasulu Y, Wang, J-R, Hsu K-T, Tsai M-J,
Charoenkwan P, Huang W-L, Huang H-L,
Ho S-Y (2015) Characterizing informative
sequence descriptors and predicting binding
affinities of heterodimeric protein complexes.
BMC Bioinf 16:S14
68. Shao X, Tan CSH, Voss C, Li SSC, Deng N,
Bader GD (2011) A regression framework
incorporating quantitative and negative interaction data improves quantitative prediction
of PDZ domain–peptide interaction from primary sequence. Bioinformatics 27:383–390
69. Moal IH, Agius R, Bates PA (2011) Protein–protein binding affinity prediction on a
In Silico Design of Peptide-Based PPI Inhibitors
diverse set of structures. Bioinformatics 27:
3002–3009
70. Luo J, Guo Y, Zhong Y, Ma D, Li W, Li M
(2014) A functional feature analysis on diverse
protein–protein interactions: application for
the prediction of binding affinity. J Comput
Aided Mol Design 28:619–629.
71. Kamisetty H, Ghosh B, Langmead CJ, BaileyKellogg C (2015) Learning sequence determinants of protein:protein interaction specificity with sparse graphical models. J Comput
Biol 22:474–486
72. Jemimah S, Yugandhar K, Michael Gromiha
M (2017) PROXiMATE: a database of
mutant protein–protein complex thermodynamics and kinetics. Bioinf 33:2787–2788
73. Jankauskaitė
J,
Jiménez-Garcı́a
B,
Dapkūnas J, Fernández-Recio J, Moal IH
(2019) SKEMPI 2.0: an updated benchmark
of changes in protein–protein binding energy,
kinetics and thermodynamics upon mutation.
Bioinformatics 35:462–469
74. Geng C, Vangone A, Folkers GE, Xue LC,
Bonvin AMJJ (2019) iSEE: Interface structure, evolution, and energy-based machine
learning predictor of binding affinity changes
upon mutations. Proteins: Struct Funct Bioinf
87:110–119
75. Rodrigues CHM, Myung Y, Pires DEV,
Ascher DB (2019) mCSM-PPI2: predicting
the effects of mutations on protein–protein
interactions. Nucleic Acids Res 47:
W338–W344
76. Zhang N, Chen Y, Lu H, Zhao F, Alvarez RV,
Goncearenco A, Panchenko AR, Li M (2020)
MutaBind2: predicting the impacts of single
and multiple mutations on protein-protein
interactions. iScience 23:100939
77. Jemimah S, Sekijima M, Gromiha MM (2019)
ProAffiMuSeq: sequence-based method to
predict the binding free energy change of
protein–protein complexes upon mutation
using functional classification. Bioinformatics
36:1725–1730
78. Li G, Pahari S, Krishna Murthy A, Liang S,
Fragoza R, Yu H, Alexov E (2020) SAAMBESEQ: a sequence-based method for predicting
mutation effect on protein-protein binding
affinity. Bioinformatics 37:btaa761
79. Massa SM, Xie Y, Longo FM (2003) Alzheimer’s therapeutics. J Mol Neurosci 20:
323–326
80. Parthasarathi L, Casey F, Stein A, Aloy P,
Shields DC (2008) Approved drug mimics
of short peptide ligands from protein interaction motifs. J Chem Inf Model 48:
1943–1948
225
81. Fayaz SM, Rajanikant GK (2015) Modelling
the molecular mechanism of protein–protein
interactions and their inhibition: CypD–p53
case study. Mol Diversity 19:931–943
82. Caporuscio F, Tafi A, González E, Manetti F,
Esté JA, Botta, M (2009) A dynamic targetbased pharmacophoric model mapping the
CD4 binding site on HIV-1 gp120 to identify
new inhibitors of gp120–CD4 protein–protein interactions. Bioorganic Med Chem Lett
19:6087–6091
83. Hall PR, Leitão A, Ye C, Kilpatrick K,
Hjelle B, Oprea TI, Larson RS (2010) Small
molecule inhibitors of hantavirus infection.
Bioorganic Med Chem Lett 20:7085–7091
84. Pihan E, Delgadillo RF, Tonkin ML,
Pugnière M, Lebrun M, Boulanger MJ, Douguet D (2015) Computational and biophysical approaches to protein–protein interaction
inhibition of Plasmodium falciparum AMA1/
RON2 complex. J Comput Aided Mol Design
29:525–539
85. Jesus Perez de Vega M, Martin-Martinez M,
Gonzalez-Muniz R (2007) Modulation of
protein-protein interactions by stabilizing/
mimicking protein secondary structure elements. Current Topics Med Chem 7:33–62
86. Klein M (2017) Stabilized helical peptides:
overview of the technologies and its impact
on drug discovery. Expert Opin Drug Disc
12:1117–1125
87. Guarracino DA, Riordan JA, Barreto GM,
Oldfield AL, Kouba CM, Agrinsoni D
(2019) Macrocyclic control in Helix
Mimetics. Chem Rev 119:9915–9949
88. Khakshoor O, Nowick JS (2008) Artificial βsheets: chemical models of β-sheets. Current
Opin Chem Biol 12:722–729
89. Laxio Arenas J, Kaffy J, Ongeri S (2019) Peptides and peptidomimetics as inhibitors of
protein–protein interactions involving βsheet secondary structures. Current Opin
Chem Biol 52:157–167
90. Tanaka M (2007) Design and synthesis of
chiral α,α-disubstituted amino acids and conformational study of their oligopeptides.
Chem Pharmaceut Bull 55:349–358
91. Chatterjee J, Rechenmacher F, Kessler H
(2013) N-Methylation of peptides and proteins: an important element for modulating
biological functions. Angew Chem Int Edition 52:254–269
92. Sarnowski MP, Pedretty KP, Giddings N,
Woodcock HL, Del Valle JR (2018) Synthesis
and β-sheet propensity of constrained
N-amino peptides. Bioorganic Med Chem
26:1162–1166
226
Maxence Delaunay and Tâp Ha-Duong
93. Matthes D, Groot BLd (2009) Secondary
structure propensities in peptide folding
simulations: a systematic comparison of
molecular mechanics interaction schemes.
Biophys J 97:599–608
94. Rauscher S, Gapsys V, Gajda MJ,
Zweckstetter M, de Groot BL, Grubmüller
H (2015) Structural ensembles of intrinsically
disordered proteins depend strongly on force
field: a comparison to experiment. J Chem
Theory Comput 11:5513–5524
95. Chan-Yao-Chong M, Deville C, Pinet L, van
Heijenoort C, Durand D, Ha-Duong T
(2019) Structural characterization of
N-WASP domain V using MD simulations
with NMR and SAXS data. Biophys J 116:
1216–1227
96. Sugita Y, Okamoto Y (1999) Replicaexchange molecular dynamics method for
protein folding. Chem Phys Lett 314:
141–151
97. Laio A and Parrinello M (2002). Escaping
free-energy minima. Proc Natl Acad Sci 99:
12562–12566
98. Joseph TL, Lane DP, Verma CS (2012) Stapled BH3 peptides against MCL-1: mechanism and design using atomistic simulations.
PLOS One 7:e43985
99. Damas JM, Filipe LC, Campos SR, Lousa D,
Victor BL, Baptista AM, Soares CM (2013)
Predicting the thermodynamics and kinetics
of Helix formation in a cyclic peptide model.
J Chem Theory Comput 9:5148–5157
100. Cornillie SP, Bruno BJ, Lim CS, Cheatham
TE (2018) Computational modeling of stapled peptides toward a treatment strategy for
CML and broader implications in the design
of lengthy peptide therapeutics. J Phys Chem
B 122:3864–3875
101. Lama
D,
Quah
ST,
Verma
CS,
Lakshminarayanan R, Beuerman RW, Lane
DP, Brown CJ (2013) Rational optimization
of conformational effects induced by hydrocarbon staples in peptides and their binding
interfaces. Sci Rep 3:3451
102. Zhu J, Wei S, Huang L, Zhao Q, Zhu H,
Zhang A (2020) Molecular modeling and
rational design of hydrocarbon-stapled/
halogenated helical peptides targeting CETP
self-binding site: Therapeutic implication for
atherosclerosis. J Mol Graph Modell 94:
107455
103. Tan YS, Lane DP, Verma CS (2016) Stapled
peptide design: principles and roles of computation. Drug Discovery Today 21:1642–1653
104. Spitaleri A, Ghitti M, Mari S, Alberici L,
Traversari C, Rizzardi G-P, Musco G (2011)
Use of metadynamics in the design of
isoDGR-based αvβ3 antagonists to fine-tune
the conformational ensemble. Ang Chem Int
Edition 50:1832–1836
105. Yedvabny E, Nerenberg PS, So C, HeadGordon T (2015) Disordered structural
ensembles of vasopressin and oxytocin and
their mutants. J Phys Chem B 119:896–905
106. Yu H, Lin, Y-S (2015) Toward structure prediction of cyclic peptides. Phys Chem Chem
Phys 17:4210–4219
107. McHugh SM, Rogers JR, Solomon SA, Yu H,
Lin Y-S (2016) Computational methods to
design cyclic peptides. Current Opin Chem
Biol 34:95–102
108. Quartararo JS, Eshelman MR, Peraro L,
Yu H, Baleja JD, Lin Y-S, Kritzer JA (2014)
A bicyclic peptide scaffold promotes phosphotyrosine mimicry and cellular uptake.
Bioorganic Med Chem 22:6387–6391
109. Razavi AM, Wuest WM, Voelz VA (2014)
Computational screening and selection of
cyclic peptide hairpin mimetics by molecular
simulation and kinetic network models. J
Chem Inf Model 54:1425–1432
110. Wakefield AE, Wuest WM, Voelz VA (2015)
Molecular simulation of conformational
pre-organization in cyclic RGD peptides. J
Chem Inf Model 55:806–813
111. Est CB, Mangrolia P, Murphy RM (2019)
ROSETTA-informed design of structurally
stabilized cyclic anti-amyloid peptides. Protein Eng Design Select 32:47–57
112. Paissoni C, Ghitti M, Belvisi L, Spitaleri A,
Musco G (2015) Metadynamics simulations
rationalise the conformational effects induced
by N-methylation of RGD cyclic hexapeptides. Chem A Europ J 21:14165–14170
113. Slough DP, Yu H, McHugh SM, Lin Y-S
(2017)
Toward
accurately
modeling
N-methylated cyclic peptides. Phys Chem
Chem Phys 19:5377–5388
114. Lensink MF, Velankar S, Wodak SJ (2017)
Modeling protein-protein and proteinpeptide complexes: CAPRI 6th edition: modeling protein-protein and protein-peptide
complexes. Proteins Struct Funct Bioinf 85:
359–377
115. Gowthaman R, Miller SA, Rogers S,
Khowsathit J, Lan L, Bai N, Johnson DK,
Liu C, Xu L, Anbanandam A, Aubé J,
Roy A, Karanicolas J (2016) DARC: mapping
surface topography by ray-casting for effective
virtual screening at protein interaction sites. J
Med Chem 59:4152–4170
116. Binkowski TA, Naghibzadeh S, Liang J
(2003) CASTp: computed atlas of surface
In Silico Design of Peptide-Based PPI Inhibitors
topography of proteins. Nucleic Acids Res 31:
3352–3355
117. Le Guilloux V, Schmidtke P, Tuffery P (2009)
Fpocket: an open source platform for ligand
pocket detection. BMC Bioinf 10:168
118. Guo Z, Thorarensen A, Che J, Xing L (2016)
Target the more druggable protein states in a
highly dynamic protein–protein interaction
system. J Chem Inf Model 56:35–45
119. Guo Z, Li B, Dzubiella J, Cheng L-T,
McCammon JA, Che J (2013) Evaluation of
hydration free energy by level-set variational
implicit-solvent model with coulomb-field
approximation. J Chem Theory Comput 9:
1778–1787
120. Liu S, Liu C, Deng L (2018) Machine
learning approaches for protein–protein interaction hot spot prediction: progress and comparative assessment. Molecules 23:2535
121. Tuncbag N, Gursoy A, Keskin O (2009)
Identification of computational hot spots in
protein interfaces: combining solvent accessibility and inter-residue potentials improves
the accuracy. Bioinf 25:1513–1520
122. Xia J-F, Zhao X-M, Song J, Huang D-S
(2010) APIS: accurate prediction of hot
spots in protein interfaces by combining protrusion index with solvent accessibility. BMC
Bioinf 11:174
123. Wang L, Liu Z-P, Zhang X-S, Chen L (2012)
Prediction of hot spots in protein interfaces
using a random forest model with hybrid features. Protein Eng Design Select 25:119–126
124. Deng L, Guan J, Wei X, Yi Y, Zhang QC,
Zhou S (2013) Boosting prediction performance of protein-protein interaction hot
spots by using structural neighborhood properties. J Comput Biol 20:878–891
125. Qiao Y, Xiong Y, Gao H, Zhu X, Chen P
(2018) Protein-protein interface hot spots
prediction based on a hybrid feature selection
strategy. BMC Bioinf 19:14
126. Jones G, Willett P, Glen RC, Leach AR, Taylor R (1997) Development and validation of a
genetic algorithm for flexible docking11Edited by F. E. Cohen. J Mol Biol 267:
727–748
127. Trott O, Olson AJ (2010) AutoDock Vina:
improving the speed and accuracy of docking
with a new scoring function, efficient optimization, and multithreading. J Comput Chem
31:455–461
128. Wang L, Hou Y, Quan H, Xu W, Bao Y, Li Y,
Fu Y, Zou S (2013) A compound-based
computational approach for the accurate
determination of hot spots. Protein Sci 22:
1060–1070
227
129. Kulp JL, Kulp JL, Pompliano DL, Guarnieri F
(2011) Diverse fragment clustering and water
exclusion identify protein hot spots. J Amer
Chem Soc 133:10740–10743
130. Kulp JL, Cloudsdale IS, Kulp JL, Guarnieri F
(2017) Hot-spot identification on a broad
class of proteins and RNA suggest unifying
principles of molecular recognition. PLOS
One 12:e0183327
131. Cunningham BC, Wells JA (1989) Highresolution
epitope
mapping
of
hGH-receptor interactions by alaninescanning
mutagenesis.
Science
244:
1081–1085
132. Barlow KA, Ó Conchúir S, Thompson S,
Suresh P, Lucas JE, Heinonen M, Kortemme
T (2018) Flex ddG: Rosetta ensemble-based
estimation of changes in protein-protein
binding affinity upon mutation. J Phys
Chem B 122:5389–5399
133. Ibarra AA, Bartlett GJ, Hegedüs Z, Dutt S,
Hobor F, Horner KA, Hetherington K,
Spence K, Nelson A, Edwards TA, Woolfson
DN, Sessions RB, Wilson AJ (2019) Predicting and experimentally validating hot-spot
residues at protein–protein interfaces. ACS
Chem Biol 14:2252–2263
134. Martins SA, Perez M AS, Moreira IS, Sousa
SF, Ramos MJ, Fernandes PA (2013)
Computational alanine scanning mutagenesis:
MM-PBSA vs TI. J Chem Theory Comput 9:
1311–1319
135. Yang XQ, Liu JY, Li XC, Chen MH, Zhang
YL (2014) Key amino acid associated with
acephate detoxification by cydia pomonella
carboxylesterase based on molecular dynamics
with alanine scanning and site-directed mutagenesis. J Chem Inf Model 54:1356–1370
136. Dapiaggi F, Pieraccini S, Sironi M (2015) In
silico study of VP35 inhibitors: from computational alanine scanning to essential dynamics. Mol BioSyst 11:2152–2157
137. He L, Bao J, Yang Y, Dong S, Zhang L, Qi Y,
Zhang JZH (2019) Study of SHMT2 inhibitors and their binding mechanism by computational alanine scanning. J Chem Inf Model
59:3871–3878
138. Laurini E, Marson D, Aulic S, Fermeglia M,
Pricl S (2020) Computational Alanine scanning and structural analysis of the SARS-CoV2 Spike protein/angiotensin-converting
enzyme 2 complex. ACS Nano 14:
11821–11830
139. Zhao J, Yin B, Sun H, Pang L, Chen J (2020)
Identifying hot spots of inhibitor-CDK2
bindings by computational alanine scanning.
Chem Phys Lett 747:137329
228
Maxence Delaunay and Tâp Ha-Duong
140. Chen R, Li L, Weng Z (2003) ZDOCK: an
initial-stage protein-docking algorithm. Proteins 52:80–87
141. Baspinar A, Cukuroglu E, Nussinov R,
Keskin O, Gursoy A (2014) PRISM: a web
server and repository for prediction of protein–protein interactions and modeling their
3D complexes. Nucleic Acids Res 42:
W285–W289
142. Cheng TM-K, Blundell TL, Fernandez-Recio
J (2007) pyDock: electrostatics and desolvation for effective scoring of rigid-body protein-protein docking. Proteins 68:503–515
143. Degryse B, Fernandez-Recio J, Citro V,
Blasi F, Cubellis MV (2008) In silico docking
of urokinase plasminogen activator and integrins. BMC Bioinf 9:S8
144. Lee H, Heo L, Lee MS, Seok C (2015) GalaxyPepDock: a protein-peptide docking tool
based on interaction similarity and energy
optimization. Nucleic Acids Res 43:
W431–435
145. Yan Y, Wen Z, Wang X, Huang S-Y (2017)
Addressing recent docking challenges: a
hybrid strategy to integrate template-based
and free protein-protein docking. Proteins
Struct Funct Bioinf 85:497–512
146. Johansson-Åkhe I, Mirabello C, Wallner B
(2020) InterPep2: global peptide–protein
docking using interaction surface templates.
Bioinformatics 36:2458–2465
147. Schindler C, de Vries S, Zacharias M (2015)
Fully blind peptide-protein docking with
pepATTRACT. Structure 23:1507–1515
148. Yan C, Xu X, Zou X (2016) Fully blind docking at the atomic level for protein-peptide
complex structure prediction. Structure 24:
1842–1853
149. Alam N, Goldstein O, Xia B, Porter KA,
Kozakov D, Schueler-Furman O (2017)
High-resolution global peptide-protein docking using fragments-based PIPER-FlexPepDock. PLOS Comput Biol 13:e1005905
150. Zhou P, Jin B, Li H, Huang S-Y (2018)
HPEPDOCK: a web server for blind peptide–protein docking based on a hierarchical
algorithm.
Nucleic
Acids
Res
46:
W443–W450
151. Raveh B, London N, Schueler-Furman O
(2010) Sub-angstrom modeling of complexes
between flexible peptides and globular proteins. Proteins 78:2029–2040
152. Ben-Shimon A, Niv MY (2015). AnchorDock: blind and flexible anchor-driven peptide docking. Structure 23:929–940
153. Kurcinski M, Jamroz M, Blaszczyk M,
Kolinski A, Kmiecik S (2015) CABS-dock
web server for the flexible docking of peptides
to proteins without prior knowledge of the
binding site. Nucleic Acids Res 43:
W419–W424
154. Antunes DA, Moll M, Devaurs D, Jackson
KR, Lizée G, Kavraki LE (2017) DINC 2.0:
a new protein-peptide docking webserver
using an incremental approach. Cancer Res
77:e55–e57
155. Peterson LX, Roy A, Christoffer C, Terashi G,
Kihara D (2017) Modeling disordered protein interactions from biophysical principles.
PLOS Comput Biol 13:e1005485
156. London N, Raveh B, Movshovitz-Attias D,
Schueler-Furman O (2010) Can selfinhibitory peptides be derived from the interfaces of globular protein-protein interactions?
Proteins 78:3140–3149
157. London N, Raveh B, Schueler-Furman O
(2013) Druggable protein-protein interactions? from hot spots to hot segments. Current Opin Chem Biol 17:952–959
158. Nomme J, Takizawa Y, Martinez SF, Renodon-Cornière A, Fleury F, Weigel P, Yamamoto K-i, Kurumizaka H, Takahashi M
(2008) Inhibition of filament formation of
human Rad51 protein by a small peptide
derived from the BRC-motif of the BRCA2
protein. Genes Cells 13:471–481
159. Nomme J, Renodon-Cornière A, Asanomi Y,
Sakaguchi K, Stasiak AZ, Stasiak A, Norden B,
Tran V, Takahashi M (2010) Design of potent
inhibitors of human RAD51 recombinase
based on BRC motifs of BRCA2 protein:
modeling and experimental validation of a
chimera peptide. J Med Chem 53:5782–5791
160. Jafary F, Ganjalikhany MR, Moradi A,
Hemati M, Jafari S (2019) Novel peptide
inhibitors for lactate dehydrogenase a
(LDHA): a survey to inhibit ldha activity via
disruption of protein-protein interaction. Sci
Rep 9:4686
161. Gavenonis J, Jonas NE, Kritzer JA (2014)
Potential C-terminal-domain inhibitors of
heat shock protein 90 derived from a
C-terminal peptide helix. Bioorganic Med
Chem 22:3989–3993
162. Bopp B, Ciglia E, Ouald-Chaib A, Groth G,
Gohlke H, Jose J (2016) Design and
biological testing of peptidic dimerization
inhibitors of human Hsp90 that target the
C-terminal domain. Biochim et Biophys Acta
1860:1043–1055
163. Sedan Y, Marcu O, Lyskov S, SchuelerFurman O (2016) Peptiderive server: derive
peptide inhibitors from protein–protein interactions. Nucleic Acids Res 44:W536–W541
In Silico Design of Peptide-Based PPI Inhibitors
164. Horita S, Nomura Y, Sato Y, Shimamura T,
Iwata S, Nomura N (2016) High-resolution
crystal structure of the therapeutic antibody
pembrolizumab bound to the human PD-1.
Sci Rep 6:35297
165. Li D, Song H, Mei H, Fang E, Wang X,
Yang F, Li H, Chen Y, Huang K, Zheng L,
Tong Q (2018) Armadillo repeat containing
12 promotes neuroblastoma progression
through interaction with retinoblastoma
binding protein 4. Nat Commun 9:2829
166. Tarsia C, Danielli A, Florini F, Cinelli P,
Ciurli S, Zambelli B (2018) Targeting Helicobacter pylori urease activity and maturation: in-cell high-throughput approach for
drug discovery. Bioch et Biophys Acta 1862:
2245–2253
167. Geppert T, Bauer S, Hiss JA, Conrad E,
Reutlinger M, Schneider P, Weisel M,
Pfeiffer B, Altmann K-H, Waibler Z, Schneider G (2012) Immunosuppressive small molecule discovered by structure-based virtual
screening for inhibitors of protein–protein
interactions. Angew Chem Int Edition 51:
258–261
168. Johnson DK, Karanicolas J (2016) Ultrahigh-throughput structure-based virtual
screening for small-molecule inhibitors of
protein–protein interactions. J Chem Inf
Model 56:399–411
169. Koes DR, Dömling A, Camacho CJ (2018)
AnchorQuery: rapid online virtual screening
for small-molecule protein–protein interaction inhibitors. Protein Sci 27:229–232
170. Wu H, Liu Y, Guo M, Xie J, Jiang X (2014) A
virtual screening method for inhibitory peptides of angiotensin i–converting enzyme J
Food Sci 79:C1635–C1642
171. Yu Z, Fan Y, Zhao W, Ding L, Li J, Liu J
(2018)
Novel
angiotensin-converting
enzyme inhibitory peptides derived from
oncorhynchus mykiss nebulin: virtual screening and in silico molecular docking study. J
Food Sci 83:2375–2383
172. Yu Z, Kan R, Wu S, Guo H, Zhao W, Ding L,
Zheng F, and Liu, J. (2020). Xanthine oxidase inhibitory peptides derived from tuna
protein: virtual screening, inhibitory activity,
and molecular mechanisms. J Sci Food Agric
173. Kim DE, Chivian D, Baker D (2004) Protein
structure prediction and analysis using the
Robetta server. Nucleic Acids Res 32:
W526–W531
174. Duffy FJ, Verniere M, Devocelle M,
Bernard E, Shields DC, Chubb AJ (2011)
CycloPs: generating virtual libraries of
cyclized and constrained peptides including
229
nonnatural amino acids. J Chem Inf Model
51:829–836
175. Huang P-S, Boyken SE, Baker D (2016) The
coming of age of de novo protein design.
Nature 537:320–327
176. Kortemme T, Joachimiak LA, Bullock AN,
Schuler AD, Stoddard BL, Baker D (2004)
Computational redesign of protein-protein
interaction specificity. Nat Struct Mol Biol
11:371–379
177. Roberts KE, Cushing PR, Boisguerin P, Madden DR, Donald BR (2012) Computational
design of a PDZ domain peptide inhibitor
that rescues CFTR activity. PLOS Comput
Biol 8:e1002477
178. Sharabi O, Shirian J, Shifman J (2013) Predicting affinity- and specificity-enhancing
mutations at protein–protein interfaces. Biochem Soc Trans 41:1166–1169
179. Simonson T, Gaillard T, Mignon D, Schmidt
am Busch M, Lopes A, Amara N,
Polydorides S, Sedano A, Druart K, Archontis
G (2013) Computational protein design: the
Proteus software and selected applications. J
Comput Chem 34:2472–2484
180. Frappier V, Jenson JM, Zhou J, Grigoryan G,
Keating AE (2019) Tertiary structural motif
sequence statistics enable facile prediction and
design of peptides that bind anti-apoptotic
Bfl-1 and Mcl-1. Structure 27:606–617.e5
181. Poole AM, Ranganathan R (2006)
Knowledge-based potentials in protein
design. Current Opin Struct Biol 16:
508–513
182. Boas FE, Harbury PB (2007) Potential
energy functions for protein design. Current
Opin Struct Biol 17:199–204
183. Mackenzie CO, Zhou J, Grigoryan G (2016)
Tertiary alphabet for the observable protein
structural universe. Proc Natl Acad Sci 113:
E7438–E7447
184. Zheng F, Zhang J, Grigoryan G (2015) Tertiary structural propensities reveal fundamental
Sequence/structure
relationships.
Structure 23:961–971
185. Grigoryan G, Reinke AW, Keating AE (2009)
Design of protein-interaction specificity gives
selective bZIP-binding peptides. Nature 458:
859–864
186. Chen TS, Reinke AW, Keating AE (2011)
Design of peptide inhibitors that bind the
bZIP Domain of Epstein–barr virus protein
BZLF1 J Mol Biol 408:304–320
187. Smith CA, Kortemme T (2010) Structurebased prediction of the peptide sequence
space recognized by natural and synthetic
PDZ domains. J Mol Biol 402:460–474
230
Maxence Delaunay and Tâp Ha-Duong
188. Zheng F, Jewell H, Fitzpatrick J, Zhang J,
Mierke DF, Grigoryan G (2015) Computational design of selective peptides to discriminate between similar PDZ domains in an
oncogenic pathway. J Mol Biol 427:491–510
189. Sievers SA, Karanicolas J, Chang HW,
Zhao A, Jiang L, Zirafi O, Stevens JT,
Münch J, Baker D, Eisenberg D (2011)
Structure-based design of non-natural
amino-acid inhibitors of amyloid fibril formation. Nature 475:96–100
190. Zhang C, Shen Q, Tang B, Lai L (2013)
Computational design of helical peptides targeting TNFα. Angew Chem Int Edition 52:
11059–11062
191. Yang W, Zhang Q, Zhang C, Guo A, Wang Y,
You H, Zhang X, Lai L (2019)
Computational design and optimization of
novel d-peptide TNFα inhibitors. FEBS Lett
593:1292–1302
192. Foight GW, Ryan JA, Gullá SV, Letai A, Keating AE (2014) Designed BH3 peptides with
high affinity and specificity for targeting
Mcl-1 in cells. ACS Chem Biol 9:1962–1968
193. Berger S, Procko E, Margineantu D, Lee EF,
Shen BW, Zelter A, Silva D-A, Chawla K,
Herold MJ, Garnier J-M, Johnson R, MacCoss MJ, Lessene G, Davis TN, Stayton PS,
Stoddard BL, Fairlie WD, Hockenbery DM,
Baker D (2016) Computationally designed
high specificity inhibitors delineate the roles
of BCL2 family proteins in cancer. eLife 5:
e20352
Chapter 12
Rapid Rational Design of Cyclic Peptides Mimicking
Protein–Protein Interfaces
Brianda L. Santini and Martin Zacharias
Abstract
The cPEPmatch approach is a rapid computational methodology for the rational design of cyclic peptides to
target desired regions of protein–protein interfaces. The method selects cyclic peptides that structurally
match backbone structures of short segments at a protein–protein interface. In a second step, the cyclic
peptides act as templates for designed binders by adapting the amino acid side chains to the side chains
found in the target complex. A link to access the different tools that comprise the cPEPmatch method and a
detailed step-by-step guide is provided. We outline the protocol by following the application to a trypsin
protease in complex with the bovine inhibitor protein (BPTI). An extension of our original approach is also
presented, where we give a detailed description of the usage of the cPEPmatch methodology focusing on
identifying hot regions of protein–protein interfaces prior to the matching. This extension allows one to
reduce the amount of evaluated putative cyclic peptides and to specifically design only those that compete
with the strongest protein–protein binding regions. It is illustrated by an application to an MHC class I
protein complex.
Key words Protein–protein interactions, Protein interaction inhibition, Protein binding modulation,
Peptidomimetics, Cyclic peptide design, Drug design with cyclic peptides, Rational cyclic peptide
binders
1
Introduction
Protein–protein interactions (PPIs) play a critical role in nearly all
cellular processes, such as signaling, regulation, metabolism, and
proliferation—making them promising drug targets for broadspectrum therapeutic interests [1]. Hence, modulating PPIs is of
great clinical relevance, and considerable effort has been put on
targeting protein–protein interfaces by rational drug design efforts
to interfere or even disrupt interactions. Typical interfaces of PPIs
tend to be large, flat, and mainly hydrophobic, and only a few
interface residues are crucial for protein–protein binding [2–
5]. These residues, often referred to as hot spots, are major determinants of affinity and specificity [6, 7].
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_12,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
231
232
Brianda L. Santini and Martin Zacharias
It has also been observed that such hot spots interact cooperatively and tend not to be uniformly spread across the interface, but
grouped within tightly packed regions, known as hot loops [6, 8,
9]. Most notably, hot loop regions have shown secondary structural
features, such as α-helices, β-strands, and turns [10, 11], which
have established them as strong candidates for peptidomimetics
drug design approaches. The key of peptidomimetics strategies is
to graft suitable side chains onto stable backbone scaffolds. In the
case of PPIs, a detailed analysis of hot spots at the native interface
could serve as a guide for the mimetic rational design.
An important step that determines the affinity and specificity of
a peptidomimetics drug is the appropriate selection of the scaffold
which is influenced by the location of the hot spots and structural
details [12]. We have recently proposed a cyclic peptide matching
approach (cPEPmatch), a straight forward in silico process for the
rapid search and optimization of cyclic peptides as putative PPIs
inhibitors based on using known cyclic peptides structures as scaffolds [13]. The great advantages of cyclic peptides compared to
linear peptides are the often-improved permeability of cell membranes (important for usage as drug molecules) and the relatively
rigid three-dimensional backbone structure compared to linear
peptides which lowers the entropic cost for binding in a defined
conformation. A standard linear peptide includes two rotatable
bonds for each amino acid along the peptide main chain (N–Cα,
Cα–C) allowing for many combinations of conformational backbone substates. By contrast, the backbone flexibility of a cyclic
peptide is drastically reduced, although not totally eliminated,
allowing the structure to make small conformational adjustments
upon binding [14]. However, a severe drawback is that it can be
difficult to reliably predict the correct three-dimensional structure
of arbitrary cyclic peptides.
The general idea of our cPEPmatch method [13] is to start
with a library of existing crystal structures of cyclic peptides and use
them as templates for the construction of cyclic peptides that
closely mimic interface sections of a partner protein in PPI complexes. This strategy avoids the de novo prediction of a putative
cyclic peptide structure that could fit to the interface. We start the
cPEPmatch process by characterizing backbone motifs of the backbone of epitopes found at the PPI interfaces and comparing them
to backbone motifs in a database of cyclic peptides with known
structure. If a backbone match is found, the cyclic peptide structure
is superimposed, and the corresponding interface amino acids of
the cyclic peptides are substituted by those of the binding epitope
to closely replicate that same interface. In our preliminary proof-ofprinciple study, our automated approach was tested against
154 protein–protein complexes, where for the majority of ~71%
we identified cyclic peptide motifs that resulted in stable bound
complexes during MD refinement [13]. It was also possible to
Cyclic Peptide Protein Interface Binders
233
predict the structure of cyclic peptide binders that were previously
found in experiments to strongly bind to proteins at a known
interface. In this first work, however, key hot regions and their
specific hot spots were not identified prior to the application of
cPEPmatch, but it was rather applied to the whole interface.
Chapter Overview: Here, the main goal is to outline the stepby-step application of the cPEPmatch approach on examples. First,
we describe the application to a complex of the trypsin protease in
complex with the inhibitor protein BPTI. In an extension of our
original approach, we include the identification of regions in the
PPI that contribute most to the interaction (hot regions) prior to
the matching process to identify putative cyclic peptides that potentially compete with such regions for binding. It limits the number
of complexes that needs to be evaluated using molecular dynamics
(MD) simulations in combination with the MMGBSA (Molecular
Mechanics Generalized Born Surface Area) calculations. Subsequently, we give a detailed description of the usage and analysis of
the cPEPmatch methodology focusing on the hot loop sequences.
2
Materials
The cPEPmatch method requires a known protein–protein complex structure as input target and the selection which partner
should be the target for cyclic peptide template selection. For the
present examples, the 4dg4.pdb file (trypsin receptor, chain A, in
complex with trypsin inhibitor protein, chain B) and the MHC class
I complex (5fa3.pdb) can be downloaded from the Protein data
bank (www.rcsb.org). For all energy minimization and MD simulations, we use the Amber18 package [15] that can be obtained from
the following site: https://ambermd.org/ (requires an academic
license). The code for performing the cPEPmatch calculations can
be downloaded freely from https://www.groups.ph.tum.de/t38/
downloads (for academic use). It runs on any PC with a linux
operation system and includes installation instructions. Another
important input for the cPEPmatch approach is a collection of
known cyclic peptide structures that can be used for superposition
with structural motifs found at the interface of PPIs. We have
expanded our library from the one presented previously [13] to a
set of 72 cyclic peptides that vary in cyclization type and sizes. The
cyclic peptide structure types have been chosen to represent common portions of protein–protein interaction hot spots such as
α-helices, β-strands, turns, and loops and can be
downloaded from: www.groups.ph.tum.de/t38/downloads/. It
is regarded as a future goal, to keep improving the database as
more and more cyclic peptide crystal structures are resolved. It is
possible for the user to add to the data set file or create a new cyclic
peptide data set (see Note 1).
234
3
Brianda L. Santini and Martin Zacharias
Methods
Our cyclic peptide matching approach is divided into three main
steps (except of the initial library construction described in the
previous paragraph and Note 1), outlined below for targeting the
interface structure of the complex of trypsin inhibitor protein
(BPTI) and trypsin (pdb4dg4, see Fig. 1). For this case, a cyclic
peptide (called sunflower peptide) with high-binding affinity for
trypsin that can effectively compete with BPTI is known, and a
known structure in complex with trypsin (pdb1sfi) can be used for
comparison to the cPEPmatch result.
3.1 PPI Motif
Characterization
and Matching
The following step is to characterize the PPI interface in terms of
distances between four consecutive backbone (CA) atoms and
match them with the corresponding distances in the data set of
cyclic peptides. The program int_analyse identifies all neighboring
protein residues from the input PPI complex (pdb4dg4) at the
interface and, subsequently, characterizes its backbone atom
motifs. Besides of the input complex structure (pdb-file), it requires
the name of the cyclic peptide database, and the cutoff for atoms
counted as belonging to the interface as well as a threshold for the
mean distance deviation that defines a match between a backbone
atom segment at the interface vs. a given cyclic peptide backbone.
In previous work, we found 7 Å sufficient for the interface cutoff
definition to identify all relevant cyclo-peptide matches [13]. We set
the threshold for the mean distance deviation to 0.5 Å to accept
only sufficiently precise matches. Hence, the int_analyse command
is given as:
./int_analyse 4dg4.pdb cPEPdatabase.dat 7.0 0.5
Fig. 1 Trypsin protein (yellow surface) in complex with (a) trypsin inhibitor protein (pdb4dg4, cyan carbons)
after 10 ns of MD simulation, (b) best matched cyclic peptide (based on pdb3avb, pink carbons) mimicking the
trypsin inhibitor protein after 10 ns MD simulation, and (c) sunflower cyclic peptide (pdb1sfi, silver carbons)
shown for comparison
Cyclic Peptide Protein Interface Binders
235
The output will be a match_list.dat file that contains a list of the
matched cyclic peptides in the order that they were found. For each
match, there will be a single row like:
3avb | 748 752 760 768 8 16 25 33
This can be read as: the Cα atoms 8, 16, 25, and 33 of the cyclic
peptide with pdb code 3avb structurally match the consecutive Cα
atoms with numbers 748, 752, 760, and 768 in the target 4dg4.
pdb file.
3.2 Superposition
of Cyclic Peptides
on Interface Segments
The next step is performed using the pose_motif tool, and it superimposes the coordinates of the identified cyclic peptide on the
matching backbone interface motif segment at the target PPI. It
requires the name of the complex structure, the cyclic peptide
structure, and the backbone atom numbers as input, e.g., for our
example:
./pose_motif 4dg4.pdb 3avb.pdb 748 752 760 768 8 16 25 33
This step returns a coordinate file of the cyclic peptide and a
Fit-RMSD value that measures how close the backbone motif at the
interface is represented by the corresponding backbone structure in
the cyclic peptide. The process can be repeated for other matches
listed in the match_list.dat file (a script is provided in the download
to automatize this process). After this step, a visual analysis can be
useful to decide which of the superimposed structures from the
match list have a good sterical fit. Some of the criteria to be
considered are the proper size, secondary structure similarity, and
alignment. Only selected putative matches should be processed for
sequence adaptation and analysis.
3.3 Sequence
Adaptation
The final step of the cPEPmatch before the evaluation analysis is to
adapt the sequence of the cyclic peptide to mimic the interface of
the PPI in the best way possible. The standard procedure we
proposed [13] is to replace the side chains in the selected cyclic
peptide by the side chains found in the original PPI complex. This
standard replacement is included in the pose_motif tool. It provides
an out.pdb file with the receptor protein coordinates (trypsin) and
the cyclic peptide coordinates including the side chains copied from
the interface of the target complex.
236
4
Brianda L. Santini and Martin Zacharias
Evaluation of Matched Complexes
4.1 System
Preparation
For the evaluation and scoring of protein–cyclic peptide complexes,
we use the Amber18 package (see Materials). Structures are processed for EM and MD simulations using the tleap module of
Amber18 following standard procedures (see also Amber18 manual). Note, for the simulation of disulfide-bonded cyclic peptides,
the input PDB for tleap preparation must have the amino acid name
CYX instead of the regular CYS for the cysteine residues participating in disulfide bonds. The pdb4amber tool of the package can add
this automatically. Special steps have to be added when dealing with
the preparation of head-to-tail or similar cyclized peptides
described in Note 2.
Following the setup in Amber18 protein parameters is retrieved
from the ff14SB force field [16]. The complexes are then neutralized by the addition of Na+ or Cl ions and are solvated in an
orthorhombic box with a minimum distance to box-boundaries of
10 Å using explicit TIP3P water molecules [17]. First, all simulation systems are energy minimized with the steepest descent
method in 2000 steps by using the Amber18 Sander module.
Every subsequent MD simulations can be performed with the
pmemd.cuda module allowing more rapid simulations than the
Sander program. Initially, the systems are heated up to 310 K in
three stages (100 K, 200 K, and 310 K). Each stage is simulated for
100 ps with positional restraints on all non-hydrogen atoms with
respect to the starting conformation. Subsequently, positional
restraints are gradually reduced from the initial 25 to
0.5 kcal·mol1·A2 in five consecutive simulations of 100 ps at
310 K and at a constant pressure of 1 bar. The equilibrated structures serve as input for the production runs for each system, with
no restraints. Data gathering simulations are carried out for 10 ns.
Coordinates are set to be written out every 500 steps. A time step of
2 fs is used, and all bonds involving hydrogens are constrained to
the optimal length using shake [18].
4.2 Trajectory
Analysis to Score
the Matches
Stable binding of the cyclic peptide can be assessed first by visual
analysis of the MD trajectory. We use the MM/GBSA (Molecular
Mechanics Generalized Born Surface Area) method for analyzing
the mean interaction energy following the well-established single
trajectory method [19] as implemented in the MMPBSA.py module of Amber18. Calculations are carried using 500 snapshots
retrieved from the last 5 ns of the MD simulation production
employing the modified GB model (igb ¼ 5) with mbondi2, and
α, β, and γ values of 1.0, 0.8, and 4.85, respectively. Dielectric
constants for the solvent and the solute are 80 and 5, respectively.
As an output, the approach gives the mean interaction energy
between cyclic peptide and protein partner.
Cyclic Peptide Protein Interface Binders
237
For our 4dg4.pdb (trypsin/BPTI) example, two cyclo-peptide
matches are evaluated using the above procedure, and as best
scoring cyclic peptide a structure is obtained that very closely
resembles a turn motif at the trypsin/BPTI interface that is also
in very close agreement with the structure of the known sunflower
cyclic peptide inhibitor (root-mean-square deviation at interface
(RMSD) < 0.5 Å, see Fig. 1).
5
Focusing on Hot Spot Regions at the Protein–Protein Interface
5.1 Application
to the MHC PPI
MHC class I complexes bind small antigenic peptides and present
them to the immune system at the cell surface, controlling which
fragments of a pathogen or cancer antigen are presented to cytotoxic T cells for immune recognition [20]. Peptide binding is
strongly coupled to the complexation of the heavy chain part with
the β-microglobin (β2m) partner. Dissociation of β2m leads also to
loss of peptide binding [21]. Hence, design of cyclic peptides that
interfere with this interaction can potentially be used to control and
modulate the immune response (including many undesired autoimmune reactions). For our application, we use the pdb-entry 5fa3
(a human MHC class I molecule). In the present example, our goal
is to target the heavy chain partner chain by replacing/mimicking
the β2m partner with a cyclic peptide. In Fig. 2, two main contact
regions between both segments (chain A and C in the original 5fa3.
pdb) are indicated.
Fig. 2 Contact regions 1 and 2 between three heavy chain (yellow cartoon, with
the marked subdomains α1, α2, and α3) and the β2m partner (cyan) of the major
histocompatibility complex (MHC) class I complex (pdb5fa3). The bound
antigenic peptide (located in the binding cavity formed by the α1, α2
subdomains) is indicated in orange. The dotted boxes mark the two main
interface regions between heavy chain and β2m partner targeted by the
cPEPmatch approach
238
Brianda L. Santini and Martin Zacharias
5.2 Identification
of Hot Loops
for cPEPmatch
Since there are more than one contact region between chain A and
C in 5fa3.pdb, and each include multiple contact motifs, our direct
application of cPEPmatch resulted in a large number of matches. In
order to reduce the number of potential complexes that need to be
evaluated and to design only the strongest competitors, we used an
extension or our original approach. In this extension, we first
perform a short MD simulation and MMGBSA application to
predict the interface segments that contribute most to the interaction in the heavy chain/β2m complex (hot loops). We then focus
the search for cyclic peptide binders to these hot loop interface
regions. The setup and MD simulation on the MHC complex are
performed exactly in the same way as described above for scoring of
the cyclo-peptide/protein complexes.
5.2.1 Trajectory Analysis
and Binding Hot Spot
Identification
Residues that contribute most to the interaction in the MHC class I
complex are identified using the MM/GBSA method as described
in Subheading 4.2 but employing the option to include a per
residue interaction energy decomposition (ΔGres), according to
the single trajectory method as implemented in the MMPBSA.py
module of Amber18 [19, 22]. As output one obtains the mean
interaction energy contribution for each residue along the sequence
(Fig. 3). In our case, we use all residues belonging to the β2m
partner because our designed cyclic peptide should superimpose on
a backbone segment of the β2m partner. We define hot spot residues potentially important for binding as those with a total
ΔGres < kBT ¼ 0.6 kcal·mol1 (kB: Boltzmann constant, T:
temperature:300 K) and with the majority of its interaction energy
contribution due to side chain interactions. Hot loops are segments
of 8 to 10 residues that include at least four hot spot residues. For
the present example, two hot segments can be clearly identified
(Fig. 3). Loop 1 is chosen from residues 278 to 287, and it contains
four hot spot residues: Lys279, Gln281, Tyr283, and Arg285.
Loop 2 is selected from residues 326 to 335, comprised of five
hot spot residues: Asp326, Leu327, Phe329, Trp333, and Phe335.
The side chain and backbone contributions for all the hot spots are
shown in Table 1. The coordinates of both loops were extracted
from the last frame of the MD simulation and used as input for
cPEPmatch. It should be noted that alternative methods to identify
important interaction regions can also be used at this step.
5.3 Application
of cPEPmatch to Hot
Loop Regions
The application of the cPEPmatch approach follows the same procedure as described in Subheading 4. However, instead of the
original MHC class I complex file, we use as input a modified
complex file that contains the heavy chain coordinates (receptor)
and just the identified hot loop structures (in two separate complex
pdb files). With the command,
./int_analyse receptor-and-loop.pdb database.dat 7.0 0.5
Cyclic Peptide Protein Interface Binders
239
Fig. 3 Hot segment selection for each contact regime in the MHC class I heavy chain interface to the β2m
subunit. (a) Per residue contributions to the effective interaction energy as calculated by MM/GBSA decomposition. The chosen hot loops along the β2m sequence are shown inside blue rectangles. (b) Loop 1 interface
with labeled hot spots, (c) same as (b) but for loop 2. All subunits are represented as cartoons (α1, α2, α3
segments of the heavy chain: yellow, β2m: cyan)
we obtain a list of matches stored in a match_list.dat file. The
matches are again used to construct complexes of the cyclo-peptide
with the MHC class I heavy chain, and the mean interaction energy
is calculated following the protocol of Subheading 4.
5.4 Selected Results
for Targeting the MHC
α Chain
Three stable cyclic peptide matches were found for each hot loop
(Table 2). Although, it not possible to include every single hot spot
for each structure, as many as possible were mutated in each match.
Figure 4 shows one representing match for each loop after 10 ns of
MD simulations. In both cases, we observe that the mutated hot
spots have similar orientations than those found in the β2m interface to the heavy chain (that they are mimicking). Also, the decomposition of the binding free energy shows similar behavior for all of
the matches. A total of six putative cyclic peptides have been found
to target the α chain of the MHC class 1. Previous work [13]
indicated that for known complexes of peptides binding to
240
Brianda L. Santini and Martin Zacharias
Table 1
MM/GBSA free energy decomposition for the two chosen hot loops in the β2m subunit. Three free
energy values are shown for each residue: total (ΔGres), side chain (ΔGres-ss), and backbone
(ΔGres-bb) contributions
Loop 1
Loop 2
Residue
ΔGres
ΔGres-ss
(kcal·Mol1)
PRO 278
0.02
0.06
LYS 279
2.37
ILE 280
ΔGres
ΔGres-ss
(kcal·Mol1)
ΔGres-bb
Residue
0.08
ASP 326
9.11
9.03
0.08
2.49
0.12
LEU 327
1.88
2.38
0.49
0.20
0.10
0.30
SER 328
0.92
0.42
0.51
GLN 281
3.26
3.60
0.35
PHE 329
5.35
5.54
0.20
VAL 282
0.10
0.14
0.24
SER 330
0.70
0.18
0.51
TYR 283
6.89
6.99
0.10
LYS 331
0.64
0.28
0.36
SER 284
1.05
0.04
1.10
ASP 332
0.87
0.70
0.17
ARG 285
3.00
3.77
0.77
TRP 333
8.35
7.28
1.07
HIE 286
0.32
0.18
0.14
SER 334
0.06
0.01
0.05
PRO 287
1.16
0.80
0.36
PHE 335
2.83
2.77
0.06
ΔGres-bb
Table 2
Best matches found for both hot loops identified for the MHC class I system
Loop
Matcha
Substitutions in the cyclic peptide
F-RMSD
(Å)
ΔGinteraction
(kcal·mol1)
1
1ebp
5eoc
4w4z
9-VAL, 10-TYR, 11-SER, 12-ARG
2-LYS, 3-ILE, 4-GLN, 5-VAL, 6-TYR
3-LYS, 5-GLN, 6-VAL, 7-TYR
0.24
0.04
0.09
25.17
15.75
24.62
2
3zwz
3avb
1ebp
8-ASP, 9-LEU, 10-SER, 11-PHE
1-ASP, 2-TRP, 3-SER, 4-PHE
14-SER, 15-ASP, 16-LEU
0.16
0.06
0.14
39.53
20.88
51.63
a
Indicates the pdb-entry of the matching cyclic peptide
proteins, calculated interaction energies of similar magnitude are
obtained (< 30 kcal·mol1). Hence, some of the suggested cyclopeptides may indeed show stable binding to the target structure.
6
Concluding Notes
l
We recently reported the cPEPmatch approach for the rational
design of cyclic peptides that target protein–protein interfaces.
Hundreds of PPIs can be screened within a few seconds for cyclic
Cyclic Peptide Protein Interface Binders
241
Fig. 4 Representative matches and modeled structures of protein-cyclic-peptide complexes for the MHC class
I heavy chain (yellow) targeting the interaction with β2m. (a) Cyclic peptide match pdb5eoc mimicking Loop
1. (b) Cyclic peptide match pdb3zwz mimicking Loop 2. In both cases, the heavy chain (yellow) and matched
cyclic peptides (pink) are shown as cartoon. The labeled residues (sticks) correspond to the β2m residues in
the native MHC class I complex that are replaced in the complex with the cyclic peptides
peptides that match to backbone structures at the PPI interface,
and even with a relatively small set of cyclic peptide templates, we
have shown that it is possible to identify putative stable bound
cyclic peptide–protein complexes [13].
l
An advantage of our cPEPmatch method compared to experimental studies is that we base the construction of a desired cyclic
peptide on known stable (high resolution) cyclic template structures, avoiding the uncertainty on how well a select cyclization
of a given motif will resemble a desired binding region.
l
A key to finding a stable binder is the adaptation of the cyclic
peptide sequence to closely mimic the essential protein–protein
interface interactions.
l
We described an extension of our original cPEPmatch approach
to target PPIs that have multiple and/or large binding sites. It
consists of a short MD simulation and MMGBSA application for
the identification of hot loops in the PPI prior to the matching
process in order to identify putative cyclic peptides that target
such regions.
l
Hot loop identification allows the reduction of the amount of
putative cyclic peptides to be evaluated, and the design of only
those cyclic peptides that specifically compete with the strongest
protein–protein binding regions.
l
The cPEPmatch hot loop extension was applied to target the
heavy chain of an MHC class I example and six putative cyclic
peptide binders are suggested.
242
7
Brianda L. Santini and Martin Zacharias
Notes
1. Extension of the cyclo-peptide database.
Additional cyclic peptides can be added to the database by
using our backbo FORTRAN tool. This program calculates
distance matrices in sets of four consecutive Cα atoms by
iterating through every set of four residues. The output is a
set of motif values, which specifies the measured distances and
corresponding amino acid positions.
Backbo can be run from directly a UNIX terminal. An
example of the 8-residue 3avb cyclic peptide is shown below.
Run the command:
$ backbo -i 3avb.pdb >> cPEPdatabase.dat
This will return an output that looks like this and appends it
to an existing data file:
start 3avb
1 5.64 8.90 6.95 2 8 16 25
2 6.95 8.55 5.34 8 16 25 33
3 5.34 4.81 5.60 16 25 33 41
4 5.60 5.80 6.23 25 33 41 49
5 6.23 9.01 5.38 33 41 49 57
The 3avb cyclic peptide has five motifs, numbered on each
of the output rows. The first three numbers correspond to the
distance values of that motif, while the last four numbers
specify the corresponding Cα atom numbers. All the cyclic
peptide sets of motifs should be stored into a database.dat file
which is used by the int_analyse tool during the matching
processes.
2. Special steps have to be added when dealing with the preparation of head-to-tail or similar cyclized peptides:
There are three steps to take: (1) modification of the
AMBER force field “leaprc.ff14B” parameter file to eliminate
the mapping of terminal residues to allow the cyclic bond. First,
a new copy of the file, which should be found at “$AMBERHOME/dat/leap/cmd/,” must be saved into the current
working directory. A new name should be given to the file,
e.g., “leaprc.cPep.” Then, the section that contains the residue
mapping that defines terminal residues as the N- or C-terminal
variants of those residues must be eliminated from the copy. It
looks as follows:
Cyclic Peptide Protein Interface Binders
243
addPdbResMap {
{ 0 "HYP" "NHYP" } { 1 "HYP" "CHYP" }
{ 0 "ALA" "NALA" } { 1 "ALA" "CALA" }
{ 0 "ARG" "NARG" } { 1 "ARG" "CARG" }
{ 0 "ASN" "NASN" } { 1 "ASN" "CASN" }
{ 0 "ASP" "NASP" } { 1 "ASP" "CASP" }
{ 0 "CYS" "NCYS" } { 1 "CYS" "CCYS" }
{ 0 "CYX" "NCYX" } { 1 "CYX" "CCYX" }
{ 0 "GLN" "NGLN" } { 1 "GLN" "CGLN" }
{ 0 "GLU" "NGLU" } { 1 "GLU" "CGLU" }
{ 0 "GLY" "NGLY" } { 1 "GLY" "CGLY" }
{ 0 "HID" "NHID" } { 1 "HID" "CHID" }
{ 0 "HIE" "NHIE" } { 1 "HIE" "CHIE" }
{ 0 "HIP" "NHIP" } { 1 "HIP" "CHIP" }
{ 0 "ILE" "NILE" } { 1 "ILE" "CILE" }
{ 0 "LEU" "NLEU" } { 1 "LEU" "CLEU" }
{ 0 "LYS" "NLYS" } { 1 "LYS" "CLYS" }
{ 0 "MET" "NMET" } { 1 "MET" "CMET" }
{ 0 "PHE" "NPHE" } { 1 "PHE" "CPHE" }
{ 0 "PRO" "NPRO" } { 1 "PRO" "CPRO" }
}
(2) Manual editing to the input PDB file: Removal of the
last OXT atom, and addition of a “CONECT” bond between
the C-terminal carbon and the N-terminal nitrogen at the end
of the file.
(3) Subsequently, the standard tleap preparation protocol
is performed using the modified PDB file as input and sourcing
the modified “leaprc.cPep” parameter file.
Acknowledgments
This research was conducted within the Max Planck School Matter
to Life supported by the German Federal Ministry of Education
and Research (BMBF) in collaboration with the Max Planck Society. We acknowledge also support by the Leibniz super computer
(LRZ) center for providing supercomputer support by grant
pr27za.
References
1. Fontaine F, Overman J, François M (2015)
Pharmacological manipulation of transcription
factor protein-protein interactions: opportunities and obstacles. Cell Regen 4:2
2. Bahadur RP, Zacharias M (2018) The interface
of protein-protein complexes: analysis of
contacts and prediction of interactions. Cell
Mol Life Sci 65:1059–1072. https://doi.org/
10.1007/s00018-007-7451-x
3. Murray JK, Gellman SH (2007) Targeting protein–protein interactions: lessons from
244
Brianda L. Santini and Martin Zacharias
p53/MDM2.
Biopolymers
88:657–686.
https://doi.org/10.1002/bip.20741
4. Corbi-Verge C, Kim PM (2016) Motif
mediated protein-protein interactions as drug
targets. Cell Commun Signal 14:8
5. Conte LL, Chothia C, Janin J (1999) The
atomic structure of protein-protein recognition sites. J Mol Biol 285:2177–2198.
https://doi.org/10.1006/jmbi.1998.2439
6. Keskin O, Ma B, Nussinov R (2005) Hot
regions in protein-protein interactions: the
organization and contribution of structurally
conserved hot spot residues. J Mol Biol 345:
1281–1294. https://doi.org/10.1016/j.jmb.
2004.10.077
7. Metz A, Pfleger C, Kopitz H et al (2012) Hot
spots and transient pockets: predicting the
determinants of small-molecule binding to a
protein-protein interface. J Chem Inf Model
52:120–133.
https://doi.org/10.1021/
ci200322s
8. Wells JA, McClendon CL (2007) Reaching for
high-hanging fruit in drug discovery at
protein-protein interfaces. Nature 450:
1001–1009.
https://doi.org/10.1038/
nature06526
9. Arkin MR, Tang Y, Wells JA (2014) Smallmolecule inhibitors of protein-protein interactions: progressing toward the reality. Chem
Biol 21:1102–1114. https://doi.org/10.
1016/j.chembiol.2014.09.001
10. Qiu Y, Li X, He X et al (2020) Computational
methods-guided design of modulators targeting protein-protein interactions (PPIs). Eur J
Med Chem 207:112764. https://doi.org/10.
1016/j.ejmech.2020.112764
11. Scott DE, Bayly AR, Abell C, Skidmore J
(2016) Small molecules, big targets: drug discovery faces the protein-protein interaction
challenge. Nat Rev Drug Discov 15:533–550.
https://doi.org/10.1038/nrd.2016.29
12. Andrei SA, de Vink P, Sijbesma E et al (2018)
Rationally designed semisynthetic natural
product analogues for stabilization of 14-3-3
protein-protein interactions. Angew Chemie
130:13658–13662.
https://doi.org/10.
1002/ange.201806584
13. Santini BL, Zacharias M (2020) Rapid in
silico design of potential cyclic peptide binders
targeting protein-protein interfaces. Front
Chem 8:2134. https://doi.org/10.3389/
fchem.2020.573259
14. Duffy FJ, Devocelle M, DCS (2015) Computational approaches to developing short cyclic
peptide modulators of protein–protein interactions. Methods Mol Biol 1268:241–271.
https://doi.org/10.1007/978-1-4939-22857_11
15. Case DA, Belfon K, Ben-Shalom IY, Brozell
SR, Cerutti DS, Cheatham TE III, Cruzeiro
VWD, Darden TA, Duke RE, Giambasu G,
Gilson MK, Gohlke H, Goetz AW, Harris R,
Izadi S, Izmailov SA, Kasavajhala K,
Kovalenko A, Krasny R, York DM, Kollman
PA (2018) AMBER 2018. University of California, San Francisco
16. Maier JA, Martinez C, Kasavajhala K et al
(2015) ff14SB: improving the accuracy of protein side chain and backbone parameters from
ff99SB. J Chem Theory Comput 11:
3696–3713.
https://doi.org/10.1021/acs.
jctc.5b00255
17. Jorgensen WL, Chandrasekhar J, Madura JD
et al (1983) Comparison of simple potential
functions for simulating liquid water. J Chem
Phys 79:926–935. https://doi.org/10.1063/
1.445869
18. Ryckaert JP, Ciccotti G, Berendsen HJC
(1977) Numerical integration of the cartesian
equations of motion of a system with constraints: molecular dynamics of n-alkanes. J
Comput Phys 23:327–341. https://doi.org/
10.1016/0021-9991(77)90098-5
19. Wang C, Greene D, Xiao L et al (2018) Recent
developments and applications of the
MMPBSA method. Front Mol Biosci 4:
201–215
20. Maenaka K, Jones Y (1999) MHC superfamily
structure and the immune system. Curr Opin
Struct Biol 9:745–753
21. Montealegre S, Venugopalan V, Fritzsche S
et al (2015) Dissociation of β2-microglobulin
determines the surface quality control of major
histocompatibility complex class I molecules.
FASEB J 29:2780–2788. https://doi.org/10.
1096/fj.14-268094
22. Gohlke H, Kiel C, Case DA (2003) Insights
into protein-protein binding by binding free
energy calculation and free energy decomposition for the Ras-Raf and Ras-RalGDS complexes. J Mol Biol 330:891–913. https://doi.
org/10.1016/S0022-2836(03)00610-7
Chapter 13
Structural Prediction of Peptide–MHC Binding Modes
Marta A. S. Perez, Michel A. Cuendet, Ute F. Röhrig, Olivier Michielin,
and Vincent Zoete
Abstract
The immune system is constantly protecting its host from the invasion of pathogens and the development of
cancer cells. The specific CD8+ T-cell immune response against virus-infected cells and tumor cells is based
on the T-cell receptor recognition of antigenic peptides bound to class I major histocompatibility complexes (MHC) at the surface of antigen presenting cells. Consequently, the peptide binding specificities of
the highly polymorphic MHC have important implications for the design of vaccines, for the treatment of
autoimmune diseases, and for personalized cancer immunotherapy. Evidence-based machine-learning
approaches have been successfully used for the prediction of peptide binders and are currently being
developed for the prediction of peptide immunogenicity. However, understanding and modeling the
structural details of peptide/MHC binding is crucial for a better understanding of the molecular mechanisms triggering the immunological processes, estimating peptide/MHC affinity using universal physicsbased approaches, and driving the design of novel peptide ligands. Unfortunately, due to the large diversity
of MHC allotypes and possible peptides, the growing number of 3D structures of peptide/MHC (pMHC)
complexes in the Protein Data Bank only covers a small fraction of the possibilities. Consequently, there is a
growing need for rapid and efficient approaches to predict 3D structures of pMHC complexes. Here, we
review the key characteristics of the 3D structure of pMHC complexes before listing databases and other
sources of information on pMHC structures and MHC specificities. Finally, we discuss some of the most
prominent pMHC docking software.
Key words Immune system, Major histocompatibility complex, T-cell receptor, Peptide antigen,
Peptide docking, Docking algorithms, Molecular mechanics, Ligand binding, Databases
1
Introduction
The immune system is constantly defending the host against the
invasion of a wide range of infectious pathogens such as viruses,
bacteria, and fungi, but also against the emergence of cancer cells.
Several groups of molecules and cells are in charge of fighting
infections and maintaining a healthy organism. The cellular
Marta A.S. Perez and Michel A. Cuendet contributed equally to this work.
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_13,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
245
246
Marta A. S. Perez et al.
immune response is based on the natural proteolytic degradation of
proteins within cells, producing peptides that are subsequently
displayed at the cell surface by the major histocompatibility complex (MHC) molecule [1, 2]. Foreign peptides arising from virus
infection or abnormal peptides originating from a malignant transformation can be recognized in complex with an MHC (i.e., a
pMHC complex) by a T-cell receptor (TCR). This recognition
constitutes a key step in the regulation of T-cell activation and
further immune responses against virus-infected cells or tumor
cells. There are two major classes of MHC molecules. MHC class
I molecules are produced by almost all cells in the body, while
MHC class II are found in antigen presenting cells of the immune
system such as dendritic cells [3, 4].
With the rise of immunotherapy treatments against cancer [5–
12] and the objective to address autoimmune diseases [13], many
approaches, most of them based on sequence data and machinelearning algorithms, have been developed to predict peptide ligands
of different MHC allotypes or the immunogenicity of pMHC
complexes [14, 15]. Simultaneously, molecular modeling
approaches able to predict the binding of a peptide to a given
MHC molecule and to determine the corresponding binding
mode have been reemerging as a subject of strong interest
[16]. Indeed, these information could provide additional insights
into the mechanism of peptide binding to MHC, open the door to
structure-based estimation of peptide/MHC affinity [17], uncover
unknown structural drivers of T-cell activation, and guide the
design of novel peptide ligands [18–21]. More than 600 pMHC
class I three-dimensional (3D) structures are currently available in
the Protein Data Bank. However, this data covers only some tens of
the 19,000 known human MHC alleles [18] and a tiny fraction of
the peptides that can be produced from the human proteome via
proteosomal processing [19]. Despite important progress, experimental methods for protein–ligand structure determination are too
time-consuming and expensive to address the missing information
or to be routinely used in the context of personalized immunotherapy of cancer, for instance. Consequently, there is a need for rapid
and efficient approaches to predict 3D structures of pMHC complexes [17]. However, many challenges have to be overcome to
computationally dock a peptide into an MHC molecule, due to the
size and the flexibility of the peptide ligands. To address these
challenges, many suitable approaches have been developed based
on the structural particularities of the pMHC complex, as well as
available experimental data. Here, after summarizing the structural
characteristics of pMHC class I and II complexes, we briefly review
the numerous sources of information regarding the TCRpMHC
system in general and the pMHC complex in particular, which can
be of interest for developing and evaluating a pMHC docking
software. Finally, we review some of these docking approaches,
focusing mainly on docking to MHC class I.
Structural Prediction of Peptide–MHC Binding Modes
2
247
The Structure of the pMHC Complex
2.1 The pMHC Class I
Structure
MHC class I molecules are molecular heterodimer complexes composed of a ~44 kDa membrane-anchored polymorphic heavy chain
and a 12 kDa invariant soluble β2-microglobulin (β2m) (Fig. 1)
[1, 2] [20]. From N- to C-terminal, the heavy chain comprises
three extracellular domains, α1, α2 and α3, a transmembrane segment, and a cytoplasmic tail. β2m is noncovalently linked to the α3
domain of the heavy chain. Both β2m and α3 show an
immunoglobulin-like fold. The α3 domain also interacts with the
CD8 co-receptor of T-cells when present [21, 24]. Although β2m
and α3 are not in direct contact with the peptide, β2m interacts with
α1 and α2, stabilizing the heavy chain and enhancing peptide binding [25–27].
Very importantly, the α1 and α2 domains of the heavy chains
form a β-sheet of eight β-strands, with four strands coming from α1
and α2, respectively. The β-sheet plane is roofed by two helices, one
originating from α1 and the second one from α2. These two
α-helices overhanging the β-sheet floor are forming a polymorphic
peptide-binding groove (Fig. 2a). The most variable residues
among the different allotypes point inside this groove as well as in
the direction of the TCR, conferring unique peptide and
TCR-binding selectivity to each MHC molecule. Most of the variable residues are situated in the center of the cleft. On the contrary,
the most conserved residues are present at both ends of the groove,
where they form walls that determine the length of the peptide—
which is generally restricted to 8 to 11 residues with a majority of
9-mers—and the position of its N- and C-termini (Fig. 2b). Noticeably, these conserved residues are holding the peptide by delimiting
the ends of the binding groove and by fixing them through a
conserved network of hydrogen bonds with the peptide’s N- and
C-termini (Fig. 3). Of note, some exceptions to this rule can be
found [28, 29].
Characteristically for MHC class I molecules, the polymorphic
residues in the peptide-binding groove generate pockets that can
strongly accommodate preferred amino acid side chains of the
peptide. The nature of the preferred peptide residues depends on
the MHC sequence. The interactions between the peptide and
these MHC pockets anchor the peptide in the peptide-binding
cleft of the MHC. Six major pockets, labeled A to F, can be
identified in the MHC groove (Fig. 4) [30]. Two deep pockets
are particularly important. The first one is pocket B, which generally—but not always—holds the second residue of the peptide. The
second one is pocket F, which accommodates the side chain of the
C-terminal residue of the peptide. The peptide residues in these
positions, called the anchor residues, play an important role in
peptide/MHC interaction. Secondary anchor residues, generally
248
Marta A. S. Perez et al.
Fig. 1 Experimental structure of an MHC HLA-A*02:01 in complex with the
nonapeptide ALGIGILTV (PDB ID 1JHT [22]). The MHC heavy chain is colored in
light brown and the β2m molecule in orange. Secondary structure elements are
shown in ribbon representation, while the solvent accessible surface is displayed
in transparent, colored according to the underlying protein. The peptide is shown
in ball and stick representation, colored according to the atom types. (All figures
were generated with UCSF Chimera [23])
located in the center of the peptide, bind more weakly to other
pockets of the MHC and play a less important role for the recognition strength of a given peptide by a given MHC, but they can
substantially influence the conformation of the peptide in the MHC
groove.
The complementarity between peptide anchor residues and
MHC anchoring pockets, together with the more limited role of
secondary anchor residues, is critical for the specificity of MHC
class I. The pockets generated by the polymorphic residues in the
MHC groove complement a small number of specific amino acids at
given positions in the binding peptide, thus determining which
peptides can bind to a specific MHC allotype. In summary, peptides
bound to MHC class I proteins have allele-specific sequence motifs
characterized by strong preferences for a few amino acids at given
positions of the peptide and large permissiveness in the other positions [31, 32]. Of note, an analysis of experimentally determined
MHC class I-presented 9-mers showed that they exhibit a localization bias to helical fragments in the source proteins, which could be
explained by the fact that, prior to loading on the HLA complexes,
the peptides must be cleaved or trimmed at the N- and C-termini to
Structural Prediction of Peptide–MHC Binding Modes
249
Fig. 2 (a) The β-sheet forming the floor of the peptide binding cleft is composed of four β-strands from domain
α1 and four β-strands from domain α2. Two helices, one from α1 and the other one from α2, are flanking the
peptide binding cleft. (b) Surface of the MHC class I molecule, showing the walls that limit the extension of the
peptide within the MHC cleft. (Figure made using the experimental structure PDB ID 1JHT [22])
Fig. 3 Hydrogen bond network between the N- and C-termini of the peptide and
conserved residues of the MHC. Hydrogen bonds are displayed as green dotted
lines. (a) Overall view, (b) zoom on the N-terminus, and (c) zoom on the
C-terminus of the peptide. (Figure made using the experimental structure PDB
ID 1JHT [22])
250
Marta A. S. Perez et al.
Fig. 4 Major binding pockets of the MHC, labeled from A to F and displayed in different colors. Figure made
using the experimental structure PDB ID 1JHT [22]
be available, but at the same time must be stable enough to be
displayed by MHC. Therefore, the higher resistance of helices to
proteolysis could explain the higher frequency of helical regions
among HLA-I binding molecules [33].
As mentioned above, peptides bound to MHC class I molecules are generally anchored in the B and F pockets of the MHC,
while their N- and C-termini residues are fixed through numerous
hydrogen bonds with conserved MHC residues. Most of the conformational diversity of the peptide binding mode thus results from
the central residues, which can protrude from the surface and
potentially interact with TCRs. Given that the extremities of the
peptide are fixed in the MHC, the longer the peptide, the more
bulged its central conformation is [34]. Analysis of pMHC class I
experimental structures suggest that the peptide backbone shows a
conserved conformation despite the diversity of amino acid
sequences [35, 36].
The MHC class I groove also shows a limited plasticity upon
binding diverse peptides [37]. Fagerberg et al. [38] compared
21 experimental structures of pMHC HLA-A*02:01, spanning
different peptides. They found an average root mean square deviation of only 0.4 and 0.9 Å for the backbone and all heavy atoms of
the MHC residues, respectively, reflecting the inherent flexibility
and crystal packing effect of proteins independently of the nature of
the bound peptide. A given MHC class I thus shows only limited
induced fit upon binding of different peptide amino acid sequences.
Two exceptions were, however, observed in the binding groove:
Structural Prediction of Peptide–MHC Binding Modes
251
Fig. 5 Superimposition of 4 pMHC HLA-A*02:01 experimental 3D structures of with resolutions ranging from
1.30 to 1.60 Å: 3D25 [39], 3MRG [40], 5C0G [41], and 2V2X [42]. Some conformationally variable residues are
shown in stick representation
the β-sheet residues Arg97 and Tyr116, which showed average
global displacements of 1.2 Å and 1.6 Å of their heavy atoms,
respectively. Of note, due to their spatial proximity, the conformations of these two residues are correlated and exhibit only two
possible combinations depending on the peptide sequence and
properties. Importantly, other residues exhibit conformational variations as a function of the crystal structure: Glu19, Glu58, Arg65,
Arg75, Lys146, Glu154, and Gln155 (Fig. 5). However, due to
their position in the MHC, their flexibility is more likely to impact
TCR binding than the peptide epitope position. Despite these
limitations, the accessible conformational space of pMHC class I
remains important due to the length and number of degrees of the
freedom (DoF) of the peptide, making structural predictions of
binding modes particularly challenging, as for any peptide docking.
However, the abovementioned conformational rules can be used to
design efficient sampling strategies for the docking of peptides in
the binding groove of an MHC molecule.
2.2 The pMHC Class
II Structure
MHC class II molecules are also heterodimers consisting of two
noncovalently associated polymorphic subunits: the 34 kDa α chain
and the 29 kDa β chain (Fig. 6). Both α and β chains are anchored
to the membrane and contain two extracellular domains (α1 and α2,
as well as β1 and β2, respectively) and an intracellular domain. The
structure of the MHC class II molecules is very similar to that of the
252
Marta A. S. Perez et al.
Fig. 6 MHC class II, HLA-DRA1 and HLA-DRB1, in complex with the 15-mer alpha-enolase peptide 326–340;
PDB ID 5NI9 [43] (a) Extracellular domains α1 and α2 (in rosy brown), as well as β1 and β2 (in brown).
Secondary structure elements are show in ribbon representation, while the solvent accessible surface is
displayed in transparent, colored according to the underlying protein. The alpha-enolase peptide is displayed
in ball and stick representation, colored according to the atom types. (b) Surface of the MHC class II molecule
showing the absence of walls delimiting the peptide binding groove, which allows the accommodation of a
large peptide extending from the N- and C-termini
MHC class I molecules [3, 36, 41, 42], with the α1 and β1 domains
forming a peptide-binding groove following the model of the α1
and α2 domains of MHC class I (Fig. 6b). The other domains,
α2 and β2, play a structural role similar to that of α3 and β2m of
MHC class I, respectively (Fig. 6a). The CD4 co-receptor of
T-cells, when present, interacts with both the β2 and α2 domains
of MHC class II [44]. While the α1 and β1 domains of MHC class
II, which are forming the peptide binding groove, are highly polymorphic, the α2 and β2 segments are very conserved among the
allotypes of a particular class II gene. The peptide-binding site of
MHC II molecules is formed by the N-terminal α1 and β1 domains
and, similarly to MHC class I, is composed of an 8-stranded β-sheet
defining the floor of the binding site and two helices creating its
borders. The most striking difference between MHC class I and II
binding sites are the binding groove walls in MHC class I, which
limit the size of the peptides that MHC class I can bind, and which
are absent in MHC class II. Consequently, MHC class II can
accommodate much longer peptides, composed of 13 to
25 amino acids [45], which extend at both the N- and the
C-termini compared to class I binding peptides [34, 46]. As a
consequence, due to the larger number of degrees of freedom of
the peptides in the binding groove, docking into MHC class II is
generally much more challenging than docking into MHC class I
Structural Prediction of Peptide–MHC Binding Modes
253
[17]. Contrarily to the MHC class I groove, which has only two
major anchoring pockets, the groove of MHC class II can present
three or four major anchoring pockets to accommodate the primary peptide anchor residues.
3
Databases and Resources
The efficiency of a computational docking method is frequently
assessed based on its ability to reproduce the native bound conformation as determined experimentally with X-ray crystallography.
Therefore, structural databases of pMHC complexes hold important information to benchmark docking methods. Databases containing sequence-based information on pMHC binding are also
extremely relevant as i) docking methods are able to predict a
pMHC structure by using only the peptide sequence as an input
and ii) the output of a docking method suitable to model thousands
of pMHC complexes can be used as a basis to discriminate between
peptide binders and nonbinders, for example, through a scoring
function. In this context, sequence-based tools to qualitatively or
quantitatively predict pMHC binding affinities can be used as a
source of comparison to assess a docking tool. Beyond pMHC
binding, the most relevant endpoint for many applications is the
ability of antigenic peptides to elicit a T-cell response. To this end,
structural data on TCRpMHC complexes and other sources of
information on TCR and pMHC interactions provide essential
input. The points outlined above illustrate the key importance of
the availability of information on pMHC and TCRpMHC to guide
further research in the field. Therefore, we describe below available
resources such as datasets, databases, and software tools, giving an
overview of relevant material to benchmark or develop new docking software and pMHC prediction methods (Table 1).
3.1 Resources
Related to pMHC (and
TCRpMHC) Structures
The Protein Data Bank (PDB [47, 48]; https://www.ebi.ac.uk/
pdbe [70–73]) is the worldwide archive of structural data of
biological macromolecules. Established in 1971, it contains today
more than 1092 structures of pMHC class I and more than
212 structures of pMHC class II (as of 25.11.2020). Approximately 650 of these structures are peptide-HLA and more than
one-third of these relate to the same MHC allotype
(HLA-A*02:01). PDB data are freely and publicly available for
download without restrictions. Each entry contains summary information about the structure and experiment, atomic coordinates,
and in most cases a reference to a corresponding scientific publication. Individually and in bulk, PDB structures can be downloaded
and/or analyzed and visualized online, for example, using tools at
PDBe. However, creating datasets from the PDB to benchmark
docking tools requires data curation and analysis, for instance, to
Database
Database
Other resources with information on
pMHC and TCR
Database
Tools
Datasets
Immunogenic and nonimmunogenic peptides
Immunogenic and nonimmunogenic peptides
A compendium of T and B cells epitope essays
Provides a substantial set of updated and novel features for epitope
prediction and analysis
Examples of predictive tools for peptide–MHC qualitative and
quantitative binding affinities
Single worldwide archive of structural data of biological macromolecules
Curated repository of 3D structures of peptide–MHC class I
Manually curated repository of pMHC and TCR-p-MHC
Repository of pMHC and TCR-p-MHC with emphasis on structural
characterization
Repository of 3D structures annotated according to the IMGTONTOLOGY
TCRpMHC structures reporting main axes and angles in the complex
Annotated TCRpMHC from PDB
Wild-type and mutant pMHC together with measured affinities
McPAS-TCR [69]
VDJdb [68]
10 genomics
TCRs with known antigen specificity
Linking highly multiplexed antigen recognition to immune repertoire
and phenotype
TCRs with known antigen, pathogen, and pathology association
IPD-MHC [67]
Centralized repository for curated MHC sequences of different species
IPD-IMGT/HLA [18] Specialist database for sequences of the human MHC
Calis et al. [56]
Chowell et al. [57]
IEDB [58]
IEDB-analysis resource
[59]
NetMHC [60, 61]
NetMHCpan [62]
NetMHCcons [63]
MHCflurry [64]
MHCSeqNet [65]
MHCAttnNet [66]
IMGT/3DstructureDB [52]
TCR3D [53]
STCRDab [54]
ATLAS [55]
Databases PDB [47, 48]
CrossTope [49]
MPID-T [50]
MPID-T2 [51]
Resources for MHC sequences
Resources for pMHC sequences
Resources for pMHC (and TCR-pMHC)
structures
Table 1
Resources (databases, datasets, and some software tools) available for an efficient benchmark/development of pMHC docking tools
254
Marta A. S. Perez et al.
Structural Prediction of Peptide–MHC Binding Modes
255
remove low-resolution structures, mutated proteins, and proteins
with missing residues, or to restrict the analysis to human molecules
or to a particular allotype, to name just a few. To address this need,
several databases provide curated and analyzed sets of pMHC
structures (and TCRpMHC structures) from the PDB.
CrossTope [49] (http://crosstope.com) is a highly curated
repository of 3D structures of peptide-MHC class I. The complexes
hosted in this database were obtained from the PDB and from in
silico modeling. The database contains 182 nonredundant complexes from two human and two murine alleles. From the CrossTope web server, the user can download pMHC class I coordinate
files as well as topological and charge distribution map images from
their T-cell receptor-interacting surface.
The peptide/MHC interaction database version T, MPID-T
[50], is a manually curated database containing experimentally
determined structures of 187 pMHC complexes and
16 TCRpMHC complexes taken from the PDB. Each structure is
manually verified, classified, and analyzed for intermolecular interactions (a) between the MHC and its corresponding bound peptide
and (b) between the TCR and its bound pMHC complex, when
TCR structural information is available. The MPID-T database
retrieval system has precomputed interaction parameters that
include solvent accessibility, hydrogen bonds, gap volume, and
gap index.
The MHC–peptide interaction database-T version 2, MPIDT2 [51], contains pMHC and TCRpMHC complexes with emphasis on structural characterization. As of November 2020, MPID-T2
contains 415 entries from five MHC sources (282 human,
127 murine, 3 rat, 2 chicken, and 1 monkey), spanning 56 alleles.
MPID-T2 covers 353 pMHC and 62 TCRpMHC structures.
Overall, 327 entries are nonredundant (279 MHC class I and
48 MHC class II). Nonclassical structures and complexes with
nonstandard residues are also included in this version.
Of note, and as far as we know, CrossTope, MPID-T, and
MPID-T2 only include structural data available before their
publication date.
IMGT/3D structure-DB [33] contains information on the
sequences and 3D structures of TCR, pMHC, and related proteins
of the immune system from human and other vertebrate species.
Experimental 3D data are taken from the PDB and expertly annotated information is provided according to the IMGT criteria, using
IMGT/DomainGapAlign, and based on the IMGT-ONTOLOGY
concepts and axioms. IMGT/3Dstructure-DB provides standardized identification (IMGT keywords), a standardized nomenclature (IMGT gene and allele names), a standardized description
(IMGT labels), and a standardized numbering (IMGT unique
numbering).
256
Marta A. S. Perez et al.
The T-cell receptor structural repertoire database, TCR3D
[53], is a comprehensive, curated collection of T-cell receptor
structures from the PDB, analysed for structure, sequence, and
antigen recognition, as well as TCR germline gene sequences
from http://www.imgt.org and TCR sequencing data from various
studies. Users can interactively view TCR structures, search
sequences of interest against known structures and sequences, and
download curated datasets of structurally characterized TCR. This
database is updated on a weekly basis and can serve as a centralized
resource for the community studying T-cell receptors and their
recognition. Users can download a curated dataset of more than
167 nonredundant TCRpMHC class I and more than 58 nonredundant TCRpMHC class II complexes. Through the database,
users can also access updated information regarding the PDB code,
description, release date, and resolution of MHC class I and II
structures.
The Structural T-cell Receptor Database, STCRDab [54], is an
online resource that automatically collects and curates TCR structural data from the Protein Data Bank. For each entry, the database
provides annotations, such as the α/β or γ/δ chain pairings, MHC
details, and, when available, antigen binding affinities. In addition,
the orientation between the variable domains and the canonical
forms of the complementarity-determining region loops is also
provided. Users can select, view, and download individual or bulk
sets of structures based on these criteria. When available, STCRDab
also finds antibody structures that are similar to TCRs, helping
users to explore the relationship between TCRs and antibodies.
STCRDab is linked with TCRBuilder [74], a structural TCR modeling tool that returns a model or an ensemble of models covering
the potential conformations of the binding site from a paired
αβTCR sequence.
The Altered TCR Ligand Affinities and Structures database,
ATLAS [55], is a manually curated repository containing the binding affinities for wild-type and mutant TCR and their antigens,
peptide-MHC. The database links experimentally measured binding affinities with the corresponding 3D structures for
TCR-pMHC complexes. ATLAS contains a dataset of TCRpMHC
structures with the following curations: renaming of chains, truncation of chains to binding interface, and removal of water molecules. For ATLAS entries lacking full experimental 3D structures,
models were generated from template structures using the Rosetta
protein modeling suite [75]. The latest update of the website was
done in 2017 [55].
Of note, TCRpMHC structures cannot be directly used for
comparison of the docking poses of the pMHC alone as the latter
can adopt different conformations when bound or not to the TCR
[76, 77].
Structural Prediction of Peptide–MHC Binding Modes
3.2 Resources
Related to pMHC
Sequences
257
The Immune Epitope Database [58] (IEDB) is an up-to-date
resource that captures experiments which identify and characterize
epitopes and epitope-specific immune receptors along with various
other details such as host organism, immune exposures, and
induced immune responses. Note that, while most of the components of IEDB can be found separately in other resources, no other
database contains them all.
A companion site, IEDB-Analysis Resource (IEDB-AR), provides a substantial set of updated and novel features for epitope
prediction and analysis [59]. New epitope prediction and analysis
tools are regularly added in the IEDB-AR with features useful to
advance epitope-based therapeutics and vaccine development.
IEDB-AR includes, among others, a tool to predict peptides that
are naturally processed by the MHC class I pathway and bind to
MHC class I molecules, MHCI-NP [78], and a tool to predict
naturally processed MHC class II ligands, MHCII-NP [79]. The
tools available in IEDB-AR are summarized in Danda S.K. et al.
[59]
The IEDB represents a huge body of knowledge regarding
which peptide epitopes are presented by which MHC molecules.
Peptidomes of various MHC molecules can be utilized to build
highly accurate predictors of MHC binding. Predictive tools for
qualitative and quantitative pMHC binding affinities include
NetMHC [60, 61], NetMHCpan [62], NetMHCcons [63],
MHCflurry [64], the IEDB tools [59], MHCSeqNet [65], and
MHCAttnNet [66]. Most of these tools use machine-learning–
based techniques, require a large amount of training data, and can
be rather weak predictors for MHC alleles for which the data is
scarce. NetMHCpan [62], however, is a pan-specific artificial neural
networks method trained on binding affinity and eluted ligand that
leverages the information from both data types and seeks to alleviate the problem of data scarcity. NetMHCpan, MHCSeqNet [65],
and the recently published MHCAttnNet [66] aim to predict
MHC–peptide binding for unseen alleles. Docking methods for
screening MHC binding peptides can be tested using IEDB, and
their efficiency can be compared with one or several of the abovementioned prediction tools.
In the context of precision medicine, searching for epitopes
that are not presented by a patient’s MHCs, even if they are related
to pathogens of interest, has little sense as it is unlikely that they can
elicit a strong immune response. In such cases, smaller datasets of
allele-specific or disease-specific peptide-MHCs can also be relevant
[56, 57].
We would also like to mention databases that offer additional
information such as the IPD-MHC [67] database that provides a
centralized repository for curated MHC sequences from a number
of different species or the IPD-IMGT/HLA database that provides
a specialized database for sequences of the human MHC [18].
258
4
4.1
Marta A. S. Perez et al.
Computational Approaches for Peptide–MHC Binding Mode Prediction
Docking
Ligand-protein docking approaches aim to computationally predict
the most probable position, orientation, and conformation of a
small drug-like molecule at the surface of a targeted protein [32–
38]. Although intensively studied for decades, ligand-protein docking remains largely an unsolved problem [80–83]. Generally
speaking, docking software can be decomposed into two components: a sampling algorithm in charge of generating possible geometries of the ligand at the protein surface (i.e., the binding modes)
and a scoring function whose purpose is to rank the binding modes
according to their probability to correspond to the experimental
true binding mode (also called the native binding mode). Since the
native binding mode corresponds in principle to the one with the
lowest binding free energy for a given ligand-protein pair, the
scoring function of a docking software is often trained to achieve
two objectives: selecting the native binding mode among all possible binding modes and estimating the binding free energy of the
ligand for the target. It thus allows the comparison of different
ligands in terms of affinity and opens the door to structure-based
virtual screening and drug design.
Two different approaches can be used to assess a docking
algorithm under development and to benchmark published docking software, namely redocking and cross-docking. Redocking consists in docking a ligand into the protein 3D structure that was
experimentally determined in complex with that same ligand. On
the contrary, in cross-docking experiments, the ligand is docked
into a protein conformation that was experimentally determined in
complex with another ligand or in its apo form. Obviously, the first
exercise is easier since the protein conformation displays the
induced fit necessary to bind the ligand of interest. Although this
exercise is very different from the typical use of a docking software,
it allows estimating different factors important for its efficiency,
such as its ability to correctly sample the conformational space of
the ligand and to find binding modes close to the native one
(knowing that the protein is in its optimal conformation).
Cross-docking is more similar to the typical usage of a docking
software, where the induced fit of the protein corresponding to a
given ligand is unknown. Successful cross-docking might necessitate the sampling of the conformational space of the protein in
addition to the one of the ligand. As such, results of cross-docking
benchmarks are generally considered more relevant to assess the
overall efficiency of docking software. The ability of a docking
algorithm to predict the binding modes of a set of ligand–protein
complexes for which the native binding mode is known thanks to
available experimental structures can be quantified by several
metrics. The most employed one remains the root mean square
deviation (RMSD) of heavy atom positions between the binding
Structural Prediction of Peptide–MHC Binding Modes
259
mode calculated by the docking algorithm and the native binding
mode. The RMSD, which is related to a distance between two
ligand positions, is generally given in Å. To allow easy comparison
of docking tools, it is generally assumed that a docking run is
successful if this RMSD is lower than 2 Å [81]. The efficiency of
ligand–protein docking software decreases with increasing number
of degrees of freedom of the ligand, especially when they exceed
about 10, because of the complexity and the size of the conformational space to explore [84, 85].
Due to the size of the peptides that bind to MHC grooves, even
in the case of class I MHC (8 to 11 residues), typical small molecule
docking codes are generally inefficient at docking such ligands.
However, as we will see below, peptide-MHC docking algorithms
are comparable to small-molecule docking programs by many
aspects, including some of the sampling engines and scoring functions. The knowledge of the nature of interactions between a
peptide and an MHC protein (see section “The structure of the
pMHC complex”), notably for MHC class I, can be used to facilitate the docking of peptides, despite the fact that their number of
degrees of freedom makes them intractable by standard docking
approaches. Following the nomenclature proposed by Antunes
et al. [17], we can distinguish approaches that rely on a constrained
backbone, on constrained termini, or on incremental peptide
reconstruction. All approaches benefit from the fact that the
MHC molecules exhibit little induced fit as a function of the
peptide nature [38] and that the overall position and N/Cterminus orientation of the peptide in the MHC groove is well
known.
Constrained backbone approaches employ sampling strategies
based on the fact that peptides with the same number of residues
and binding to the same MHC allotype show a limited number of
backbone
conformations
in
experimental
structures
[43, 47]. These approaches generally start by constructing conformations of the peptide to dock, bound, or unbound to the MHC,
based on experimentally determined backbone conformations of
other peptides of the same size. Constrained termini approaches
use the preserved networks of hydrogen bonds that exist between
conserved residues of the MHC and the N- and C-termini of the
peptide to restrain the corresponding peptide atoms in these positions during the docking. In these conditions, docking a peptide
into an MHC boils down to a loop closure problem. Approaches
based on incremental peptide reconstruction try to limit the effect
of the numerous internal degrees of freedom in the peptide by
reconstructing it within the MHC groove in consecutive steps, in
a way that only a small number of these degrees of freedom are
considered at once. In the following paragraph, we describe some
of the main peptide–MHC docking approaches belonging to these
different categories, focusing on docking to MHC class I (Table 2).
Constrained
backbone
Constrained
backbone
pDock [86]
DockTope
[87]
FlexPepDock Constrained
[88]
backbone
Strategy
Approach
FlexPepDock refinement
protocol applied to a
coarse-grained binding
mode generated by
sequence threading on
babckbone experimental
conformation and
Tested on 30 experimental
Freely available as a web
structures of pMHC class
server
I complexes
http://piperfpd.furmanlab.
When starting the docking
cs.huji.ac.il
from a peptide template
And as a standalone
bound to the same MHC
program
allotype, 84% success rate
Tested on 135 pMHC
complexes of class I,
covering the 5 MHC
Average RMSD between
0.4 and 1.1 Å for the Cα
atoms as a function of the
MHC allotypes (from 1.7
to 2.5 Å for all-atom
RMSD), with an average
of 0.9 Å over all allotypes
(2.0 Å for all-atom
RMSD)
Predictive ability
Autodock Vina score used to Freely available as a web
Generation of the peptide
server [86]
filter the calculated poses.
conformation based on a
Best result selected as the
preselected template (one
one closest in RMSD to all
per MHC allotype),
other calculated binding
followed by two Autodock
modes
Vina docking rounds (with
rigid MHC and rigid
peptide backbone)
separated by an energy
minimization performed
with GROMACS (flexible
MHC and peptide)
Availability
Tested on a nonredundant
set of 186 pMHC
complexes (149 of MHC
class I and 37 of MHC
class II)
A predicted binding mode
within 1.0 Å Cα RMSD
for 83% of the class I and
95% of the class II
complexes. Average Cα
RMSD about 0.6 Å over
the entire dataset
Scoring approach
N/A
Internal energy of the
Single docking and
peptide and peptide–MHC
refinement using ICM and
interaction energy, plus a
a Monte Carlo algorithm
solvation energy term.
ECEPP/3 force field
Sampling algorithm
Table 2
Summary of the docking approaches reviewed in this chapter
260
Marta A. S. Perez et al.
Constrained
termini
Park et al.
[90]
Yanover et al. Constrained
[91]
termini
Constrained
termini
GradDock
[89]
MODELLER score, or
abundance of the binding
mode after simulated
annealing, or detection of
conformational transition
Generation of conformations Modified Rosetta all-atom
scoring function
for the peptide backbone
in the MHC groove based
on constrained anchor
Homology modeling,
followed by all-atom MD
simulated annealing and
MD simulation
Generation of conformations Reparameterized Rosetta
score
for the unbound peptide
based on constrained
termini and loop closure
algorithm, followed by
binding simulation using a
steered insertion of the
peptide into the MHC
groove
positioning of the peptide
in the MHC groove
respecting the anchor
residues
(continued)
Tested on 29 MHC class I
(11 HLA-A and
18 HLA-B). Docking
and sequence-optimizing
Tested on
17 HLA-A*02:01
pMHC complexes.
Average all-atom RMSD
of 1.6 Å after simulated
annealing
N/A
N/A
Tested by self-redocking on
107 nonredundant
pMHC class I, covering
82 class I MHCs and 8 to
10-mer peptides, as well
as on cross-docking of
70 complexes. RMSDs
around 1.2 Å and 2.5 Å
for backbone and
all-atoms, respectively, in
both self- and crossdocking tests
at 1 Å backbone RMSD
and of 52% 2 Å all-atom
RMSD among the
top-five best binding
modes. When starting
from a peptide template
bound to a different
MHC allotype, 60%
success rate at 1 Å
backbone RMSD among
and 55% success rate at
2 Å all-atom RMSD
among the top-five best
binding modes
Freely available for
academic purposes
Program available at [49]
https://www.
rosettacommons.org
Structural Prediction of Peptide–MHC Binding Modes
261
Constrained
termini
APE-Gen
[92]
Bordner et al. Constrained
[93]
termini
Strategy
Approach
Table 2
(continued)
Scoring approach
ICM + Monte Carlo, with
ECEPP/3 force field
restrain in the atoms of the
N and C-termini
Generation of conformations SMINA force field
for the peptide backbone
in the MHC groove based
on constrained anchor
residues and loop closure
algorithm, followed energy
minimization
residues and loop closure
algorithm, followed
Monte Carlo refinement
Sampling algorithm
of thousands of peptides
provided computed
PFMs very close to the
experimental PFMs
Predictive ability
ICM is available under paid Tested by cross-docking of
license at [51]
14 peptides into
HLA-A*02:01 and
9 peptides into H-2-Kb,
as well as docking
peptides into homology
models for five different
MHC allotypes
Average backbone RMSD
of 1.1 Å and 0.7 Å for
cross-docking on
HLA-A*02:01 and H-2Kb, respectively. Average
backbone RMSD of
1.1 Å for docking on
homology models
Tested on 535 pMHC
APE-Gen is open-source
complexes of class I, with
and freely available at
8 to 11-mers peptides.
https://github.com/
Average RMSD of 0.9 Å
KavrakiLab/APE-Gen. It
for the Cα atoms and
is also available within the
2.0 Å for all atoms
HLA-Arena platform
between the native
which is accessible here:
binding mode and the
[50]
closest sampled peptide
conformation
Availability
262
Marta A. S. Perez et al.
Incremental
Reconstruction of the ligand Autodock 4 or Autodock
Vina scoring functions
peptide
by incrementally docking
reconstruction
larger and larger
overlapping peptides
fragments
CHARMM forcefield
including the GB-MV2
implicit solvent model
DINC
[94, 95]
and DINC
2.0 [96]
All-atom MD simulated
annealing
Constrained
termini
Fagerberg
et al. [38]
Freely available as a web
server [52]
N/A
(continued)
Tested via the redocking of
25 pMHC complexes,
spanning 10 different
MHC class I and peptides
ranging from 8 to
10-mers. Averaged Cα
and all-atoms RMSD of
1.0 and 1.9 Å,
respectively
Tested by the redocking of
14 HLA-A*02:01
pMHC and
27 non-HLA-A*02:01
pMHC
For HLA-A*02:01 and
selection of output by
cluster size, success rate
of 86% for backbone
RMSD lower than 1.0 Å
and 71% for heavy atom
RMSD lower than 1.5 Å.
For non-HLA-A*02:01,
success rates of 70 and
59% for backbone and
heavy atoms RMSD,
respectively. For selection
by the mean effective
energy, success rates of
100% and 93% for
HLA-A*02:01 pMHC,
and of 74 and 67% for
non-HLA-A*02:01
pMHC
Structural Prediction of Peptide–MHC Binding Modes
263
Incremental
Reconstruction of the ligand OPLSAA/L force field
peptide
by reconnection of amino
reconstruction
acid conformations
obtained by MD
simulations and energy
minimization
Scoring approach
DynaPred
[97]
Sampling algorithm
Strategy
Approach
Table 2
(continued)
N/A
Availability
Tested by cross-docking of
20 complexes of 9-mer
peptides in MHC
HLA-A*02:01. Average
backbone RMSD of
1.5 Å
Predictive ability
264
Marta A. S. Perez et al.
Structural Prediction of Peptide–MHC Binding Modes
265
Due to the difficulty of docking long peptides, and to the
particular flexibility of the side chains, the success rate of peptide–
MHC docking software is generally quantified using not only the
RMSD calculated on all heavy atoms of the peptide but also the
RMSD calculated only on the backbone atoms or even on the Cα
atoms. The backbone RMSD allows to estimate if the docking
software was successful in reproducing the backbone conformation
of the native binding mode, even though the positioning of the side
chains may not be correct.
4.2 Constrained
Backbone
Docking of peptides to MHC class I or class II using pDock [86]
requires some preparation steps, followed by a single-step docking
and refinement based on the Internal Coordinate Mechanics (ICM)
algorithm [98, 99]. First, the peptide and the MHC are prepared
for docking by adding missing residues, side chains, and polar
hydrogen atoms. A docking grid is positioned to ensure that the
peptide ligand will be situated in the vicinity of the MHC binding
site. The authors claim that high-quality homology models of the
MHC can be used with pDock, although the assessment was only
performed through redocking to existing X-ray structures. The
peptide is positioned based on existing X-ray structures. Next, the
ICM docking algorithm is used to perform a flexible docking of the
peptide into the MHC binding groove. During this docking, torsion angle values of the ligand side chains are sampled using a
Monte Carlo procedure. The energy function used during this
procedure is the sum of the internal energy of the peptide and the
interaction energy between the peptide and the MHC, including
the internal Van der Waals interaction, hydrophobic potential
between the peptide and the MHC, the hydrogen bonding energy,
the configurational/conformational entropy, and a surface-based
solvation energy, based on the ECEPP/3 force field [100]. Loose
restraints are imposed on the position of the peptide to keep it close
to the starting conformation during the docking. Finally, all peptide
and MHC residues (in the vicinity of the peptide) are refined to
eliminate or minimize peptide–MHC atom clashes, again using
ICM and a Monte Carlo procedure.
pDock was tested on a nonredundant set of 186 pMHC complexes (149 MHC class I and 37 MHC class II) with 3D structures
determined by X-ray crystallography. A predicted binding mode
within 1.0 Å RMSD from the native binding mode, calculated on
the Cα atoms of the nonameric core of the peptide, was obtained
for 83% of the class I complexes and 95% of the class II complexes.
The average Cα RMSD between the redocked and experimental
poses was about 0.6 Å over the entire dataset.
DockTope [87] is based on the so-called D1-EM-D2 approach
[36] (see below for the definition of D1, EM, and D2) for the
modeling of pMHC class I previously published by Antunes et al.
This technique divides peptide–MHC docking into four steps.
266
Marta A. S. Perez et al.
First, capitalizing on the known conservation of the backbone
conformation of peptides binding to the same MHC [36], the
input peptide sequence is transformed into a three-dimensional
structure. This is performed by threading its sequence on the
constrained backbone of a peptide-epitope 3D pattern preselected
by the authors. DockTope provides five such patterns (PDB IDs
1LK2 [101], 2V2W [42], 2A83 [102], 1WBX [103], and 1WBY
[103]) covering four MHC allotypes, and thus allowing the docking of 8-mers into H-2-Kb, 9-mers into HLA-A*02:01,
HLA-B*27:05 and H-2-Db, and 10-mers into H-2-Db. This
threading is followed by an energy-minimization to mildly relax
the conformation of the peptide, following a protocol identical to
the one described in the third step, below.
Second, starting from the 3D conformation generated in the
first step, an initial molecular docking (D1) is performed using the
Autodock Vina program [104]. Before docking, the system is
prepared using Autodock Tools [105]. This preparation consists
in adding all hydrogens to the MHC macromolecule to calculate
the Gasteiger charges of each protein atom, before removing the
nonpolar hydrogens. The peptide ligand is setup using the same
protocol. The grid box defining the Vina search space is configured
to allow the sampling of the peptide poses inside the MHC cleft. Of
note, the ϕ and ψ backbone torsional angles of the peptide are
excluded from the degrees of freedom, such that only the peptide
side chains conformations are optimized during the docking.
Twenty independent docking runs are performed, each one
providing a best-predicted binding mode with a corresponding
calculated binding energy. The best-predicted binding mode
among these 20 runs is obtained by (a) removing all binding
modes with binding energies lower than the average binding
energy of the 20 calculated modes and (b) selecting the binding
mode with the lowest average RMSD to all other remaining binding modes.
Third, starting from the calculated binding mode generated by
D1, an energy minimization (EM) is performed with the steepest
descent algorithm, using the GROMACS package [106] and the
GROMOS 53A5 force field [107], to correct possible steric clashes
between the docked peptide and the MHC. Interestingly, the
pMHC system is embedded in a box filled with explicit water
molecules and with a 0.15 mol/l NaCl concentration during the
minimization to take the solvent effect into account.
Fourth, a second docking (D2) is performed in order to refine
the structure, because the MHC side chain conformations have
been modified during the energy minimization step in presence of
the peptide. This docking step follows the same procedure as D1
for the sampling and scoring as well as for the selection of the final
binding mode.
Structural Prediction of Peptide–MHC Binding Modes
267
DockTope was tested on 135 pMHC class I complexes, covering the five MHC allotypes and the corresponding peptide lengths
mentioned above. Given that the MHC structures used for the
docking were systematically taken from the five preselected PDB
files listed above, this assessment was de facto a cross-docking
experiment. The averaged RMSD between the predicted and native
binding modes ranged from 0.4 to 1.1 Å for the Cα atoms as a
function of the MHC allotypes (from 1.7 to 2.5 Å for all-atom
RMSD), with an average of 0.9 Å over all allotypes (2.0 Å for
all-atom RMSD). Of note, DockTope is freely available as a web
server.
Liu et al. tested the Rosetta FlexPepDock [88] refinement
protocol in the context of peptide–MHC docking [108]. This
protocol can be used when an approximate model of the peptideprotein interaction is already available. It uses a Monte Carlo
energy minimization to iteratively optimize the peptide backbone
and its rigid-body orientation while sampling the side chain flexibility of the peptide and the protein receptor. Applying a refinement
protocol necessitates to first generate coarse grained models of the
peptide binding mode in the MHC groove. The authors used two
approaches for this: threading the target sequence in experimentally
determined backbone positions of peptides bound either to the
same MHC allotype or to different MHC allotypes. The conformers obtained this way were then orientated manually into the
peptide binding groove so as to position the anchor residues (positions 2 and 9 of 9-mers) into the respective B and F MHC pockets
(Fig. 7). The resulting pMHC coarse-grained binding modes were
then used as input for the FlexPepDock refinement.
Fig. 7 Anchor residues Leu2 and Val9 of the ALGIGILTV peptide in the HLA-A*02:01 peptide groove (PDB ID
1JHT [22]). Peptide Leu2 is situated in pocket B of MHC, while Val9 residue is in pocket F. Their surface is
displayed and colored in magenta. The position of the MHC N and C-termini walls is also indicated
268
Marta A. S. Perez et al.
This double protocol was used to test the influence of the
origin of the backbone conformation on the quality of the prediction. 1000 independent FlexPepDock refinement calculations were
performed for each peptide to efficiently sample the conformational
space. The resulting binding modes were ranked based on the
Rosetta full-atom energy function [109]. The approach was tested
on 30 experimental structures of pMHC class I complexes. When
starting the docking from a peptide template bound to the same
MHC allotype, the authors found that 84% of the complexes were
docked with a backbone RMSD from the native binding mode
lower than 1 Å if they considered the five best binding modes. In
those conditions, the success rate at 2 Å all-atom RMSD among the
five best binding modes was 52%. When starting from a peptide
template bound to a different MHC allotype, the success rate at 1 Å
backbone RMSD among the five best binding modes decreased to
60%. However, in this case, the success rate at 2 Å all-atom RMSD
among the five best binding modes remained at 55%. Of note,
FlexPepDock is freely accessible as a web server.
4.3 Constrained
Termini
In GradDock, Kyeong et al. [89] decompose the peptide-MHC
docking procedure into three main steps.
First, three-dimensional conformations are generated for the
unbound peptide. Exploiting the high conservation of the N- and
C-termini conformation of the peptides presented by MHC thanks
to sequence-independent hydrogen bonds (Fig. 3), GradDock
generates the unbound peptide, with only backbone atoms, by
growing and joining half-peptides from the two fixed termini
taken from a selected experimental pMHC structure (PDB ID
1DUZ [110]). These half-peptides are produced using random ϕ
and ψ angle values. After removing the ϕ/ψ combinations with
lowest probabilities, the remaining half-peptides are randomly
paired and assembled using the cyclic coordinate descent algorithm
originally developed for loop closure [111].
Second, the unbound peptide conformations are inserted into
the MHC-I molecule through a so-called binding simulation; starting 20 Å above the MHC-I groove, the unbound peptide is pushed
into the latter following the binding axis. During this steered
insertion, the peptide is moved by the gradient descent algorithm,
where the gradient is iteratively calculated and added to the physical
forces. The GROMOS 54a7 force field parameters are used to
calculate the nonbonded interactions [112], while the bond
lengths, angles, and proper and improper dihedral angles are maintained by harmonic restraints. At the bound position, side chains of
the terminal residues are optimized using a Monte Carlo approach
applied to the torsion angles. Then, the peptide is submitted to a
gradient descent energy minimization before applying a topological
correction consisting in refining the position of the backbone
atoms of the bound peptides using Ramachandran probability
Structural Prediction of Peptide–MHC Binding Modes
269
maps [113]. The peptide is ultimately fully hydrogenated using
REDUCE [114]. During this binding simulation, the structure of
the MHC molecule is held fixed and treated as an AutoDock-style
grid [105].
Third, the resulting candidate poses from step two are ranked
by the Gradock algorithm to provide the final prediction. Of note,
Kyeong et al. reparameterized the Rosetta scoring terms using a
linear programming approach. All Rosetta score terms were calculated for the native and calculated binding modes of several pMHC
systems, before being normalized. Each of these energy terms was
attributed a weight. Following the hypothesis that the crystal structure is in the minimum energy state, the energy of each calculated
binding mode of a given peptide constitutes an energy inequality
against the corresponding native binding mode. The authors determined the optimal weights of the Rosetta score terms by solving the
linear equations. The new ranking functions were validated based
on cross-validation using self-docking results as well as crossdockings.
Gradock was tested by redocking using a set on 107 nonredundant pMHC class I systems, covering 82 class I MHCs and 8- to
10-mer peptides, and further challenged on cross-docking of
70 complexes. GradDock was found to provide robust crossdocking predictions, with a predictive ability similar to that of
redocking, i.e., RMSDs around 1.2 Å and 2.5 Å for backbone
and all-atoms, respectively. Of note, although GradDock provides
calculated binding modes with an averaged RMSD to the native
binding modes similar to the standard Rosetta score-based
approach [91], it was found to provide good predictions for three
times more targets in cross-docking.
Another approach using constrained termini was provided by
the work of Park et al. [90] Their method makes use of all-atom
molecular
dynamics
(MD)
and
simulated
annealing
(SA) simulations. Their protocol starts by preparing an initial peptide–MHC structure using homology modeling, with MODELLER [115], Rosetta [75], or PRIME [116]. However, only the
MODELLER structure was used to initiate the SA protocol in their
study. Each SA cycle consists in heating the system from 300 K to
1500 K during 80 ps, followed by an equilibration at 1500 K for
another 80 ps and finally a cooling to 300 K in 800 ps. The MD
simulations during the SA cycles were performed using Langevin
dynamics, calculated using the AMBER9 program [117] and
AMBER force field [118]. 100 SA cycles were performed during
which the MHC atoms were restrained to their initial position, and
the distances of the four hydrogen bonds between the MHC and
the N- and C-termini of the peptide were also maintained. The
most frequent conformation among the 100 generated ones was
selected as predicted binding mode.
270
Marta A. S. Perez et al.
The approach was tested on 17 pMHC complexes, all
HLA-A*02:01. For each one, the experimental structure
corresponding to PDB ID 2V2W [42] was used as a template for
the homology modeling step. The authors found that homology
modeling already provided calculated binding modes with an averaged peptide all-atom RMSD from the native binding mode of only
1.5 Å for MODELLER, and 3.1 Å for Rosetta and PRIME. The SA
protocol, which started from the MODELLER-generated conformations, did not improve the predictions, with an all-atom RMSD
of 1.6 Å. The authors decided to change the criteria of selection of
the final binding mode. For this, starting from the binding mode
generated by the SA protocol, a 10 ns MD simulation was performed at 283 K, again using AMBER and the same restraints on
the peptide and MHC. Each trajectory was analyzed to find conformational transitions. Finally, the authors selected the most probable conformational state as the one resulting the most frequently
from conformational transitions. They found that this new
MD-based protocol could correct the worst MODELLER prediction, obtained for the complex with PDB ID 2V2X [42], decreasing the peptide all-atom RMSD from 2.4 Å after homology
modeling to 1.4 Å after simulated annealing followed by MD
simulation. Tested on three other complexes for which MODELLER provided a successful prediction, the MD-based protocol
provided results similar to that of homology modeling. In addition,
the authors used their method to perform a blind docking of three
peptides on HLA-A*02:01. Their predicted binding modes were
validated by X-ray crystallography.
A constrained termini approach, using the Rosetta scoring
function, was also used by Yanover et al. [91] to calculate
position-specific frequency matrices (PFM) for several MHC
alleles. Contrarily to the other approaches mentioned here, their
method was developed to also explore the sequence of the peptides
during docking. Their docking approach proceeds in two stages.
First, a low-resolution backbone model for the peptide bound
to the MHC is obtained by fragment assembly. The backbone of
the peptide is built outward from the two canonical anchor positions by assembling three-residue fragments from proteins of
known structure and similar local sequence [119]. The peptide is
randomly cut in two fragments between the two anchor positions
to perform an independent sampling of the two halves. The cyclic
coordinate descent (CCD) loop closure algorithm [120, 121] is
finally applied to provide peptide conformations. Several cut/closure cycles are performed to increase the sampling. In addition, the
orientation of the anchor positions is sampled by replacement with
orientations derived from a peptide–MHC complex of known
structure. In this low-resolution stage, both the peptide and the
MHC are represented only by their backbone, and their energy is
calculated using a knowledge-based scoring function.
Structural Prediction of Peptide–MHC Binding Modes
271
Second, a high-resolution modeling step is performed by adding all side chain atoms to the low-resolution backbone model.
Peptide conformations obtained this way are refined using a
Monte Carlo optimization procedure. During this stage, the peptide sequence is considered as a degree of freedom and is optimized
by Monte Carlo moves. Preferred peptides and binding modes are
selected based on the binding energy. The potential energy function used in the second stage is the Rosetta all-atom potential [122]
modified to incorporate a short-ranged electrostatics term and an
implicit solvation model.
Docking and sequence-optimizing of thousands of peptides
using this approach provided computed PFMs very close to the
experimental PFMs for 29 MHC class I (11 HLA-A and
18 HLA-B).
The ICM docking approach was also tested in the context of a
constrained termini approach by Bordner et al. [93] using a flexible
all-atom model of the complete peptide. The authors used their
biased-probability Monte Carlo [123] conformational search
method implemented in the ICM program [98, 99] to sample the
conformational space of the peptide into a grid potential derived
from an X-ray structure of the MHC molecule. Based on the fact
that the MHC side chain conformations from all available X-ray
crystal structures of HLA-A*02:01 MHC were found to cluster
into only two groups, one representative structure from each
group, PDB ID 1JF1 [22] and 1I7U [124], was used for docking.
A quadratic restraint energy was applied between atoms on the
peptide to be docked and the corresponding atoms of the N- and
C-termini of the peptide in the original pMHC structure to
account for the conserved position of the peptide termini. The
50 lowest energy conformations from the grid docking simulations
were ranked using the energy of an all-atom model of the complex,
using the ECEPP/3 force field [100], after local minimization of
the peptide and nearby MHC residues. The lowest energy conformation was chosen as the final docking solution.
The approach was tested by cross-docking of 14 peptides into
HLA-A*02:01 and 9 peptides into H-2-Kb, as well as docking
peptides into homology models for five different MHC allotypes.
For the cross-docking on HLA-A*02:01 and H-2-Kb, the authors
obtained an averaged backbone RMSD to the native binding mode
of 1.1 Å and 0.7 Å, respectively. For the docking on homology
models, the authors obtained an averaged backbone RMSD to the
native binding mode of 1.1 Å.
Recently, the team of Kavraki and coworkers, which provided
several important contributions to the field of peptide–MHC docking notably with DINC [94–96] and HLA-Arena (see below),
proposed the APE-Gen [92] (Anchored Peptide-MHC Ensemble
Generator) approach to generate ensembles of bound conformations of pMHC complexes. The development of this approach
272
Marta A. S. Perez et al.
followed the observation that pMHC complexes are dynamic systems and that taking their flexibility into account can significantly
enhance functional interpretations [125]. The objective of
APE-Gen is therefore to generate an ensemble of conformations
of the pMHC system, as opposed to producing only the most
probable one as done in docking. Consequently, APE-Gen is not
a docking method strictly speaking. As usual, the approach is
decomposed in several steps.
First, the MHC structure can be provided as a PDB file if an
experimental structure is available. Otherwise, the MHC structure
is obtained by homology modeling using MODELLER [115] and
selecting the best solution according to the DOPE [126] score.
Then, APE-Gen places the termini atoms of the peptide backbone
in the MHC groove using a template pMHC conformation (which
can be different than the template used by MODELLER above) so
as to capitalize on the conserved position of the N- and C-termini
of the peptides in the pMHC complexes.
Second, 100 backbone conformations are generated for the
peptide by loop modeling based on the random coordinate descent
[127] (RCD) algorithm, which is a modification of the previously
mentioned CCD approach [120, 121]. Third, for each of the
backbone conformations generated above, the side chains are
added with PDBFixer from OpenMM [128], and the resulting
full peptide and MHC side chains are energy minimized using the
SMINA force field.
APE-Gen was tested on 535 pMHC complexes of class I, with
peptide length ranging from 8 to 11-mers, and for which an
experimental X-ray structure is available. For each of these test
cases, APE-Gen was run 10 times. The averaged RMSD between
the native binding mode and the closest sampled peptide conformation, over the entire test sets, is 0.9 Å for the Cα atoms and 2.0 Å
for all atoms, showing that, although APE-Gen is not a docking
software, it can generate peptide conformations close to the native
one. APE-Gen is open source and freely available as a standalone
software. It is also available within HLA-Arena, a recently developed platform for structural modeling and analysis of pMHC
complexes [129].
In their work, Fagerberg et al. [38] designed an MD conformational sampling protocol based on SA cycles with near logarithmic cooling. Here, the pMHC system is described using the
CHARMM force field [130], and calculations are performed
using the CHARMM Molecular Mechanics package [131]. Starting
from the native binding mode (for redocking runs) or a near-native
binding mode (for cross-docking), the system is submitted to
successive cycles of SA. An SA cycle starts by assigning random
velocities at 100 K to the peptide atoms. The system is heated over
3 ps to a temperature of 1300 K using Langevin dynamics before
being equilibrated at this temperature for another 3 ps and
Structural Prediction of Peptide–MHC Binding Modes
273
subsequently being cooled for 25 ps. The SA cycle is terminated by
a minimization of the system, and the final conformation is stored
for additional analysis before being used as a starting point for the
next SA cycle. The SA cycle is repeated until a collection of 1000
peptide conformers is obtained. During this process, the solvent
effect is approximated by using a distance-dependent dielectric
constant (ε ¼ 4r). The MHC molecule is kept rigid during the
entire sampling but no constraints are applied to the peptide.
Therefore, although the approach is not totally agnostic about the
particularities of the typical peptide–MHC binding since it initiates
the process from a native or near-native binding mode, no constraint is applied to the backbone conformation nor to the N- and
C-termini. However, the rigidity of the MHC during the SA might
fix de facto the N- and C-termini in their original position. The
authors demonstrated that increasing the temperature up to
1300 K erased very rapidly most of the memory of the starting
native state during the SA: some of the conformers generated early
in the process lost most of the native contacts, and minor pockets of
the MHC were unfilled. Consequently, this SA protocol is expected
to erase the structural memory of the native binding mode and to
only keep the memory of the global orientation of the peptide in
the groove as well as some of the interactions in the anchor binding
pockets. At the end of the SA cycles, the 1000 minimized peptide
conformations are clustered based on their relative RMSD. Their
effective energy, including the CHARMM energy and a GB-MV2
[132, 133] implicit estimation of the solvent effect, is calculated.
The final binding mode is the center of a cluster of binding poses,
itself selected based on the size of the cluster or on the mean
effective energy of its members.
The approach was challenged by the redocking of
14 HLA-A*02:01 pMHC and 27 non-HLA-A*02:01 pMHC.
When selecting the final result based on the cluster size, 86% of
the calculated binding modes for HLA-A*02:01 pMHC have a
RMSD to the native binding mode lower than 1.0 Å for the
backbone atoms and 71% an RMSD lower than 1.5 Å for all
heavy atoms. For non-HLA-A*02:01 pMHC, these success rates
become 70 and 59%, respectively. When replacing the cluster size
by the mean effective energy to select the output, the success rates
increase to 100% and 93% for HLA-A*02:01 pMHC and to 74 and
67% for non-HLA-A*02:01 pMHC.
4.4 Incremental
Peptide
Reconstruction
The predictive ability of standard small molecule/protein ligand
software drops if the number of internal degrees of freedom (DoF)
of the ligand is larger than 10. However, the number of DoF of 8 to
11-mer peptides largely surpasses this limit, making standard docking software inadequate for the peptide-MHC system. Based on
this observation, Antunes et al. [94] proposed a new algorithm,
derived from their initial DINC docking software [95]. Their
274
Marta A. S. Perez et al.
approach addresses this constraint by limiting the number of
peptide-related DoF to 6 at every step of the docking process
through an incremental reconstruction of the peptide, rather than
by docking the entire peptide and exploring all its degrees of
freedom at once.
Briefly, DINC (Docking Incrementally) starts by docking only
a small fragment of the peptide. This initial root fragment is chosen
so as to maximize the potential for hydrogen bonds (i.e., by counting the number of hydrogen bond donors and acceptors), while
limiting the number of DoF to 6. This fragment is docked using a
standard docking software, and the 10 best binding modes according to the calculated binding free energy are selected for fragment
expansion. These docked fragment poses are expanded by adding
atoms following the same heuristic as above (i.e., optimizing the
number of hydrogen bond donors and acceptors). The expanded
fragment, in the different selected binding modes, is used as input
for a second round of dockings in which a new set of 6 DoFs is
considered flexible regardless of the fragment size. These new DoF
are selected to involve some of the newly added atoms and some of
the atoms that were already present in the previous fragment. This
process of docking and expansion is repeated until the peptide has
been entirely reconstructed and docked. Of note, DINC is a meta
docking approach and its code is only in charge of the selection of
the initial fragment, of its incremental expansion, and of the choice
of the DoF used in dockings. In the discussed version of the
algorithm, the docking of the fragments itself is delegated to the
AutoDock 4 software [105] used with standard parameters.
Noticeably, DINC was developed to dock peptides in general, not
necessarily in the context of peptide-MHC docking. Consequently,
its algorithm is not using the characteristics of the pMHC system,
such as the conservation of the position of the N- and C-termini, or
the limited number of backbone conformations observed for peptides in the grooves of MHC class I.
The approach was tested via the redocking of 25 pMHC complexes [94], spanning 10 different MHC class I and peptides ranging from 8- to 10-mers. The averaged Cα and all-atoms RMSD
between the calculated and native binding modes were 1.0 and
1.9 Å, respectively. This result is particularly impressive given that
the DINC algorithm is not capitalizing on any specific constraints
resulting from the known features of the pMHC complexes.
Recently, a new version of DINC was released. DINC 2.0 was
modified to use AutoDock Vina and to be more efficient on larger
peptides. It was made accessible as a freely available docking web
server [96].
An alternative of the incremental peptide reconstruction is
offered by DynaPred [97]. Briefly, the latter performs MD simulations to approximate the binding free energy of each peptide residue inside the binding pockets of the MHC cleft. The structural
Structural Prediction of Peptide–MHC Binding Modes
275
information obtained by these simulations is used to construct the
3D structure of the pMHC complexes. To stabilize the peptide
conformations, single residues are extended to peptide-trimers and
dimers by adding glycine residues at both sides. The final docked
poses are obtained by connecting the residue conformations from
the simulation runs and performing a short energy minimization.
The approach was tested by cross-docking on 20 complexes of
9-mer peptides in MHC HLA-A*02:01. The authors obtained an
average backbone RMSD of 1.5 Å.
Of note, the incremental peptide reconstruction strategy is
used in other peptide docking algorithms [35, 134], but these
will not be detailed here since they were not intensively tested on
pMHC complexes.
5
Conclusion
Progresses in experimental approaches have brought a wealth of
information regarding MHC specificities for peptides and TCR
specificities for pMHC epitopes. Given the importance of this
information for the treatment of cancer and autoimmune diseases,
these data have been curated and collected in several freely accessible databases. In return, this allowed the development of several
rapid and efficient machine-learning or deep-learning approaches
to predict these specificities, which are now widely used in immunoinformatics. The number of available pMHC and TCRpMHC 3D
structures is also continuously growing. However, it remains
neglectable compared to the huge diversity of pMHC that results
from the number of possible peptides and MHC allotypes. To
address this limitation, several pMHC docking algorithms have
been developed and benchmarked, some of them very recently.
Docking peptides, even limited to 8 to 11 amino acids in length,
is particularly challenging in view of the flexibility of such ligands
and their large number of degrees of freedom. However, pMHC
docking codes, notably for class I MHC, can rely on different datadriven assumptions regarding the overall orientation of the peptide
in the MHC groove, the position of the N- and C-termini of the
peptide and the general conformation of its backbone. Thanks to
this knowledge, several approaches have been developed and can be
categorized according to the nomenclature proposed by Antunes
et al. [17] as constrained backbone, constrained termini, and incremental peptide reconstruction approaches. Some of these
approaches demonstrated a satisfactory predictive ability in redocking and cross-docking of some pMHC complexes. However, there
is still room for improvement in terms of speed and availability. Of
note, the predictive ability of the approaches is generally tested on a
larger and larger amount of diverse pMHC complexes. However,
276
Marta A. S. Perez et al.
the number of such complexes in the test sets remains small in view
of the real diversity in terms of MHC allotypes and peptide
sequence and length. We can expect that the increasing amount
of experimental data available will nurture the design of new
pMHC docking software and the enhancement of existing ones.
Acknowledgments
This work was supported by the University of Lausanne—Department of Oncology UNIL-CHUV, the Ludwig Institute for Cancer
Research—Lausanne Branch, the SIB Swiss Institute of Bioinformatics, SNSF grants to VZ (#205321_192019, CRSII5_193749
and CRSK-3_190400) and OM (#31003A_176168), and funds
from Research for Life to OM.
References
1. Hansen TH, Bouvier M (2009) MHC class I
antigen presentation: learning from viral evasion strategies. Nat Rev Immunol 9:503–513.
https://doi.org/10.1038/nri2575
2. Hewitt EW (2003) The MHC class I antigen
presentation pathway: strategies for viral
immune
evasion.
Immunology
110:163–169. https://doi.org/10.1046/j.
1365-2567.2003.01738.x
3. Jones EY, Fugger L, Strominger JL, Siebold
C (2006) MHC class II proteins and disease: a
structural perspective. Nat Rev Immunol
6:271–282.
https://doi.org/10.1038/
nri1805
4. Roche PA, Furuta K (2015) The ins and outs
of MHC class II-mediated antigen processing
and presentation. Nat Rev Immunol
15:203–216.
https://doi.org/10.1038/
nri3818
5. Schumacher TN, Schreiber RD (2015)
Neoantigens in cancer immunotherapy. Science 348:69–74. https://doi.org/10.1126/
science.aaa4971
6. Tran E, Turcotte S, Gros A et al (2014) Cancer immunotherapy based on mutationspecific CD4+ T cells in a patient with epithelial cancer. Science 344:641–645. https://
doi.org/10.1126/science.1251102
7. Sahin U, Türeci Ö (2018) Personalized vaccines for cancer immunotherapy. Science
359:1355–1360. https://doi.org/10.1126/
science.aar7112
8. Wirth TC, Kühnel F (2017) Neoantigen
targeting-dawn of a new era in cancer
immunotherapy? Front Immunol 8:1848.
https://doi.org/10.3389/fimmu.2017.
01848
9. Tran E, Robbins PF, Rosenberg SA (2017)
“Final common pathway” of human cancer
immunotherapy: targeting random somatic
mutations. Nat Immunol 18:255–262.
https://doi.org/10.1038/ni.3682
10. Lizée G, Overwijk WW, Radvanyi L et al
(2013) Harnessing the power of the immune
system to target cancer. Annu Rev Med
64:71–90.
https://doi.org/10.1146/
annurev-med-112311-083918
11. Galluzzi L, Chan TA, Kroemer G et al (2018)
The hallmarks of successful anticancer immunotherapy. Sci Transl Med 10:eaat7807.
https://doi.org/10.1126/scitranslmed.
aat7807
12. Comber JD, Philip R (2014) MHC class I
antigen presentation and implications for
developing a new generation of therapeutic
vaccines. Ther Adv Vaccines 2:77–89.
https://doi.org/10.1177/
2051013614525375
13. Yin Y, Li Y, Mariuzza RA (2012) Structural
basis for self-recognition by autoimmune
T-cell receptors. Immunol Rev 250:32–48.
https://doi.org/10.1111/imr.12002
14. Gfeller D, Bassani-Sternberg M, Schmidt J,
Luescher IF (2016) Current tools for predicting cancer-specific T cell immunity. Onco Targets Ther 5:e1177691. https://doi.org/10.
1080/2162402X.2016.1177691
15. Mösch A, Raffegerst S, Weis M et al (2019)
Machine learning for cancer immunotherapies
Structural Prediction of Peptide–MHC Binding Modes
based on epitope recognition by T cell receptors. Front Genet 10:1141. https://doi.org/
10.3389/fgene.2019.01141
16. Adams JJ, Narayanan S, Birnbaum ME et al
(2016) Structural interplay between germline
interactions and adaptive recognition determines the bandwidth of TCR-peptide-MHC
cross-reactivity. Nat Immunol 17:87–94.
https://doi.org/10.1038/ni.3310
17. Antunes DA, Abella JR, Devaurs D et al
(2018) Structure-based methods for binding
mode and binding affinity prediction for
peptide-MHC complexes. Curr Top Med
Chem 18:2239–2255. https://doi.org/10.
2174/1568026619666181224101744
18. Robinson J, Barker DJ, Georgiou X et al
(2020) IPD-IMGT/HLA Database. Nucleic
Acids Res 48:D948–D955. https://doi.org/
10.1093/nar/gkz950
19. Gfeller D, Bassani-Sternberg M (2018) Predicting antigen presentation-what could we
learn from a million peptides? Front Immunol
9:1716. https://doi.org/10.3389/fimmu.
2018.01716
20. Klein J, Sato A (2000) The HLA system. First
of two parts. N Engl J Med 343:702–709.
https://doi.org/10.1056/
NEJM200009073431006
21. Gao GF, Rao Z, Bell JI (2002) Molecular
coordination of alphabeta T-cell receptors
and coreceptors CD8 and CD4 in their recognition of peptide-MHC ligands. Trends
Immunol 23:408–413. https://doi.org/10.
1016/s1471-4906(02)02282-2
22. Sliz P, Michielin O, Cerottini JC et al (2001)
Crystal structures of two closely related but
antigenically distinct HLA-A2/melanocytemelanoma tumor-antigen peptide complexes.
J Immunol 167:3276–3284
23. Pettersen EF, Goddard TD, Huang CC et al
(2004) UCSF chimera--a visualization system
for exploratory research and analysis. J Comput Chem 25:1605–1612. https://doi.org/
10.1002/jcc.20084
24. Gao GF, Tormo J, Gerth UC et al (1997)
Crystal structure of the complex between
human CD8alpha(alpha) and HLA-A2.
Nature 387:630–634. https://doi.org/10.
1038/42523
25. Wang H, Capps GG, Robinson BE, Zúñiga
MC (1994) Ab initio association with beta
2-microglobulin during biosynthesis of the
H-2Ld class I major histocompatibility complex heavy chain promotes proper disulfide
bond formation and stable peptide binding.
J Biol Chem 269:22276–22281. https://doi.
org/10.1016/S0021-9258(17)31787-8
277
26. Shields MJ, Kubota R, Hodgson W et al
(1998) The effect of human beta2microglobulin on major histocompatibility
complex I peptide loading and the engineering of a high affinity variant. Implications for
peptide-based vaccines. J Biol Chem
273:28010–28018.
https://doi.org/10.
1074/jbc.273.43.28010
27. Uger RA, Chan SM, Barber BH (1999) Covalent linkage to beta2-microglobulin enhances
the MHC stability and antigenicity of suboptimal
CTL
epitopes.
J
Immunol
162:6024–6028
28. Collins EJ, Garboczi DN, Wiley DC (1994)
Three-dimensional structure of a peptide
extending from one end of a class I MHC
binding site. Nature 371:626–629. https://
doi.org/10.1038/371626a0
29. Guillaume P, Picaud S, Baumgaertner P et al
(2018) The C-terminal extension landscape
of naturally presented HLA-I ligands. Proc
Natl Acad Sci U S A 115:5083–5088.
https://doi.org/10.1073/pnas.
1717277115
30. Matsui M, Hioe CE, Frelinger JA (1993)
Roles of the six peptide-binding pockets of
the HLA-A2 molecule in allorecognition by
human cytotoxic T-cell clones. Proc Natl
Acad Sci U S A 90:674–678. https://doi.
org/10.1073/pnas.90.2.674
31. Deres K, Beck W, Faath S et al (1993)
MHC/peptide binding studies indicate hierarchy of anchor residues. Cell Immunol
151:158–167.
https://doi.org/10.1006/
cimm.1993.1228
32. Bassani-Sternberg M, Chong C, Guillaume P
et al (2017) Deciphering HLA-I motifs across
HLA peptidomes improves neo-antigen predictions and identifies allostery regulating
HLA specificity. PLoS Comput Biol 13:
e1005725. https://doi.org/10.1371/jour
nal.pcbi.1005725
33. Perez MAS, Bassani-Sternberg M, Coukos G
et al (2019) Analysis of secondary structure
biases in naturally presented HLA-I ligands.
Front Immunol 10:823. https://doi.org/10.
3389/fimmu.2019.02731
34. Liu J, Gao GF (2011) Major histocompatibility complex: interaction with peptides. eLS.
https://doi.org/10.1002/9780470015902.
a0000922.pub2
35. Sezerman U, Vajda S, DeLisi C (1996) Free
energy mapping of class I MHC molecules
and structural determination of bound peptides. Protein Sci 5:1272–1281. https://doi.
org/10.1002/pro.5560050706
278
Marta A. S. Perez et al.
36. Antunes DA, Vieira GF, Rigo MM et al
(2010) Structural allele-specific patterns
adopted by epitopes in the MHC-I cleft and
reconstruction of MHC:peptide complexes to
cross-reactivity assessment. PLoS One 5:
e10353. https://doi.org/10.1371/journal.
pone.0010353
37. Schueler-Furman O, Elber R, Margalit H
(1998) Knowledge-based structure prediction of MHC class I bound peptides: a study
of 23 complexes. Fold Des 3:549–564.
https://doi.org/10.1016/S1359-0278(98)
00070-4
38. Fagerberg T, Cerottini J-C, Michielin O
(2006) Structural prediction of peptides
bound to MHC class I. Proteins
356:521–546. https://doi.org/10.1016/j.
jmb.2005.11.059
39. Nicholls S, Piper KP, Mohammed F et al
(2009) Secondary anchor polymorphism in
the HA-1 minor histocompatibility antigen
critically affects MHC stability and TCR recognition. Proc Natl Acad Sci U S A
106:3889–3894. https://doi.org/10.1073/
pnas.0900411106
40. Reiser J-B, Legoux F, Gras S et al (2014)
Analysis of relationships between peptide/
MHC structural features and naive T cell frequency
in
humans.
J
Immunol
193:5816–5826. https://doi.org/10.4049/
jimmunol.1303084
41. Cole DK, Bulek AM, Dolton G et al (2016)
Hotspot autoimmune T cell receptor binding
underlies pathogen and insulin peptide crossreactivity. J Clin Invest 126:2191–2204.
https://doi.org/10.1172/JCI85679
42. Lee JK, Stewart-Jones G, Dong T et al (2004)
T cell cross-reactivity and conformational
changes during TCR engagement. J Exp
Med 200:1455–1466. https://doi.org/10.
1084/jem.20041251
43. Pieper J, Dubnovitsky A, Gerstner C et al
(2018) Memory T cells specific to citrullinated α-enolase are enriched in the rheumatic
joint. J Autoimmun 92:47–56. https://doi.
org/10.1016/j.jaut.2018.04.004
44. Wang JH, Meijers R, Xiong Y et al (2001)
Crystal structure of the human CD4
N-terminal two-domain fragment complexed
to a class II MHC molecule. Proc Natl Acad
Sci U S A 98:10799–10804. https://doi.org/
10.1073/pnas.191124098
45. Chicz RM, Urban RG, Lane WS et al (1992)
Predominant naturally processed peptides
bound to HLA-DR1 are derived from
MHC-related molecules and are heterogeneous in size. Nature 358:764–768. https://
doi.org/10.1038/358764a0
46. Achour A (2001) Major histocompatibility
complex: interaction with peptides. eLS.
https://doi.org/10.1038/npg.els.0000922
47. Burley SK, Berman HM, Kleywegt GJ et al
(2017) Protein data Bank (PDB): the single
global macromolecular structure archive.
Methods Mol Biol 1607:627–641. https://
doi.org/10.1007/978-1-4939-7000-1_26
48. Berman HM (2000) The Protein Data Bank.
Nucleic Acids Res 28:235–242. https://doi.
org/10.1093/nar/28.1.235
49. Sinigaglia M, Antunes DA, Rigo MM et al
(2013) CrossTope: a curate repository of 3D
structures of immunogenic peptide: MHC
complexes. Database 2013:bat002. https://
doi.org/10.1093/database/bat002
50. Tong JC, Kong L, Tan TW, Ranganathan S
(2006) MPID-T: database for sequencestructure-function information on T-cell
receptor/peptide/MHC interactions. Appl
Bioinforma 5:111–114. https://doi.org/10.
2165/00822942-200605020-00005
51. Khan JM, Cheruku HR, Tong JC, Ranganathan S (2011) MPID-T2: a database for
sequence-structure-function
analyses
of
pMHC and TR/pMHC structures. Bioinformatics 27:1192–1193. https://doi.org/10.
1093/bioinformatics/btr104
52. Kaas Q, Ruiz M, Lefranc M-P (2004) IMGT/
3Dstructure-DB and IMGT/StructuralQuery, a database and a tool for immunoglobulin, T cell receptor and MHC structural data.
Nucleic Acids Res 32:D208–D210. https://
doi.org/10.1093/nar/gkh042
53. Gowthaman R, Pierce BG (2019) TCR3d:
the T cell receptor structural repertoire database. Bioinformatics 35:5323–5325. https://
doi.org/10.1093/bioinformatics/btz517
54. Leem J, de Oliveira SHP, Krawczyk K, Deane
CM (2018) STCRDab: the structural T-cell
receptor database. Nucleic Acids Res 46:
D406–D412.
https://doi.org/10.1093/
nar/gkx971
55. Borrman T, Cimons J, Cosiano M et al (2017)
ATLAS: a database linking binding affinities
with structures for wild-type and mutant
TCR-pMHC
complexes.
Proteins
85:908–916.
https://doi.org/10.1002/
prot.25260
56. Calis JJA, Maybeno M, Greenbaum JA et al
(2013) Properties of MHC class I presented
peptides that enhance immunogenicity. PLoS
Comput Biol 9:e1003266. https://doi.org/
10.1371/journal.pcbi.1003266
57. Chowell D, Krishna S, Becker PD et al (2015)
TCR contact residue hydrophobicity is a hallmark of immunogenic CD8+ T cell epitopes.
Structural Prediction of Peptide–MHC Binding Modes
PNAS 112:E1754–E1762. https://doi.org/
10.1073/pnas.1500973112
58. Vita R, Overton JA, Greenbaum JA et al
(2015) The immune epitope database
(IEDB) 3.0. Nucleic Acids Res 43:
D405–D412.
https://doi.org/10.1093/
nar/gku938
59. Dhanda SK, Mahajan S, Paul S et al (2019)
IEDB-AR: immune epitope database-analysis
resource in 2019. Nucleic Acids Res 47:
W502–W506.
https://doi.org/10.1093/
nar/gkz452
60. Nielsen M, Lundegaard C, Worning P et al
(2003) Reliable prediction of T-cell epitopes
using neural networks with novel sequence
representations. Protein Sci 12:1007–1017.
https://doi.org/10.1110/ps.0239403
61. Andreatta M, Nielsen M (2016) Gapped
sequence alignment using artificial neural networks: application to the MHC class I system.
Bioinformatics 32:511–517. https://doi.
org/10.1093/bioinformatics/btv639
62. Jurtz V, Paul S, Andreatta M et al (2017)
NetMHCpan-4.0: improved peptide-MHC
class I interaction predictions integrating
eluted ligand and peptide binding affinity
data. J Immunol 199:3360–3368. https://
doi.org/10.4049/jimmunol.1700893
63. Karosiene E, Lundegaard C, Lund O, Nielsen
M (2012) NetMHCcons: a consensus
method for the major histocompatibility complex class I predictions. Immunogenetics
64:177–186.
https://doi.org/10.1007/
s00251-011-0579-8
64. O’Donnell TJ, Rubinsteyn A, Bonsack M et al
(2018) MHCflurry: open-source class I MHC
binding affinity prediction. Cell Syst
7:129–132.e4. https://doi.org/10.1016/j.
cels.2018.05.014
65. Phloyphisut P, Pornputtapong N, Sriswasdi S,
Chuangsuwanich E (2019) MHCSeqNet: a
deep neural network model for universal
MHC binding prediction. BMC Bioinformatics 20:270–210. https://doi.org/10.1186/
s12859-019-2892-4
66. Venkatesh G, Grover A, Srinivasaraghavan G,
Rao S (2020) MHCAttnNet: predicting
MHC-peptide bindings for MHC alleles classes I and II using an attention-based deep
neural model. Bioinformatics 36:i399–i406.
https://doi.org/10.1093/bioinformatics/
btaa479
67. Maccari G, Robinson J, Ballingall K et al
(2017) IPD-MHC 2.0: an improved interspecies database for the study of the major
histocompatibility complex. Nucleic Acids
279
Res 45:D860–D864. https://doi.org/10.
1093/nar/gkw1050
68. Shugay M, Bagaev DV, Zvyagin IV et al
(2018) VDJdb: a curated database of T-cell
receptor sequences with known antigen specificity. Nucleic Acids Res 46:D419–D427.
https://doi.org/10.1093/nar/gkx760
69. Tickotsky N, Sagiv T, Prilusky J et al (2017)
McPAS-TCR: a manually curated catalogue of
pathology-associated
T
cell
receptor
sequences. Bioinformatics 33:2924–2929.
https://doi.org/10.1093/bioinformatics/
btx286
70. Armstrong DR, Berrisford JM, Conroy MJ
et al (2020) PDBe: improved findability of
macromolecular structure data in the PDB.
Nucleic Acids Res 48:D335–D343. https://
doi.org/10.1093/nar/gkz990
71. Velankar S, Alhroub Y, Best C et al (2012)
PDBe: Protein Data Bank in Europe. Nucleic
Acids Res 40:D445–D452. https://doi.org/
10.1093/nar/gkr998
72. Gutmanas A, Alhroub Y, Battle GM et al
(2014) PDBe: Protein Data Bank in Europe.
Nucleic Acids Res 42:D285–D291. https://
doi.org/10.1093/nar/gkt1180
73. Velankar S, Best C, Beuth B et al (2010)
PDBe: Protein Data Bank in Europe. Nucleic
Acids Res 38:D308–D317. https://doi.org/
10.1093/nar/gkp916
74. Wong WK, Marks C, Leem J et al (2020)
TCRBuilder: multi-state T-cell receptor structure
prediction.
Bioinformatics
36:3580–3581. https://doi.org/10.1093/
bioinformatics/btaa194
75. Raman S, Vernon R, Thompson J et al (2009)
Structure prediction for CASP8 with all-atom
refinement using Rosetta. Proteins 77 Suppl
9:89–99.
https://doi.org/10.1002/prot.
22540
76. Mazza C, Auphan-Anezin N, Gregoire C et al
(2007) How much can a T-cell antigen receptor adapt to structurally distinct antigenic
peptides? EMBO J 26:1972–1983. https://
doi.org/10.1038/sj.emboj.7601605
77. Buckle AM, Borg NA (2018) Integrating
experiment and theory to understand
TCR-pMHC dynamics. Front Immunol
9:2898. https://doi.org/10.3389/fimmu.
2018.02898
78. Giguère S, Drouin A, Lacoste A et al (2013)
MHC-NP: predicting peptides naturally processed by the MHC. J Immunol Methods
400-401:30–36. https://doi.org/10.1016/
j.jim.2013.10.003
280
Marta A. S. Perez et al.
79. Paul S, Karosiene E, Dhanda SK et al (2018)
Determination of a predictive cleavage motif
for eluted major histocompatibility complex
class II ligands. Front Immunol 9:1795.
https://doi.org/10.3389/fimmu.2018.
01795
80. Wang Z, Sun H, Yao X et al (2016) Comprehensive evaluation of ten docking programs
on a diverse set of protein-ligand complexes:
the prediction accuracy of sampling power
and scoring power. Phys Chem Chem Phys
18:12964–12975.
https://doi.org/10.
1039/c6cp01555g
81. Gathiaka S, Liu S, Chiu M et al (2016) D3R
grand challenge 2015: evaluation of proteinligand pose and affinity predictions. J Comput
Aided Mol Des 30:651–668. https://doi.
org/10.1007/s10822-016-9946-8
82. Mey ASJS, Juárez-Jiménez J, Hennessy A,
Michel J (2016) Blinded predictions of binding modes and energies of HSP90-α ligands
for the 2015 D3R grand challenge. Bioorg
Med Chem 24:4890–4899. https://doi.
org/10.1016/j.bmc.2016.07.044
83. Xu X, Ma Z, Duan R, Zou X (2019) Predicting protein-ligand binding modes for CELPP
and GC3: workflows and insight. J Comput
Aided Mol Des 33:367–374. https://doi.
org/10.1007/s10822-019-00185-0
84. Pagadala NS, Syed K, Tuszynski J (2017)
Software for molecular docking: a review. Biophys Rev 9:91–102. https://doi.org/10.
1007/s12551-016-0247-1
85. Kontoyianni M, McClellan LM, Sokol GS
(2004) Evaluation of docking performance:
comparative data on docking algorithms. J
Med Chem 47:558–565. https://doi.org/
10.1021/jm0302997
86. Khan JM, Ranganathan S (2010) pDOCK: a
new technique for rapid and accurate docking
of peptide ligands to major histocompatibility
complexes. Immunome Res 6 Suppl 1:S2.
https://doi.org/10.1186/1745-7580-6S1-S2
87. Rigo MM, Antunes DA, de Freitas MV et al
(2015) DockTope: a web-based tool for automated pMHC-I modelling. Sci Rep 5:18413.
https://doi.org/10.1038/srep18413
88. London N, Raveh B, Cohen E et al (2011)
Rosetta FlexPepDock web server--high resolution modeling of peptide-protein interactions. Nucleic Acids Res 39:W249–W253.
https://doi.org/10.1093/nar/gkr431
89. Kyeong H-H, Choi Y, Kim H-S (2018) GradDock: rapid simulation and tailored ranking
functions for peptide-MHC class I docking.
Bioinformatics 34:469–476. https://doi.
org/10.1093/bioinformatics/btx589
90. Park M-S, Park SY, Miller KR et al (2013)
Accurate structure prediction of peptideMHC complexes for identifying highly immunogenic antigens. Mol Immunol 56:81–90.
https://doi.org/10.1016/j.molimm.2013.
04.011
91. Yanover C, Bradley P (2011) Large-scale
characterization of peptide-MHC binding
landscapes with structural simulations. PNAS
108:6981–6986. https://doi.org/10.1073/
pnas.1018165108
92. Abella JR, Antunes DA, Clementi C, Kavraki
LE (2019) APE-gen: a fast method for generating ensembles of bound peptide-MHC
conformations. Molecules 24:881. https://
doi.org/10.3390/molecules24050881
93. Bordner AJ, Abagyan R (2006) Ab initio prediction of peptide-MHC binding geometry
for diverse class I MHC allotypes. Proteins
63:512–526.
https://doi.org/10.1002/
prot.20831
94. Antunes DA, Devaurs D, Moll M et al (2018)
General prediction of peptide-MHC binding
modes using incremental docking: a proof of
concept. Sci Rep 8:4327. https://doi.org/
10.1038/s41598-018-22173-4
95. Dhanik A, McMurray JS, Kavraki LE (2013)
DINC: a new AutoDock-based protocol for
docking large ligands. BMC Struct Biol 13
(Suppl 1):S11–S14. https://doi.org/10.
1186/1472-6807-13-S1-S11
96. Antunes DA, Moll M, Devaurs D et al (2017)
DINC 2.0: a new protein-peptide docking
webserver using an incremental approach.
Cancer Res 77:e55–e57. https://doi.org/
10.1158/0008-5472.CAN-17-0511
97. Antes I, Siu SWI, Lengauer T (2006)
DynaPred: a structure and sequence based
method for the prediction of MHC class I
binding peptide sequences and conformations. Bioinformatics 22:e16–e24. https://
doi.org/10.1093/bioinformatics/btl216
98. Abagyan R, Totrov M, Kuznetsov D (1994)
Icm - a new method for protein modeling and
design - applications to docking and structure
prediction from the distorted native conformation. J Comput Chem 15:488–506.
https://doi.org/10.1002/jcc.540150503
99. Abagyan RA, Totrov M (1999) Ab InitioFolding of peptides by the optimal-Bias
Monte Carlo minimization procedure. J
Comput Phys 151:402–421. https://doi.
org/10.1006/jcph.1999.6233
Structural Prediction of Peptide–MHC Binding Modes
100. Nemethy G, Gibson KD, Palmer KA et al
(2002) Energy parameters in polypeptides.
10. Improved geometrical parameters and
nonbonded interactions for use in the
ECEPP/3 algorithm, with application to
proline-containing peptides. J Phys Chem
96:6472–6484. https://doi.org/10.1021/
j100194a068
101. Rudolph MG, Shen LQ, Lamontagne SA et al
(2004) A peptide that antagonizes
TCR-mediated reactions with both syngeneic
and allogeneic agonists: functional and structural aspects. J Immunol 172:2994–3002.
https://doi.org/10.4049/jimmunol.172.5.
2994
102. Rückert C, Fiorillo MT, Loll B et al (2006)
Conformational dimorphism of self-peptides
and molecular mimicry in a disease-associated
HLA-B27
subtype.
J
Biol
Chem
281:2306–2316. https://doi.org/10.1074/
jbc.M508528200
103. Meijers R, Lai C-C, Yang Y et al (2005) Crystal structures of murine MHC class I H-2 D
(b) and K(b) molecules in complex with CTL
epitopes from influenza A virus: implications
for TCR repertoire selection and immunodominance. J Mol Biol 345:1099–1110.
https://doi.org/10.1016/j.jmb.2004.11.
023
104. Trott O, Olson AJ (2010) AutoDock Vina:
improving the speed and accuracy of docking
with a new scoring function, efficient optimization, and multithreading. J Comput Chem
31:455–461. https://doi.org/10.1002/jcc.
21334
105. Morris GM, Huey R, Lindstrom W et al
(2009) AutoDock4 and AutoDockTools4:
automated docking with selective receptor
flexibility. J Comput Chem 30:2785–2791.
https://doi.org/10.1002/jcc.21256
106. Abraham MJ, Murtola T, Schulz R et al
(2015) GROMACS: high performance
molecular simulations through multi-level
parallelism from laptops to supercomputers.
SoftwareX 1-2:19–25
107. Oostenbrink C, Villa A, Mark AE, van Gunsteren WF (2004) A biomolecular force field
based on the free enthalpy of hydration and
solvation: the GROMOS force-field parameter sets 53A5 and 53A6. J Comput Chem
25:1656–1676. https://doi.org/10.1002/
jcc.20090
108. Liu T, Pan X, Chao L et al (2014) Subangstrom accuracy in pHLA-I modeling by
Rosetta FlexPepDock refinement protocol. J
Chem Inf Model 54:2233–2242. https://
doi.org/10.1021/ci500393h
281
109. Rohl CA, Strauss CEM, Misura KMS, Baker
D (2004) Protein structure prediction using
Rosetta. Methods Enzymol 383:66–93.
https://doi.org/10.1016/S0076-6879(04)
83004-0
110. Khan AR, Baker BM, Ghosh P et al (2000)
The structure and stability of an
HLA-A*0201/octameric tax peptide complex with an empty conserved peptide-N-terminal
binding
site.
J
Immunol
164:6398–6405. https://doi.org/10.4049/
jimmunol.164.12.6398
111. Canutescu AA, Dunbrack RL (2003) Cyclic
coordinate descent: a robotics algorithm for
protein loop closure. Protein Sci 12:963–972.
https://doi.org/10.1110/ps.0242703
112. Schmid N, Eichenberger AP, Choutko A et al
(2011) Definition and testing of the GROMOS force-field versions 54A7 and 54B7.
Eur Biophys J 40:843–856. https://doi.
org/10.1007/s00249-011-0700-9
113. Ting D, Wang G, Shapovalov M et al (2010)
Neighbor-dependent Ramachandran probability distributions of amino acids developed
from a hierarchical Dirichlet process model.
PLoS Comput Biol 6:e1000763. https://doi.
org/10.1371/journal.pcbi.1000763
114. Word JM, Lovell SC, Richardson JS, Richardson DC (1999) Asparagine and glutamine:
using hydrogen atom contacts in the choice
of side-chain amide orientation. J Mol Biol
285:1735–1747. https://doi.org/10.1006/
jmbi.1998.2401
115. Eswar N, Eramian D, Webb B et al (2008)
Protein structure modeling with MODELLER. In: Biomolecular simulations. Humana
Press, Totowa, NJ, pp 145–159
116. McRobb FM, Capuano B, Crosby IT et al
(2010) Homology modeling and docking
evaluation of aminergic G protein-coupled
receptors. J Chem Inf Model 50:626–637.
https://doi.org/10.1021/ci900444q
117. Case DA, Cheatham TE, Darden T et al
(2005) The Amber biomolecular simulation
programs. J Comput Chem 26:1668–1688.
https://doi.org/10.1002/jcc.20290
118. Wang J, Cieplak P, Kollman PA (2000) How
well does a restrained electrostatic potential
(RESP) model perform in calculating conformational energies of organic and biological
molecules? J Comput Chem 21:1049–1074.
https://doi.org/10.1002/1096-987X(
200009)21:12<1049::AID-JCC3>3.0.
CO;2-F
119. London N, Movshovitz-Attias D, SchuelerFurman O (2010) The structural basis of
282
Marta A. S. Perez et al.
peptide-protein binding strategies. Structure
18:188–199. https://doi.org/10.1016/j.str.
2009.11.012
120. Wang C, Bradley P, Baker D (2007) Proteinprotein docking with backbone flexibility. J
Mol Biol 373:503–519. https://doi.org/10.
1016/j.jmb.2007.07.050
121. Rohl CA, Strauss CEM, Chivian D, Baker D
(2004) Modeling structurally variable regions
in homologous proteins with rosetta. Proteins
55:656–677.
https://doi.org/10.1002/
prot.10629
122. Kuhlman B, Dantas G, Ireton GC et al (2003)
Design of a novel globular protein fold with
atomic-level
accuracy.
Science
302:1364–1368. https://doi.org/10.1126/
science.1089427
123. ABAGYAN R, Totrov M (1994) Biased probability Monte Carlo conformational searches
and electrostatic calculations for peptides and
proteins. J Mol Biol 235:983–1002. https://
doi.org/10.1006/jmbi.1994.1052
124. Buslepp J, Zhao R, Donnini D et al (2001) T
cell activity correlates with oligomeric
peptide-major histocompatibility complex
binding on T cell surface. J Biol Chem
276:47320–47328.
https://doi.org/10.
1074/jbc.M109231200
125. Fodor J, Riley BT, Borg NA, Buckle AM
(2018) Previously hidden dynamics at the
TCR-peptide-MHC Interface revealed. J
Immunol 200:4134–4145. https://doi.org/
10.4049/jimmunol.1800315
126. Shen M-Y, Sali A (2006) Statistical potential
for assessment and prediction of protein
structures. Protein Sci 15:2507–2524.
https://doi.org/10.1110/ps.062416606
127. Chys P, Chacón P (2013) Random coordinate
descent with spinor-matrices and geometric
filters for efficient loop closure. J Chem
Theory Comput 9:1821–1829. https://doi.
org/10.1021/ct300977f
128. Eastman P, Swails J, Chodera JD et al (2017)
OpenMM 7: rapid development of high performance algorithms for molecular dynamics.
PLoS Comput Biol 13:e1005659. https://
doi.org/10.1371/journal.pcbi.1005659
129. Antunes DA, Abella JR, Hall-Swan S et al
(2020) HLA-arena: a customizable environment for the structural modeling and analysis
of peptide-HLA complexes for cancer immunotherapy. JCO Clin Cancer Inform
4:623–636. https://doi.org/10.1200/CCI.
19.00123
130. Mackerell AD, Bashford D, Bellott M et al
(1998) All-atom empirical potential for
molecular modeling and dynamics studies of
proteins. J Phys Chem B 102:3586–3616.
https://doi.org/10.1021/jp973084f
131. Brooks BR, Brooks CL, Mackerell AD et al
(2009) CHARMM: the biomolecular simulation
program.
J
Comput
Chem
30:1545–1614. https://doi.org/10.1002/
jcc.21287
132. Lee MS, Salsbury FR, Brooks CL (2002)
Novel generalized born methods. J Chem
Phys 116:10606. https://doi.org/10.1063/
1.1480013
133. Lee MS, Feig M, Salsbury FR, Brooks CL
(2003) New analytic approximation to the
standard molecular volume definition and its
application to generalized born calculations. J
Comput Chem 24:1348–1356. https://doi.
org/10.1002/jcc.10272
134. Desmet J, Wilson IA, Joniau M et al (1997)
Computation of the binding of fully flexible
peptides to proteins with flexible side chains.
FASEB J 11:164–172. https://doi.org/10.
1096/fasebj.11.2.9039959
Chapter 14
Molecular Simulation of Stapled Peptides
Victor Ovchinnikov, Aravinda Munasinghe, and Martin Karplus
Abstract
Constrained peptides represent a relatively new class of biologic therapeutics, which have the potential to
overcome several limitations of small-molecule drugs, and of designed antibodies. Because of their modest
size, the rational design of such peptides is becoming increasingly amenable to computer simulation; multimicrosecond molecular dynamic (MD) simulations are now routinely possible on consumer-grade graphical
processors (GPUs). Here, we describe the procedures for performing and analyzing MD simulations of
hydrocarbon-stapled peptides using the CHARMM energy function, in isolation and in complex with a
binding partner, to investigate their conformational properties and to compute changes in their binding
affinity upon mutation.
Key words Molecular dynamics, Peptide design, MMGBSA, Binding free energy, MDM2, p53
1
Introduction
Peptide therapeutics continue to attract pharmaceutical interest as
important alternatives to small-molecule and protein-based treatments [1]. Peptide-based drugs generally have favorable pharmacokinetic profiles and can achieve high specificity and selectivity via
optimization of their amino acid sequence. The main obstacles to
the use of peptides as therapeutics are (1) a potentially high
sequence-dependent propensity for aggregation, (2) often lower
cellular penetration compared to small-molecule therapeutics,
(3) immunogenicity, especially for longer sequences, (4) susceptibility to cellular proteases, and (5) conformational heterogeneity
due to the presence of several rotatable bonds per residue. Obstacles (1)–(3) can be partially overcome by optimizing peptide hydrophobicity, solubility, and amino acid sequence length. However,
overcoming obstacles (4)–(5) generally requires chemical modification. One possibility, which is considered here, is hydrocarbonbased (or other, e.g. urea-based [2]) “stapling,” i.e. introduction of
a cross link between different parts of the peptide [3]. The stapling
serves to reduce the number of thermally accessible peptide
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_14,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
283
284
Victor Ovchinnikov et al.
conformations, which decreases the configurational entropic penalty of binding, and therefore improve the binding free energy
provided that the designed conformation has a high affinity for
the target. Furthermore, the stapling can impart resistance to proteases if the staple sterically hinders protease access or if the stapling
reduces the population of conformations that are susceptible to
proteolysis.
In view of the relatively small size and low conformational
heterogeneity of most stapled peptides and of the ongoing increases
in the speed of molecular dynamics (MD) simulation, meaningful
computational analyses of stapled peptides in isolation and in complex with a protein target are now possible using general-purpose
computing hardware (rather than supercomputing clusters) and
software that is freely available for nonprofit use [4–7].
Consequently, the present protocol, which describes a step-bystep method to simulate hydrocarbon-stapled peptides using MD
simulation and to analyze the simulation data, appears to us especially timely. As examples, we consider two stapled peptide systems:
(1) stapled deca-alanine in solvent and (2) stapled inhibitor peptides bound to the oncoprotein MDM2.
2
Materials
In this section, we list the computer hardware and software used to
perform and analyze the simulations described here (Table 1). All of
the software is available freely to academic or nonprofit users.
(A freely available version of the program CHARMM [5, 8] with
a reduced set of features is sufficient for this protocol.) For-profit
users are encouraged to contact the relevant software publishers to
investigate available licensing options. We note that some of the
software listed here can be replaced with other software with similar
functionality. For example, some features of CHARMM used to
prepare simulation files can be replaced using the psfgen
tool distributed with the program VMD [6], which is widely used
by the NAMD2 [14] community. However, input scripts to psfgen
are not provided with this protocol.
We also note that the simulations described here use the
CHARMM36 energy function [15, 16]. Although the protocol
could be modified to use other energy functions or MD simulation
software (e.g. Amber [17], OPLS [18], and GROMOS [19]), the
modifications are extensive and are not described here, as our focus
is on functionality, not on generality. The users interested in alternative energy functions are encouraged to read the relevant computational literature on stapled peptide simulations [2, 20, 21].
In addition to listing the software required to perform the
simulations and analyses in this protocol, we list all of the input
scripts provided with this protocol in a supplement. These scripts
Molecular Simulation of Stapled Peptides
285
Table 1
Computational resources used in this protocol
Computer
hardware
AMD Ryzen
3900X CPU
Nvidia GTX
2080 Ti
GPU
32 GB DDR43200 RAM
Specialized
software
Version
Used for
Availability
CHARMM
44
Preparation, MD
simulation, and
analysis
https://charmm.chemistry.
harvard.edu [8]
CHARMM
topology,
parameters
36
MD simulation and
energy analysis
https://mackerell.umaryland.
edu/charmm_ff.shtml [9]
VMD
1.9.3
Analysis and
visualization
www.ks.uiuc.edu/Research/
vmd/ [10]
OpenMM
a770038f125588ab0cced8
cd67f04e5083f24e0da
MD simulation and
analysis
http://openmm.org [11]
MDTraj
19bf371eb4ab2792ff88c2ad Analysis
80272e4e54327c82a
http://mdtraj.org [12]
Generalpurpose
software
GNU
Compiler
suite
8
Compiling CHARMM Arch Linux distribution
and OpenMM
Python
3.8
MD simulation and
analysis
Arch Linux distribution
Cuda Toolkit
11
MD simulations and
analysis
Arch Linux distribution
GNU Octave
[13]
6.1
Data plotting and
analysis
Arch Linux distribution
Arch Linux
rolling
release
Kernel 5.8.7
Operating System
Arch Linux distribution
Bash shell
5.1.0
Driver/shell scripts
Arch Linux distribution
Essential
scripts
(continued)
286
Victor Ovchinnikov et al.
Table 1
(continued)
Name
Purpose
Software
Location (relative to root)
buildvac.inp
Prepare structure in vacuum CHARMM
util/charmm
solvate.inp
Add explicit water to
structure
CHARMM
util/charmm
watercube.str
Subroutine for solvate.inp
CHARMM
util/charmm
mdcube.inp
MD in exp. solvent via
CHARMM/OpenMM
CHARMM
util/charmm
mdene.inp
Compute energy from MD
trajectory
CHARMM
util/charmm
radius_gbsw.
str
Subroutine for mdene.inp
CHARMM
util/charmm
read-smallbox. Subroutine for solvate.inp
str
CHARMM
util/toppar
staples-toppar. Energy function parameters
str
for staples
CHARMM and
OpenMM
util/toppar
toppar-waterions.str
Energy function parameters
for water
CHARMM and
OpenMM
util/toppar
align.vmd
Compute RMSDb from
trajectory
VMD
util/vmd
rmwat.vmd
Remove explicit water from
trajectory
VMD
util/vmd
rgyr.vmd
Compute Rgyrc from
trajectory
VMD
util/vmd
mdvac.py
MD in vacuum or implicit
solvent
Python/OpenMM
util/python
mdcube.py
MD in explicit solvent
Python/OpenMM
util/python
mdene.py
Energy analysis using
OpenMM/MDTraj
Python
util/python
mdsasa.py
SASAd analysis using
MDTraj
Python
util/python
mddssp.py
Secondary structure analysis
using MDTraj
Python
util/python
show.m
Plot RMSD or Rgyr
Octave
ala10
dssp.m
Plot peptide helicity
Octave
ala10
dgnp.m,
dgnps.me
Nonpolar solvation energy
difference
Octave
3V3B
Solvation energy difference
dgobc2.m,
dgobc2s.me
Octave
3V3B
(continued)
Molecular Simulation of Stapled Peptides
287
Table 1
(continued)
dggbsw.m,
Solvation energy difference
dggbsws.me
Octave
3V3B
dgnp.m
Nonpolar solvation energy
difference
Octave
3V3B
dgobc2.m
Solvation energy difference
Octave
3V3B
dggbsw.m
Solvation energy difference
Octave
3V3B
acorr.m
Compute sample correlation Octave
length
util/matlab
getchain
Extract a single chain from a Bash shell
PDB file
util
addseg
Add a segment name to a
PDB file
Bash shell
util
toppar.str
Read energy function
parameters
CHARMM
util
rst2xsc
Convert box size from
CHARMM rst file to xsc
bash
util
prepare
Prepare simulation files
bash
ala10, 3V3B/complexes,
3V3B/peptides
run
Perform MD simulations
bash
ala10, 3V3B/complexes,
3V3B/peptides
post
Analyze MD simulations
bash
ala10, 3V3B/complexes,
3V3B/peptides
a
Git commit ID
Root-mean-square deviation
c
Radius of gyration
d
Solvent-accessible surface area
e
Modified for single-trajectory analysis
b
could be useful starting points for customizing the simulation and
analysis procedures and for applying the methods to different proteins and/or peptides.
A compressed archive containing the input and analysis scripts,
along with partial analysis data, is provided as Supporting Material
accompanying this chapter.
3
Methods
In the following, we discuss two examples of stapled peptide simulations: (1) a stapled deca-alanine (Ala10) and (2) a complex
between stapled peptides derived from the p53 protein and the
oncoprotein MDM2 from the PDB entry 3V3B [22]. The two
288
Victor Ovchinnikov et al.
Fig. 1 Stapled peptide systems simulated in this protocol: (a) Ala10 with the i,i+4 staple [3]; (b) Ala10 with the i,
i+7 staple; (c)–(e) MDM2 protein in complex with three peptides; the peptide backbones are drawn as red
ribbons; to show detail, peptide residues are also drawn in stick representations (hydrogens are omitted); and
the MDM2 protein in (c)–(e) is shown as a green ribbon. Several residues that are at the interface with MDM2
are indicated. Peptide chemical structures are given in Tables 2 and 3
structures are shown in Fig. 1. These examples should serve as
starting points for simulating stapled peptides in more complex
environments, such as within lipid membranes [23].
3.1
Preliminaries
We assume that the user has access to a Linux computer with a
modern graphical processing unit (GPU) capable of performing
computations via CUDA or OpenCL and that the necessary software packages have been installed (see Table 1). For installation
procedures, the users should visit the web pages listed in the
table. We further assume that the user has basic familiarity with
the general principles of molecular dynamics (MD) simulations,
which will not be discussed in this protocol. The user is referred
to textbooks on the subject [24–26].
Each example below consists of a preparation stage, an MD
simulation stage, and a post-processing stage. Each of the stages is
itself composed of steps that involve running the software as
described below. For the user’s convenience, most of the steps are
organized into three shell scripts, “prepare,” “run,” and “post,”
which can be run from the command prompt. However, it is
recommended that the user examine the contents of all the scripts
to understand the details of the procedures. In the description of
Molecular Simulation of Stapled Peptides
289
Table 2
Stapled Ala10 peptides. Ac and NMet denote acetylation and N-methylation tags, respectively
Peptide
i,i+4
i,i+7
Sequence & Staple Structure
1Ac A A A
1Ac A
N
H
N
H
A A A
O
A A A A A A
O
A A10 NMet
N
H
O
N
H
A10 NMet
O
the steps below, we provide the line number and name of the shell
script associated with the step in the format #[line_number],
[script_name].
3.2 Deca-alanine
(Ala10)
In this example, we set up and perform MD simulations of decaalanine cross-linked using two different hydrocarbon staples i, i + 4
and i, i + 7 [3], shown in Table 2. This is a simple example designed
to illustrate the basics of system preparation and simulation.
1. Decide on the peptide sequence. In this example, we start with
Ala10 and modify two residues in the sequence to introduce a
staple. Following the nomenclature by Verdine and Hilinski
[3], we have prepared a special CHARMM topology and
parameter file (see Note 1) to describe two nonstandard
amino acid residues that correspond to the legs of the staple,
R8 and S5, where the letter describes the chirality at the α
carbon of the residue, and the digit is the number of carbons
in the hydrocarbon chain, excluding the α carbon.
The corresponding sequences are Ala3-S5-Ala3-S5-Ala2
and Ala-R8-Ala6-S5-Ala.
2. Generate peptide structure files in vacuum. The sequences
from the previous step are used as input to the CHARMM
script buildvac.inp (#24, ala10/prepare) along with a flag
(“qstaple”) that determines whether stapling is to be performed. If qstaple¼1, the CHARMM script will connect the
staple legs. The script is set up for the two types of staples
considered here. In this example, the deca-alanine coordinates
are generated in α-helical geometry by setting the backbone
dihedral angles (ϕ,ψ) to (57∘, 47∘). The peptide N- and
C-termini are acetylated and methylated, respectively.
290
Victor Ovchinnikov et al.
3. If a simulation in explicit solvent is desired, immerse the structure in a cubic box of explicit TIP3 water, using the CHARMM
script solvate.inp (#26, ala10/prepare). At this point, the structures are ready for MD simulation (see Note 2).
4. Decide whether to perform an MD simulation in implicit or
explicit solvent. For the user’s convenience, we include two
examples of explicit solvent simulation at constant pressure and
temperature (NPT ensemble) [24]; one uses the CHARMM/
OpenMM interface via the CHARMM script “mdcube.inp”
(#28, ala10/run), and the other uses the Python interface to
OpenMM via “mdcube.py” (#33, ala10/run). The two MD
simulations use slightly different simulation parameters, and
the Python/OpenMM simulation uses hydrogen mass repartitioning, which allows the use of a 4fs time step; this is also
possible with CHARMM/OpenMM but is not illustrated here
(see Notes 3 and 4).
An implicit solvent MD simulation is illustrated in the
Python script “mdvac.py,” which uses the OBC2 Generalized
Born implicit solvent model [27].
In this example, the MD simulations are performed for
200 ns (CHARMM/OpenMM) or 400 ns (Python/
OpenMM). For research purposes, multi-microsecond simulations are typically used [2], although simulation convergence is
dependent on the biological system under study. An example of
calculating statistical errors from simulation data is given as part
of the MDM2/peptide test case.
5. After the MD simulation is complete, the recorded coordinates
from the simulation trajectory are analyzed to compute various
properties of interest. Since the present protocol concerns the
simulations of stapled peptides, the analysis here quantifies the
conformational differences between stapled and unstapled versions of the two peptides.
If one is not interested in the properties of solvent around
the peptide, explicit solvent can be removed from the simulation trajectories. This is accomplished by the VMD script
“rmwat.vmd” (#40,#59, ala10/post).
To quantify the conformational differences between the
peptides, we compute the root-mean-square deviation
(RMSD) of the peptide structure from the starting α-helical
configuration using “align.vmd” (#28, #49, #67, ala10/post),
the radius of gyration (Rgyr) using “rgyr.vmd” (#28, #49, #67,
ala10/post), and perform secondary structure analysis using
the software MDTraj via the Python script “mddssp.py” (#35,
#54, #73 ala10/post) to quantify peptide helicity.
6. The analysis results can be plotted using the GNU Octave
scripts “ala10/show.m” (RMSD and Rgyr) and “ala10/dssp.
m” (secondary structure). We note that the Octave scripts are
Molecular Simulation of Stapled Peptides
291
Table 3
Stapled peptides bound to the MDM2 oncoprotein simulated in this study. Inhibition constants Ki are
taken from Chang et al. [30]: { corresponds to peptide ATSP1800, { corresponds to peptide
ATSP3900, and corresponds to peptide ATSP7342
Peptide
1
2
3
Ki (nM)
Sequence & Staple Structure
17Ac Q T F
17Ac L T F
17Ac L T A
N
H
N
H
N
H
N L W R L
L
O
H Y W A Q L
O
E Y W A Q L
O
N
H
N
H
N
H
Q N29 NMet
25.9†
S A29 NMet
1.0‡
S A29 NMet
536∗
O
O
O
also compatible with Matlab [28]. The results are shown in
Fig. 2, which show that stapling generally decreases the RMSD
to the helical structure (Fig. 2a), the peptide radius of gyration
(Fig. 2b), and the probability of extended coil conformation
(Fig. 2c). These results are expected in view of the constraints
provided by the staples. We note that the conformational
ensemble of deca-alanine in solution is generally not α-helical,
but composed of partially unstructured coils [29].
3.3 MDM2/Peptide
Complex
In this example, we set up and perform MD simulations of three
peptides that bind the oncoprotein MDM2 [30]. The peptides
were designed to disrupt competitively the p53/MDM2 protein–
protein interaction; they use a partial sequence of p53 [22].
The stapled peptide sequences are given in Table 3. Using MD
simulation trajectories, we compute approximate binding affinity
changes upon mutating peptide 1 to peptides 2 and 3 [31] and
compare the results with experimental data [30].
Many details of the preparation procedure are the same as in the
previous case, and the descriptions here are therefore shortened. An
important difference is the availability of a high-resolution crystal
structure for one of the complexes [22], which is the basis for all
simulations discussed below.
292
Victor Ovchinnikov et al.
Fig. 2 Comparison of simulation statistics between stapled and unstapled peptides to show conformational
differences: (a) RMSD from the starting (α-helical) conformation; (b) radius of gyration; and (c) percent coil
(defined as 100-percent helix, as computed by the DSSP algorithm [7]); the colored bars correspond to
simulations with connected staple legs, and the transparent bars correspond to peptides with unstapled legs.
Error bars represent one standard deviation; they are indicated in only one direction for clarity. PyOMM corresponds to the Python/OpenMM interface, and ChOMM corresponds to the CHARMM/OpenMM interface
1. Download the structure 3V3B from the Protein Data Bank
(PDB)
(#12, 3V3B/complexes/prepare). The protein MDM2
(chain A) has one missing N-terminal residue, and the peptide
(chain D) has three missing N-terminal residues. The missing
residues appear not to be involved in the stability of the
complex and are therefore omitted from the simulation (see
Note 5).
2. Ensure that the staple atoms in the PDB structure are named
consistently with the topology file “util/toppar/staples-toppar.str,” and store the stapled peptide coordinates in a new
file SAH7.pdb (#22–#27, 3V3B/complexes/prepare).
Molecular Simulation of Stapled Peptides
293
3. Extract chain A (MDM2) from PDB file, and set the segment
name to MDM2
(#44, 3V3B/complexes/prepare).
4. Specify the desired peptide sequences to test for binding (#50,
3V3B/complexes/prepare). The sequences here are mutants
of the original PDB sequence; missing coordinates for mismatched residues will be generated by CHARMM.
5. For each mutant peptide sequence, generate simulation structure files that can be used for MD simulations in vacuum (#74,
3V3B/complexes/prepare).
6. If MD simulations in explicit water are desired, add solvent to
the structure.
(#75, 3V3B/complexes/prepare). Our experience indicates that simulations in explicit solvent generally give superior
results to those in implicit solvent, in particular, as regards
protein structure stability.
7. For each peptide sequence, perform an MD simulation in
implicit or explicit solvent (#32, #37, #42 3V3B/complexes/
run). For the purpose of this protocol, as in the case of Ala10,
we performed 400 ns-long simulations in explicit solvent;
however, longer simulations are strongly recommended to
improve statistical sampling (discussed below).
8. Repeat the above steps to simulate the stapled peptides in
isolation from MDM2; these substantially identical steps are
performed in the shell scripts “3V3B/peptides/prepare” and
“3V3B/peptides/run.”
9. If needed, repeat the above steps to simulate the MDM2
protein in isolation from the peptides (3V3B/protein/prepare
and 3V3B/protein/run). This step is not required if only the
changes in the binding affinity upon mutation are desired
because of cancellation of terms (see Eqs. 1 and 2 below).
10. Free energy differences due to mutations will be computed
using the Molecular Mechanics Generalized Born Surface
Area (MMGBSA) approach [31]. In the MMGBSA analysis,
the free energy of protein/peptide binding is approximated
using the equation
ΔG bind ¼
vdW
elec
GBorn
cmplx
Ecmplx þ Ecmplx þ Ecmplx þ γ SA
prot Þ
þ γ SA
ðEprot þ Eprot þ Eprot
vdW
elec
GBorn
pept Þ:
ðEpept þ Epept þ Epept þ γ SA
vdW
elec
GBorn
ð1Þ
ΔGbind is the difference between the values of the energy
components of the protein and peptide in complex and that of
the separated protein and peptide. The components are
294
Victor Ovchinnikov et al.
nonbonded (van der Waals and electrostatic) interaction energies and the polar and nonpolar solvation energies, represented
GBorn
respectively (the overbar denotes trajectory
by E
and γ SA,
is the averaged Solvent-Accessible Surface
averaging, and SA
Area [SASA]). The nonpolar solvation energy calculation was
performed using the standard water probe radius of 1.4 Å to
and γ was set to 0.00542kcal/mol/Å2 [32].
compute SA,
The contribution from the protein and peptide configurational and rototranslational entropy changes is neglected in
Eq. 1, because (i) it is difficult to compute accurately and
precisely [33] and (ii) its differences are expected to be small
for mutations that do not perturb the structures significantly;
such mutations are considered here. If desired, the user can
include entropy differences using harmonic or quasiharmonic
analysis [34].
Binding free energy differences of sequence mutation 1 ! i
were computed as
ΔΔG i ¼ ΔG ibind ΔG 1bind :
ð2Þ
Finally, we note that the energy terms corresponding to the
protein and peptide in separation from each other can be
computed from the trajectory of the bound complex, by alternately deleting the peptide or protein from the trajectory,
respectively, and repeating trajectory analysis. This variant of
the method is called the “single-trajectory” method; it is theoretically incorrect because the conformations of the separated
protein and peptide are drawn from incorrect thermodynamic
ensembles. However, the method is often used because it is less
computationally demanding, since separate MD simulations of
system components are not needed) [2]. Both the single- and
the multi-trajectory methods are illustrated here.
11. Explicit solvent should be removed from MD trajectories prior
to solvation energy analysis (#44 3V3B/complexes/post). MD
simulation trajectories of the protein/peptide complexes in
explicit solvent may have periodic wrapping artifacts, whereby
the peptide or protein alternately appears on opposing sides of
the periodic box. These artifacts need to be corrected prior to
solvation energy analysis. This step is performed using the
“pbctools” package in VMD (#46–64, 3V3B/complexes/
post). Simulations of the protein or a peptide in isolation do
not require this step (see Note 6).
12. Compute the nonbonded interaction energies and the polar
solvation energies from the different trajectories using an
appropriate solvation model. In this protocol, we provide two
alternatives, the OBC2 solvation model [27], available from
OpenMM through either the CHARMM or the Python interface (“mdene.py,” #69, 3V3B/complexes/post), or the
Molecular Simulation of Stapled Peptides
295
GBSW [35] solvation model in CHARMM/OpenMM
(“mdene.inp,”
#72,
3V3B/complexes/post).
Other
Generalized Born (GB) models are available in CHARMM,
which could potentially yield more accurate results. However,
they do not currently have GPU acceleration. The user is
encouraged to consult the CHARMM documentation for
details of their use; see Note 7.
13. Compute the SASA from each trajectory from which explicit
solvent has been removed using MDTraj software via the
Python script “mdsasa.py” (#75, 3V3B/complexes/post).
The SASA differences are used to provide the nonpolar contribution to the solvation energy in Eq. 1. The Octave script
“mkdgnp.m” computes the nonpolar solvation energy
differences.
14. Compute the averages in Eq. 1 and the standard error of the
mean (SEM). The calculation of SEM is complicated by the
fact that the trajectory snapshots are correlated, which requires
additional analysis to estimate the number of uncorrelated
samples in the time series. In this protocol, we explicitly set
the size of correlated trajectory blocks to twice the correlation
time scale [36],
tblock = 2
t1 →∞
t0 =0
C(t)dt,
ð3Þ
where C(t) is the auto-correlation function of the nonbonded
energies, computed using the Fourier transform in Octave via
the script acorr.m (this script is called automatically from the
parent scripts mkdgobc2.m, and mkdggbsw.m, depending on
which GB model is used for analysis). In Eq. 3, t1 is taken to be
the time at which the autocorrelation falls below 0.01 (see Note
8).
Typical correlation times computed in this way were
around 10 ns, corresponding to about 40 uncorrelated samples
in a 400 ns trajectory. This relatively small number corresponds
to the uncertainties in the Δ ΔG that are > 2kcal/mol (Fig. 3),
making clear the necessity of long MD simulations to reduce
statistical errors.
15. The Δ ΔG results are compared to experimental values in
Fig. 3, which are created by the Octave scripts mkdgobc2.m
and mkdggbsw.m. The experimental binding free energies
were approximated from inhibition constants (Ki) reported
by Chang et al. [30]. The second mutant peptide has a cyclobutane amino acid (Cba) at position 26, whereas our sequence
has leucine. However, other mutants in the experimental dataset of Chang et al. [30] suggest that the binding free energy
296
Victor Ovchinnikov et al.
a) 4
b)15
10
Δ Δ G(kcal/mol)
Δ Δ G(kcal/mol)
2
0
-2
Experiment
Exp. MD ; Δ Δ G from OBC2
-10
Experiment
Exp. MD ; Δ Δ G from GBSW
Δ Δ G1→2
Δ Δ G1→3
4
Δ Δ G1→3
d)20
15
2
Δ Δ G(kcal/mol)
Δ Δ G(kcal/mol)
-5
-20
Δ Δ G1→2
0
-2
Experiment
Exp. MD ; ΔΔ G from OBC2
-4
0
-15
-4
c)
5
Δ Δ G1→2
Δ Δ G1→3
10
5
0
-5
Experiment
Exp. MD ; Δ Δ G from GBSW
Δ Δ G1→2
Δ Δ G1→3
Fig. 3 Binding free energy differences of peptide mutations: (a) OBC2 model, separate MD trajectories; (b)
GBSW model, separate MD trajectories; (c) OBC2 model, single MD trajectory; and (d) GBSW model, single MD
trajectory. Note that the energy scale is different for the subplots. Experimental values are taken from Chang
et al. [30] (see Table 3). The results were obtained from the same explicit solvent simulations performed using
the Python interface to OpenMM; the differences are in the choice of GB model used to compute free energies
and whether the single or multiple trajectory method was used (see text)
difference between having these two amino acids at this position is small compared with the differences observed in this
protocol.
Both GB models, OBC2 [27] and GBSW [35], gave
results that are qualitatively consistent with the experimental
values, but OBC2 produced values that are in better agreement
with experiment (Fig. 3). This difference underscores the
importance of trying several GB models to gain more confidence in the simulation results. Furthermore, because of the
large statistical uncertainties, the values reported here should
be considered qualitative illustrations of the protocol. Much
longer simulations would be needed to obtain quantitative
results.
Molecular Simulation of Stapled Peptides
297
Fig. 4 Conformational properties of MDM2/peptide simulations in explicit water: (a) RMSD of MDM2/peptide
complexes; (b) RMSD of peptides without MDM2; (c) percent coil for peptides without MDM2; structure of
peptide #1 (d) at t ¼ 0 ns; and (e) at t ¼ 390 ns
16. If desired, conformational properties of the simulation structures can be computed, as done for deca-alanine (e.g. RMSD
and secondary structure analysis, #51 and #62 in 3V3B/peptides/post, respectively). For example, in Fig. 4a, we show that
the MDM2/peptide trajectories are stable for the duration of
the simulations. The peptides in solvent partially unwind at the
termini, e.g. Fig. 4e, which explains the higher RMSD shown in
Fig. 4b. However, they maintain greater helicity than the decaalanine stapled peptides (Fig. 4c vs. Fig. 2c).
4
Notes
1. The parameter files used here were prepared manually by analogy with existing CHARMM parameters for lysine, 1- and
2-butenes, and propene. An alternative (automatic) method
for CHARMM-compatible parameter generation involves submitting the desired chemical structure to the CGENFF [37]
website, www.paramchem.org (accessed December 25, 2020).
298
Victor Ovchinnikov et al.
More sophisticated parametrization approaches, involving fitting to quantum mechanical calculations, may be needed for
chemical modifications other than hydrocarbon staples or to
improve parameter accuracy [38]. We also note that other
energy functions and simulation software can be used for MD
simulations [17–19] and for stapled peptides, in particular [2].
2. In this protocol, for simplicity, we do not add ions to the
explicit solvent, as is usually done to ensure that the simulation
system is electrostatically neutral. In this case, the long-range
electrostatic solvers used by MD programs (e.g. particle mesh
Ewald) impose “tin-foil” boundary conditions (see e.g. Ref. 39
for a discussion of electrostatics in free energy simulations).
Although charge neutralization is a common practice, its omission is not expected to influence the present results significantly
because computation of interaction energies is performed
using an implicit solvation model without long-range electrostatic forces.
3. Using 4fs integration steps in CHARMM is possible after
explicit hydrogen mass repartitioning (HMR), i.e. the mass of
each hydrogen atom is increased to at most 4 a.m.u., and the
mass of the corresponding parent heavy atom is decreased to
preserve the total mass (see scalar.doc in the CHARMM documentation). Note that HMR requires rigid bonds (e.g. using
the SHAKE method [40]).
4. Other simulation parameters were as follows. The cutoff for
van der Waals (vdW) and near-space electrostatic calculations
was 10 Å. The vdW interactions were smoothly attenuated to
zero in the range of 8.5–10 Å using the CHARMM VSWITCH
function (CHARMM/OpenMM) or OpenMM switching
function (Python/OpenMM). Long-range electrostatics were
treated using fourth-order PME with default OpenMM parameters. The Langevin dynamics integrator was used to maintain the temperature at 298 K using a friction constant of 0.1/
ps, and the Monte Carlo barostat from OpenMM was used to
maintain the pressure at 1 atm. All bonds involving hydrogen
were treated as rigid.
5. In structures with missing internal (or otherwise important)
residues, the user may need to use modeling software such as
Modeller [41] or Rosetta [42] as an additional
preparation step.
6. In the MD simulations performed here, the connectivity of
each protein chain is preserved when coordinates are wrapped
across boundaries, which implies that only inter-protein (but
not intra-protein) distances can be affected by wrapping. Thus,
in simulations of single amino acid chains (e.g. protein or
peptide in solvent), wrapping does not change the energy
computed from the GB analysis.
Molecular Simulation of Stapled Peptides
299
7. Generalized Born (GB) models are developed to provide fast
pairwise approximations to the solution of the Poisson–Boltzmann (PB) equation. Some users may prefer to use a Poisson–
Boltzmann solver [5, 43–45] in place of GB. However, just as
there are important tunable parameters that enter into GB
models (such as GB radii), PB solvers also have parameters,
such as the type of solver used, resolution, atomic radii, dielectric constant, or Stern layer thickness [5, 43, 45].
For this protocol, we set the nonbonded cutoff distance for
the GB energy analysis to 20 Å. However, larger cutoffs may be
desirable (which will increase the computational cost of the
analysis).
8. Other methods of correlated trajectory analysis exist, such as
block bootstrap [46], or checking for the correct asymptotic
behavior of the standard error of the mean (SEM) as a function
of the number of independent samples, e.g. finding a range of
block sizes for which the SEM decreases as the reciprocal
square root of the number of blocks [25].
References
1. Fosgerau K, Hoffmann T (2015) Peptide therapeutics: current status and future directions.
Drug Discov Today 20(1):122–128. https://
doi.org/10.1016/j.drudis.2014.10.003
2. Cornillie S, Bruno B, Lim C, Cheatham T
(2018) Computational modeling of stapled
peptides toward a treatment strategy for CML
and broader implications in the design of
lengthy peptide therapeutics. J Phys chemistry
B 122(14):3864–3875. https://doi.org/10.
1021/acs.jpcb.8b01014
3. Verdine GL, Hilinski GJ (2012) Stapled peptides for intracellular drug targets., vol 503, 1st
edn. Elsevier Inc., Amsterdam. https://doi.
org/10.1016/B978-0-12-3
96962-0.00001-X
4. Friedrichs M, Eastman P, Vaidyanathan V,
Houston M, Legrand S, Beberg A, Ensign D,
Bruns C, Pande V (2009) Accelerating molecular dynamic simulation on graphics processing
units. J Comput Chem 30:864–872
5. Brooks B, Brooks III C, Mackerell Jr A,
Nilsson L, Petrella R, Roux B, Won Y,
Archontis G, Bartels C, Boresch S, et al
(2009) CHARMM: the biomolecular simulation program. J Comput Chem 30:
1545–1614, pMC2810661
6. Humphrey W, Dalke A, Schulten K (1996)
VMD - visual molecular dynamics. J Mol Graphics 14:33–38
7. McGibbon R, Beauchamp K, Harrigan M,
Klein C, Swails J, Hernãndez C, Schwantes C,
Wang L, Lane T, Pande V (2015) MDTraj: a
modern open library for the analysis of molecular dynamics trajectories. Biophys J 109(8):
1528–1532. https://doi.org/10.1016/j.
bpj.2015.08.015
8. (2020) CHARMM development project.
https://charmm.chemistry.harvard.edu.
Accessed 25 Dec 2020
9. (2020) CHARMM force field. https://
mackerell.umaryland.edu/charmm_ff.shtml.
Accessed 25 Dec 2020
10. (2020) Visual molecular dynamics. https://
www.ks.uiuc.edu/Research/vmd/. Accessed
25 Dec 2020
11. (2020) OpenMM. http://openmm.org.
Accessed 25 Dec 2020
12. (2020) MDTraj. http://mdtraj.org.
Accessed 25 Dec 2020
13. Eaton JW, Bateman D, Hauberg S, Wehbring
R (2015) GNU octave version 4.0.0 manual: a
high-level interactive language for numerical
computations. http://www.gnu.org/soft
ware/octave/doc/interpreter
14. Phillips J, Braun R, Wang W, Gumbart J,
Tajkhorshid E, Villa E, Chipot C, Skeel R,
Kale L, Schulten K (2005) Scalable molecular
dynamics with NAMD. J Comput Chem 26:
1781–1802
300
Victor Ovchinnikov et al.
15. Best R, Zhu X, Shim J, Lopes P, Mittal J,
Feig M, MacKerell Jr A (2012) Optimization
of the additive CHARMM all-atom protein
force field targeting improved sampling of the
backbone ϕ, ψ and side-chain χ 1 and χ 2 dihedral
angles. J Chem Theor Comput 8:3257–3273
16. Guvench O, Mallajosyula S, Raman E,
Hatcher E, Vanommeslaeghe K, Foster T,
Jamison F, Mackerell A (2011) CHARMM
additive all-atom force field for carbohydrate
derivatives and its utility in polysaccharide and
carbohydrate-protein modeling. J Chem Theor
Comput 7(10):3162–3180. https://doi.
org/10.1021/ct200328p
17. Pearlman DA, Case DA, Caldwell JW, Ross
WS, Cheatham TE, DeBolt S, Ferguson D,
Seibel G, Kollman P (1995) Amber, a package
of computer programs for applying molecular
mechanics, normal mode analysis, molecular
dynamics and free energy calculations to simulate the structural and energetic properties of
molecules. Comput Phys Commun 91(1):
1–41. https://doi.org/10.1016/0010-4655
(95)00041-D. http://www.sciencedirect.
com/science/article/pii/001046559500041
D
18. Shivakumar D, Harder E, Damm W, Friesner
RA, Sherman W (2012) Improving the prediction of absolute solvation free energies using
the next generation OPLS force field. J Chem
Theory Comput 8(8):2553–2558. https://
doi.org/10.1021/ct300203w
19. Hess B, Kutzner C, van der Spoel D, Lindahl E
(2008) GROMACS 4: algorithms for highly
efficient, load-balanced, and scalable molecular
simulation. J Chem Theor Comput 4(3):
4 3 5 – 4 4 7 . h t t p s : // d o i . o r g / 1 0 . 1 0 2 1 /
ct700301q. http://pubs.acs.org/doi/
pdf/10.1021/ct700301q
20. Brown CJ, Quah ST, Jong J, Goh AM, Chiam
PC, Khoo KH, Choong ML, Lee Ma,
Yurlova L, Zolghadr K, Joseph TL, Verma CS,
Lane DP (2013) Stapled peptides with
improved potency and specificity that activate
P53. ACS Chem Biol 8(3):506–512. https://
doi.org/10.1021/cb3005148
21. Morrone J, Perez A, Deng Q, Ha S,
Holloway M, Sawyer T, Sherborne B,
Brown F, Dill K (2017) Molecular simulations
identify binding poses and approximate affinities of stapled α-helical peptides to MDM2
and MDMX. J Chem Theor Comput 13(2):
863–869. https://doi.org/10.1021/acs.jctc.
6b00978
22. Baek S, Kutchukian PS, Verdine GL, Huber R,
Holak Ta, Lee KW, Popowicz GM (2012)
Structure of the stapled P53 peptide bound to
MDM2. J Am Chem Soc 134(1):103–106.
https://doi.org/10.1021/ja2090367
23. Ovchinnikov V, Stone TA, Deber C, Karplus M
(2018) Structure of the EmrE multidrug transporter and its use for inhibitor peptide design.
Proc Natl Acad Sci USA 115(34):E7942
24. Frenkel D, Smit B (2001) Understanding
molecular simulation: from algorithms to
applications, 2nd edn. Academic, San Diego
25. Allen MP, Tildesley DJ (1989) Computer simulation of liquids. Clarendon Press,
New York, NY
26. Rapaport DC (1996) The art of molecular
dynamics simulation. Cambridge University
Press, New York, NY
27. Onufriev A, Bashford D, Case D (2004)
Exploring protein native states and large-scale
conformational changes with a modified
Generalized Born model. Proteins 55(2):
383–394. https://doi.org/10.1002/prot.
20033
28. MATLAB (2010) Version 7.10.0 (R2010a).
The MathWorks Inc., Natick, MA
29. Hazel A, Chipot C, Gumbart J (2014) Thermodynamics of deca-alanine folding in water. J
Chem Theor Comput 10(7):2836–2844.
https://doi.org/10.1021/ct5002076
30. Chang Y, Graves B, Guerlavais V, Tovar C,
Packman K, To K, Olson K, Kesavan K,
Gangurde P, Mukherjee A, Baker T, Darlak K,
Elkin C, Filipovic Z, Qureshi F, Cai H, Berry P,
Feyfant E, Shi X, Horstick J, Annis D,
Manning A, Fotouhi N, Nash H, Vassilev L,
Sawyer T (2013) Stapled α-helical peptide drug
development: a potent dual inhibitor of
MDM2 and MDMX for p53-dependent cancer
therapy. Proc Natl Acad Sci USA 110(36):
E3445–3454. https://doi.org/10.1073/
pnas.1303002110
31. Brice A, Dominy B (2011) Analyzing the
robustness of the MM/PBSA free energy calculation method: application to DNA conformational transitions. J Comput Chem 32(2):
1431–1440
32. Srinivasan J, Cheatham TE, Cieplak P, Kollman
PA, Case DA (1998) Continuum solvent studies of the stability of DNA, RNA, and phosphoramidate–DNA helices. J Am Chem Soc
120(37):9401–9409
33. Ovchinnikov V, Cecchini M, Karplus M (2013)
A simplified confinement method (SCM) for
calculating absolute free energies and free
energy and entropy differences. J Phys Chem
B 117:750–762. https://doi.org/10.1021/
jp3080578. pMC3569517
Molecular Simulation of Stapled Peptides
34. Brooks B, Janežič D, Karplus M (1995) Harmonic
analysis
of
large
systems. I. Methodology. J Comput Chem
16:1522–1542
35. Im W, Feig M, Brooks III C (2003) An implicit
membrane generalized Born theory for the
study of structure, stability, and interactions of
membrane proteins. Biophys J 85:2900–2918
36. Shirts M (2012) Best practices in free energy
calculations for drug design. Methods Mol Biol
819:425–467. https://doi.org/10.1007/
978-1-61779-465-0_26
37. Vanommeslaeghe K, Hatcher E, Acharya C,
Kundu S, Zhong S, Shim J, Darwin E,
Guvench O, Lopes P, Vorobyev I, MacKerell
Jr A (2009) CHARMM general force field: a
force field and drug-like molecules compatible
with the CHARMM all-atom additive
biological force fields. J Comput Chem 31:
671–690
38. Mayne CG, Saam J, Schulten K, Tajkhorshid E,
Gumbart JC (2013) Rapid parameterization of
small molecules using the force field toolkit. J
Comput Chem 34(32):2757–2770. https://
doi.org/10.1002/jcc.23422
39. Simonson T, Roux B (2016) Concepts and
protocols for electrostatic free energies. Mol
Simul 42(13):1090–1101. https://doi.
org/10.1080/08927022.2015.1121544
40. Ryckaert JP, Ciccotti G, Berendsen H (1977)
Numerical integration of the Cartesian equations of motion of a system with constraints:
molecular dynamics of n-alkanes. J Comput
Phys 23:327–341
41. Eswar N, Webb B, Marti-Renom M,
Madhusudhan M, Eramian D, Shen M,
301
Pieper U, Sali A (2006) Comparative protein
structure modeling using modeller. Curr Prot
Bioinf 54: 5. 6.1–5.6.37. h t t p s : //d o i.
org/10.1002/0471250953.bi0506s15
42. Leaver-Fay A, Tyka M, Lewis S, Lange O,
Thompson J, Jacak R, Kaufman K, Renfrew P,
Smith C, Sheffler W, Davis I, Cooper S,
Treuille A, Mandell D, Richter F, Ban Y,
Fleishman S, Corn J, Kim D, Lyskov S,
Berrondo M, Mentzer S, Popović Z,
Havranek J, Karanicolas J, Das R, Meiler J,
Kortemme T, Gray J, Kuhlman B, Baker D,
Bradley P (2011) Rosetta3: an object-oriented
software suite for the simulation and design of
macromolecules. Methods Enzymol 487:
545–574. https://doi.org/10.1016/B978-012-381270-4.00019-6
43. Li L, Li C, Sarkar S, Zhang J, Witham S,
Zhang Z, Wang L, Smith N, Petukh M, Alexov
E (2012) DelPhi: a comprehensive suite for
DelPhi software and associated resources.
BMC Biophysics 5:9. https://doi.org/10.11
86/2046-1682-5-9
44. Roux B (1997) Influence of the membrane
potential on the free energy of an intrinsic protein. Biophys J 73:2980–2989
45. Baker N, Sept D, Joseph S, Holst M, McCammon J (2001) Electrostatics of nanosystems:
application to microtubules and the ribosome.
Proc Natl Acad Sci USA 98:10037–10041
46. Zoubir AM, Boashash B (1998) The bootstrap
and its application in signal processing. IEEE
Signal Process Mag 15(1):56–76. https://doi.
org/10.1109/79.647043
Chapter 15
Free Energy-Based Computational Methods for the Study
of Protein-Peptide Binding Equilibria
Emilio Gallicchio
Abstract
This chapter discusses the theory and application of physics-based free energy methods to estimate proteinpeptide binding free energies. It presents a statistical mechanics formulation of molecular binding, which is
then specialized in three methodologies: (1) alchemical absolute binding free energy estimation with
implicit solvation, (2) alchemical relative binding free energy estimation with explicit solvation, and
(3) potential of mean force binding free energy estimation. Case studies of protein-peptide binding
application taken from the recent literature are discussed for each method.
Key words Free energy, Binding free energy, Equilibrium binding constant, Alchemical perturbation,
Potential of mean force, Protein-peptide binding modeling, Molecular dynamics, Molecular recognition, Statistical mechanics
1
Introduction
Peptide and peptide-derived molecules are widely used to target
protein-protein interactions for medicinal purposes and basic
biological research. In-silico models play an increasingly significant
role in the study of protein-peptide interactions. As excellently
reviewed elsewhere, [1–3] computational methods for studying
protein-peptide interactions have evolved on somewhat separate
tracks from those used for small molecule-protein interactions.
These differences are partly due to the greater flexibility and size
of peptides and their tendency to interact with proteins through
many relatively weak interactions. Nevertheless, because the same
fundamental physical forces regulate all molecular recognition phenomena, it is helpful to relate computational models under a standard set of principles.
This chapter is devoted to a class of physics-based free energy
methods considered the most accurate and detailed for modeling
the thermodynamics of molecular binding equilibria. These
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_15,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
303
304
Emilio Gallicchio
methods model the interactions between molecules as well as their
motion at the atomic level. We derive each method discussed from a
well-established statistical mechanics theory of non-covalent
molecular association. The chapter attempts to demystify the theory and the seemingly arcane formulas and computational procedures used in the field and point out the specific features of the
methods that make them more or less suitable for studying proteinpeptide interactions.
There is an implicit acknowledgment here that an understanding of these methodologies and how to select and apply them
appropriately cannot be accomplished fully without referring to
the underlying theory. The treatment employed here requires
only a basic familiarity with concepts of statistics (probability distributions, averages, marginalization) and of classical statistical
thermodynamics (classical partition functions and their manipulations, and their relationship with the free energy). After presenting
the theory and methods, we then illustrate their applications by
discussing three case studies. We hope that this format will help
convey the characteristics and relationships between the various
methodologies and the fundamental principles on which they are
based.
2
Statistical Mechanics Formulation
In this section, we derive and discuss a statistical mechanics theory
of molecular binding. The concepts and the formulas expressed
here will be used later to rationalize the specific computational
methods and practices used in the case studies reviewed in
Subheading 3.
We attempt to use unambiguous notation throughout, but
sometimes we adopt a simplified notation to unclutter the equations. For example, in intermediate formulas, we often omit limits
of integration and Jacobian factors for curvilinear coordinates when
they do not affect the form and interpretation of the final result. In
some places, we use function arguments to distinguish two functions. For example, we might denote the ligand and receptor’s
potential energy functions with the same symbol U, as U(xL) and
U(xR), even though they are different mathematical functions.
2.1 The Standard
Free Energy of Binding
We will consider here the reversible non-covalent binding equilibrium between receptor molecules R and ligand molecules L to form
a complex RL in an ideal solution:
R þ LÐRL,
with the dimensionless equilibrium constant
ð1Þ
Free Energy-Based Computational Methods
Kb ¼
½RL=C ∘
,
ð½R=C ∘ Þð½L=C ∘ Þ eq
305
ð2Þ
where [. . .] are concentrations, C∘ is the standard state concentration (conventionally set as 1 M or 1 molecule/1668 Å3), and the
“eq” subscript states that all concentrations are evaluated at equilibrium. The Gibb’s molar standard binding free energy, which is
the main objective of the computational models of binding discussed here, is defined as
ΔG ∘b ¼ kB T ln K b ,
ð3Þ
where kB is Boltzmann’s constant and T is the temperature in the
Kelvin scale (in the following we will assume constant temperature
pressure conditions).
Implicit in this quasi-chemical description of the binding equilibrium is the idea that the separated species in solution R and L, as
well as the complex RL, are defined in some way. In an experimental
setting, the apparatus used to measure equilibrium concentrations
provides a working definition of the species. The nature of the
experimental reporter used to monitor the formation of the complex is of particular relevance [4]. The change of a spectroscopic
signal, as in NMR and UV/VIS fluorescence assays, [5] likely
probes a set of conformations of the complex in which specific
groups of the receptor and the ligand are in contact. Hence, different spectroscopic reporters would, in general, yield different estimates of the standard free energy of binding [6]. Spectroscopic
reporters stand in contrast to experimental reporters, such as those
in calorimetric, surface plasmon resonance (SPR), amplified luminescent proximity (AlphaScreen), and equilibrium dialysis binding
assays, that probe unspecific molecular association [4, 6–9]. Here,
we focus mainly on computational models that define the complex
using structural means–typically specific distances and angles
between groups of atoms [10]–and are therefore more suitable to
describe measurements of binding constants with specific spectroscopic experimental reporters.
In practice, the association between a peptide ligand and a
protein receptor is also often monitored by indirect biochemical
means, such as enzymatic inhibition [11] or pull-down assays, [12]
that are only indirectly related to the equilibrium binding constant
of the ligand-receptor complex. The computational models’ ability
to reproduce or explain this type of data is expected to be semiquantitative at best, as it would be a correlation between experimental binding constants and activity data.
While ambiguities in relating molecular computer simulations
to experimental biophysical data of molecular binding exist for any
molecular complex, the issue is explicitly discussed here because it is
expected to be particularly widespread for the study of the interactions involving peptides, which are generally more flexible than
306
Emilio Gallicchio
most small-molecule drug compounds and engage protein receptors over a large binding surface in a variety of binding modes. It is
useful to keep these issues in mind when designing a computational
model and the answers that one can reasonably extract from
it. Computational modeling can be a valuable tool when used
judiciously by exploiting its strengths while managing its unavoidable limitations.
2.2 Statistical
Mechanics Theory of
Non-covalent
Molecular Binding
Under the assumptions above, Gilson et al. [13] derived a statistical
mechanics expression for the binding constant (Eq. 2) which, with
a few reasonable approximations (discussed below), can be written
as [14]:
Kb ¼
C ∘ z RL
,
8π 2 z R z L
ð4Þ
where zi is the intramolecular configurational partition function of
one molecules of species i in solution.
A full derivation of Eq. 4 is beyond the scope of this chapter.
However, it is briefly outlined here to introduce the notation.
Equation 4 is derived by writing the molar standard binding free
energy as the difference of the standard chemical potentials of the
complex and those of the receptor and ligand
ΔG ∘b ¼ μ∘RL μ∘R μ∘L
ð5Þ
and employing the McMillan-Mayer expression for the standard
chemical potential of a solute in an ideal solution [15]
μ∘i ¼ kB T ln
ϕi
3 ∘
Λi C
,
ð6Þ
where ϕi is the internal canonical molecular partition function of
solute i in solution and Λi is the thermal De Broglie wavelength of
the center of mass of the solute. The internal molecular partition
function includes only the internal degrees of freedom of the solute
obtained after separating the translational degrees of freedom of
the molecular center of mass. Furthermore, the solute’s internal
canonical molecular partition function in solution is understood in
the context of the concept of the solvent potential of mean force,
[16] in which the solvent degrees are averaged out.
While a quantum-mechanical treatment is required in general,
adopting a classical expression for the molecular partition function
is appropriate for the present discussion limited to non-covalent
molecular association equilibria, which do not involve the formation or breaking of chemical bonds. The internal canonical molecular partition function is written as
ϕi ¼
8π 2 z i
,
∏ j λ3j
ð7Þ
Free Energy-Based Computational Methods
307
where the denominator comes from the integration over the
momenta,1 the factor of 8π 2 comes from the integration over the
orientational degrees of freedom of the solute,2 and zi is the vibrational molecular configurational partition function
zi ¼
R
dx i e βΨi ðx i Þ ,
ð8Þ
where β ¼ 1/(kBT) is the inverse temperature, x i denotes the
collection of the vibrational degrees of freedom of solute i, and Ψi
is the solvent-averaged potential of mean force of a specific configuration of the solute in solution.3 In the present notation,
Z
1
βΨi ðx i Þ
dr
¼
e
ZN
\upsilonN e βU i ðr \upsilonN ,x i Þ,
where r N
v denotes the collection of degrees of freedom of N solvent
molecules, U i ðr N
v , x i Þ is the potential energy of the mixture of
N solvent molecules and one solute molecule i, and ZN is the
configurational partition function of the pure solvent,4 expressed
as the integral in Eq. 9 but without the solute.
Equation 4 is obtained by inserting Eq. 6 for each species,
using Eqs. 7–9, into Eq. 5 noticing that the kinetic energy factors
cancel out, and finally inverting Eq. 3.
The definition of the intramolecular configuration partition
function of the complex, zRL, receives special consideration in this
theory [13]. In the complex, the translational and orientational
degrees of freedom of the ligand are represented by the internal
degrees of freedom of the complex that specify the position and
orientation of the ligand with respect to a coordinate system
attached to the receptor [10]. Furthermore, the integration along
these coordinates is limited to some specified range of configurational space that encodes our structural definition of what constitutes a valid configuration of a ligand “bound” to the receptor
(see the discussion in Subheading 2.1). The structural definition of
the bound complex is a necessary and somewhat arbitrary input of
the theory [4, 7, 13, 14]. Without it, the free energy of the bound
complex relative to the unbound state is undefined and, consequently, the standard binding free energy and the binding constant
would also be undefined in this theory. It is customary to represent
the bound region of the complex by an indicator function I(ζ L),
where ζL represents the collection of the six coordinates5 that
1
The details are omitted since the contributions from momenta cancel out in this classical treatment.
We assume that the orientational degrees of freedom can be separated from the vibrational degrees of freedom
without significant loss of accuracy. This is generally an excellent approximation at moderate temperatures.
3
It should be noted that the solvent potential of mean force formalism does not introduce new assumptions or
approximations than the ones already adopted. In this context, it is only a convenient notation aid. We will discuss
later implicit solvation models which approximate the solvent potential of mean force.
4
The notation can be easily extended to solvent mixtures including ions and co-solvents.
5
Three translations and three orientations for a non-linear ligand.
2
308
Emilio Gallicchio
specify the position and orientation of the ligand relative to the
receptor [10].6 The indicator function is set to 1 if the position and
orientation of the ligand is such that receptor and ligand are considered bound and zero otherwise so that zRL can be written as
z RL ¼
2.3 The Binding Free
Energy Formula
R
dx R dx L dζL I ðζ L Þe βΨRL ðx R ,x L ,ζL Þ :
ð10Þ
Since the direct evaluation of partition functions is not generally
feasible, Eq. 4 is not amenable to direct computation. One strategy
is to transform it into an average over the conformational ensemble
in which receptor and ligand are uncoupled. To do so, we reorganize the integration variables in the numerator so that they match
exactly those in the denominator. First, define
R
dζL I ðζ L Þ ¼ V site Ωsite
ð11Þ
which measures the spatial (Vsite) and angular (Ωsite) extent of the
bound state of the complex when receptor and ligand are
uncoupled.7 Then, multiply and divide Eq. 4 by Eq. 11 by keeping
the integral form in the denominator and the integrated form in the
numerator. The result is
K b ¼ C ∘ V site
where
he βu i0 ¼
R
Ωsite βu
he i0 ,
8π 2
dx R dx L dζ L e βuðx R ,x L ,ζL Þ ρ0 ðx R , x L , ζL Þ
ð12Þ
ð13Þ
is the ensemble average of the Boltzmann weight of the effective
binding energy, u, of defined as the difference in effective potential
energies of the complex in the specified configuration and of that of
the separated receptor and ligand without changing their internal
configurations
uðx R , x L , ζL Þ ¼ ΨRL ðx R , x L , ζL Þ ΨR ðx R Þ ΨL ðx L Þ
ð14Þ
with the normalized probability density function
ρ0 ðx R , x L , ζL Þ ¼ R
I ðζL Þe βΨR ðx R Þ e βΨL ðx L Þ
dx R dx L dζL I ðζL Þe βΨR ðx R Þ e βΨL ðx L Þ
ð15Þ
The specific choice of the ζL coordinates is arbitrary as long as they do not couple directly or indirectly the
intramolecular coordinates of the receptor or the ligand.
7
Equation 11 is colloquially referred to as the volume of the receptor binding site. The notation used here
suggests that translational and orientational components are not coupled in the definition of I(ζL). The present
treatment is still valid if this is not the case, except that in this case the value of the integral of the indicator function
is not written as the product of spatial and orientational components. Finally, Ωsite ¼ 8π 2 if the definition of the
bound complex does not involve orientational coordinates, that is when only the position of the ligand is used to
judge whether it is bound to the receptor.
6
Free Energy-Based Computational Methods
309
which corresponds to an unphysical state of the complex in which
the ligand is bound to the receptor (the density is zero unless
I(ζL) ¼ 1) but it does not interact with it (the potential function
lacks receptor-ligand coupling terms). We will hereafter refer to this
state as the decoupled state of the complex. Conversely, the coupled
state of the complex is the physical state in which the bound ligand
and the receptor interact through the ΨRL ðx R , x L , ζL Þ potential
function.
Inserting Eq. 12 into Eq. 3 yields the following expression for
the standard free energy of binding
ΔG ∘b ¼ ΔG ∘b,id þ ΔG b ,
ð16Þ
where
ΔG ∘b,id ¼ kB T ln C ∘ V site kB T ln
Ωsite
8π 2
ð17Þ
is the ideal component of the standard free energy of binding
corresponding to the reversible work for transferring a ligand
from an ideal solution at concentration C∘ to the binding site
region in the absence of ligand-receptor interactions, and
ΔG b ¼ kB T ln he βu i0
ð18Þ
is the excess component of the standard free energy of binding,
corresponding to the reversible work for turning on the receptorligand interactions while the ligand is sequestered within the binding site region of the receptor. The goal of the computational
models discussed in this chapter is the estimation of the excess
free energy of binding. The ideal component is generally computed
analytically by integration of the expression that defines the indicator function of the bound complex.
Equation 18 provides, in principle, a computational route to
evaluate the binding free energy. The process is often called alchemical because it is unrealizable in Nature. Nevertheless it produces
estimates that can be compared to experimental measurements. It
instructs to (1) obtain a sample of Boltzmann’s-distributed conformations of the complex in the uncoupled state (by molecular
dynamics, typically), (2) evaluate the binding energy function
u (Eq. 14) for each sample by turning on without conformational
rearrangements the coupling between ligand and receptor, and
finally (3) find the average of the Boltzmann weight exp ð βuÞ.
While straightforward, this process is numerically ill-conditioned,
and it fails for all but the simplest systems. This problem arises
because atoms of the ligand and the receptor are very likely to
clash when uncoupled. Consequently, the binding energy u is
large and positive, and exp ð βuÞ is negligibly small for the vast
majority of samples. Effectively, the sampling process generates
mostly zeros, and the average is dominated by the very rare cases
310
Emilio Gallicchio
Fig. 1 The probability density p0(u) (right, green curve) of the binding energy for the alchemical uncoupled
state (λ ¼ 0) of the complex between 3-iodotoluene and the L99A mutant of T4-lysozyme (left) at 300 K
[17]. The red curve is p1(u), the probability density of the binding energy in the coupled ensemble (λ ¼ 1),
which is proportional to the integrand in Eq. 19 exp ð βuÞp 0 ðuÞ [18]. Note that the y-axis and the positive xaxis are in a logarithmic scale
when, by chance, ligand and receptor do not clash and are primed
to form favorable interactions even in the absence of such
interactions.
To appreciate more quantitatively the severity of this numerical
problem, let us rewrite the ensemble average in Eq. 18 as a statistical average
he βu i0 ¼
R þ1
1
du e βu p0 ðuÞ,
ð19Þ
where p0(u) is the probability density distribution of the binding
energy in the uncoupled state. As shown, for example, in Fig. 1 for
the complex between 3-iodotoluene and the L99A mutant of
T4-lysozyme, [17] p0(u) (in green) is greatest for large and positive
values of the binding energy. For this system, the probability of
finding a conformation for which the integrand of Eq. 19 is significant (the red curve) is six or more orders of magnitude smaller than
the probability of occurrence of conformations with atomic clashes.
It would take a prohibitively large number of independent samples
of the decoupled ensemble to obtain a sufficiently large subset at
favorable binding energies to estimate the binding free energy with
any precision. Effectively, the binding free energy is dominated by
the low binding energy tail of p0(u), which is difficult to estimate
precisely and which is greatly amplified by the exponential term in
Eq. 19. 8
The 3-iodobenzene/T4-lysozyme complex illustrated in Fig. 1
is a rather simple system. The severity of the numerical problem is
8
It is tempting to try to address the clashes caused by coupling by reversing the process by decoupling. However,
an equilibrium thermodynamic process like this one must be reversible, so the process’s direction is irrelevant.
Free Energy-Based Computational Methods
311
far greater for ligand peptides, which are significantly larger and
more flexible than a small molecule. Random placement of a peptide molecule in the protein receptor site will almost inevitably
result in conformations with atomic clashes that do not contribute
significantly to the binding free energy. Moreover, peptides can
assume a large variety of conformations when decoupled from the
receptor, with only a small fraction of them compatible with binding, thereby further reducing the probability of generating useful
bound conformations.
In practice, various strategies ranging from stratification (break
up the binding process by introducing appropriate intermediate
states) to importance sampling (preferential sampling of bound
states) have been devised to overcome the numerical problems in
alchemical free energy averages. Some of these strategies will be
discussed in the case studies later in this chapter. While often very
useful, applying these advanced strategies to protein-peptide complexes remains very challenging, as reflected in the paucity of successful alchemical absolute binding free energy calculations for
protein-peptide complexes reported in the literature.
2.3.1 The DoubleDecoupling Method
Equation 18 is not directly applicable to the calculation of binding
free energies unless the solvent potential of mean force, Ψi ðx i Þ, or a
suitable implicit solvent approximation for it, is available for the
ligand, the receptor, and their complex. The solvent potential of
mean force is required for conformational sampling and the evaluation of effective binding energies for each sample using Eqs. 14 9.
The alternative is to employ an explicit representation of the
solvent. The relevant partition functions include integrating the
solutes’ internal degrees of freedom and the degrees of freedom
of the solvent molecules. The result is a binding free energy formulation known as double-decoupling [13] involving two exponential
averages of the same form as Eq. 18, one for coupling the ligand
from vacuum to the solvated receptor and another for coupling the
ligand to the pure solvent. These two processes, the second of
which is related to the solvation of the ligand, are part of a thermodynamic cycle that brings the ligand from the solvent bulk to the
solvated receptor through an intermediate state in which the ligand
is in vacuum (Fig. 2).
The double-decoupling method is regarded as the leading
computational model for calculating protein-small molecule binding free energies. However, due to their sizes, it is not generally
applicable to peptides. It is presented here because it forms the basis
for the relative binding free energy method employed in the case
study of Subheading 3.2. To see why double-decoupling is not
readily applicable to peptides, consider, for example, the first leg
in Fig. 2, which is the inverse of the coupling of the peptide to the
solvated receptor. For the same reasons outlined above concerning
Eq. 16, it would be very challenging to compute the free energy of
312
Emilio Gallicchio
Fig. 2 Schematic illustration of the thermodynamic cycle of the doubledecoupling method for the calculation of the binding free energy between a
molecular receptor (orange doughnut) and a ligand (black circle). The dashed
circle within the receptor represents the binding site region. The blue boxes
represent the solvent. The bound and unbound end states are transformed to a
common intermediate state in which the ligand is in vacuum (white). The excess
binding free energy is the difference of the free energy changes of the two legs,
ΔGb ¼ ΔG2 ΔG1
this process because, in addition to the many atomic clashes with
the receptor atoms, the uncoupled peptide will also clash with
solvent molecules that would be present in the binding site. Similar
challenges would exist for the hydration leg.
The double-decoupling formula is derived from the statistical
mechanics theory outlined in Subheading 2.2 by first inserting the
definition of the solvent potential of mean force (Eq. 9) in each of
the configurational partition functions in Eq. 4 and then multiplying and dividing by the configurational partition function of the
ligand in vacuum Z0,L to obtain
Kb ¼
C ∘ Z N ,RL Z N Z 0,L
,
8π 2 Z N ,R Z 0,L Z N ,L
ð20Þ
where ZN,i is the configurational partition function of a system with
N solvent molecules with one molecule of species i whose position
and orientation, like in Subheading 2.2, is fixed. So, for example,
Z N ,RL ¼
R
βU ðx R ,x L ,ζ L ,r v Þ
dx R dx L dζL I ðζL Þdr N
,
v e
N
ð21Þ
where U ðx R , x L , ζ L , r N
v Þ is the potential energy function of a
system with N solvent molecules containing the receptor-ligand
complex RL in the configuration specified by the internal degrees
Free Energy-Based Computational Methods
313
of freedom x R , x L , and ζL. Z0,L represents the configurational
partition function of the ligand in vacuum.
The reciprocal of the last term in Eq. 20 can be written as
R
βU ðx L ,r N
v Þ
dx L dr N
Z N ,L
v e
¼R
¼ he βuL iN þL ¼ e βΔG 2 ,
βU ðx L Þ e βU ðr N
Z N Z 0,L
v Þ
e
dx L dr N
v
ð22Þ
N
where uL ¼ U ðx L , r N
v Þ U ðx L Þ U ðr v Þ is the instantaneous
change in potential energy for bringing the ligand from vacuum
to solution and h. . .iN+L indicates the ensemble average over pure
solvent and the ligand in vacuum. As indicated in Eq. 22, this term
is related to the solvation free energy of the ligand9 or the opposite
process of leg 2 in Fig. 2.
The ratio of partition functions corresponding to the complex
in Eq. 20 is converted to an average by multiplying and dividing by
Vsite Ωsite as done earlier to derive Eq. 12
Z N ,RL
¼ V site Ωsite he βuRL iN ,RþL ¼ V site Ωsite e βΔG 1 ,
Z N ,R Z 0,L
ð23Þ
N
is the
where uRL ¼ U ðx R , x L , ζ L , r N
v Þ U ðx R , r v Þ U ðx L Þ
instantaneous change in potential energy for bringing the ligand
from vacuum to a position and orientation ζL relative to receptor in
a solution containing the receptor, and h. . .iN,R+L, similarly to
Eq. 18, indicates the ensemble average over the uncoupled ensemble in which the ligand is bound to the receptor (I(ζ L) ¼ 1) but it
does not interact with either the receptor nor the solvent. As
indicated in Eq. 23 this ensemble average gives the free energy of
the inverse of leg 1 in Fig. 2. Combining Eqs. 22, 23, 20, 16, 17,
and 3 we finally arrive at the double-decoupling expression for the
excess binding free energy:
ΔG b ¼ ΔG 2 ΔG 1
ð24Þ
as illustrated in Fig. 2.
Note that the free energy formula for each leg is in the same
form of an exponential average (Eq. 23) of the alchemical potential
energy change as the direct binding free energy formula we derived
in Subheading 2.3. Thus, similar considerations apply for each leg
of double-decoupling. In each case, the formula instructs to obtain
samples of configurations of either the systems with the ligand in
solution or the ligand in the solvated receptor in their decoupled
ensembles. It then instructs to average over the set of samples the
Boltzmann’s weight of the potential energy change for turning on
the coupling between the ligand and the environment without
9
Specifically, the free energy of a solute in a fixed position and orientation in vacuum to a fixed position and
orientation in solution; a quantity also known as the solvation free energy in the Ben-Naim standard state
[19, 20].
314
Emilio Gallicchio
conformational rearrangements. Here too, each leg’s averaging
process is expected to be numerically ill-conditioned (see, for example, Fig. 1) and not generally applicable directly in molecular simulations. Some numerical approaches to this problem are illustrated
in the Case Studies section of this chapter.
2.4 The Potential of
Mean Force Method
In this section we derive a non-alchemical formulation of the
statistical mechanics expression 4 which leads to the potential of
mean force formula for the of binding constant.
Using the definition of the internal configurational partition
function of the complex in Eq. 10 and the analogous ones for the
receptor and ligand, Eq. 4 is written as
R
C ∘ dx R dx L dζL I ðζL Þe βΨRL ðx R ,x L ,ζL Þ
R
,
ð25Þ
Kb ¼ 2
8π
dx R dx L e βΨRL ðx R ,x L ,ζL Þ
where we have written the product zRzL of the separated receptor
and ligand as the partition function of a single system in which the
ligand is placed in an arbitrary position ζL sufficiently removed from
the receptor so that it does not interact with it. Equation 25 is then
written as
C∘
Kb ¼ 2
8π
Z
dζL e βΔF ðζL Þ ,
ð26Þ
site
where the integration is within the binding site region where
I(ζL)6¼0, and the potential of mean force (PMF) function is defined
as
R
dx R dx L e βΨRL ðx R ,x L ,ζL Þ
βΔF ðζ L Þ
ð27Þ
¼R
e
,
dx R dx L e βΨRL ðx R ,x L ,ζL Þ
where ΔF(ζL) is the value of the PMF at ζL relative to the value far
away from the receptor. With this definition the PMF is zero at any
point far away from the receptor.
The PMF as defined corresponds to the probability density of
p(ζ L) of finding the ligand in the orientation and position ζL relative
to the receptor:
R
dx R dx L e βΨRL ðx R ,x L ,ζL Þ
¼ hδðζ0L ζL Þi
ð28Þ
pðζL Þ ¼ R
0
dx R dx L dζ0L e βΨRL ðx R ,x L ,ζL Þ
so that
ΔF ðζL Þ¼ kB T ln
pðζL Þ
:
pðζL Þ
ð29Þ
The potential of mean force expression 26 formally instructs to
map out the probability density 28 to observe the ligand around the
receptor in orientation and position ζL, including far away from the
Free Energy-Based Computational Methods
315
receptor and within the binding site region, and to then integrate it
within the binding site region to obtain the binding constant using
Eq. 26.
Some comments are in order. First, the PMF function can be
obtained in the solvent of potential of mean force formulation as
suggested by Eq. 27 or by using an explicit representation of the
solvent by inserting the definitions of the effective potential energy
Ψ and of the solvent of potential of mean force 9 into Eq. 27
R
βU ðx R ,x L ,r N
v ,ζ L Þ
dx R dx L dr N
v e
βΔF ðζ L Þ
ð30Þ
e
¼R
:
βU ðx R ,x L ,dr N
v ,ζ L Þ
dx R dx L dr N
v e
It is evident therefore that the PMF is obtained by monitoring the
probability of occurrence of the ligand at ζL whether an implicit or
explicit description of the solvent is used.
Second, the potential of mean force formula for the binding
constant 26 does not require knowledge of the probability density
p(ζ L) everywhere around the receptor. It requires it only within the
binding site region and at one arbitrary point ζ L far away from the
receptor in the solvent bulk to compute ΔF(ζL) from Eq. 29. The
latter is a fundamental point. It is not sufficient to study the
distribution of placements of the ligand in the binding site to
compute the binding free energy. We also require the probability
of finding the ligand in the binding site relative to finding it
somewhere in the solvent bulk. In practice, the PMF is obtained
in a volume that includes both the binding site and positions far
away from the receptor to connect the two regions in a statistical
sense [21–23].
Finally, the PMF is rarely obtained over all six degrees of
freedom of ζ L (three positions and three orientations). In practice,
the PMF is collected only along some of the dimensions by averaging over the others. The averaging procedure is formally described
by marginalization of p(ζL). For example, to obtain the probability
of the position r L of the ligand regardless of its orientation we
integrate pðζL Þ ¼ pðr L , θ1 , ψ 1 , ψ 2 Þ over the three Euler angles θ1,
ψ 1, and ψ 2
R
pðr L Þ¼ dð cos θ1 Þdψ 1 dψ 2 pðr L , θ1 , ψ 1 , ψ 2 Þ:
ð31Þ
In the bulk, the ligand distribution does not depend on the orientation and we get
R
pðr L Þ¼ dð cos θ1 Þdψ 1 dψ 2 pðr L , θ1 , ψ 1 , ψ 2 Þ ¼ 8π 2 pðζL Þ:
ð32Þ
Next, integrate Eq. 26 over θ1, ψ 1, and ψ 2, assuming that the
binding site definition does not depend on orientations, and
expressing e βΔF ðζL Þ as pðζ L Þ=pðζL Þ, to obtain
316
Emilio Gallicchio
Kb ¼
C∘
8π 2
Z
site
dr L pðr L Þ=pðζL Þ ¼ K b ¼ C ∘
Z
dr L e βΔF ðr L Þ ,
site
ð33Þ
where
ΔF ðr L Þ¼ kB T ln
pðr L Þ
pðr L Þ
ð34Þ
and we have used Eqs. 31 and 32. The implementation of Eq. 33
requires the PMF with respect to the position of the ligand regardless of its orientation.
3 Case Studies of Applications of Free Energy Methods to Protein-Peptide Binding
Free Energy Estimation
In this section, we review some applications of the free energy
methods derived from the statistical mechanics theory of
non-covalent molecular binding introduced in Subheading 2.2 to
the study of protein-peptide binding phenomena. We will focus in
particular on theoretical and methodological aspects that will be
introduced and discussed as needed. The following case studies are
far from an exhaustive representation of the literature in the field.
They have been selected primarily to illustrate the application of the
theory and methods presented in Subheading 2. We also do not
attempt to review each work exhaustively.
3.1 Binding of Cyclic
Peptides to HIV
Integrase with the
Single-Decoupling
Method and Implicit
Solvation
As part of the infection cycle, HIV inserts its genome into a human
chromosome. The HIV integrase (IN) enzyme responsible for this
process is recruited to the nuclear chromatin by the human lens
epithelium-derived growth factor (LEDGF) transcriptional coactivator [24]. There have been significant attempts [8, 25–27] to
develop therapies against HIV based on disrupting the interaction
of LEDGF with HIV IN, which occurs at the so-called LEDGF
binding domain of integrase (Fig. 3). The study of the interaction
of LEDGF and LEDGF-derived synthetic peptides with HIV-IN
has provided useful insights for competitive inhibitors’ design
[28, 29]. As an example, Fig. 3 illustrates the crystal structure of
the LEDGF binding domain of the HIV IN dimer complexed with
a cyclic peptide [29].
Building upon an earlier successful application of alchemical
binding free energy calculations of small-molecule inhibitors targeting the LEDGF/HIV IN interaction, [30] Kilburg and Gallicchio [31] modeled the binding free energies between HIV IN and
of five of the thirteen cyclic peptides assayed by Rhodes et al. [29]
The alchemical binding free energy study by Kilburg and Gallicchio
recapitulated the trends observed in the experimental assays and
Free Energy-Based Computational Methods
317
Fig. 3 The 3AVN crystal structure of the dimer of the LEDGF binding domain of
HIV integrase (multi-color ribbons) bound to SHKIDNLD cyclic peptides (red tube)
[29]
identified the specific structural and energetic signatures responsible for favorable binding. Conversely, the calculations provided
explanations for the lack of binding observed for two sequences
for which structural information is not available.
The study by Kilburg and Gallicchio [31] remains one of a few
examples of the successful application of alchemical free energy
methods to the computation of the absolute binding free energies
of protein-peptide complexes. This was made possible by employing an implementation of Eq. 16 which was first reported under the
name of Binding Energy Distribution Analysis Method (BEDAM),
[18, 32] as part of the IMPACT molecular simulation program
[33]. The latest implementation as a plugin of the OpenMM
molecular dynamics library [34] has been named the SingleDecoupling Method (SDM), [16]10 a name chosen to better
place it in the same theoretical context as the Double-Decoupling
Method (DDM) [13] discussed in Subheading 2.3.1. In the following, we will use the latter name to refer to both implementations. SDM has been used in two studies involving protein-peptide
binding to date [31, 35].
The implementation of Eq. 16 requires the averaging of the
Boltzmann weight of the effective binding energy in Eq. 14, which
in turn requires the specification of the intramolecular potential
energy and the solvent potential of mean force for each configuration x i of the molecular species involved. The former is available
from a molecular mechanics force field (OPLS-AA [36] in the
applications discussed here) while the solvent potential of mean
10
github.com/rajatkrpal/openmm_sdm_plugin.
318
Emilio Gallicchio
force is approximated by an implicit solvent model [16]. SDM
employs the Analytical Generalized Born plus Non-Polar
(AGBNP) implicit solvent model [37, 38] which is now maintained
as an OpenMM plugin [39].11
3.1.1 Alchemical
Pathways and Stratification
We use this case study to illustrate the very general concept of an
alchemical pathway and the idea of performing conformational
sampling along the pathway to improve the convergence characteristics of the basic binding free energy formula (Eq. 18). This
technique, commonly known in the field as stratification, is used
in many free energy estimation problems [40].
As discussed in Subheading 2.3, Eq. 18 is not directly applicable in numerical simulations because, fundamentally, the coupled
and uncoupled ensembles preferentially visit distinct regions of
conformational space (see Fig. 1, for example). The free energy,
however, is a thermodynamic state function, and it should be
possible to compute it as the sum of free energy changes over a
series of intermediate states, each sufficiently similar to its neighbors so that free energy estimation formulas such as Eq. 18 among
these are numerically well-behaved [41, 42].12 The intermediate
so-called alchemical states are generally implemented by means of
an alchemical progress parameter λ that tunes the system’s potential
energy function such that λ ¼ 0 corresponds to the initial state and
λ ¼ 1 corresponds to the final state. A simple–but not necessarily the
most efficient [17, 43] choice–is a linear interpolating function of
the form
U λ ðxÞ ¼ U 0 ðxÞ þ λuðxÞ,
ð35Þ
where U0(x) is the potential energy function that describes the
initial state and u(x) ¼ U1(x) U0(x), where U1(x) is the potential
function of the final state, is the perturbation potential. The progress parameter λ and the specific parameterization of the alchemical
potential are said to define an alchemical path that connects, in a
thermodynamic sense, the initial and final states.
The specific alchemical potential energy function adopted by
Kilburg and Gallicchio [31] to study peptide binding is, in the
notation of Subheading 2.3,
Ψλ ðx R , x L , ζ L Þ ¼ Ψðx R Þ þ Ψðx L Þ þ λuðx R , x L , ζL Þ,
ð36Þ
where the first term on the r.h.s. is the potential energy function of
the decoupled ensemble (corresponding to U0(x) in Eq. 35) and
the binding energy function u is defined by Eq. 14.13 It is
11
12
github.com/egallicc/openmm_agbnp_plugin.
This concept has since evolved into rigorous statistical interpretations and numerical algorithms, some of which
are discussed later in this section.
13
To improve convergence, Kilburg and Gallicchio actually used a soft-core form of the binding energy function
[17, 44]. Soft-core functions are critical aspects of alchemical binding free energy calculations.
Free Energy-Based Computational Methods
319
straightforward to see that Ψλ at λ ¼ 1 is the potential energy
function of the coupled state. An alchemical binding free energy
profile, ΔG(λ), along the thermodynamic path is defined, which
corresponds to the free energy of the intermediate alchemical
state at λ relative to the uncoupled state (λ ¼ 0) [18]
ΔGðλÞ¼ kB T ln he βλu i0
ð37Þ
which is Eq. 18 with u replaced with λu, the perturbation energy at
the alchemical state at λ. By definition, the excess free energy of
binding 18 is the difference between the end points of the alchemical binding free energy profile
ΔG b ¼ ΔGðλ ¼ 1Þ ΔGðλ ¼ 0Þ:
ð38Þ
In Kilburg and Gallicchio’s study, the alchemical path was
subdivided into 26 intermediate states mostly linearly spaced
between 0 and 1, except the region near λ ¼ 0, which required
more closely spaced points. Conformational sampling was conducted at each λ-state by molecular dynamics (MD)14 using the
alchemical potential energy function 36. The binding energy function 14 and its gradients were evaluated at each MD time step by
first evaluating the potential energy of the complex ΨRL ðx R , x L , ζ L Þ
and then displacing the peptide in the implicit solvent medium at a
large distance away from the protein receptor to evaluate the
potential energy ΨR ðx R Þ þ ΨL ðx L Þ without protein-peptide interactions.15 Samples of the decoupled energy Ψ0 ¼ ΨR ðx R Þ þ
ΨL ðx L Þ and of the binding energy u were saved at each alchemical
state at regular intervals. As discussed in Subheading 3.1.3, these
are the inputs for the estimation of the binding free energy profile
and of the excess binding free energy through Eq. 38.
3.1.2 Replica-Exchange
Conformational Sampling
Stratification implies that an alchemical binding free energy calculation is commonly carried out as a collection of molecular simulations, each with a different alchemical potential energy function
(Eq. 35) at a series of values of the alchemical progress parameter λ.
The accuracy of alchemical free energy calculations depends heavily
on the conformational sampling’s quality at each λ-state. In this
context, the conformational sampling’s challenge is to generate a
diverse set of configurations distributed according to Boltzmann’s
distribution for the given temperature and potential energy function. It is not sufficient, like in molecular docking, to propose a set
of low-energy configurations. The configurations should also
Specifically by replica-exchange molecular dynamics in temperature and λ space as described in
Subheading 3.1.2.
15
The ligand displacement approach to compute the alchemical potential energy was made necessary by the
many-body nature of the implicit solvation model. As briefly discussed in Subheading 3.2, with pairwise
decomposable potentials it is more common that λ is integrated into the calculation of individual interatomic
interaction energies.
14
320
Emilio Gallicchio
appear according to their probability of occurrence. Conformational sampling in alchemical simulations is carried out by Monte
Carlo and, more often, Molecular Dynamics (MD). MD conformational sampling is limited by the slow time-scales of biomolecules’
motion, and a host of advanced conformational sampling algorithms have been devised to overcome it [45]. Kilburg and Gallicchio employed two-dimensional replica-exchange conformational
sampling in temperature and alchemical spaces [31, 46].
It is useful to consider separately the problem of sampling
intermolecular degrees of freedom (the position and orientation
of the ligand relative to the receptor, denoted by ζL above) from the
sampling of intramolecular degrees of freedom (the individual
conformations of the peptide and the receptor, denoted by x L
and x R ). The first problem is related to the simulation algorithm’s
ability to explore all relevant binding modes of the protein receptor
complex for fixed receptor and peptide conformations. Missing the
most stable binding mode would, of course, underestimate the
binding affinity. The sampling of intermolecular degrees of freedom is straightforward near the decoupled state (λ ’ 0) where
protein-peptide interactions are weak, and the peptide can nearly
freely translate and rotate within the binding site volume. In contrast, because of receptor-peptide interactions, rotations, and translations are severely hindered near the coupled state (λ ’ 1) where
the peptide visits alternative binding modes only very rarely. Therefore, one solution to this problem is to make it so that the MD
thread evolves the system in conformational space as well as λ space.
In this way, new binding modes are formed when λ is small and, if
they are sufficiently stable, they will be retained when the MD
thread visits more strongly coupled states at λ ’ 1. Conversely, an
MD thread in a metastable binding mode at λ ’ 1 would have an
opportunity to acquire a smaller λ and convert to another binding
mode. Of course, the excursions in λ space have to be so that a
canonical ensemble of conformations is generated at each alchemical state.
The replica-exchange algorithm achieves this by evolving as
many MD threads as there are alchemical states. At any one point
in time, each MD thread j is assigned the λ value of a unique
alchemical state j. The collection of threads, called replicas, forms
an ensemble of independent canonical systems with the joint
canonical statistical weight function
h Pn
i
ρRE ðx 1 , . . ., x n jλ1 , . . ., λn Þ¼ exp β
Ψ
ðx
Þ
,
ð39Þ
λ
j
j
j ¼1
where Ψλ(x) is the alchemical potential energy function 36, xj
denotes the configuration of replica j, and λj is the value of λ
assigned to it. The joint distribution is sampled by alternating
updates of coordinates xj at a fixed assignment of λ values, which
is accomplished independently for each replica by conventional
constant temperature MD, with updates of the λ assignments.
Free Energy-Based Computational Methods
321
The latter is performed at fixed by proposing permutations of λ
assignments fλ1 , . . ., λn g!fλ01 , . . ., λ0n g at fixed configurations xj and
accepting and rejecting the move using the Metropolis Monte
Carlo algorithm based on the ratio of the values of the proposed
and original weight functions
ρRE ðx 1 , . . ., x n jλ01 , . . ., λ0n Þ
ρRE ðx 1 , . . ., x n jλ1 , . . ., λn Þ:
ð40Þ
There are many variations of replica-exchange differing in the
nature of the replicas, the scheme of permutations of state assignments, and the computational implementation [47]. Schemes, such
as the one illustrated above, that modify the parameters of the
potential energy function are known in the field as Hamiltonian
replica-exchange algorithms [48]. Kilburg and Gallicchio used the
Gibbs Independent Sampling Algorithm [17] for Hamiltonian
reassignments and an asynchronous implementation [46] of
replica-exchange for that allows running the collection of replica
simulations on heterogeneous and potentially unreliable computational resources such as on computational grids [49].
Hamiltonian replica-exchange addresses the sampling of intermolecular degrees of freedom. However, because λ couples
receptor-peptide interactions, it has only an indirect influence on
the rate at which intramolecular degrees of freedom are sampled.
Peptides are very flexible and often change conformation upon
binding. They often interact with the protein over an extended
surface and induce substantial induced-fit reorganization of the
receptor. Conformational rearrangements of peptides occur very
slowly at room temperature, especially of the cyclic peptides investigated in this study. The temperature replica-exchange algorithm,
one of the first versions of replica-exchange proposed, [50] is very
useful for accelerating the sampling of the conformational space of
peptides and proteins [51, 52] and is applicable to free energy
calculations [53]. Kilburg and Gallicchio adopted a
two-dimensional replica-exchange scheme in which both the λ
and temperature assignments undergo permutations. The joint
canonical weight is generalized as
h Pn
i
ρRE ½x 1 , . . ., x n jðβ, λÞ1 , . . ., ðβ, λÞn Þ ¼ exp β
Ψ
ðx
Þ
,
λ
j
j
j ¼1 j
ð41Þ
where βj and λj are the inverse temperature and λ assigned to replica
j, and (β, λ) is one of the n pair combinations of a set of inverse
temperatures and alchemical states. Kilburg and Gallicchio
employed 8 temperatures between 300–379 K and 26 alchemical
states for a total of 208 replicas for each protein-peptide complex.
The multi-dimensional replica-exchange algorithm employed
allowed to explore simultaneously multiple conformations of the
peptide and multiplied binding modes of each conformation.
322
Emilio Gallicchio
3.1.3 Multi-State Free
Energy Estimation
While Eq. 18 is formally correct, it is not an optimal free energy
estimator. Optimal here refers to a free energy estimator’s ability to
return a free energy estimate with the smallest bias relative to the
true free energy (accuracy) and smallest variance (precision) with a
given finite set of samples. Kilburg and Gallicchio employed the
Unbinned Weighted Histogram Analysis Method (UWHAM) estimator [44] which is considered an optimal free energy estimator
when no information of the system is known other than the samples
from the molecular simulations. The statistical and mathematical
origins of the method [44, 54] are beyond the scope of this chapter.
The main idea is to arrive at an estimate of the free energy ΔG(λ)
(Eq. 37) at λ by using the data collected at all λ-states. UWHAM
can be interpreted as an extension of the familiar Weighted Histogram Analysis Method (WHAM), [55] applied to Eq. 19 for the
maximum likelihood estimation of the distribution of binding
energies in the uncoupled ensemble p0(u) from the corresponding
distributions along the alchemical path pλ(u).
In this case, Kilburg and Gallicchio collected data as a function
of temperature as well as λ on a grid of 208 states. UWHAM
provides, in this case, optimal estimates of the dimensionless free
energy factor for each state defined as, up to an additive constant,16
F r ¼ ln z RL ðβr , λr Þ,
ð42Þ
where βr and λr are the values of the inverse temperature and of the
alchemical progress parameter of state r and zRL(β, λ) is defined by
Eq. 10. Given the free energy factors, the free energy profile as
function of temperature and λ is given by Eq. 37, or17
ΔGðβr , λr Þ¼ kB T F r :
ð43Þ
The dimensionless free energy factors minimize the convex objective function [44]18
X
XN Xn N
n
1
Nr
r F r vrs
e e
Fr,
ln
ð44Þ
þ
s¼1
r¼1 N
r¼1 N
N
where N is the total number samples collected at any of the n states,
Nr is the number of samples collected at state r, and
vrs ¼ βr ½Ψ0,s þ λr us ð45Þ
16
Note that, because zRL is not dimensionless, the ambiguity of the additive constant is also related to the
arbitrariness of the units chosen to evaluate the logarithm.
17
Because the free energy estimates are known up to a temperature-dependent additive factor, differences
between free energies at different temperatures are generally meaningless. However, differences along λ at
different temperature can be compared. For example, the binding free energy at one temperature ΔGb(β) ¼ Δ
G(β, λ ¼ 1) ΔG(β, λ ¼ 0) can be compared to the binding free energy estimate at a different temperature to, for
example, estimate the binding entropy.
18
The convexity property guarantees that there is a unique minimum.
Free Energy-Based Computational Methods
323
is the dimensionless energy of sample s in state r, where Ψ0,s and us
are, respectively, the values of the decoupled potential energy and of
the binding energy of the sample collected during the replicaexchange alchemical simulations. The UWHAM optimizer implemented in the statistical program R was used to obtain the dimensionless free energy factors (cran.r-project.org/web/
19
packages/UWHAM).
Note that setting to zero the gradient of the UWHAM objective function leads to the self-consistent equations
f 1
r ¼
PN
e vrs
vr0s ,
r 0 ¼1 N r 0 f r 0 e
Pn
s¼1
ð46Þ
where f r ¼ e F r . Eq. 46 is the basis of the equivalent Multi-state
Bennet Acceptance Ratio (MBAR) method to obtain the free
energy factors [57]. The UWHAM formulation of multi-state
reweighting has been found to be more generalizable than
MBAR’s [56]. For example, it has been recently employed to
impose global restraints on the free energy solutions [58].
3.2 Effect of
Mutations on the
Binding Affinity of
Peptides to PDZ
Protein Domains
PDZ protein domains are widespread protein-protein interaction
modules. They specifically recognize the 4 to 8 amino acids at the
C-terminus sequence of proteins. Peptides and peptide derivatives
that mimic these binding motifs are investigated as potential therapeutics for many diseases [59]. Panel et al. [60] studied the binding
free energies between the TIAM1 PDZ domain and a series of
peptides derived from its syndecan-1 and caspr4 protein targets
(Fig. 4) using an alchemical relative binding free energy computational method generally known in the field as Free Energy Perturbation (FEP) [62, 63]. The study’s goal was to validate the
methodology for protein-peptide binding and obtain physical and
structural insights into the recognition mechanisms that allow PDZ
domain to target specific sequences.
3.2.1 Theory of Relative
Binding Free Energy
Calculations
The dataset considered by Panel et al. [60] included the TIAM1
PDZ domain bound to the wild-type peptides and a series of single
and double mutants. As discussed in Subheading 2.3.1, peptides
are generally too large and complex to be studied by doubledecoupling absolute binding free energy calculations with explicit
solvation. Instead, the study employed a relative FEP method that
yields the difference between a peptide’s binding free energies
relative to a reference peptide. The approach is based on the thermodynamic cycle illustrated in Fig. 5. The reference peptide L1 is
alchemically transformed into a mutant L2 when bound to the
receptor and solvated in water. The difference in the free energies
19
Ding, Vilseck, and Brooks [56] developed a GPU implementation of UWHAM called FastMBAR (github.
com/xqding/FastMBAR) [56].
324
Emilio Gallicchio
Fig. 4 The 4GVD crystal structure of the complex between the TIAM1 PDZ
domain (multi-color ribbons) and the pTKQEEFYA peptide (red tube) [61]
Fig. 5 The thermodynamic cycle used in the relative free energy perturbation
method. The vertical transformations correspond to the association equilibrium
between the receptor R and one of two ligands L1 and L2. The horizontal legs
correspond to the alchemical transformation of one ligand into the other alone in
solution (top) or in the complex (bottom)
ΔGbound and ΔGsolv of these two processes yields the difference in
the binding free energy of the two complexes. Therefore, the
method allows probing the effect of different mutations on the
binding affinity between the peptide and the receptor.
The statistical mechanics formula at the basis of this approach
can be derived, for example, from Eq. 20 by considering the
expression of the ratio of the binding constants Kb(2) and
Kb(1) for the RL2 and RL1 complexes, respectively. When taking
the ratio, the constant factors and the partition functions of the
solvent, of the receptor in the solvent, and of the ligands in vacuum
cancel yielding
K b ð2Þ Z N ,RL 2 Z N ,L 1
¼ e β½ΔG bound ΔG solv ,
¼
K b ð1Þ Z N ,RL 1 Z N ,L 2
ð47Þ
where the ratio of partition functions involving the receptor corresponds to the free energy difference ΔGbound between the complex
Free Energy-Based Computational Methods
325
with ligand L2 in the solvent and the same system but with ligand L2
replaced by L1. Similarly, the ratio of partition functions of the
ligands in solution corresponds to the free energy difference
ΔGsolv.20 Finally, using Eq. 2, we obtain
ΔΔG b :¼ ΔG b ð2Þ ΔG b ð1Þ ¼ ΔG bound ΔG solv
ð48Þ
which is the key formula of the relative binding FEP method.
Let us now turn to the evaluation of ΔGbound and ΔGsolv by
alchemical computer simulations. As usual, the strategy is to compute ratios of partition functions as ensemble averages. However,
for example, the expression
R
βU ðx L 2 ,r N
v Þ
dx L 2 dr N
Z N ,L 2
v e
ð49Þ
¼R
βU ðx L 1 ,r N
Z N ,L 1
v Þ
dx L 1 dr N
v e
cannot be directly turned into the form of an ensemble average
because, in general, the number and kind of the internal degrees of
freedom of the two ligands differ. Panel et al. [60] adopted the
so-called dual-topology strategy to address this issue,21 in which
the simulation is conducted with a hybrid peptide in which the
wild-type, say, and mutated amino acid side chains are both represented at the same time (Fig. 6). The alchemical potential energy
function is constructed so that the environment (the water solution
or the solvated receptor) interacts with the atoms of both forms of
the sidechain with a strength that depends on the alchemical charging parameter λ. Similarly, the intramolecular potential energy
function is designed so that the atoms of the protein backbone
interact by bond stretching, bond angle, torsional, and 1,4
non-bonded interactions with both forms of the sidechain. The
atoms of the two forms of the sidechain being mutated never
interact directly with each other.
Formally, the dual-topology approach is derived from Eq. 47
by multiplying and dividing each term by an appropriate partition
function that introduces the additional degrees of freedom to turn
each peptide into the hybrid peptide with both forms of the sidechain. For example, if Z N ,L 1 term represents the peptide with the
phenylalanine (PHE) sidechain in solution (Fig. 6, red), multiplying and combining it with
Z ILE ¼
20
R
dζILE dx ILE e βU ðζILE Þ e βU ðx ILE Þ ,
ð50Þ
Comparing the free energies of systems with different atomic composition and number of degrees of freedom is
arguably physically meaningless at this level of theory. However, note that the overall ratio of partition functions in
Eq. 47 if physically well defined. It represents the free energy difference between two systems, the first composed
of two solutions one containing the complex with L2 and the other containing L1, and the second in which L2 and
L1 have swapped places. Evidently, the free energy difference ΔGbound ΔGsolv, which is the target of the theory, is
physically well defined even though the individual components may not be.
21
There is an analogous single-topology strategy [64] which we do not discuss here.
326
Emilio Gallicchio
Fig. 6 Representation of the dual-topology alchemical mutation of a
phenylalanine (PHE, red) to isoleucine (ILE, green) of the TKQEEFYA peptide
considered by Panel et al. [60] The illustration shows the peptide in solution. A
similar transformation is applied to the peptide bound to the PDZ domain
where, as in Eq. 10, ζ ILE represents the six external coordinates that
specify the position and orientation of the added isoleucine (ILE)
sidechain relative to the peptide backbone, U(ζILE)22 represents the
potential energy terms that anchor the ILE side chain to the peptide backbone,23 x ILE represents the other internal degrees of
freedom of the ILE side chain, and U ðx ILE Þ represents the intramolecular potential energy function that couples atoms of ILE
together,24 transforms it into the partition function, that we will
denote by Z N ,L 1ð2Þ, of the hybrid peptide in the PHE state in which
the ILE sidechain is “turned off,” by which we mean that the ILE
sidechain interacts only with the backbone through the U(ζ ILE)
potential and does not otherwise interact with the environment.
The same procedure applied to the partition function of the complex of the original peptide bound to the receptor Z N ,RL 1 in the
denominator of Eq. 47 yields the partition function Z N ,RL 1ð2Þ of the
hybrid peptide in the PHE state bound to the receptor. Similarly,
multiplying and dividing by the term ZPHE analogous to Eq. 50 to
install a PHE sidechain onto the peptide with the ILE sidechain,
yields the partition functions Z N ,L ð1Þ2 and Z N ,RL ð1Þ2 for the hybrid
peptides in solution and bound to the receptor in their ILE states.
22
As explained there, this function acquires in the next section an “SD” superscript.
Other attachment modalities, including to the β carbon, are possible.
24
As further discussed later, here we have explicitly singled-out the ζ degrees of freedom that couple the added
sidechain to the backbone to emphasize that they must be appropriately chosen, using, for example, the scheme
described by Boresch and Karplus [10], to avoid introducing spurious indirect interactions between backbone
atoms that would affect the conformational distribution of the original peptide [65].
23
Free Energy-Based Computational Methods
327
With these preparations, finally Eq. 47 is rewritten as
K b ð2Þ Z N ,RL ð1Þ2 Z N ,L 1ð2Þ
¼ e βΔG bound e þβΔG solv ,
¼
K b ð1Þ Z N ,RL 1ð2Þ Z N ,L ð1Þ2
ð51Þ
Z N ,RL ð1Þ2
¼ kB T ln he βu2 i1 ,
Z N ,RL 1ð2Þ
ð52Þ
where
ΔG bound ¼ kB T ln
where u2 is the change in potential energy of the system for a given
configuration of the solvated complex with the hybrid peptide due
to, in this example, turning off PHE sidechain and turning on the
ILE sidechain, and h. . .i1 represents the average over the ensemble
in which the PHE sidechain is on and the ILE sidechain is off. An
analogous ensemble average gives ΔGsolv for the transformation of
PHE into ILE in solution.
3.2.2 Alchemical
Transformations for
Relative Binding Free
Energies
As discussed in Subheadings 2.3 and 3.1.1 the free energies ΔGsolv
and ΔGbound for mutating one sidechain into another are calculated
in practice using a hybrid alchemical potential energy function
Uλ(x) parametrized by a progress parameter λ. Panel et al. [60]
used the NAMD molecular simulation package [66] which implements the alchemical potential [65]
U λ ðxÞ ¼ U L 12 ðxÞþð1 λÞU L 1 ðx, 1 λÞ þ λU L 2 ðx, λÞ,
ð53Þ
where x is the collection of all of the degrees of freedom of the dualtopology peptide system, U L 12 ðxÞ contains the potential energy
terms that do not depend on λ
SD
U L 12 ðxÞ ¼ U 0 ðxÞ þ U SD
L 1 ðζ 1 Þ þ U L 2 ðζ 2 Þ,
ð54Þ
where U0(x) is the unperturbed component of the potential energy
(including the intramolecular potential energy terms of the dualtopology sidechains not affected by the transformation, but excluding interactions between the two sidechains), and the terms
U SD
L i ðζ i Þ represent the auxiliary restraints used in the dual-topology
scheme to anchor each sidechain to the backbone (see Eq. 50), and
SS
SD
U L i ðx, λÞ ¼ U NB
L i ðx, λÞ þ U L i ðxÞ þ U L i ðxÞ,
ð55Þ
where U NB
L i denotes non-bonded interactions between the sidechain atoms and the environment, U SS
L i denotes the bonded (1–2,
1–3, and 1–4 interactions) among backbone atoms with sidechain i,
and U SD
is the corresponding term for bonded interactions
Li
between the backbone atoms and the sidechain. 25 As illustrated
by Eq. 55, the non-bonded component has an explicit λ dependence due to the use of separation-shifted soft-core pair potentials
25
The S symbol stands for the single-topology region (the backbone in this case), and D stands for dual-topology
region (the two sidechains) [65, 67].
328
Emilio Gallicchio
[65, 67] to describe the non-bonded interactions between the
dual-topology sidechains and the rest of the system.
It is straightforward to see that Eq. 53 evaluated at
λ ¼ 0 describes the L1(2) state of the dual-topology peptide with
sidechain 2 turned off and, conversely, λ ¼ 1 describes the L(1)
2 state. Panel et al. [60] simulated 11 alchemical states from λ ¼ 0 to
λ ¼ 1. The change in free energy from λr to λr+1 was evaluated using
the Bennet Acceptance Ratio (BAR) method, which is MBAR
(Eq. 46) for two states and where, in this case,
vrs ¼ βU λr ðx s Þ
ð56Þ
is the alchemical potential energy at λr of the conformational
sample xs collected at either λr or λr+1.26
3.3 Potential of Mean
Force Study of the
Binding of the MEEVD
Peptide to the TPR2A
Receptor
The heat shock organizing protein (Hop) binds specifically to the
heat shock protein Hsp90 through its tetratricopeptide repeat
(TPR) domain TPR2A. TPR modules are widespread protein
domains responsible for the specific recognition patterns of many
proteins. Due to their molecular recognition characteristics, engineered TPR domains are seen as potential alternatives to antibodyderived biological medicines. Lapelosa [22] studied the binding of
the MEEVD peptide from Hsp90 to the TPR2A domain of Hop
(Fig. 7) using the potential of mean force methodology outlined in
Subheading 2.4. The work yielded an estimate of the standard free
energy of binding between TPR2A and MEEVD in good agreement with experimental measurements. It provided structural
insights into the entry and exit mechanism of the peptide from
the receptor binding site.
3.3.1 Calculation of the
Standard Binding Free
Energy
Lapelosa [22] computed a 1-dimensional radial potential of mean
force (PMF), ΔF(r), along the center of mass separation r between
the receptor and the peptide (Fig. 7) using the Adaptive Biasing
Force (ABF) method described in the next section. The PMF was
then employed to compute the free energy of binding. The expression of the binding constant in terms of the radial PMF is derived
from Eqs. 33 and 34 by expressing the integral in terms of spherical
polar coordinates (r, θ, ϕ), where r is the distance between the
centers of mass of the receptor and the peptide, θ is the angle
between the line connecting the centers of masses and the axis
connecting the C-α atoms of two chosen residues of the receptor,
and ϕ is an azimuthal angle (which can be considered arbitrary
because neither the conical sampling region nor the binding site
region depends on it). Following a procedure similar to the one
that yielded Eq. 33 from Eq. 26, we carry out the integration in
Eq. 33 over the θ and ϕ coordinates to obtain
26
The numerator and the denominator of Eq. 46 are often combined to cast the formula in terms of energy
differences v r 0 s v rs .
Free Energy-Based Computational Methods
329
Fig. 7 Illustration of the calculation of the binding free energy of the complex
between the Hop TPR2A domain (multi-colored ribbon) with the MEEVD peptide
(red and pink tubes). The MEEVD peptide is shown in its position in the crystal
structure (PDB id: 1ELR [68], pink tube) and in a representative position and
orientation (red tube) within the simulation cone (the yellow shaded region). The
potential of mean force is collected along the distance (black arrow) between the
center of mass of the receptor and the center of mass of the peptide while the
peptide is kept within the cone. The arc across the cone delineates the binding
site region r < rb
K b ¼ C∘
R rb
0
drr 2
where pðr , θ , ϕ Þ ¼ pðr L Þ, and
pðrÞ ¼
R1
cos θ0
pðrÞ
,
pðr , θ , ϕ Þ
R 2π
dð cos θÞ
0
dϕpðr, θ, ϕÞ
ð57Þ
is the polar angle-averaged probability density of the ligand position in the conical region. Considering the value of the radial
probability density at distance r in the canonical region far away
from the receptor and integrated over the polar angles,
pðr Þ ¼ 2πð1 cos θ0 Þpðr , θ , ϕ Þ
we finally obtain
ð58Þ
27
K b ¼ 2πð1 cos θ0 ÞC ∘
R rb
0
drr 2 e βΔF ðrÞ ,
ð59Þ
where rb ¼ 20 Å is the limiting radial distance of the binding site
region and θ0 ¼ 60∘ is the angle of aperture of the cone, and
27
Probably because of a typo, the 2π factor is missing in the corresponding expression (Eq. 2) of the paper by
Lapelosa [22].
330
Emilio Gallicchio
ΔF ðrÞ¼ kB T ln
pðrÞ
pðr Þ
ð60Þ
is the radial PMF relative to the bulk distance r ¼ 30 Å.
3.3.2 Calculation of the
Potential of Mean Force
Using the Adaptive Biasing
Force Method
The peptide’s radial PMF, ΔF(r), was evaluated using the Adaptive
Biasing Force (ABF) method [69]. ABF serves the dual purpose of
accelerating the sampling of the peptide positions relative to the
receptor and providing an estimate of the PMF. ABF introduces a
fictitious biasing force fb(r) along the radial direction such that the
observed distribution of distances with the addition of the biasing
force, pobs(r), is flat within the sampling region (in this case the
region within the cone illustrated in Fig. 7 with θ0 ¼ 60∘ angle of
aperture and up to r < r ¼ 30 Å).
A derivation of ABF is beyond the scope of this chapter, however, to motivate it, first note that differentiation of Eq. 30 leads to
the conclusion that the gradient of the PMF with respect of ζL is the
average gradient of the system potential energy function
∂ΔF ðζL Þ
¼
∂ζL
∂U
h ∂ζ
i
L
ζL
,
ð61Þ
where U is the potential energy function of the solvated system and
h. . .iζL represents an ensemble average at fixed ζL. In other words,
the negative of the gradient of the PMF is the system force averaged
over the degrees of freedom of the system other than those along
which the PMF is defined, thereby justifying the name potential of
mean force for ΔF(ζ L). The same conclusion applies to forms of the
PMF averaged over some coordinates such as ligand orientations
(Eq. 29), including the 1-dimensional radial PMF, ΔF(r), considered in the work of Lapelosa.28
Also, note that the PMF along a coordinate is proportional to
the logarithm of the probability distribution for that coordinate
(Eq. 29). Thus, a flat distribution indicates that the overall force,
the mean force, plus the biasing force along the coordinate is zero
or, equivalently, that the added biasing force is equal and opposite
to the mean force. This implies that the potential of mean force can
be obtained by integrating the biasing force that flattens the radial
distribution. The additional benefit of having a flat distribution is
that the dynamics along the chosen coordinate are more likely to be
diffusive and not impeded by free energy barriers. Indeed, several
independent binding/unbinding events have been reported in the
study by Lapelosa [22].
28
In this case the radial force is interpreted in terms of the force of a central potential, and Eq. 61 has additional
terms due to the Jacobian of the radial coordinate [69].
Free Energy-Based Computational Methods
4
331
Conclusion
This chapter has shown how a statistical mechanics formulation of
the non-covalent molecular association from first principles gives
rise to different computational methods to estimate the binding
free energies of protein-peptide complexes. The three case studies
illustrate the application of each method to particular molecular
complexes and how they are tailored to achieve specific goals. It is
much more challenging to apply rigorous binding free energy
estimation methods to protein-peptide complexes relative to
small-molecule binding. We hope that this chapter illustrates how
a good appreciation of the underlying theories and their computational implementations helps understand the practices connected
with each approach and its strengths and limitations.
Acknowledgement
E.G. acknowledges support from the National Science Foundation
(NSF CAREER 1750511).
References
1. Kastritis PL, Bonvin AM (2013) On the binding affinity of macromolecular interactions:
daring to ask why proteins interact. J Roy Soc
Interface 10(79):20120835
2. Kilburg D, Gallicchio E (2016) Recent
advances in computational models for the
study protein-peptide interactions. Adv Prot
Chem Struct Biol 105:27–57
3. D’Annessa I, Di Leva FS, La Teana A,
Novellino E, Limongelli V, Di Marino D
(2020) Bioinformatics and biosimulations as
toolbox for peptides and peptidomimetics
design: where are we? Front Molecular
Biosci, 7
4. Mihailescu M, Gilson MK (2004) On the theory of noncovalent binding. Biophys J 87:
23–36
5. Gibb CL, Gibb BC (2013) Binding of cyclic
carboxylates to octa-acid deep-cavity cavitand.
J Comp Aided Mol Des 28:1–7
6. Judy E, Kishore N (2020) Discrepancies in
thermodynamic information obtained from
calorimetry and spectroscopy in ligand binding
reactions: Implications on correct analysis in
systems of biological importance. Bullet
Chem Soc Jpn 94
7. Simonson T (2016) The physical basis of
ligand binding. In: In silico drug discovery
and design, pp 3–43
8. Tsiang M, Jones GS, Hung M, Mukund S,
Han B, Liu X, Babaoglu K, Lansdon E,
Chen X, Todd J, Cai T, Pagratis N,
Sakowicz R, Geleziunas R (2009) Affinities
between the binding partners of the HIV-1
integrase
dimer-lens
epithelium-derived
growth factor (IN dimer-LEDGF) complex. J
Biol Chem 284(48):33580–33599
9. Ranganathan A, Heine P, Rudling A,
Pluckthun A, Kummer L, Carlsson J (2017)
Ligand discovery for a peptide-binding GPCR
by structure-based screening of fragment-and
lead-like chemical libraries. ACS Chem Biol
12(3):735–745
10. Boresch S, Tettinger F, Leitgeb M, Karplus M
(2003) Absolute binding free energies: a quantitative approach for their calculation. J Phys
Chem B 107:9535–9551
11. Marcotrigiano J, Gingras A-C, Sonenberg N,
Burley SK (1999) Cap-dependent translation
initiation in eukaryotes is regulated by a molecular mimic of eIF4G. Mol Cell 3(6):707–716
12. Wysocka J (2006) Identifying novel proteins
recognizing histone modifications using peptide pull-down assay. Methods 40(4):339–343
13. Gilson MK, Given JA, Bush BL, McCammon
JA (1997) The statistical-thermodynamic basis
for computation of binding affinities: a critical
review. Biophys J 72:1047–1069
332
Emilio Gallicchio
14. Gallicchio E, Levy RM (2011) Recent theoretical and computational advances for modeling
protein-ligand binding affinities. Adv Prot
Chem Struct Biol 85:27–80
15. Hill TL (1986) An introduction to statistical
thermodynamics. Dover, New York
16. Roux B, Simonson T (1999) Implicit solvent
models. Biophys Chem 78:1–20
17. Pal RK, Gallicchio E (2019) Perturbation
potentials to overcome order/disorder transitions in alchemical binding free energy calculations. J Chem Phys 151(12):124116
18. Gallicchio E, Lapelosa M, Levy RM (2010)
Binding energy distribution analysis method
(BEDAM) for estimation of protein-ligand
binding affinities. J Chem Theory Comput 6:
2961–2977
19. Ben Naim A (1974) Water and aqueous solutions. Plenum, New York
20. Gallicchio E, Kubo MM, Levy RM (1998)
Entropy-enthalpy compensation in solvation
and ligand binding revisited. J Am Chem Soc
120:4526–27
21. Limongelli V, Bonomi M, Parrinello M (2013)
Funnel metadynamics as accurate binding freeenergy method. Proc Natl Acad Sci 110(16):
6358–6363
22. Lapelosa M (2017) Free energy of binding and
mechanism of interaction for the meevd-tpr2a
peptide–protein complex. J Chem Theory
Comput 13(9):4514–4523
23. Cruz J, Wickstrom L, Yang D, Gallicchio E,
Deng N (2020) Combining alchemical transformation with a physical pathway to accelerate
absolute binding free energy calculations of
charged ligands to enclosed binding sites. J
Chem Theory Comput 16(4):2803–2813
24. Cherepanov P, Maertens G, Proost P,
Devreese B, Van Beeumen J, Engelborghs Y,
De Clercq E, Debyser Z (2003) HIV-1 integrase forms stable tetramers and associates with
LEDGF/p75 protein in human cells. J Biol
Chem 278(1):372–381
25. Peat TS, Rhodes DI, Vandegraaff N, Le G,
Smith JA, Clark LJ, Jones ED, Coates JA,
Thienthong N, Newman J, et al (2012) Small
molecule inhibitors of the LEDGF site of
human immunodeficiency virus integrase identified by fragment screening and structure
based design. PloS One 7:e40147
26. Fader LD, Malenfant E, Parisien M, Carson R,
Bilodeau F, Landry S, Pesant M, Brochu C,
Morin S, Chabot C, et al (2014). Discovery
of BI 224436, a noncatalytic site integrase
inhibitor (NCINI) of HIV-1. ACS Med
Chem Lett 5(4):422–427
27. Zhang F-H, Debnath B, Xu Z-L, Yang L-M,
Song L-R, Zheng Y-T, Neamati N, Long Y-Q
(2017)
Discovery
of
novel
3-hydroxypicolinamides as selective inhibitors
of HIV-1 integrase-LEDGF/p75 interaction.
Eur J Med Chem 125:1051–1063
28. Cherepanov P, Ambrosio AL, Rahman S,
Ellenberger T, Engelman A (2005) Structural
basis for the recognition between HIV-1 integrase and transcriptional coactivator p75. Proc
Natl Acad Sci 102(48):17308–17313
29. Rhodes DI, Peat TS, Vandegraaff N,
Jeevarajah D, Newman J, Martyn J, Coates
JAV, Ede NJ, Rea P, Deadman JJ (2011) Crystal structures of novel allosteric peptide inhibitors of HIV integrase identify new interactions
at the LEDGF binding site. ChemBioChem
12(15):2311–2315
30. Gallicchio E, Deng N, He P, Perryman AL,
Santiago DN, Forli S, Olson AJ, Levy RM
(2014) Virtual screening of integrase inhibitors
by large scale binding free energy calculations:
the SAMPL4 challenge. J Comp Aided Mol
Des 28:475–490
31. Kilburg D, Gallicchio E (2018) Assessment of a
single decoupling alchemical approach for the
calculation of the absolute binding free energies of protein-peptide complexes. Front
Molecular Biosci 5:22
32. Lapelosa M, Gallicchio E, Levy RM (2012)
Conformational transitions and convergence
of absolute binding free energy calculations. J
Chem Theory Comput 8:47–60
33. Banks J, Beard J, Cao Y, Cho A, Damm W,
Farid R, Felts A, Halgren T, Mainz D,
Maple J, Murphy R, Philipp D, Repasky M,
Zhang L, Berne B, Friesner R, Gallicchio E,
Levy R (2005) Integrated modeling program,
applied chemical theory (IMPACT). J Comp
Chem 26:1752–1780
34. Eastman P, Swails J, Chodera JD, McGibbon
RT, Zhao Y, Beauchamp KA, Wang LP, Simmonett AC, Harrigan MP, Stern CD, et al
(2017) Openmm 7: Rapid development of
high performance algorithms for molecular
dynamics. PLoS Comp Bio 13(7):e1005659
35. Di Marino D, D’Annessa I, Tancredi H,
Bagni C, Gallicchio E (2015) A unique binding
mode of the eukaryotic translation initiation
factor 4E for guiding the design of novel peptide inhibitors. Prot Sci 24:1370–1382
36. Kaminski GA, Friesner RA, Tirado-Rives J, Jorgensen WL (2001) Evaluation and reparameterization of the OPLS-AA force field for
proteins via comparison with accurate quantum
chemical calculations on peptides. J Phys Chem
B 105:6474–6487
Free Energy-Based Computational Methods
37. Gallicchio E, Levy R (2004) AGBNP: an analytic implicit solvent model suitable for molecular dynamics simulations and high-resolution
modeling. J Comput Chem 25:479–499
38. Gallicchio E, Paris K, Levy RM (2009) The
AGBNP2 implicit solvation model. J Chem
Theory Comput 5:2544–2564
39. Zhang B, Kilburg D, Eastman P, Pande VS,
Gallicchio E (2017) Efficient gaussian density
formulation of volume and surface areas of
macromolecules on graphical processing units.
J Comp Chem 38:740–752
40. Chipot, Pohorille (eds) (2007) In: Free energy
calculations. theory and applications in chemistry and biology. Springer series in chemical
physics. Springer, Berlin
41. Zwanzig RW (1954) High-temperature equation
of
state
by
a
perturbation
method. i. nonpolar gases. J Chem Phys
22(8):1420–1426
42. Jorgensen WL, Thomas LL (2008) Perspective
on free-energy perturbation calculations for
chemical equilibria. J Chem Theory Comput
4:869–876
43. Khuttan S, Azimi S, Wu J, Gallicchio E (2021)
Alchemical transformations for concerted
hydration free energy estimation with explicit
solvation. J Chem Phys 154:054103
44. Tan Z, Gallicchio E, Lapelosa M, Levy RM
(2012) Theory of binless multi-state free
energy estimation with applications to
protein-ligand binding. J Chem Phys 136:
144102
45. Gallicchio E, Levy RM (2011) Advances in all
atom sampling methods for modeling proteinligand binding affinities. Curr Opin Struct Biol
21:161–166
46. Gallicchio E, Xia J, Flynn WF, Zhang B,
Samlalsingh S, Mentes A, Levy RM (2015)
Asynchronous replica exchange software for
grid and heterogeneous computing. Comput
Phys Commun 196:236–246
47. Chodera J, Shirts M (2011) Replica exchange
and expanded ensemble simulations as Gibbs
sampling: simple improvements for enhanced
mixing. J Chem Phys 135:194110
48. Sugita Y, Kitao A, Okamoto Y (2000) Multidimensional replica-exchange method for freeenergy calculations. J Chem Phys 113:
6042–6051
49. Xia J, Flynn W, Gallicchio E, Uplinger K, Armstrong JD, Forli S, Olson AJ, Levy RM (2019)
Massive-scale binding free energy simulations
of HIV integrase complexes using asynchronous replica exchange framework implemented
on the IBM WCG distributed network. J Chem
Inf Model 59(4):1382–1397
333
50. Sugita Y, Okamoto Y (1999) Replica-exchange
molecular dynamics method for protein folding. Chem Phys Lett 314:141–151
51. Felts AK, Harano Y, Gallicchio E, Levy RM
(2004) Free energy surfaces of beta-hairpin
and alpha-helical peptides generated by replica
exchange molecular dynamics with the
AGBNP implicit solvent model. Proteins:
Struct Funct Bioinf 56:310–321
52. Andrec M, Felts AK, Gallicchio E, Levy RM
(2005) Protein folding pathways from replica
exchange simulations and a kinetic network
model. Proc Natl Acad Sci USA 102:
6801–6806
53. Rick SW (2006) Increasing the efficiency of
free energy calculations using parallel tempering and histogram reweighting. J Chem Theory Comput 2:939–946
54. Tan Z (2004) On a likelihood approach for
monte Carlo integration. J Am Stat Assoc 99:
1027–1036
55. Gallicchio E, Andrec M, Felts AK, Levy RM
(2005) Temperature weighted histogram analysis method, replica exchange, and transition
paths. J Phys Chem B 109:6722–6731
56. Ding X, Vilseck JZ, Brooks III CL (2019) Fast
solver for large scale multistate Bennett acceptance ratio equations. J Chem Theory Comput
15(2):799–802
57. Shirts MR, Chodera JD (2008) Statistically
optimal analysis of samples from multiple equilibrium states. J Chem Phys 129:124105
58. Giese TJ, York DM (2021) Variational method
for networkwide analysis of relative ligand
binding free energies with loop closure and
experimental constraints. J Chem Theory
Comput 17:1326–1336
59. Subbaiah VK, Kranjec C, Thomas M, Banks L
(2011) PDZ domains: the building blocks regulating tumorigenesis. Biochem J 439(2):
195–205
60. Panel N, Villa F, Fuentes EJ, Simonson T
(2018) Accurate PDZ/peptide binding specificity with additive and polarizable free energy
simulations. Biophys J 114(5):1091–1102
61. Liu X, Shepherd TR, Murray AM, Xu Z,
Fuentes EJ (2013) The structure of the tiam1
PDZ domain/phospho-syndecan1 complex
reveals a ligand conformation that modulates
protein dynamics. Structure 21(3):342–354
62. Clark AJ, Gindin T, Zhang B, Wang L, Abel R,
Murret CS, Xu F, Bao A, Lu NJ, Zhou T,
Kwong PD, Shapiro L, Honig B, Friesner RA
(2017) Free energy perturbation calculation of
relative binding free energy between broadly
neutralizing antibodies and the GP120 glycoprotein of HIV-1. J Mol Biol 429(7):930–947
334
Emilio Gallicchio
63. Clark AJ, Negron C, Hauser K, Sun M,
Wang L, Abel R, Friesner RA (2019) Relative
binding affinity prediction of charge-changing
sequence mutations with FEP in protein–protein interfaces. J Mol Biol 431(7):1481–1493
64. Mey ASJS, Allen BK, Macdonald HEB, Chodera JD, Hahn DF, Kuhn M, Michel J, Mobley
DL, Naden LN, Prasad S, Rizzi A, Scheen J,
Shirts MR, Tresadern G, Xu H (2020) Best
practices for alchemical free energy calculations
[article v1.0]. Living J Comput Mol Sci 2(1):
18378
65. Jiang W, Chipot C, Roux B (2019) Computing
relative binding affinity of ligands to receptor:
an effective hybrid single-dual-topology freeenergy perturbation approach in NAMD. J
Chem Inf Model 59(9):3794–3802
66. Phillips JC, Braun R, Wang W, Gumbart J,
Tajkhorshid E, Villa E, Chipot C, Skeel RD,
Kale L, Schulten K (2005) Scalable molecular
dynamics with NAMD. J Comp Chem 26(16):
1781–1802
67. Steinbrecher T, Mobley DL, Case DA (2007)
Nonlinear scaling schemes for Lennard-Jones
interactions in free energy calculations. J Chem
Phys 127:214108
68. Scheufler C, Brinker A, Bourenkov G,
Pegoraro S, Moroder L, Bartunik H, Hartl
FU, Moarefi I (2000) Structure of TPR
domain–peptide complexes: critical elements
in the assembly of the Hsp70–Hsp90 multichaperone machine. Cell 101(2):199–210
69. Comer J, Gumbart JC, Hénin J, Lelièvre T,
Pohorille A, Chipot C (2015) The adaptive
biasing force method: everything you always
wanted to know but were afraid to ask. J Phys
Chem B 119(3):1129–1151
Chapter 16
Computational Evolution Protocol for Peptide Design
Rodrigo Ochoa, Miguel A. Soler, Ivan Gladich, Anna Battisti,
Nikola Minovski, Alex Rodriguez, Sara Fortuna, Pilar Cossio,
and Alessandro Laio
Abstract
Computational peptide design is useful for therapeutics, diagnostics, and vaccine development. To select
the most promising peptide candidates, the key is describing accurately the peptide–target interactions at
the molecular level. We here review a computational peptide design protocol whose key feature is the use of
all-atom explicit solvent molecular dynamics for describing the different peptide–target complexes explored
during the optimization. We describe the milestones behind the development of this protocol, which is now
implemented in an open-source code called PARCE. We provide a basic tutorial to run the code for an
antibody fragment design example. Finally, we describe three additional applications of the method to
design peptides for different targets, illustrating the broad scope of the proposed approach.
Key words Peptide design, In silico antibody maturation, Molecular dynamics, Consensus scoring
functions, Sensor technology, Evolutionary algorithm, Antibody design, Affinity optimization
1
Introduction
The design of synthetic peptides is unanimously considered of
enormous potential for biomedical applications, in the emerging
field of nanomedicine [1–3] as well as in medicinal chemistry
[4]. Their versatility enables their use as alternatives to antibodies
in targeted drug delivery and biomarker detection [5, 6]. Indeed,
like antibodies, they can be mounted on detection devices or on
nanoparticles to form ordered capturing arrays [7–10]. They can
display pharmacological activity [11–15] and can be employed as
modulators of protein/protein interactions [16, 17], with lower
adverse effects and a higher binding specificity with respect to
traditional drugs [18]. All these applications rely on the possibility
to identify suitable hits.
The state of the art of peptide design is strongly rooted on
biotechnology. Phage display library screening is used to assess
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_16,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
335
336
Rodrigo Ochoa et al.
interactions between different types of macromolecules, including
peptides [13, 19]. With this technique, it is possible to massively
screen potential peptide binders. If a binding partner is known, a
suitable sequence corresponding to the minimal sub-domain
responsible for binding can be extracted from the partner itself
[17]. However, these approaches require specialized infrastructures
and are expensive. A more cost-effective alternative, which has
aroused in the last years, is computational design. Due to enormous
advances in computer power and to a better understanding of the
chemical properties of natural amino acids, it is nowadays possible
to rationally design peptides or proteins with a high probability of
being active in vitro and in vivo [20].
An advantage of computational techniques is that they can
describe the binding mechanisms at the atomistic level, allowing
for a rational supervision of their properties. For instance, they
enable controlling at the molecular level the binding site on the
target protein, enhancing the selectivity properties of the designed
binders [21]. However, all these benefits do not come for free.
Computational design of peptides and proteins requires the efficient exploration of the sequence space, the accurate description of
the bound (and unbound) conformations, and an accurate prediction of the peptide–target binding affinities (Fig. 1).
Fig. 1 The challenges in computational peptide design: exploring efficiently the sequence space, the bound
conformations, and predicting the binding affinity of the protein–peptide complexes
Computational Evolution Protocol for Peptide Design
337
Different strategies have been developed to face the challenges
associated with computational peptide design. For example, the
design can be performed using an in silico panning method with
structural information [22], using a genetic algorithm [23] for
sequence optimization. In this approach, conformational optimization and binding energy estimation are performed by a docking
program. In [24], the authors use a Gaussian Network Model [25]
for identifying the binding site and from that an approximate
position of the peptide backbone. Then, they systematically
attempt docking 400 dipeptides at the positions determined by
this procedure in order to maximize the interaction energy, checking simultaneously the quality of the peptide conformation by
characterizing the ϕ ψ propensities of the dipeptides.
These approaches can be classified as template-based protocols.
They are very computationally efficient but require the prior knowledge of the structure of a template. Reversely, de novo methods are
computationally more expensive but can be used also when a
template is not available (Fig. 1). This is the case of the pepspec
protocol [26] included in the Rosetta software suite [27]. The
pepspec tool follows a strategy of “anchor and grow” flexible
backbone docking by starting from one key residue and optimizing
from this point peptide sequences and structures [28]. Another
example is the VitAl approach [29], which generates the peptides
by sequentially docking a pair of residues and selecting the best fit
by scoring the binding energies with AutoDock. PepComposer
[30] retrieves patches from a database similar to the query and
peptide fragments that interact with these patches. It then merges
these fragments into an initial proposal that is further optimized
using a set of iterative mutations and controlled backbone movements. Our design algorithm, called PARCE (Protocol for Amino
acid Refinement through Computational Evolution), belongs to
this second group.
PARCE, like most other de novo computational approaches,
generates successive single-point mutations on the peptide or protein binder sequence. Each mutation is then accepted or rejected by
analyzing the behavior of the complex using explicit solvent molecular dynamics trajectories. This makes the approach much more
computationally expensive than the design schemes based on docking, but at the same time it enables describing the conformational
changes induced by the binding with a level of accuracy which is
only limited by the quality of the force field used in the simulation.
Chapter Overview In the following sections, we describe in detail
the design protocol implemented in PARCE [31]. We then provide a
detailed tutorial and manual-like example for the design. Afterward, three additional applications of the protocol are presented.
Finally, we discuss the open problems that require further development
for the code and the field.
338
2
Rodrigo Ochoa et al.
PARCE: Protocol for Amino Acid Refinement Through Computational Evolution
The evolution of PARCE until its current version can be summarized in the timeline of Fig. 2. The original idea of this design
approach was proposed by Laio and collaborators in 2012 [32],
as an in silico mutagenesis platform for the optimization of amino
acid-based binders. The approach can be used to design not only
peptides but also antibody fragments, or other proteins whose
amino acid sequence requires accurate engineering for applications
in biosensing, biomedicine, and bioengineering. The protocol
explores the sequence space of peptides bound to proteins or
small molecules, using a Monte Carlo approach that integrates
various simulation and prediction techniques [33].
The method, already in its original formulation [32], was based
on a sequence of single-point mutations. One of the first applications was the design of peptides capable of binding with high
affinity to an organic molecule in a denaturating solvent
[34]. The idea of performing MD in explicit solvent was originally
motivated exactly by this need, but it was afterward adopted in
general, also when the design is performed in water. In Ref. 34, the
quality of the complexes was estimated by computing the average
value over the trajectory of a single suitable scoring function (Vina
[35] for this case). This approach was refined and specialized to the
design of protein binders in Ref. 36. The process is repeated many
times, with the aim to evolve the original sequence toward novel
sequences with predicted better affinities toward their targets
[21, 32, 34, 36–38]. In 2019, Ref. 39 introduced another
key idea: the estimation of the suitability of an attempted mutation
to be carried out by a consensus mechanism using a set of binding
2012
2015
2019
2015
2017
2021
Fig. 2 The PARCE timeline describing the main development milestones and the publications supporting the
progress of the peptide design protocol
Computational Evolution Protocol for Peptide Design
339
scoring functions [40]. This makes the results much less dependent
on the accuracy and the quality of a single scoring function. The
approach has been successfully used to design peptides and protein
fragments capable of binding to protein targets [21, 36, 39].
The PARCE code is distributed as an open-source software
(https://github.com/PARCE-project/PARCE-1) and enables
designing peptides or proteins capable of binding with higher
affinity to a generic target, as long as this can be accurately
described by a classical force field. The method, in its current
formulation, combines several computational biophysics and bioinformatics tools, in order to achieve an equilibrium between accuracy and computational efficiency. In general, in a design run, one
obtains several peptide candidates, whose number can be increased
by performing several statistically independent runs (if enough
computational resources are available). This increases the pool of
sequences for further filtering and validation and is an advantage
against more deterministic or brute force alternatives. Moreover,
since the code is open source, it is possible to adapt it according to
the research project needs. A graphical representation of the protocol and the required dependencies are shown in Fig. 3. In the
following, we explain its main steps.
2.1 Mutation
Protocol
The core of the algorithm is an iterative sequence optimization. At
every iteration step, a single-point mutation on the peptide
sequence is generated by selecting at random a position along the
peptide chain and by replacing the selected residue by a random
amino acid (i.e., a mutation). A key element of the algorithm is
generating a reliable structure of the mutant. If the mutated side
chain is placed incorrectly, it will likely make severe steric clashes
with other side chains or with the target. To heal these clashes, it
would be necessary to perform very long MD equilibrations, which
are not affordable. In PARCE, the configuration of the mutated
amino acids can be generated either with the programs Scwrl4 [41]
or with FASPR [42]. These approaches were selected based on a
study that assessed if a mutation protocol is able to predict amino
acid rotamers similar to those that would be generated in a long
MD run [43]. After performing the mutation, a first minimization
of just the predicted side chain is performed. In order to avoid
clashes between the mutated amino acid and the surrounding water
molecules, a second minimization of the new amino acid and the
water molecules surrounding it within 2 Åis carried out. Finally, a
minimization of the full system is performed followed by an NVT
equilibration of typically 100 ps. Other standard parameters
employed in the equilibration are described in the next section.
After the minimization and equilibration steps/phases, the new
system is then sampled by performing an MD simulation.
340
Rodrigo Ochoa et al.
Fig. 3 Schematic representation of the PARCE pipeline. It includes four main phases: a single-point mutation,
the conformational sampling of the new protein–binder complex, the scoring of the conformations of the new
complex, and the acceptance or rejection of the mutation. The protocol is iterated to improve the binder
sequence. Several-open source dependencies are required to run the protocol
2.2 Conformational
Sampling with
Molecular Dynamics
For each mutation, an MD simulation is run to sample the conformations of the complex. This step can be seen as the “fingerprint” of this design approach, the feature that makes it different
from most other design strategies. The specific force filed can be
chosen based on the experience of the user, and the MD setup can
be adapted to the physico-chemical conditions of the environment
in which the binding should happen. For example, for a design of
peptides capable of binding a protein in water solution at ambient
conditions, one can use the Amber99SB-ILDN protein force field
[44], a TIP3P water model [45], a modified Berendsen thermostat
[46], and a Parrinello–Rahman barostat [47]. In general, the complex formed by the peptide–protein and its target is solvated in a
cubic box with periodic boundaries at a distance of at least 8 Å from
any atom of the complex. By default, Na+ and Cl counterions are
included in the solvent to make the box neutral, but the
Computational Evolution Protocol for Peptide Design
341
concentration and the ion type can be easily changed to take into
account a specific ionic strength. In general, the electrostatic interactions are calculated by using the Particle Mesh Ewald (PME)
method, with 1.0 nm short-range electrostatic and van der Waals
cutoffs [48], and the equations of motion are solved with the leapfrog integrator [49], using a timestep of 2 fs.
2.3 Scoring and
Mutation-Acceptance
Strategies
After performing the mutated peptide–protein (or peptide–ligand)
conformational sampling, the trajectory is scored with a chosen set
of scoring functions used for protein–protein, protein–peptide, or
protein–ligand affinity predictions. The mutation can be accepted
or rejected based on three different strategies, outlined in the
following.
2.3.1 Monte Carlo
Optimization
The most simple optimization strategy is based on Monte Carlo
and on the use of a single scoring function for estimating the
binding affinity. At each step of the mutation cycle, the peptide
chain is randomly mutated selecting one amino acid from the
sequence and replacing it with a different amino acid. The protocol
offers the possibility to select the amino acid positions in the
peptide chain that are eligible for mutations, as well as the list of
possible amino acids selected for the replacement. For example, in
Ref. 34, the design is performed on cyclic peptides: the terminal
CYS positions were never mutated in order to conserve the cyclic
geometry, while GLY was removed from the amino acid list used for
the replacement, avoiding undesired mobility in the new peptide
chain.
After each mutation step, meaningful conformations of the
mutated peptide/target complex in explicit solvent are generated
by finite temperature MD, employing the methodology described
in the previous section. The binding affinity of the mutated peptide
toward the target ligand is then estimated using a single scoring
function. In Ref. 34, a cluster analysis is performed over the last
part of the trajectory (the last 1 ns of a 5 ns NPT production run) to
extract statistically relevant conformations of the peptide–ligand
complex. Poorly populated clusters were discarded (clusters with
less than 15 conformations), while for the central structure of the
remaining clusters, the peptide–ligand affinities were scored
employing the Vina scoring function [35]. In the PARCE implementation, the cluster analysis is not performed anymore, and the
binding affinity is simply estimated as the average value of the
scoring function on the whole MD trajectory, neglecting only its
first part (whose length can be set by the user).
The new peptide sequence at step k is accepted or rejected
based on the Metropolis criterion, with a probability
min ð1, exp½ðE k E k1 Þ=T e Þ,
ð1Þ
342
Rodrigo Ochoa et al.
where Ek1 is the estimated binding affinity before the mutation,
Ek is the binding affinity after the mutation, and Te is an efficacious
temperature that controls the acceptance rate. If the sequence is
accepted, a new mutation cycle is started from the mutated
sequence; otherwise, the former sequence from step k 1 is used
as starting point for a new mutation attempt. The mutation cycle
described above is iterated up to a desired number of mutations.
2.3.2 Replica Exchange
Optimization
The exploration of the sequence space can be increased using a
replica exchange scheme by running simultaneous and independent
mutation cycles at many different efficacious temperatures. At the
end of each step, a swap between two randomly selected replicas
(e.g., r and r0 ) is attempted. The swap is accepted according to a
parallel tempering scheme,
1
1
,
ð2Þ
min ð1, exp Þ ðE r E r 0 Þ
Tr Tr
where Er and E 0r are the peptide/target binding energy in replicas r
and r0, and Tr and T 0r are the efficacious temperatures. If the swap is
accepted, the replica indexes are swapped. The replica exchange
scheme is not currently implemented in PARCE.
2.3.3 Consensus
Optimization
The two optimization approaches described above attempt optimizing the binding affinity estimated by a single scoring function. If,
for example, one estimates the binding affinity with Vina, the
evaluation is based on counting the number and type of peptide–
ligand contacts and providing, for each of them, an energy value
assuming that the complex is fully solvated in an aqueous environment [35]. For this reason, binding affinities were not necessarily
meaningful in non-aqueous environments and the scoring was used
with the only intent to screen the most viable peptide–ligand
complex observed during the MD trajectory.
These limitations motivated us to improve the scoring scheme
of the protocol with a consensus optimization scheme. In this strategy, the mutation is accepted following a consensus-based approach
using N scoring functions. If a particular number n of scoring
functions agrees on an improvement of the binding affinity of the
mutated peptide B, with respect to the one prior to the mutation,
i.e., peptide A, then the final consensus will accept the attempted
mutation [39]. Formally, the consensus score C is defined as
C¼
N
P
ck ,
ð3Þ
k¼1
where ck for the scoring function k is
(
1,
S Bk S A
k < 0 ,
ck ¼
0,
otherwise,
ð4Þ
Computational Evolution Protocol for Peptide Design
343
where S Ik is the value of the average score for peptide I. It should be
noted that all employed scoring functions are defined as binding
energies, so that lower values mean higher binding affinities. The
criterion to evaluate if a consensus among the scoring functions is
achieved is based on the comparison of C to a predefined threshold
T (with a value between 1 and N). If C T, the mutated sequence
is accepted. The scores are estimated as an average over all the
snapshots of the trajectory.
The next section presents a tutorial on installing and running
PARCE and an example guide on designing a nanobody paratope
region bound to a protein fragment.
3
Tutorial
3.1 Installing and
Running PARCE
PARCE can be downloaded from https://github.com/PARCEproject/PARCE-1 and installed under any Linux operating system.
A README file with instructions is included in the repository. The
code has been initially optimized for Debian and Ubuntu OS server
distributions. We note that all the dependencies required to run
PARCE are open-source software, but some of them, such as
Scwrl4 [41], require academic licenses. In such cases, it is recommended to install these packages following the developer’s documentation to integrate their paths to the code. To guarantee that
the additional tools and dependencies are functioning, a set of tests
is provided in the repository. A docker container is also available in
case the user wants to skip the installation of third-party tools.
After installing PARCE, one has to set up the configuration file
that contains instructions to start the system and launch the protocol. It describes the path and the characteristics of the input files, as
well as the necessary parameters to run the design protocol. An
explanation of the input parameters is provided in Table 1.
Before running the protocol, we recommend performing an
equilibrated MD simulation of the initial complex. Then, the protocol is run by the command:
python3 run_protocol . py [- h ] - c CONFIG_FILE
The design protocol results are summarized in the output file
called mutation_report.txt, which contains details per mutation step, like the type of mutation, the average scores, the binder
sequence, and if the mutation was accepted or not. The mutation is
defined by the syntax: [old amino acid][binder chain]
[position][new amino acid]. An example of a mutation is
AB2P, which means that an alanine located in the position number
2 of the chain B is replaced by a proline.
344
Rodrigo Ochoa et al.
Table 1
Parameters provided by the user in the configuration file
Parameter
Explanation
Folder
Name of the folder that has all the input and output files of the protocol
src_route
Route of the PARCE folder where the src folder is located
Mode
The design mode, which has three possible options, including start and restart
modes
peptide_reference The sequence of the peptide, or protein fragment that will be modified
pdbID
Name of the structure that is used as input
Chain
Chain id of the peptide/protein in the structural complex
sim_time
Time in nanoseconds that will be used to sample the complex after each mutation
num_mutations
Number of mutations that will be attempted
residues_mod
These are the specific positions of the residues that want to be modified.
md_route
Path to the folder containing the input files used during the previous MD sampling
of the system
md_original
Name of the system file located in the folder containing the previous MD sampling
score_list
List of the scoring functions that will be used to calculate the consensus.
half_flag
Flag that controls which part of the trajectory is used to obtain the average score.
Threshold
Threshold used for the consensus scoring.
mutation_method Protocol to perform the single-point mutations
scwrl_path
Provide the path to Scwrl4 in case it is not installed in a PATH folder.
gmxrc_path
Provide the path to GMXRC for Gromacs
In addition, the report file includes failed attempts based on
minimization or equilibration problems in MD. To overcome these
issues, the protocol automatically attempts a number of mutations
using the last accepted structure. If the simulation keeps failing
after a certain number of attempts (defined by the user in the
input file with the keyword try_mutations), a new mutation
will be attempted but using the complex that was accepted previous
to the current one. If the problem persists more than the number of
try_mutations, the design run is stopped. If the protocol is
successful, the number of attempted mutations is decided by the
key word try_mutations.
PARCE has an MIT license that allows for the distribution of
the code and its improvement through new functionalities, for
example, for adding new scoring functions. We note that the
computational resources required for running PARCE are determined by the complexity of the system, since the design is based on
running MD. HPC versions of the code are available upon request.
Computational Evolution Protocol for Peptide Design
345
Fig. 4 The structure of the VHH antibody fragment. The peptides to be optimized
correspond to the complementary determining regions (highlighted in red). The
framework sequence (yellow) is not mutated throughout the whole process
3.2 Tutorial Example:
The Optimization of
Anti-HER2 Antibody
Fragments
The human epidermal growth factor receptor 2 (HER2) is a transmembrane protein whose overexpression is associated with specific
classes of breast cancer and is thus a widely recognized biomarker
employed for monitoring cancer progression, as well as a key pharmacological target for cancer therapy [50, 51].
For this particular example, the goal is to design a novel antibody fragment of camelid origin (or VHH, Fig. 4) capable of
detecting HER2 in a patient’s biological fluids [8, 10]. The idea
is to optimize a peptide, or a set of peptides, already embedded into
an existing protein to recognize the target. In particular, we aim to
design peptide fragments that are part of the antibody binding
domain, also known as complementary determining regions
(CDRs). This process, called antibody maturation, is usually performed in vivo, by animal immunization. Using an in silico process
reduces the use of animals for binder discovery. A further advantage
of the computational design is that it enables choosing a priori the
binding site on the target protein. The selection of the binding site
(or epitope) to be targeted is of paramount importance for the
development of new nanodevices, for targeted therapies, and for
drug design [52].
An example of VHH optimization performed by PARCE is
described in Ref. 39. Here, we show how to set up the design and
how it is possible to employ a different set of scoring function to
obtain an ex novo designed antibody fragment for an arbitrarily
selected binding site on HER2.
3.2.1 Design
Methodology
To get started with any design, it is necessary to have a reasonable
starting model of the initial complex. This can be done using either
a crystallographic complex or a conformation obtained by docking
346
Rodrigo Ochoa et al.
the binder to the target. If there are no experimental 3D structures
available, these can be constructed by homology modelling. We
also remind that, when working with a new system and before
getting started with the optimization, all scoring functions should
be benchmarked over the particular system [40, 53].
As a second step, it is necessary to identify the residues that
should be mutated. In the case of an antibody fragment, these can
be the residues belonging to either one or two or all three CDRs
(highlighted in Fig. 4). Only this selected region will be optimized,
leaving the sequence of the rest of the protein unchanged throughout the whole process. The input files for the design are the starting
topologies for the MD and the configuration file. Examples of these
files, aiming to reproduce the results of Ref. 39, can be found in the
folder design_input/protein_protein. The CONFIG_FILE
contains the input parameters shown in Table 1.
The individual peptide residues to be optimized should be
explicitly listed in the CONFIG_FILE. For instance, for the optimization of a single antibody fragment loop, the config_vhh.txt
would read
residues_mod: 54,55,56,57,58,59,60,61
Instead, to optimize all the VHH residues highlighted in Fig. 4,
namely residues 29–25 corresponding to the first CDR, 55–61
corresponding to the second CDR, and 101–109 corresponding
to the third CDR, one should write
residues_mod: 29,30,31,32,33,34,35,55,56,57,58,59,60,61,101,
102,103,104,105,106,107,108,109
simply listing all residues even if they belong to different
regions of the system.
While the example of the optimization of a single CDR can be
found in Ref. 39, here we show how the same process can lead to
the optimization of all three VHH CDRs.
3.2.2 DesignOptimization Results
A typical optimization path is reported in Fig. 5a, where each score
is a proxy measure of the binding affinity between the two components of the system, in that case, the whole VHH and its target. It is
important to note that, even if only the selected residues are
mutated, the score is calculated over the entire complex. An optimization is considered concluded when all scoring functions reach a
plateau.
For a collective view of the whole optimization process, a rankbased analysis can be used. First, one computes the rank r ik associated with each sequence i according to the score obtained with a
single scoring function k. Accordingly, r ik can be normalized as
Computational Evolution Protocol for Peptide Design
347
Fig. 5 Design of an antibody fragment (VHH) bound to the HER2 terminal domain. (a) Evolution of the six
scoring functions during the design. The dots in the curve represent the mutations that were accepted. The
scoring functions used are BMF-Bluues [75, 76] (gray), Rosetta [27] (magenta), PiePisa [73] (orange), Haddock
[70] (blue), Bach6 [77, 78] (mauvre), and Bluues [76] (cyan). (b) Ranking of the configurations: both the single
i
scoring function normalized ranks (r^k , stars) and the global normalized rank Ri (black line) for each peptide i.
In the insets, starting and final configuration of the VHH/HER2 complexes. Color code: HER2 (gray), VHH
framework (yellow), starting residues (red), and optimized residues (green)
r^ik ¼
r ik
,
N
ð5Þ
where N is the total number of accepted mutations obtained in the
runs. From the collection of r^ik (indicated by stars in Fig. 5b), a
global ranking score Ri for each sequence is defined (black dots in
Fig. 5b) as
Ri ¼
P r^ik
k¼1, N s
Ns
;
i ¼ 1, N ,
ð6Þ
where Ns is the number of scoring functions. If the ranks of a
certain sequence i are consistently low for all the scoring functions,
then Ri is small.
In the particular case illustrated in Fig. 5, Ri decreases when
more mutations are performed, as expected. By comparing the
initial and the final configurations of the system, the former with
sequence associated with max ðRi Þ and the latter with min ðRi Þ
(insets in Fig. 5b), it is possible to see how the initial VHH evolves
into a final VHH by changing its orientation to maximize its contacts with the target, defining a larger contact area between the two.
The optimized VHHs, or better a selection of the lowest ranking
sequences, will then need to undergo extensive MD simulations
and stability tests [54]. VHHs passing all the computational tests
will then be ready to be expressed in bacterial cells [55].
In the next section, several additional examples of peptide
design are presented.
348
4
Rodrigo Ochoa et al.
Additional Peptide Design Examples
4.1 Drug-Binding
Peptide Design in
Different Environments
Reference 34 was the first works performing peptide design in
explicit solvent with our scheme. It reports the design of highaffinity cyclic peptides toward Ironotecan (CPT-11). CPT-11 is a
chemotherapy drug, and its choice was motivated by the need of
engineering sensors for therapeutic drug monitoring in denaturant
solvent (e.g., methanol), which were afterwards validiated
experimentally [37].
Compared to the original protocol of 2012 [32], three important innovations were introduced: (1) the conformational search
for viable peptide–ligand conformations during the mutation cycle
was carried on by finite-temperature molecular dynamics and not
by flexible docking in vacuum, (2) cyclic peptides were adopted for
the design, and (3) the design was performed with the peptide–
ligand complexes fully solvated in a simulation box with an explicit
atomistic description of the solvent molecules [34]. Computationally intense design in explicit solvent was made possible thanks to
the advent of GPU-based computing, which started to be efficiently implemented in commonly used MD packages in those
years.
The protocol adopted was basically the same employed in the
current version of PARCE and described in Subheading 2, using in
particular Replica Exchange optimization with a single scoring
function (Vina [35]). Two independent designs were performed,
one in water and one in methanol. The procedure started from a
deca-alanine cyclized by a disulfide bridge between two terminal
cysteines. CPT-11 was initially inserted within the cyclic peptide,
and one randomly selected amino acid of the peptide chain was
mutated at each step. The terminal cysteines were not selected for
the mutation in order to conserve the cyclic geometry. After each
mutation, MD simulations of 5 ns were performed for the peptide–
ligand complex fully solvated in water (or methanol), and relevant
peptide–ligand structures were selected from the last part of the
MD trajectory by cluster analysis (see Subheading 2.3.1).
For the selected structures, the peptide–ligand affinities were
estimated using the Vina scoring function, and the mutation was
accepted or rejected according to a Metropolis criterion. To further
enhance the exploration of the sequence space, a replica exchange
scheme with 5 effective temperatures was employed, as described in
Subheading 2.3.2 (Fig. 6a). After 400 mutations, the best seven
peptide–ligand complexes, in terms of binding energies, were
selected and their stability further assessed by longer (at least
100 ns) MD simulations in explicit solvent at different temperatures (i.e., from 300 K up to 450 K).
The designed peptides revealed a solvent specificity, namely
peptides designed in aqueous environment do not necessarily
bind the ligand in a different solvent, such as methanol. This is
Computational Evolution Protocol for Peptide Design
349
Fig. 6 Design of peptides for CPT-11 in aqueous and methanol solutions. (a) The Vina scoring as a function of
the mutation steps for the design in water. The five different colors report the binding affinities observed at the
five effective temperatures employed during the procedure (see main text for details). (b) The best seven
peptides (A-G), in terms of binding affinity toward CPT-11, from the design in water. Black dots are the binding
energies as predicted during the mutation cycle, while in square blue after 100 ns MD at 300 K in water. The
green diamond displays the binding affinity of the A-G peptides after 100 ns MD in methanol. (c) as panel (b)
for the seven best peptides (α η) designed in methanol. (d) The experimental dissociation constant, kD in
methanol solution vs. the computationally predicted binding energies. The green dots show results for two
peptides designed in methanol, black dots for three peptides designed in vacuum using the flexible docking
approach [32]. The experimental values were taken from Ref. 37. In the green inlet, the peptide backbone
around CPT-11 at 0, 25, 75, and 100 ns MD simulations for one of the peptides designed in methanol. Panels
(a)–(c) adapted from Ref. 34, copyright 2015 American Chemical Society
evident in Fig. 6b: once peptides designed in water are solvated in
methanol, the binding becomes weaker and, occasionally, some
peptides can even detach from the ligand [34]. This solvent specificity is a consequence of the explicit description of the solvation
environment during the design. Peptides created in water are,
indeed, richer of aromatic residues than those designed in methanol: once peptides designed in water are immersed in methanol, the
350
Rodrigo Ochoa et al.
aromatic side chains are more easily exposed to the solvent, competing with the binding toward the ligand.
The high affinity of the designed peptides in methanol has been
confirmed experimentally in a follow-up paper using surface plasmon resonance and fluorescence spectroscopy [37]. The peptides
displayed an experimental micromolar affinity toward CPT-11 in
methanol solution, and MD simulations revealed peptide–drug
complexes more stable in solvent than those designed in vacuum
using flexible docking (Fig. 6d). Interestingly, the designed peptides were selective toward the target and unable to bind SN-38, an
active metabolite of CPT-11 lacking of the carbamate and
piperidyl-piperidine groups [37].
A similar procedure was also adopted to design peptides that
bind chlorogenic acid (GCA), a compound present in coffee
blends, in water solution [56]. Electrochemical measurements
and circular dichroism and fluorescence spectroscopy certified the
high affinity of the design, showing a remarkable peptide selectivity
toward CGA and not to other related phenolic compounds [56].
4.2 Peptides for
Protein Recognition
The protocol introduced in Ref. 34 was subsequently employed for
the design of peptides for protein recognition [9, 10, 21, 38, 54,
57]. While the approach was still relying on a single scoring function, namely Vina [35], it allowed for an unprecedented versatility
in the choice of the binding site. In particular, after having successfully designed linear peptides for a well-defined protein pocket with
the docking based code [32, 33], the new approach introduced in
Ref. 34 allowed designing ligands for surface-exposed binding sites
(Fig. 7), which are generally regarded as “undraggable.”
Fig. 7 Design of peptides for B2M in vacuum. (a) The Vina score as a function of the mutation steps. The three
different colors report the binding affinities observed at the three effective temperatures employed during the
procedure (see main text for details). Yellow crosses indicate peptides that underwent computational and
experimental screening. The best peptide/protein complex is shown in (b). (c–d) Top and side view of the two
computationally designed peptides discussed in the text. Adapted from Ref. 21
Computational Evolution Protocol for Peptide Design
351
To show the versatility of PARCE in terms of choice of the
target binding site, we selected two sites on opposite sides of a
globular protein that does not possess pockets: the beta-2microglobulin. Due to the large system size, the design was performed in vacuo, followed by a screening in explicit solvent using
MD simulations and a final experimental validation.
The first binding site chosen for the design was a surfaceexposed site, which is known to interact with the human histocompatibility antigen [54, 57]. Among the generated peptides, five
were experimentally tested giving dose–response surface plasmon
resonance (SPR) signals with dissociation constants in the micromolar range. The result was confirmed by means of isothermal
titration calorimetry and nuclear magnetic resonance, showing
that the approach is capable of designing binders for an arbitrarily
selected binding site. We then identified another site on the opposite side of B2M and attempted to generate a second peptide
(Fig. 7a). Once again SPR confirmed the dissociation constant to
be in the micromolar range. Competition experiments further
confirmed the two peptides to bind to non-overlapping binding
sites, thus confirming the theoretical predictions (Fig. 7b and c).
We further showed that this design approach can be exploited
for bottom-up design of smart nanodevices. Indeed, the peptides
designed in these projects were employed as sensing elements to
build a self-assembled nanochip capable of capturing a target protein by means of preselected binding sites [21], allowing for the
immobilization of the chosen protein in a predefined
orientation [54].
4.3 MHC Class II
Peptide-Binder Design
The major histocompatibility complex (MHC) class II is a complex
of encoding proteins responsible for regulating the immune system
in humans [58] through the interaction with antigen proteins and
peptide subunits. Different experimental and computational strategies have been implemented to predict affinities of peptides
toward relevant alleles within the population [59], as a strategy
for the development of more efficient vaccines [60]. The field
known as immunoinformatics has provided an extensive set of
tools, mainly sequence-based strategies, for predicting the affinity
between a peptide and MHC class I or class II molecules
[61]. However, structural information is also crucial to rationally
study peptides bound to the MHC class II binding interface, which
has been characterized by a large groove located between the
solvent exposed α and β structural subunits [62] (Fig. 8). Specific
interactions created between some protein pockets and core amino
acids of the peptides contribute to the molecular affinity [63]. The
latest has been correlated with immunogenic properties, as well as
352
Rodrigo Ochoa et al.
Fig. 8 Summary of the scoring strategy used in the design protocol. (a) The structure of MHC class II in
complex with the peptide at step 0, and examples of a rejected mutation at the 20th step (colored in red) and
of an accepted mutation at the 50th step (colored in green). (b) Representation of the accepted mutations
(green circles) and the rejected (red circles), with the rejected and accepted examples depicted by dash lines
other events during the MHC editing process [64]. This motivates
the use of structure- and dynamic-based approaches such as
PARCE to engineer peptides with better affinities for this molecular
receptor.
In this example, the starting complex for the design was the
MHC class II allele DRB1:01*01 bound to a peptide of 14 amino
acids, that is part of an influenza virus antigen (YPKYVKQNTLK
LAT ) (Fig. 8). This sequence has a reported bioactivity of
IC50 ¼ 130nM from a curated dataset of peptide binders against
multiple MHC class II alleles [65]. As reference, we used the crystal
structure 1DLH [66] from the Protein Data Bank (PDB) [67] that
has a missing tyrosine at the N-terminal flanking region. The
missing amino acid was modelled using the Rosetta Remodel functionality [68] using the full protein–peptide complex as a template.
The side chains of the complex were relaxed using Rosetta with the
protein backbone fixed. The refined protein–peptide structure was
equilibrated by an MD simulation of 100 nanoseconds (ns), with
previous minimization and NVT/NPT equilibration, using Gromacs v5.1 [69]. Despite the linear conformation of the peptide in
the bound state, the complex remains stable during the simulation,
Computational Evolution Protocol for Peptide Design
353
mostly due by the hydrogen bonds between the receptor and
peptide backbone atoms. The final snapshot of the MD was used
as the starting conformation for the design.
We then applied the PARCE protocol explained in Subheading
2. Specifically, we configured the protocol to mutate randomly any
amino acid of the peptide. We iterated the mutation process and
sampled each mutated protein–peptide complex for 5 ns, at a
temperature of 350 K. A high temperature was chosen to allow a
more efficient exploration of the conformational space. All the
protein atoms located at a distance greater than 12 Å from the
peptide were restrained to keep the system stable at the selected
temperature. The design was performed by using the consensus
scheme described in Subheading 2.3.3, using six scoring functions
that were previously benchmarked for this specific system [53]:
Haddock [70], Vina [35], a combination of DFIRE and GOAP
(DFIRE-GOAP) [71, 72], Pisa [73], FireDock [74], and
BMF-BLUUES
[75,
76].
The
threshold
parameter
T (Subheading 2.3.3) was set equal to 3, following Ref. 39. This
means that if 3 or more scoring functions predict better scores for
the new mutation, the mutation is accepted. During the design,
100 mutations were attempted, with an acceptance ratio of around
20–30%. The evolution of the scores with accepted and rejected
mutations is shown in Fig. 8. As expected, the mutations minimize
the majority of the scoring functions scoring functions.
This specific example illustrates the usefulness of the consensus
strategy with respect to the standard design strategy, which is based
on a Monte Carlo optimization of a single scoring function [34]
(see Subheading 4.1). Indeed, each scoring function is typically
affected by errors, but very often the errors of different scoring
functions are uncorrelated and might compensate. The consensus
criterion allows complementing the different empirical and physicsbased terms of the scoring functions [39]. As shown in Fig. 9, the
scoring functions are, on average, minimizing their value through
the trial of multiple mutations. Nonetheless, due to the nature of
the stochastic search and the definition of the consensus score, the
single scoring functions can also increase.
We note that to explore the sequence space, it might be beneficial to run multiple replicas of the protocol starting from the same
initial complex but with different random seeds. In this case, the
peptides from the different runs can be combined following the
same re-ranking procedure using the average rank from all the
scoring functions calculated from MD simulations (similarly to
that described in Subheading 3.2.2).
354
Rodrigo Ochoa et al.
Fig. 9 Evolution of the scoring functions for the design of peptides bound to the MHC class II. We used six
scoring functions to calculate the consensus. The dots in the curve represent the mutations that were
accepted. The scoring functions used were (a) BMF-Bluues [75, 76], (b) Vina [35], (c) Firedock [74], (d)
Haddock [70], (e) DFIRE-GOAP [71, 72], and (f) Pisa [73]
Computational Evolution Protocol for Peptide Design
5
355
Concluding Notes and Perspectives
The exponential increase of computational resources enables the
use of novel strategies to complement, assist, or even replace traditional experimental methods for designing and screening novel
peptide ligands for applications ranging from biomarker detection,
drug delivery, drug design, and vaccine development.
This chapter presented the methods that our team has developed to address this exciting challenge. We developed, implemented, tested, and validated a modular algorithm for the ex novo
optimization of amino acid based binders, named PARCE [31]. It
enables the optimization of the peptide sequence to maximize its
(predicted) binding affinity toward a molecular target.
The protocol, initially introduced as an evolutionary algorithm
based on iterative docking [32, 33], evolved into a comprehensive
open-source design protocol, embedding a number of functionalities that have been tested and improved during the years, thus
enhancing its outreach. Indeed, the key of PARCE’s success relies
on its modularity: it has been designed so that when novel more
accurate approaches are available, these can be easily embedded into
the existing code.
The explicit description of the solvation environment during
the design procedure was crucial for selecting successful candidates
that are solvent-specific and target-selective. This procedure and
the ongoing improvement of force fields for MD have opened the
possibility of exploring new design conditions (e.g., binding in
nonstandard solvents and under extreme pressures and temperatures), which may be hardly accessible with other computational
approaches. Another determinant for successful designs was the
inclusion of multiple scoring functions, in the form of a consensus
criterion. This enabled, for example, the in silico unsupervised
maturation of an antibody fragment [39]. All these improvements
are expected to push forward the limits of peptide design by reaching affinities analogous to those reached by nature.
PARCE is only limited by the accuracy of the predictors it relies
upon, which can be updated as new techniques become available.
We are thus looking forward for novel advances in structure prediction and free energy evaluations. However, we note that accurate
predictors typically involve more computational resources. For the
case of PARCE, MD simulations involve costs which are orders of
magnitude higher than docking methodologies. Nevertheless, we
believe that using MD helps improving the quality of the design.
We foresee that the continuous growth in computing power will
make the trade-off between computational cost and accuracy more
and more unbalanced toward accuracy.
356
Rodrigo Ochoa et al.
Acknowledgements
R.O and P.C. were supported by MinCiencias, Ruta N, University
of Antioquia, Colombia, and the Max Planck Society, Germany.
N.M. was supported by the Alternatives Research & Development
Foundation (Annual Open Grant, PI: S.F., M.A.S.). S.F. would like
to acknowledge the Italian Association for Cancer Research (AIRC)
through the grant “My First AIRC grant,” Rif.18510, and the
CINECA Awards N. HP10B3JT25, 2020, for the availability of
high performance computing resources and support.
Conflict of Interest The authors declare that they have no competing interest.
References
1. Kim BY, Rutka JT, Chan WC (2010) Nanomedicine. N Engl J Med 363(25):2434–2443
2. Zhang X-X, Eden HS, Chen X (2012) Peptides
in cancer nanomedicine: drug carriers, targeting ligands and protease substrates. J Controll
Release 159(1):2–13
3. Chung EJ (2016) Targeting and therapeutic
peptides in nanomedicine for atherosclerosis.
Exp Biol Med 241(9):891–898
4. Brayden DJ, Hill T, Fairlie D, Maher S, Mrsny
R (2020). Systemic delivery of peptides by the
oral route: formulation and medicinal chemistry approaches. Adv Drug Deliv Rev
€ (2019)
5. Kurrikoff K, Aphkhazava D, Langel U
The future of peptides in cancer treatment.
Curr Opin Pharmacol 47:27–32
6. Deutscher, S. (2019). Phage display to detect
and identify autoantibodies in disease. N Engl J
Med 381(1):89–91
7. Cretich M, Damin F, Pirri G, Chiari M (2006)
Protein and peptide arrays: recent trends and
new directions. Biomol Eng 23(2–3):77–88
8. Ambrosetti E, Paoletti P, Bosco A, Parisse P,
Scaini D, Tagliabue E, De Marco A, Casalis L
(2017). Quantification of circulating cancer
biomarkers via sensitive topographic measurements on single binder nanoarrays. ACS
Omega 2(6):2618–2629
9. Adedeji AF, Ambrosetti E, Casalis L, Castronovo M (2018a) Spatially resolved peptideDNA nanoassemblages for biomarker detection: a synergy of DNA-directed immobilization
and
nanografting.
In:
DNA
nanotechnology. Springer, New York, pp
151–162
10. Adedeji AF, Ambrosetti E, Casalis L, Castronovo M (2018b) Spatially resolved peptideDNA nanoassemblages for biomarker detection: a synergy of dna-directed immobilization
and nanografting. In: DNA nanotechnology.
Springer, New York, pp 151–162
11. Ciemny M, Kurcinski M, Kamel K, Kolinski A,
Alam N, Schueler-Furman O, Kmiecik S
(2018) Protein–peptide docking: opportunities and challenges. Drug Discov Today
23(8):1530–1537
12. Diller DJ, Swanson J, Bayden AS, Jarosinski M,
Audie J (2015) Rational, computer-enabled
peptide drug design: principles, methods,
applications and future directions. Future
Med Chem 7(16):2173–2193
13. Fosgerau K, Hoffmann T (2015) Peptide therapeutics: current status and future directions.
Drug Discovery Today 20(1):122–128
14. La Manna S, Di Natale C, Florio D, Marasco D
(2018) Peptides as therapeutic agents for
inflammatory-related diseases. Int J Mol Sci
19(9):2714
15. Lee AC-L, Harris JL, Khanna KK, Hong J-H
(2019) A comprehensive review on current
advances in peptide drug development and
design. Int J Mol Sci 20(10):2383
16. Sillerud LO, Larson RS (2005) Design and
structure of peptide and peptidomimetic
antagonists of protein-protein interaction.
Curr Protein Peptide Sci 6(2):151–169
17. Russo A, Aiello C, Grieco P, Marasco D (2016)
Targeting “undruggable” proteins: design of
synthetic cyclopeptides. Curr Med Chem
23(8):748–762
18. Vlieghe P, Lisowski V, Martinez J, Khrestchatisky M (2010) Synthetic therapeutic peptides:
science and market. Drug Discov Today
15(1–2):40–56
19. Leurs U, Lohse B, Ming S, Cole PA, Clausen
RP, Kristensen JL, Rand KD (2014) Dissecting
the binding mode of low affinity phage display
peptide ligands to protein targets by
Computational Evolution Protocol for Peptide Design
hydrogen/deuterium exchange coupled to
mass spectrometry. Anal Chem 86(23):
11734–11741
20. Fjell CD, Hiss JA, Hancock REW, Schneider G
(2011) Designing antimicrobial peptides: form
follows function. Nat Rev Drug Discov 2
(Mic):31–45
21. Adedeji Olulana, AF, Soler MA, Lotteri M,
Vondracek H, Casalis L, Marasco D,
Castronovo M, Fortuna S (2021) Computational evolution of beta2-microglubulin binding peptides for nanopatterned surface sensors.
Int J Mol Sci 22(2):812
22. Yagi Y, Terada K, Noma T, Ikebukuro K, Sode
K (2007) In silico panning for a
non-competitive peptide inhibitor. BMC
Bioinform 8(1):11
23. Mitchell M (1998) An introduction to genetic
algorithms. MIT Press, Cambridge
24. Besray Unal E, Gursoy A, Erman B (2010)
Vital: Viterbi algorithm for de novo peptide
design. PLoS One 5(6):e10926
25. Haliloglu T, Seyrek E, Erman B (2008) Prediction of binding sites in receptor-ligand complexes with the Gaussian network model. Phys
Rev Lett 100(22):228102
26. Raman S, Vernon R, Thompson J, Tyka M,
Sadreyev R, Pei J, Kim D, Kellogg E,
DiMaio F, Lange O, Kinch L, Sheffler W, Kim
B-H, Das R., Grishin NV, Baker D (2009)
Structure prediction for casp8 with all-atom
refinement using Rosetta. Proteins Struct
Funct Bioinform 77(S9):89–99
27. Alford RF, Leaver-Fay A, Gonzales L, Dolan
EL, Gray JJ (2017) A cyber-linked undergraduate research experience in computational biomolecular structure prediction and design.
PLOS Comput Biol 13(12):e1005837
28. King CA, Bradley P (2010) Structure-based
prediction of protein-peptide specificity in
Rosetta. Proteins Struct Funct Bioinform
78(16):3437–3449
29. Unal EB, Gursoy A, Erman B (2010) Vital:
Viterbi algorithm for de novo peptide design.
PLOS One 5(6):1–15
30. Obarska-Kosinska A, Iacoangeli A, Lepore R,
Tramontano A (2016) PepComposer: computational design of peptides binding to a given
protein surface. Nucleic Acids Res 44(W1):
W522–W528
31. Ochoa R, Soler M, Laio A, Cossio P (2020)
PARCE: protocol for amino acid refinement
through computational evolution. Comput
Phys Commun 260:107716
32. Hong Enriquez RP, Pavan S, Benedetti F,
Tossi A, Savoini A, Berti F, Laio A (2012)
Designing short peptides with high affinity for
357
organic molecules: a combined docking,
molecular dynamics, and Monte Carlo
approach. J Chem Theor Comput 8(3):
1121–1128
33. Russo A, Scognamiglio PL, Enriquez RPH,
Santambrogio C, Grandori R, Marasco D,
Giordano A, Scoles G, Fortuna S (2015) In
silico generation of peptides by replica
exchange Monte Carlo: docking-based optimization of maltose-binding-protein ligands.
PLoS One 10(8):1–16
34. Gladich I, Rodriguez A, Hong Enriquez RP,
Guida F, Berti F, Laio A (2015) Designing
high-affinity peptides for organic molecules by
explicit solvent molecular dynamics. J Phys
Chem B 119(41):12963–12969
35. Trott O, Olson AJ (2010) Autodock Vina:
improving the speed and accuracy of docking
with a new scoring function, efficient optimization, and multithreading. J Comput Chem
31(2):455–461
36. Soler MA, Rodriguez A, Russo A, Adedeji AF,
Dongmo Foumthuim CJ, Cantarutti C,
Ambrosetti E, Casalis L, Corazza A, Scoles G,
Marasco D, Laio A, Fortuna S (2017) Computational design of cyclic peptides for the customized oriented immobilization of globular
proteins. Phys Chem Chem Phys 19(4):
2740–2748
37. Guida F, Battisti A, Gladich I, Buzzo M,
Marangon E, Giodini L, Toffoli G, Laio A,
Berti F (2017) Peptide biosensors for anticancer drugs: design in silico to work in denaturizing environment. Biosens Bioelectron 100:
298–303
38. Chi LA, Vargas MC (2020) In silico design of
peptides as potential ligands to resistin. J Mol
Model 26:1–14
39. Soler MA, Medagli B, Semrau MS, Storici P,
Bajc G, de Marco A, Laio A, Fortuna S (2019)
A consensus protocol for the in silico optimisation of antibody fragments. Chem Commun
55(93):14043–14046
40. Soler MA, Fortuna S, de Marco A, Laio A
(2018) Binding affinity prediction of
nanobody-protein complexes by scoring of
molecular dynamics trajectories. Phys Chem
Chem Phys 20(5):3438–3444
41. Peterson LX, Kang X, Kihara D (2014) Assessment of protein side-chain conformation prediction methods in different residue
environments. Proteins Struct Funct Bioinform 82(9):1971–1984
42. Huang X, Pearce R, Zhang Y (2020) FASPR:
an open-source tool for fast and accurate protein side-chain packing. Bioinformatics 36:
3758–3765
358
Rodrigo Ochoa et al.
43. Ochoa R, Soler MA, Laio A, Cossio P (2018)
Assessing the capability of in silico mutation
protocols for predicting the finite temperature
conformation of amino acids. Phys Chem
Chem Phys 20(40):25901–25909
44. Lindorff-Larsen K, Piana S, Palmo K,
Maragakis P, Klepeis JL, Dror RO, Shaw DE
(2010) Improved side-chain torsion potentials
for the Amber ff99SB protein force field. Proteins Struct Funct Bioinform 78(8):
1950–1958
45. Jorgensen WL, Chandrasekhar J, Madura JD,
Impey RW, Klein ML (1983) Comparison of
simple potential functions for simulating liquid
water. J Chem Phys 79(2):926–935
46. Bussi G, Donadio D, Parrinello M (2007)
Canonical sampling through velocity rescaling.
J Chem Phys 126(1):014101
47. Parrinello M, Rahman A (1980) Crystal structure and pair potentials: a molecular dynamics
study. Phys Rev Lett 45(14):1196–1199
48. Di Pierro M, Elber R, Leimkuhler B (2015) A
stochastic algorithm for the isobaric-isothermal
ensemble with Ewald summations for all long
range forces. J Chem Theor Comput 11(12):
5624–5637
49. Janežič D, Merzel F (1995) An efficient symplectic integration algorithm for molecular
dynamics simulations. J Chem Inf Comput Sci
35(2):321–326
50. Hicks DG, Kulkarni S (2008) HER2+ breast
cancer: review of biologic relevance and optimal use of diagnostic tools. Am J Clin Pathol
129(2):263–273
51. Oh D-Y, Bang Y-J (2020) HER2-targeted
therapies-a role beyond breast cancer. Nat Rev
Clin Oncol 17(1):33–48
52. Sawant MS, Streu CN, Wu L, Tessier PM
(2020) Toward drug-like multispecific antibodies by design. Int J Mol Sci 21(20):7496
53. Ochoa R, Laio A, Cossio P (2019) Predicting
the affinity of peptides to major histocompatibility complex class II by scoring molecular
dynamics simulations. J Chem Inf Model
59(8):3464–3473
54. Soler MA, De Marco A, Fortuna S (2016)
Molecular dynamics simulations and docking
enable to explore the biophysical factors
controlling the yields of engineered nanobodies. Sci Rep 6:34869
55. Medagli B, Soler MA, de Zorzi R, Fortuna S
(2021) Antibody affinity maturation using
computational methods: from an initial hit to
small scale expression of optimised binders. In:
Computer-aided antibody design. Springer, in
press
56. Del Carlo M, Capoferri D, Gladich I, Guida F,
Forzato C, Navarini L, Compagnone D,
Laio A, Berti F (2016) In silico design of
short peptides as sensing elements for phenolic
compounds. ACS Sensors 1(3):279–286
57. Soler M, Fortuna S, Scoles G (2015) Computational design of peptides as probes for the
recognition of protein biomarkers. In: 10th
European-biophysical-societies-association
(EBSA) European biophysics congress, vol 44.
Springer, New York, pp 149–149
58. Negroni MP, Stern LJ (2018) The N-terminal
region of photocleavable peptides that bind
HLA-DR1 determines the kinetics of fragment
release. PLoS One 13(7):e0199704
59. Peters B, Nielsen M, Sette A (2020) T cell
epitope predictions. Ann Rev Immunol 38(1):
123–145
60. Purcell AW, McCluskey J, Rossjohn J (2007)
More than one reason to rethink the use of
peptides in vaccine design. Nat Rev Drug Discov 6(5):404–414
61. Wang P, Sidney J, Dow C, Mothé B, Sette A,
Peters B (2008) A systematic assessment of
MHC class II peptide binding predictions and
evaluation of a consensus approach. PLoS
Comput Biol 4(4):e1000048
62. Bjorkman PJ (2015) Not second class: the first
class II MHC crystal structure. J Immunol
194(1):3–4
63. Unanue ER, Turk V, Neefjes J (2016) variations in MHC class II antigen processing and
presentation in health and disease. Annu Rev
Immunol 34(1):265–297
64. Weaver JM, Sant AJ (2009) Understanding the
focused CD4 T cell response to antigen and
pathogenic organisms. Immunol Res 45(2–3):
123–143
65. Wang P, Sidney J, Kim Y, Sette A, Lund O,
Nielsen M, Peters B (2010) Peptide binding
predictions for HLA DR, DP and DQ molecules. BMC Bioinform 11(1):568
66. Stern LJ, Brown JH, Jardetzky TS, Gorga JC,
Urban RG, Strominger JL, Wiley DC (1994)
Crystal structure of the human class II MHC
protein HLA-DR1 complexed with an influenza virus peptide. Nature 368(6468):
215–221
67. Berman HM, Westbrook J, Feng Z,
Gilliland G, Bhat TN, Weissig H, Shindyalov
IN, Bourne PE (2000) The protein data bank.
Nucleic Acids Res 28(1):235–242
68. Huang P-S, Ban Y-EA, Richter F, Andre I,
Vernon R, Schief WR, Baker D (2011) RosettaRemodel: a generalized framework for flexible backbone protein design. PLoS One 6(8):
e24109
Computational Evolution Protocol for Peptide Design
69. Hess B, Kutzner C, van der Spoel D, Lindahl E
(2008) GROMACS 4: algorithms for highly
efficient, load balanced, and scalable molecular
simulations. J Chem Theor Comput 4:
435–447
70. Dominguez C, Boelens R, Bonvin AMJJ
(2003) HADDOCK: a protein–protein docking approach based on biochemical or biophysical information. J Am Chem Soc 125(7):
1731–1737
71. Yang Y, Zhou Y (2008) Specific interactions for
Ab Initio folding of protein terminal regions
with secondary structures. Proteins Struct
Funct Genet 72(2):793–803
72. Zhou H, Skolnick J (2011) GOAP: a
generalized orientation-dependent, all-atom
statistical potential for protein structure prediction. Biophys J 101(8):2043–2052
73. Krissinel E, Henrick K (2007) Inference of
macromolecular assemblies from crystalline
state. J Mol Biol 372(3):774–797
359
74. Andrusier N, Nussinov R, Wolfson HJ (2007)
FireDock: fast interaction refinement in molecular docking. Proteins Struct Funct Bioinform
69(1):139–159
75. Berrera M, Molinari H, Fogolari F (2003)
Amino acid empirical contact energy definitions for fold recognition in the space of contact maps. BMC Bioinform 4(1):8
76. Fogolari F, Corazza A, Yarra V, Jalaru A,
Viglino P, Esposito G (2012) Bluues: a program for the analysis of the electrostatic properties of proteins based on generalized Born
radii. BMC Bioinform 13(Suppl 4):S18
77. Cossio P, Granata D, Laio A, Seno F, Trovato A
(2012) A simple and efficient statistical potential for scoring ensembles of protein structures.
Sci Rep 2:1–8
78. Sarti E, Zamuner S, Cossio P, Laio A, Seno F,
Trovato A (2013) Bachscore. A tool for evaluating efficiently and reliably the quality of large
sets of protein structures. Comput Phys Commun 184(12):2860–2865
Chapter 17
Computational Design of Miniprotein Binders
Younes Bouchiba, Manon Ruffini, Thomas Schiex, and Sophie Barbe
Abstract
Miniprotein binders hold a great interest as a class of drugs that bridges the gap between monoclonal
antibodies and small molecule drugs. Like monoclonal antibodies, they can be designed to bind to
therapeutic targets with high affinity, but they are more stable and easier to produce and to administer.
In this chapter, we present a structure-based computational generic approach for miniprotein inhibitor
design. Specifically, we describe step-by-step the implementation of the approach for the design of miniprotein binders against the SARS-CoV-2 coronavirus, using available structural data on the SARS-CoV2 spike receptor binding domain (RBD) in interaction with its native target, the human receptor ACE2.
Structural data being increasingly accessible around many protein–protein interaction systems, this method
might be applied to the design of miniprotein binders against numerous therapeutic targets. The computational pipeline exploits provable and deterministic artificial intelligence-based protein design methods, with
some recent additions in terms of binding energy estimation, multistate design and diverse library
generation.
Key words Computational protein design, Miniprotein binders, Multistate protein design, Binding
affinity, Protein–protein interaction, SARS-CoV-2.
1
Introduction
Miniprotein binders can have antibody-like affinity and functionality with the advantages of stability and amenability to synthesis over
monoclonal antibodies [1–3]. They can also avoid weaknesses of
larger scaffolds such as poor tissue or cell penetration, protease and
reduction sensitivity. Miniprotein binders can also have several
advantages over small molecule drugs, notably the capacity to
block protein–protein interactions when a deep binding pocket is
missing at the interface. Therefore, they have the potential to span
the gap between monoclonal antibodies and small molecule drugs
and thus greatly impact therapies and diagnoses. The ability to
design stable miniprotein binders with tight affinity for a given
target is thus of great interest for a wide range of medicine applications [2, 4]. We present here a computational pipeline for the
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_17,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
361
362
Younes Bouchiba et al.
design of miniprotein binders based on our advanced methods in
the structure-based computational protein design (CPD) field.
CPD has become a valuable approach in protein engineering to
identify protein sequences that can adopt a given 3D fold and
possess desired properties, from the exploration of combinatorial
sequence spaces astronomically larger than those that can be tested
experimentally [5–8]. The most usual description of the CPD
problem relies on a pairwise decomposable energy function, a discretized description of the amino acid conformational space based
on a library of frequent side chain conformations (i.e., rotamers)
and a single fixed backbone conformation. Under such assumptions, referred to as single state design (SSD), the problem of
searching for a sequence with the global minimum energy conformation (GMEC) is known to be NP-hard [9]. Because of this, most
CPD approaches rely on stochastic optimization algorithms [10–
12], mainly Monte Carlo simulated annealing [13]
(as implemented in Rosetta [14]), which provide only asymptotic
convergence guarantees. Although they have the advantage of
providing a solution at any time, they neither guarantee finding
the GMEC in finite time nor a bounded energetic distance to the
optimal solution. To try to circumvent this limitation, multiple
independent runs are performed (each with a predefined number
of steps) in order to cover, as well as possible, a rugged energy
landscape. However, the accuracy of metaheuristic methods may
drastically decrease as problem size increases [15, 16]. In contrast,
provable methods [17], historically based on the Dead-End-Elimination (DEE) theorem and the A* algorithm [18], provably identify the optimal minimum energy design in finite time.
Unfortunately, they are rapidly outstripped by the size of the search
space and often do not provide any solution in reasonable time [19–
21]. The capacity to efficiently identify sequences with optimal
conformations is thus challenging, especially in de novo design,
where the full sequence is designed.
Relying on state-of-the-art “automated reasoning” artificial
Intelligence algorithms, more specifically cost function networks
(CFNs) [22], we developed a provable CPD method [16, 19–21]
that speeds up searching by several orders of magnitudes compared
to previous provable methods. It thus enables, with provable guarantees, on standard hardware, the identification of the GMEC as
well as suboptimal solutions at a given energy gap to the GMEC,
from a combinatorial search space of size that is far beyond what has
been solved with previous provable methods.
The remarkable efficiency of CFN algorithms to handle vast
search spaces was then exploited to push back some limits in the
formulation of the CPD problem. In particular, the fixed backbone
approximation in CPD may lead to the rejection of sequences that
would be accepted if a slightly different backbone conformation
was allowed. Furthermore, protein backbone flexibility plays a key
Computational Design of Miniprotein Binders
363
role in protein properties and intermolecular interactions, and this
can be even more crucial for peptides and miniproteins. To address
this, positive multistate protein design (MSD) with backbone
ensembles is an attractive approach [23–26]. In MSD, sequence
design relies on the energy contributions of several protein backbone conformations. Combinations of rotamers that would be
removed using a single fixed backbone conformation could be
accepted when several backbone conformations are used to inform
sequence selection. Two main MSD approaches based on distinct
criteria (or fitness) can be considered for sequence design. The first
one seeks to design a sequence for any of the considered backbone
states. The Boltzmann-weighted average of the energies (defined as
the sum of optimal energies, weighted by their Boltzmann probabilities) in each state may be an attractive criterion [27]. Because
this gives an exponential advantage to the backbone with lowest
energy, the computation of the fitness has been approximated by
the minimum optimal energy [24, 28, 29], defining what is called
“multistate analysis” (MSA) or minMSD. The second approach
aims to design a sequence that simultaneously fits several conformational states mimicking protein flexibility required to ensure
targeted function. In this case, it is usual to optimize the average
of the optimal energies over all states, a problem denoted Σ-MSD
[26]. We recently introduced efficient reductions of positive MSD
problems to CFNs, with the two criteria [30]. These CFN-based
methods, implemented in POMPd, have emerged as efficient MSD
approaches, outperforming state-of-the-art provable methods to
identify the guaranteed optimal solution and exhaustively enumerate suboptimal sequences. They can solve, in reasonable time, MSD
problems with several backbone conformations defining search
sizes unreachable up to now except with stochastic MSD
approaches with no guarantees of quality [30].
Beyond the identification of the GMEC and the enumeration
of suboptimal sequences within an energy threshold of the GMEC,
we also extended our CPD approaches to generate libraries of
sequences which are both diverse and of low energies [31]. Indeed,
producing diverse sequences with a known energy distance to the
GMEC can be useful to alleviate the effect of the approximations
that exist in CPD models. Based on an incremental CFN approach
using sequence diversity constraints that lower-bound the Hamming distance between sequences, the developed method enables
the efficient identification of sequences that satisfy guarantees on
both sequence diversity and energy quality.
In addition to all these, our CFN-based method, EasyE, can
efficiently estimate protein–protein binding energies of a large
number of mutants [32]. The binding energy is estimated by the
difference in energy of the bound and unbound proteins in their
globally optimal rotameric side chain conformations. Compared to
state-of-the-art computational methods, EasyE shows better
364
Younes Bouchiba et al.
correlation coefficients between predicted and experimental values.
EasyE is thus highly useful to rank mutant sequences of libraries
according to their binding energy.
The CFN-based CPD methods are generic and can be applied
to the design of different types of proteins. They recently enabled
the engineering of a highly stable artificial self-assembling symmetrical eight-bladed β-propeller [33], optimized enzymes in terms of
activity and thermostability, as well as new, highly stable nanobody
scaffolds (data not published). Based on these methods, in this
chapter, we present a pipeline for miniprotein binder design. We
provide a step-by-step guide to the pipeline and illustrate it with an
application to the design of miniprotein binders against the SARSCoV-2 coronavirus that has emerged in the late 2019 and has since
caused a global pandemic [34]. SARS-CoV-2 initiates its entry into
host cells by binding to the human angiotensin-converting
enzyme2 (ACE2) via the receptor binding domain (RBD) of its
spike protein [35, 36]. Miniprotein binders that can block SARSCoV-2 RBD from binding to ACE2 may potentially prevent the
virus from entering human cells and serve as an effective antiviral
drug [37]. Numerous works aim to develop such antiviral drugs
[38]. As miniproteins are more suitable for disrupting protein–
protein interactions than small molecules, by specifically binding
to the interface binding region, some studies have been recently
focused on the design of such anti-SARS-Cov2 binders, based on
the analysis of the 3D structure of the ACE2-SARS-CoV-2 RBD
complex [39–41]. Compared to monoclonal antibodies, miniprotein binders can also have the advantage of reduced
immunogenicity.
This chapter reports a complete computational approach to
design such miniprotein binders, starting from the de novo building of a 3D scaffold. The overall and versatile pipeline based on
CFN technology can be used for different targets. It includes
several successive steps (Fig. 1). From the construction of a 3D
scaffold of the miniprotein, a multistate design of the core region is
performed using several backbone states generated with a backrublike backbone simulation. An initial binding mode between the
miniprotein and the target protein is then built by superimposing
some structural motifs of the miniprotein on the protein–protein
complex to be inhibited. This initial binding mode is used as
reference during an automated docking step. For several selected
binding modes, a multistate design of the binding interface (and
residues outside the core region) is performed with several backbone states (also generated using a backrub-like backbone simulation), generating a library of diverse and low energy sequences. The
last step of the pipeline consists in ranking the library sequences
according to the binding energy to the target and the energy of the
minibinder. Finally, candidate sequences can be selected for experimental testing. The detailed description of the implementation of
Computational Design of Miniprotein Binders
3D Scaffold
Construction
Backrub
Simulation
3D Model
min(Ebackrub )
Backrub
Simulation
MultiState
Binder Design
min(Ebackrub )
Sequence library
MultiState
Core Design
Binding
Energy
Estimation
Input Model
Docking
Minimization
365
(ΔEcomplex , Eminiprot )
Ranking
Selected
Candidates
Binding Modes
Fig. 1 Computational design pipeline. The ellipses, rectangles, and losanges
indicate data types, operations, and criteria of selection, respectively. The main
steps of the pipeline include: (1) the construction of the 3D scaffold domain;
(2) multistate design of the core region with several backbone states generated
using a backrub-like backbone simulation; (3) docking of the miniprotein on the
target and selection of binding modes; (4) multistate design of the binding
interface (and residues out of the core region) and generation of a sequence
library; (5) ranking of the sequences according to the binding energy ΔEbinding
and the energy of the minibinder Eminiprotein
366
Younes Bouchiba et al.
the computational pipeline is given below and exemplified with the
design of a triple alpha helix anti-SARS-CoV-2 miniprotein. The
choice of a triple helix bundle domain results from the analysis of
the structural motifs involved in the interaction of ACE2 with the
SARS-CoV-2 RBD [42].
2
Materials
The computational miniprotein design process relies on POMPd
[30] and EasyE [32], implemented using the toulbar2 prover [22]
and PyRosetta version 4 for energy computation (based on the
Rosetta beta_nov16 energy function [43]). The overall computational pipeline also involves Rosetta version 3.11 [14] for energy
relaxation, docking, and backrub-type backbone sampling.
AMBER18 and AmberTools v19.12 [44] are used for molecular
modelling and analyses. Python 3 is required for computations
(we used the 3.6.9 version).
Analysis and graphics were performed using R version 3.4.4.
Molecular models were visualized and aligned using PyMOL
(Schrödinger, LLC). WebLogo3 (http://weblogo.threeplusone.
com/create.cgi) was used for the generation of sequence logos.
The hardware used consisted of a single workstation with Intel
Xeon E5-2650 2.3 GHz CPUs, running the Ubuntu LTS distribution 18.04.
3
Methods
3.1 Building of
Miniprotein Scaffold
and Core Design
This section explains the different steps, ranging from the building
of the 3D triple alpha helix scaffold of the miniprotein up to its core
design using MSD with backbone conformations generated from
backrub-like backbone simulation that recapitulates natural protein
conformational variability [45] (Fig. 2).
3.1.1 Miniprotein 3D
Scaffold Construction
1. Access the CCBuilder 2.0 web server (coiledcoils.chm.bris.ac.
uk/ccbuilder2/builder).
2. Using the advanced options menu, choose three oligomeric
states and tick the Anti Parallel box for the second chain.
3. Provide an input sequence. The input sequence is dependent
on knowledge of the targeted interaction. For our system, we
used
STIEEQAKTFLDKFNHEA ,
GDKWSAFLKEQS
TLAQMY , and AQNLQNLTVKLQLQALQ , originating,
respectively, from ACE2 alpha helix regions 19–36, 66–83,
and 89–101, involved in the binding interface with SARSCoV-2 spike RBD (PDB ID: 6LZG).
Computational Design of Miniprotein Binders
367
CCBuilder
3D Scaffold
Construction
Backbone
Relaxation
AMBER
minimization
RosettaRelax
Backrub
Simulation
RosettaBackrub
min(Ebackrub )
POMPd
MultiState
Core Design
Fig. 2 Triple alpha helix miniprotein backbone building and its core design. The triple alpha helix bundle
backbone was constructed using CCBuilder (a coiled-coil modeling server) [46]. The 3D backbone scaffold
was generated using as sequence, the amino acid types of ACE2 alpha helix regions involved in the binding
with SARS-CoV-2 RBD, from visualization of the ACE2/SARS-CoV-2 X-ray structure (PDB ID: 6LZG). To connect
the helices, two serine residues were added to form each linker, and the 3D model was then minimized using
the ff14SB amber force field of AMBER18 [47]. The full 3D miniprotein (57 amino acid residues) was then
relaxed using RosettaRelax [48] with the beta_nov16 energy function [43]. Backbone conformations were
sampled using the Rosetta backrub-like backbone simulation method. After clustering, three backbone
conformations were selected, and MSD was performed for designing the core region of the miniprotein.
The beta_nov16 energy function was also used for MSD with POMPd [30]
4. Add linkers to connect the helices and minimize the miniprotein. In our case, linkers formed by two serines were added to
connect the helices, using the Build functionality of PyMOL
(Scrödinger, LLC), and the system was minimized (5000 steps
of steepest descent and 5000 steps of conjugate gradient),
368
Younes Bouchiba et al.
using the ff14SB force field and the sander module of
AMBER18 [44] (see Note 1).
5. Relax the 3D miniprotein model using RosettaRelax with the
beta_nov16 energy function and the command:
nohup relax.linuxgccrelease -s MyMiniProtein.pdb -nstruct
10 -keep_input_protonation_state -overwrite -beta_nov16 &
Check whether the relaxed structure deviated too much
from the initial conformation (backbone RMSD <0.5 Å) and
pick the lowest energy relaxed model, otherwise consider using
the following options:
-relax:constrain_relax_to_start_coords -relax:coord_cst_stdev 0.5 -relax:coord_constrain_sidechains
3.1.2 Miniprotein Core
Design
The amino acids of the core of the 3D miniprotein model previously built and relaxed (MyMiniprotein.pdb) are designed using
MSD method (POMPd), with the Σ-MSD criterion. The objective
is to stabilize the miniprotein core before designing amino acids for
the binding with the target.
1. Generate backbone conformations for MSD using the RosettaBackrub method [45] with the command:
nohup mpirun -np 10 backrub.mpi.linuxgccrelease -in:file:s
Myprotein.pdb -beta_nov16 -nstruct 100 -backrub:ntrials 10000
-keep_input_protonation_state &
Cluster the generated conformations using the K-means
clustering algorithm over the backbone RMSD by requesting
N clusters, as implemented using the cpptraj module [49] of
AMBER18 [44] (see Note 2).
Select the lowest energy conformation of each cluster and
name each corresponding PDB format file as MyMiniProtein1.pdb, MyMiniProtein2.pdb, etc. In our example,
three backbone conformational states were selected and considered as input for MSD, in addition to the initial relaxed
conformation (MyMiniprotein.pdb).
2. Select amino acids defining the core of the miniprotein, for
example, by a visual inspection of the 3D model using PyMOL.
3. Create a so-called Resfile that specifies the amino acid residues
to be designed (the core amino acid residues previously
defined) and the amino acid side chains to consider as merely
flexible in MSD (all other residues of the miniprotein). An
example of Resfile is described in Note 3. An identical Resfile
Computational Design of Miniprotein Binders
369
is required for each backbone considered in MSD (named
MyMiniProtein.resfile,
MyMiniProtein1.resfile,
MyMiniProtein2.resfile, etc.).
4. Clone the POMPd git repository and copy the 3D structural
models (MyMiniProtein.pdb, MyMiniProtein1.pdb,
etc.) and the Resfiles (MyMiniProtein.resfile, MyMiniProtein1.resfile, etc.) into the MSD/positive/directory, compile toulbar2 solver, and submit MSD as follows:
cd /PATH/TO/YOUR/DIRECTORY/
git clone https://forgemia.inra.fr/thomas.schiex/pompd
cp MyMiniProtein*.pdb MyMiniProtein*.resfile pompd/positive/
cd pompd/positive/
make toulbar2
cd pompd/positive
nohup make positive.gmec &
The execution can be traced back in the nohup.out file.
The designed sequence is saved in the positive.seq
output file, where it is repeated once for each state. The script
below extracts one copy of the sequence and maps it on the
backbone states by side chain placement.
size=‘cat MyMiniProtein.nat | awk ’{printf $1}’ | wc -m‘
seq=‘cut -c1-$size positive.seq‘
for i in *pdb
do
python3 ./exe/tb2cpd.py --doscp --scpseq $seq -i $i.pdbdone
The 3D miniprotein model corresponding to the designed
sequence mapped on the initial relaxed backbone state (saved as
MyMiniProteinCD.pdb) is used for the following step. In
our example, this miniprotein is named Gyro.
3.2 Docking on the
Protein Target
This section describes the exploration and selection of binding
modes of the miniprotein (MyMiniProteinCD) on the protein
target using RosettaDock [50]. Available structural data on the
protein–protein interaction (Complex_ref.pdb) to be blocked
by the miniprotein can be used as reference for the docking. An
initial binding mode between the miniprotein and the target protein can be built by superimposing the miniprotein on the similar
370
Younes Bouchiba et al.
structural motifs (Binding_native_residues) of the native
binding protein. The resulting 3D model can then be used as a
starting binding mode for the docking. In our case, the structure of
the SARS-CoV-2 RDB in complex with ACE2 (PDB ID: 6LZG)
[42] was used, and the miniprotein was aligned to the structural
motifs (regions 19–36, 66–83, and 89–101) of ACE2 involved in
the binding to SARS-CoV-2 RBD (Fig. 3).
1. Using PyMOL, fetch Complex_ref.pdb from the Protein
Data Bank (rcsb.org) [51], load the MyMiniProteinCD.pdb,
select the Binding_native_residues, and align the MyMiniProteinCD.pdb to the Binding_native_residues.
PyMOl>fetch Complex_ref.pdb
PyMOl>load MyMiniProteinCD.pdb
PyMOl>select Complex_ref and resid [Binding_native_residues]
PyMOL>align MyMiniProteinCD, sele, cycles=0
Save the complex between the target and the miniprotein
and name it MyMiniProtein_Target.pdb (in our example,
the complex of the RBD target (residues 333 to 527 in 6LZG.
pdb) with the miniprotein).
2. Relax the complex, MyMiniProtein_Target.pdb using the
same command and procedure as described in Subheading
3.1.1.
Name the output structural model: MyMiniProtein_Target_relax.pdb.
3. Run the docking simulation using the following command:
nohup mpirun -np 2 docking_protocol.mpi.linuxgccrelease -s
MyMiniProtein_Target_relax.pdb -ex1 -ex2aro -partners B_A
-dock_pert 2 5 -docking:sc_min -nstruct 100 -use_input_sc
-beta_nov16
4. Minimize each binding model.
minimize.default.linuxgccrelease
-s
*.pdb
-run:min_type
lbfgs_armijo_nonmonotone -run:min_tolerance 0.001 -out:suffix
.min -beta_nov16
Select several binding models and save them as
Protein_Target_pose1.pdb,
MyMini-
MyMiniProtein_Tar-
get_pose2.pdb, etc.
In our example, six distinct docking poses (Fig. 3) were
selected among the 100 generated. The distribution of the
100 protein binding modes according to their energy score is
shown in Fig. 3. As the interaction surface is subsequently
Computational Design of Miniprotein Binders
371
A. Miniprotein
(Gyro)
Native binding protein
(ACE2)
Miniprotein - Target
complex relaxation
Target
(SARS-CoV-2 RBD)
B.
Docking and
Minimization
4
Counts
5
1
2
1
2
3
4
5
6
3
6
Minimized docking pose Energy [ kcal/mol ]
Fig. 3 Miniprotein/SARS-CoV-2 RBD Docking. (a) The miniprotein model named Gyro (colored in green) was
superposed to ACE2 (colored in cyan), based on corresponding binding interface regions in the complex
between ACE2 and SARS-CoV-2 RBD (in green). The resulting model between the miniprotein and SARS-CoV2 RBD was relaxed using the Rosetta beta_nov16 score function. The relaxed complex was then used as
starting conformation for docking. Each generated docking pose was minimized using the beta_nov16 score
function. (b) The distribution of the 100 docking poses according to their energy is shown. Six models were
selected, giving priority to those that can give rise to the largest interaction surface
redesigned, the selection of the protein binding modes was
based on the analysis of shape complementarity of surfaces
between the two proteins, rather than specifically on the best
energy scores.
372
Younes Bouchiba et al.
3.3 Design of Diverse
Sequence Libraries
This section concerns the redesign of the binding interface between
the miniprotein and the protein target for each of the binding poses
previously selected. MSD [30] is performed with an ensemble of
backbone conformational states of the complex, sampled using
RosettaBackrub [52] and a library of sequences both diverse and
low energy is generated.
For each selected docking pose, the following protocol is
applied:
1. Perform backrub-like backbone simulations and select conformational states using the procedure of Subheading 3.1.2.
Save the selected states as MyMiniProtein_Target_pose[n]_1.pdb,
MyMiniProtein_Target_pose[n]_2.
The selected states and the initial conformation of
binding pose (MyMiniProtein_Target_pose[n].pdb) are
used as input in MSD. In our case, two conformational states of
the Gyro/SARS-CoV-2 RBD complex were selected and considered as input for MSD, in addition to the initial relaxed
conformation of the complex.
pdb, etc.
2. Create the Resfile that specifies the amino acid residues to be
designed (all the amino acid residues except the core residues,
as defined in the Subheading 3.1.2, which are only considered
as flexible) (see Note 3). A copy of this file must be saved as
many times as there are states considered in MSD (MyMiniProtein_Target_pose[n].resfile, MyMiniProtein_Target_pose[n]_1.resfile, etc.).
3. Create a file that specifies the sequence diversity constraint
(based on the Hamming distance between sequences) for the
generation of the library and name it with the same prefix as the
.pdb file and the .resfile using the .div suffix. An example
would be MyMiniProtein_Target_Pose[n].div. It is formatted as json and should look as in the following example:
{
"regions": ["1-57 A"],
"diversity": 6,
"solutions": 10
}
This example indicates that the considered residues are the
residues 1 to 57 of the chain A and that 10 sequences are
requested with a diversity constraint of at least six mutations
between any two sequences.
make MyMiniProtein_Target_Pose[n].opt+lib
Computational Design of Miniprotein Binders
373
The output files:
l
positive.seq:
Contains all optimal sequences of the
library with the total energy of each sequence over all the
conformational states considered in MSD. Note that the
sequences are repeated once for each input state provided.
Please refer to Subheading 3.1.2 for commands to extract
single sequences.
l
MyMiniProtein_Target_Pose[n].opt+[m],
for each
of the m sequences of the library: Corresponds to the 3D
model (in PDB format) of the sequence mapped on the
MyMiniProtein_Target_pose[n] backbone state.
For our miniprotein design, Gyro, we previously retained six
binding poses from docking simulations, and for each of them, we
generated three diverse sequence libraries using MSD with diversity
constraints of 1, 3, and 6 mutations. We requested 10 sequences
per library, leading to a total of 180 designed sequences. We considered three conformational states of the Gyro/SARS-CoV2 RBD complex for performing MSD for each docking pose.
Designs took roughly 15 mn for three states of a Gyro/SARSCoV-2 RBD complex including 30 mutable positions and the
222 other residues considered as flexible.
The total resulting library comprises 159 unique sequences out
of 180. The average energies and amino acid variability of designed
sequences are shown in Fig. 4. In the library, the amino acid
residues of the two first helices are more variable than those of the
C-terminal helix. The two first helices are mainly involved in the
binding of RBD in the native binding mode of ACE2 (PDB ID:
6LZG), which is consistent with the entropy profile in these
regions, because of the presence of a more chemically diverse
neighborhood than around the C-ter helix, which is mainly
exposed to the solvent. As shown in Fig. 4, the pose’s rank is
essentially the same as after docking, except for a flip between
poses 3 and 6.
3.4 Designed
Miniproteins Ranking
and Analysis
For each designed miniprotein, the binding energy, ΔEbinding, with
the protein target is estimated using EasyE [32] in order to rank the
sequences of the library according to their affinity for the target
protein.
EasyE can be downloaded and installed using the commands:
git clone https://git.renater.fr/anonscm/git/easy-jayz/easyjayz.git
cd easy-jayz/exes/
sh toulbar2-install.sh
374
Younes Bouchiba et al.
Fig. 4 Designed miniproteins. (a) Energy and sequence entropy per docking pose considered. The average
energy of designed miniproteins in complex with the spike RBD over the conformational space used for MSD is
presented for every design, together with the Shannon entropy associated with the sequences produced for
each docking pose. (b) All sequences were merged for WebLogo construction portraying the sequence space
mapped for RBD binding. The Shannon entropy of all sequences (n ¼ 180) is mapped on the scaffold of the
initial relaxed conformation of Gyro
Note that its usage requires specific python libraries, as listed in
the Quick_start.pdf file.
An example of usage is provided in the Example/folder and
described in the Quick_start.pdf instruction file in the downloaded git archive.
As input files, EasyE requires:
l
A .pdb file corresponding to the 3D structural model of the
miniprotein scaffold with the initial sequence, used as input in
previous MSD (see Subheading 3.3): MyMiniProtein_Target_Pose[n].pdb.
l
A .seqE file that specifies for each designed sequence, the changes
of amino acid types before and after design: mut.seqE. This file
should contain all the desired sequences to be mapped on a
given scaffold for binding energy computations. A simple procedure to generate such a file is detailed in Note 4.
Computational Design of Miniprotein Binders
375
ΔEbinding calculation can be performed using the MyMiniProand the mut.seqE file in the
easy-jayz/ directory file with the following command:
tein_Target_Pose[n].pdb
./exes/EasyE.py --pdb ./MyMiniProtein_Target_Pose[n].pdb
--seq ./mut.seqE --partner A_B --score beta_nov16 --v 1 --min
1 --rec 1 --forced_out 1 --lig 1
Here is a short description of each argument used:
l
--partner: The interaction chains following the nomenclature
[Designed partner]_[Binding target].
l
--score:
l
--v:
l
--min:
Performs minimization with 0.5 kcal/mol of standard
deviation harmonic restraints and 0.001 tolerance threshold.
l
--rec:
Forces the computation of receptor energy.
l
--lig:
Forces the computation of the ligand energy.
l
--forced_out:
Score function used, here beta_nov16 [43] [REF].
Verbose output.
Forces the output of energies.
Two output files are produced:
l
MyMiniProtein_Target_Pose[n].DeltaE:
Contains the
values of binding energy, ΔEbinding.
l
MyMiniProtein_Target_Pose[n].E: Contains the energy
values of bound (Eminiprot/target) and unbound states (Eminiprot
and Etarget).
For ranking, we use ΔEbinding as a metric for the binding affinity
and Eminiprot to check the miniprotein stability. A miniprotein of
choice would have a high affinity for its intended target and yet be
sufficiently stable in its unbound form. We compared the scores of
each of the states used in MSD, mapped with each designed
sequence of the library for the post-analysis of our design results,
and considered the state with the minimum ΔEbinding as the state
defining the potential affinity of the sequence. With this criterion,
Pose 1 appears as the best scaffold for binding design. While it has
only a slightly lower binding energy than Pose 4, its unbound form
is far more stable as indicated by its lower Eminiprot.
Interestingly, the entropy profile for the designed libraries
based on the best pose scaffold (Pose 1) shows low sequence
entropy at the N-ter of the scaffold and higher entropy at the
C-ter helix, which points toward the solvent (Fig. 5). However,
the mutations predicted for the lowest energy design for the
ΔEbinding ranking criterion number 7 mutations at this site, most
of which are hydrophobic (6 W, 9I, 12F, 13 W, 16 V, and 17I) or
376
Younes Bouchiba et al.
Fig. 5 Stability and affinity assessment for designed miniproteins. (a) The values of the best ΔEbinding (first
panel) and Eminiprot (second panel) are given in kcal/mol. The optimal design for each, for both ΔEbinding or
Eminiprot, is presented below. (b) Sequence entropy profiles for the best designs are mapped on Gyro’s scaffold,
and the WebLogo representing amino acid diversity is shown beside. The 3D model of the lowest ΔEbinding
design, in interaction with the RBD, is presented with interface residues highlighted in lines representation.
The mutated residues are shown in sticks representation and are written in red in the sequence on the right.
The design space is shown with a light blue background in the sequence
negatively charged residues at the N-ter (2D). The high proportion
of hydrophobic packing at the RBD binding site seems to favorably
accommodate the RBD binding site surface residues, which are
mainly hydrophobic (Fig. 5).
Computational Design of Miniprotein Binders
4
377
Conclusion
We described a generic method for miniprotein inhibitor design,
illustrating its feasibility using structural data on the SARS-CoV2 RBD in interaction with its native target, the human receptor
ACE2. Structural data being increasingly accessible for many
molecular binding systems, this method might be applied to a
variety of protein complexes. Compared to common heuristic
methods, our method offers a reliable and very comfortable
usage; it combines optimality guarantees with computational efficiency. As an indication, generating 10 sequences with a Hamming
diversity constraint of six mutations between all produced
sequences, using three conformational states as input, 33 mutable
positions, and 222 flexible positions took around 15 min on standard hardware. The main advantage of the diversity constraintbased library generation is that it better accounts for the fact that
energy functions are still approximate; it more efficiently spans the
mutation profiles for a given protein scaffold, yielding a more varied
list of mutants to be tested, enhancing the probability for a functional mutant to be identified [31]. Common methods might get
trapped around the minimal energy sequence and hence lack the
ability to generate diverse yet low energy sequences. We compared
the best miniprotein obtained with this protocol to recently proposed minibinders of a comparable size, binding to the same target
[40]. The LCB1 and LCB3 minibinders show an EasyE-estimated
binding energy of respectively 75.81 kcal/mol and 55.77 kcal/
mol. The best solution produced by our protocol reached an estimated ΔEbinding of 131.9 kcal/mol. This makes an experimental
assay of the proposed miniprotein attractive, to confirm its practical
functionality.
5
Notes
1. Initial models were prepared and energy-minimized with
AMBER18 [44] using the ff14SB force field [47] with a 9 Å
cutoff. The script below shows the command to use, either in
the tleap terminal or by entering the commands in a file leap.
in and launching it with the command tleap -f leap.in.
The output file can then be used for minimization. An input file
is provided below along with the proper command to run the
minimization with sander [44].
Leap input file:
source oldff/leaprc.ff14SB
source leaprc.gaff
source leaprc.water.tip3p
378
Younes Bouchiba et al.
mol = loadpdb MyMiniProtein.pdb
saveAmberParm mol MyMiniProtein.prmtop MyMiniProtein.inpcrd
quit
Sander input file:
MINIMIZATION
&cntrl
imin=1, ntx=1, irest=0, ntpr=50, ntf=1, ntb=0,
cut=9.0, nsnb=10, ntr=1, maxcyc=10000, ncyc=5000, ntmin=1,
&end
END
Then run the following command:
sander -O -i min.in -p MyMiniProtein.prmtop -c MyMiniProtein.
inpcrd -r MyMiniProtein_min.rst -o mini.out -ref MyMiniProtein.inpcrd
2. Backrub simulation clustering can be achieved through
K-means clustering, as implemented in the cpptraj module of
AmberTools19 [49]. Here is an example script for clustering
using the .pdb output of a RosettaBackrub simulation. This
method ensures access to conformationally diverse states. This
procedure could be simplified by taking the lowest energies
conformations in your backrub ensemble.
cpptraj_cluster.in
parm MyProtein.pdb
trajin *pdb
average crdset MyAvg
run
rms ref MyAvg :1-57&@C,CA,N,O
cluster c1 \
kmeans clusters 3 randompoint maxit 500 \
rms :1-57&@CA,C,N,O \
out cluster_cnumvtime.dat \
summary cluster_summary.dat \
info cluster_info.dat
run
3. Resfiles stipulate which residues are respectively mutable, flexible, or rigid during design, using Rosetta syntax. Mutables
residues are allowed to adopt any rotamer of any specified
residue type. All residues can be used with the ALLAA keyword,
while specific types can be given using PIKAA. If no resfile is
Computational Design of Miniprotein Binders
379
provided, all residues will be considered as flexible only. You
can force a residue to be flexible using the NATAA keyword.
Rigid residues are ignored in the energy optimization and are
specified by the NATRO keyword. Note that rigid residues will
not appear in the output sequences (.seq extension). This can
be cumbersome, for example, when using the substitution_names_easyE.py script provided for Subheading 3.4.
A simple trick to get the native sequence without its rigid
residues is to perform a side chain packing of your protein
with all residues, except the rigid ones, set as flexible. These
sequences can then be used for design comparison. For more
detailed resfile syntax, please refer to the Rosetta
documentation:
(https://www.rosettacommons.org/docs/latest/rosetta_
basics/file_types/resfiles).
The first lines of an example resfile used in this chapter are
presented below:
NATAA USE_INPUT_SC
start
1 A ALLAA USE_INPUT_SC
2 A ALLAA USE_INPUT_SC
3 A ALLAA USE_INPUT_SC
5 A ALLAA USE_INPUT_SC
6 A ALLAA USE_INPUT_SC
9 A ALLAA USE_INPUT_SC
...
4. EasyE relies on a specific sequence format for mutation specification. The file follows a [Residue Number]_[Mutation type]
syntax. Multiple mutations must be separated with an underscore. A python script is provided in the easy-jayz/exes/
folder for converting your designed sequences into the proper
format: substitution_name.py. This script can be used to convert a toulbar2 output sequence (MyMiniProtein_Target_Pose[n].nat native sequences and positive.seq or
MyMiniProtein_Target_Pose[n].seq for respectively
MSD and SSD) to an EasyE readable file. For each design
condition (pose, diversity constraint, conformational state),
use:
awk ‘{print $1}’ MyProtein_Target_min.nat > ref.seq
size=‘awk ‘{printf $1}’ MyProtein_Target_min.nat| wc -m‘
awk ‘{print $1}’ positive.seq | cut -c1-$size > mut.seq
./exes/substitution_name_easyE.py --mut mut.seq --ref ref.seq
> mut.seqE
380
Younes Bouchiba et al.
The output file mut.seqE contains the designed sequence
in the proper format, i.e., as mutations to the native sequence.
Please note that rigid residues will be absent from the output
sequence, as they can be ignored in the energy optimization.
Hence, we recommend to refer to the Note 3 for more detailed
instructions about how designs with rigid residues can be
achieved.
Acknowledgments
The authors thank the French ANR for financial support through
the grant ANR-19-PI3A-0004. We also thank the Computing
mesocenter of Région Midi-Pyrénées (CALMIP, Toulouse, France)
for providing access to the HPC resources.
References
1. Vazquez-Lombardi R, Phan TG, Zimmermann
C et al (2015) Challenges and opportunities
for non-antibody scaffold drugs. Drug Discov
Today 20:1271–1283. https://doi.org/10.
1016/j.drudis.2015.09.004
2. Crook ZR, Nairn NW, Olson JM (2020) Miniproteins as a powerful modality in drug development. Trends Biochem Sci 45:332–346.
https://doi.org/10.1016/j.tibs.2019.12.008
3. Gebauer M, Skerra A (2020) Engineered protein scaffolds as next-generation therapeutics.
Annu Rev Pharmacol Toxicol 60:391–415.
https://doi.org/10.1146/annurev-pharmtox010818-021118
4. Chevalier A, Silva D-A, Rocklin GJ et al (2017)
Massively parallel de novo protein design for
targeted therapeutics. Nature 550:74–79.
https://doi.org/10.1038/nature23912
5. Mignon D, Druart K, Michael E et al (2020)
Physics-based computational protein design:
an
update.
J
Phys
Chem
A
124:10637–10648.
https://doi.org/10.
1021/acs.jpca.0c07605
6. Setiawan D, Brender J, Zhang Y (2018) Recent
advances in automated protein design and its
future challenges. Expert Opin Drug Discov
13:587–604.
https://doi.org/10.1080/
17460441.2018.1465922
7. Samish I (2017) Computational protein
design. Humana Press
8. Kuhlman B, Bradley P (2019) Advances in protein structure prediction and design. Nat Rev
Mol Cell Biol 20:681–697. https://doi.org/
10.1038/s41580-019-0163-x
9. Pierce NA, Winfree E (2002) Protein Design is
NP-hard. Protein Eng Des Sel 15:779–782.
https://doi.org/10.1093/protein/15.10.779
10. Kuhlman B, Dantas G, Ireton GC et al (2003)
Design of a novel globular protein fold with
atomic-level
accuracy.
Science
302:1364–1368. https://doi.org/10.1126/
science.1089427
11. Villa F, Panel N, Chen X, Simonson T (2018)
Adaptive landscape flattening in amino acid
sequence space for the computational design
of protein:peptide binding. J Chem Phys
149:072302.
https://doi.org/10.1063/1.
5022249
12. Mignon D, Simonson T (2016) Comparing
three stochastic search algorithms for computational protein design: Monte Carlo, replica
exchange Monte Carlo, and a multistart,
steepest-descent heuristic. J Comput Chem
37:1781–1793.
https://doi.org/10.1002/
jcc.24393
13. Kuhlman B, Baker D (2000) Native protein
sequences are close to optimal for their structures. Proc Natl Acad Sci U S A
97:10383–10388
14. Leaver-Fay A, Tyka M, Lewis SM et al (2011)
Rosetta3: an object-oriented software suite for
the simulation and design of macromolecules.
Methods Enzymol 487:545–574. https://doi.
org/10.1016/B978-0-12-381270-4.
00019-6
15. Voigt CA, Gordon DB, Mayo SL (2000) Trading accuracy for speed: a quantitative comparison of search algorithms in protein sequence
Computational Design of Miniprotein Binders
design. J Mol Biol 299:789–803. https://doi.
org/10.1006/jmbi.2000.3758
16. Simoncini D, Allouche D, de Givry S et al
(2015) Guaranteed discrete energy optimization on large protein design problems. J Chem
Theory Comput 11:5980–5989. https://doi.
org/10.1021/acs.jctc.5b00594
17. Hallen MA, Donald BR (2019) Protein design
by provable algorithms. Commun ACM
62:76–84.
https://doi.org/10.1145/
3338124
18. Leach AR, Lemon AP (1998) Exploring the
conformational space of protein side chains
using dead-end elimination and the a* algorithm. Proteins 33:227–239. https://doi.org/
10.1002/(SICI)1097-0134(19981101)
33:2<227::AID-PROT7>3.0.CO;2-F
19. Traoré S, Allouche D, André I et al (2013) A
new framework for computational protein
design through cost function network optimization.
Bioinformatics
29:2129–2136.
https://doi.org/10.1093/bioinformatics/
btt374
20. Traoré S, Allouche D, André I et al (2017)
Deterministic search methods for computational protein design. Methods Mol Biol
1529:107–123.
https://doi.org/10.1007/
978-1-4939-6637-0_4
21. Allouche D, André I, Barbe S et al (2014)
Computational protein design as an optimization problem. Artif Intell 212:59–79. https://
doi.org/10.1016/j.artint.2014.03.005
22. Hurley B, O’Sullivan B, Allouche D et al
(2016) Multi-language evaluation of exact solvers in graphical model discrete optimization.
Constraints 21:413–434. https://doi.org/10.
1007/s10601-016-9245-y
23. Druart K, Bigot J, Audit E, Simonson T
(2016) A hybrid Monte Carlo scheme for multibackbone protein design. J Chem Theory
Comput 12:6035–6048. https://doi.org/10.
1021/acs.jctc.6b00421
24. Davey JA, Chica RA (2012) Multistate
approaches in computational protein design.
Protein Sci 21:1241–1252. https://doi.org/
10.1002/pro.2128
25. Davey JA, Chica RA (2014) Improving the
accuracy of protein stability predictions with
multistate design using a variety of backbone
ensembles. Proteins 82:771–784. https://doi.
org/10.1002/prot.24457
26. Sauer MF, Sevy AM, Crowe JE Jr, Meiler J
(2020) Multi-state design of flexible proteins
predicts sequences optimal for conformational
change. PLoS Comput Biol 16:e1007339.
https://doi.org/10.1371/journal.pcbi.
1007339
381
27. Davey JA, Damry AM, Euler CK et al (2015)
Prediction of stable globular proteins using
negative design with non-native backbone
ensembles. Structure 23:2011–2021. https://
doi.org/10.1016/j.str.2015.07.021
28. Davey JA, Chica RA (2017) Multistate computational protein design with backbone ensembles. Methods Mol Biol 1529:161–179.
https://doi.org/10.1007/978-1-4939-66370_7
29. Karimi M, Shen Y (2018) iCFN: an efficient
exact algorithm for multistate protein design.
Bioinformatics 34:i811–i820. https://doi.
org/10.1093/bioinformatics/bty564
30. Vucinic J, Simoncini D, Ruffini M et al (2020)
Positive multistate protein design. Bioinformatics 36:122–130. https://doi.org/10.
1093/bioinformatics/btz497
31. Ruffini M, Vucinic J, de Givry S et al (2019)
Guaranteed diversity quality for the weighted
CSP. In: 2019 IEEE 31st international conference on tools with artificial intelligence
(ICTAI), pp 18–25
32. Viricel C, de Givry S, Schiex T, Barbe S (2018)
Cost function network-based design of
protein-protein
interactions:
predicting
changes in binding affinity. Bioinformatics
34:2581–2589. https://doi.org/10.1093/bio
informatics/bty092
33. Noguchi H, Addy C, Simoncini D et al (2019)
Computational design of symmetrical eightbladed β-propeller proteins. IUCrJ 6:46–55.
https://doi.org/10.1107/
S205225251801480X
34. Hui DS, Azhar EI, Madani TA et al (2020) The
continuing 2019-nCoV epidemic threat of
novel coronaviruses to global health — the
latest 2019 novel coronavirus outbreak in
Wuhan, China. Int J Infect Dis 91:264–266.
https://doi.org/10.1016/j.ijid.2020.01.009
35. Zhou P, Yang X-L, Wang X-G et al (2020) A
pneumonia outbreak associated with a new
coronavirus of probable bat origin. Nature
579:270–273.
https://doi.org/10.1038/
s41586-020-2012-7
36. Hoffmann M, Kleine-Weber H, Schroeder S
et al (2020) SARS-CoV-2 cell entry depends
on ACE2 and TMPRSS2 and is blocked by a
clinically proven protease inhibitor. Cell
181:271–280.e8. https://doi.org/10.1016/j.
cell.2020.02.052
37. Shyr ZA, Gorshkov K, Chen CZ, Zheng W
(2020) Drug discovery strategies for SARSCoV-2. J Pharmacol Exp Ther 375:127–138.
https://doi.org/10.1124/jpet.120.000123
38. Pomplun S (2021) Targeting the SARS-CoV2-spike
protein:
from
antibodies
to
382
Younes Bouchiba et al.
miniproteins and peptides. RSC Med Chem 12
(2):197–202.
https://doi.org/10.1039/
D0MD00385A
39. Linsky TW, Vergara R, Codina N et al (2020)
De novo design of potent and resilient hACE2
decoys to neutralize SARS-CoV-2. Science
370:1208–1214. https://doi.org/10.1126/
science.abe0075
40. Cao L, Goreshnik I, Coventry B et al (2020)
De novo design of picomolar SARS-CoV2
miniprotein
inhibitors.
Science
370:426–431. https://doi.org/10.1126/sci
ence.abd9909
41. Han Y, Král P (2020) Computational design of
ACE2-based peptide inhibitors of SARS-CoV2. ACS Nano 14:5143–5147. https://doi.
org/10.1021/acsnano.0c02857
42. Wang Q, Zhang Y, Wu L et al (2020) Structural and functional basis of SARS-CoV-2 entry
by using human ACE2. Cell 181:894–904.e9.
https://doi.org/10.1016/j.cell.2020.03.045
43. Alford RF, Leaver-Fay A, Jeliazkov JR et al
(2017) The Rosetta all-atom energy function
for macromolecular modeling and design. J
Chem Theory Comput 13:3031–3048.
https://doi.org/10.1021/acs.jctc.7b00125
44. Case DA, Ben-Shalom IY, Brozell SR et al
(2018) AMBER. University of California, San
Francisco
45. Smith CA, Kortemme T (2008) Backrub-like
backbone simulation recapitulates natural protein conformational variability and improves
mutant side-chain prediction. J Mol Biol
380:742–756.
https://doi.org/10.1016/j.
jmb.2008.05.023
46. Wood CW, Woolfson DN (2018) CCBuilder
2.0: powerful and accessible coiled-coil modeling. Protein Sci 27:103–111. https://doi.org/
10.1002/pro.3279
47. Maier JA, Martinez C, Kasavajhala K et al
(2015) ff14SB: improving the accuracy of protein side chain and backbone parameters from
ff99SB.
J
Chem
Theory
Comput
11:3696–3713.
https://doi.org/10.1021/
acs.jctc.5b00255
48. Conway P, Tyka MD, DiMaio F et al (2014)
Relaxation of backbone bond geometry
improves protein energy landscape modeling.
Protein Sci 23:47–55. https://doi.org/10.
1002/pro.2389
49. Roe DR, Cheatham TE (2013) PTRAJ and
CPPTRAJ: software for processing and analysis
of molecular dynamics trajectory data. J Chem
Theory Comput 9:3084–3095. https://doi.
org/10.1021/ct400341p
50. Gray JJ, Moughon S, Wang C et al (2003)
Protein-protein docking with simultaneous
optimization of rigid-body displacement and
side-chain conformations. J Mol Biol
331:281–299.
https://doi.org/10.1016/
s0022-2836(03)00670-3
51. Berman HM, Westbrook J, Feng Z et al (2000)
The Protein Data Bank. Nucleic Acids Res
28:235–242
52. Davis IW, Arendall WB, Richardson DC,
Richardson JS (2006) The backrub motion:
how protein backbone shrugs when a sidechain
dances. Structure 14:265–274. https://doi.
org/10.1016/j.str.2005.10.007
Chapter 18
Computational Design of Peptides with Improved
Recognition of the Focal Adhesion Kinase FAT Domain
Eleni Michael, Savvas Polydorides, and Georgios Archontis
Abstract
We describe a two-stage computational protein design (CPD) methodology for the design of peptides
binding to the FAT domain of the protein focal adhesion kinase. The first stage involves high-throughput
CPD calculations with the Proteus software. The energies of the folded state are described by a physicsbased energy function and of the unfolded peptides by a knowledge-based model that reproduces aminoacid compositions consistent with a helicity scale. The obtained sequences are filtered in terms of the affinity
and the stability of the complex. In the second stage, design sequences are further evaluated by all-atom
molecular dynamics simulations and binding free energy calculations with a molecular mechanics/implicit
solvent free energy function.
Key words Computational peptide design, Computational protein design, Proteus program,
Molecular mechanics, Monte Carlo
1
Introduction
The protein focal adhesion kinase (FAK) is required for the efficient
assembly and disassembly of focal adhesions and controls many
biological processes including cell adhesion, growth and survival,
embryonic development, and wound healing [1–3]. Due to its
increased expression in many cancers [3–5], it constitutes a
promising cancer therapy target [6–8]. Numerous efforts to inhibit
FAK have targeted its interactions with other proteins [3, 9–
13]. An important target is the complex between FAK and the
protein Paxillin, whose formation is a necessary and sufficient condition for FAK localization at focal adhesions. In the present chapter we employ high-throughput computational protein design
(CPD) calculations and all-atom simulation/free energy analysis
calculations to optimize peptides that recognize the kinase focal
adhesion targeting (FAT) domain. The obtained sequences could
serve as a starting point for the design of peptidomimetic inhibitors
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_18,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
383
384
Eleni Michael et al.
of the FAK:Paxillin complex. Such inhibitors are promising against
protein-protein interactions, which typically involve areas with sizes
of 10004000 Å2, much larger than the contact areas of small
druglike molecules (a few hundred Å2 [14, 15]).
High-throughput CPD calculations assess a large number
(109) of sequences and conformations in a few hours, using a
desktop computer. Key ingredients for the efficiency and accuracy
of such calculations are the energy function and the sequence/
conformation sampling method. We describe the structure of the
folded states (FAT complexes and unbound peptides) by a fixed
backbone/discrete rotamer approximation and its interactions by a
physics-based energy function that includes molecular mechanics
terms and describes solvent effects by a Generalized Born term, a
Lazaridis-Karplus term and a dispersion -interaction term
(MM/GBDILK approximation) [16, 17]. We describe implicitly
the unfolded state of the unbound peptide by a knowledge-based
model, adjusted to yield peptide sequences consistent with a helicity scale [18]. We sample the sequence/conformation space by a
Monte Carlo (MC) procedure that produces sequences weighted
according to their relative stabilities. We filter the designed
sequences with respect to the stabilities and affinities of the FAK
complex and evaluate selected sequences via all-atom simulations
and a free energy function that includes molecular mechanics terms
and describes solvent effects by a Generalized Born term, a
Lazaridis-Karplus term and a dispersion -interaction term
(MM/GBDILK approximation) [16, 17].
The CPD calculations are conducted with the Proteus 3.0 CPD
software [19, 20] that is freely available for academic and government scientists from the website address https://proteus.
polytechnique.fr. The distribution contains source code, binaries
for Intel processors, detailed documentation, and tutorials. Scientists in industry can consult the above address for obtaining instructions. The all-atom simulations are conducted with the NAMD
program [21].
Description of the System The focal adhesion kinase FAT
domain folds into a four-helix bundle [22–24]. As shown in
Fig. 1, two sites, respectively, at the helix 1/helix 4 interface (site
14) and at the helix 2/helix 3 interface (site 23), interact with
conserved segments on Paxillin with the consensus sequence
LDXLLXXL, known as LD motifs (reviewed in [25]). Selected
LD motifs of the Paxillin-family proteins are listed in Table 1. In
the FAK:Paxillin complex, motifs LD2 and LD4 interact, respectively, with sites 14 and 23, but short peptides with the Paxillin
LD2 or LD4 sequences can bind at both sites [26–29]. The affinities of LD-motif containing peptides for LD-binding domains are
on the order of μM [17].
Computational Optimization of Peptide Affinity for FAK
385
Fig. 1 Structure of the focal adhesion kinase FAT complex with two LD-motif
peptides, bound at the interfaces of helices 1 and 4 (site 14) and helices 2 and
3 (site 23). The signature residues of the 0LDXLLXXL+7 motif are indicated
Table 1
Sequence alignment of natural LD motifs of the Paxillin-family of proteins Paxillin, Leupaxin and
HIC-5
Residue position
3
2
1
0
+1
+2
+3
+4
+5
+6
+7
+8
Paxillin LD1
M
D
D
L
D
A
L
L
A
D
L
E
Paxillin LD2
L
S
E
L
D
R
L
L
L
E
L
N
Paxillin LD3
R
P
S
V
E
S
L
L
D
E
L
E
Paxillin LD4
T
R
E
L
D
E
L
M
A
S
L
S
Paxillin LD5
G
S
Q
L
D
S
M
L
G
S
L
Q
Leupaxin LD1
M
E
E
L
D
A
L
L
E
E
L
E
Leupaxin LD4
A
A
Q
L
D
E
L
M
A
H
L
T
Leupaxin LD5
K
A
S
L
D
S
M
L
G
G
L
E
Hic5 LD1
M
E
D
L
D
A
L
L
S
D
L
E
Hic5 LD2
L
C
E
L
D
R
L
L
Q
E
L
N
Hic5 LD4
T
L
E
L
D
R
L
M
A
S
L
S
Hic5 LD5
K
G
S
L
D
T
M
L
G
L
L
Q
L
D
X
L
L
X
X
L
Consensus motif
386
Eleni Michael et al.
In what follows, we first describe briefly the employed CPD
methodology. We present the energy function and structural model
for the folded state and the knowledge-based model for the
unfolded peptide. We then outline a step-by-step procedure to
identify sequences that optimize the peptide-protein affinities or
the complex stabilities. A final step involves testing selected
sequences by all-atom MD simulations in explicit solvent and binding free energy calculations with the MM/GBDILK model.
2
Methodology
The design methodology is summarized in the flowchart of Fig. 2.
Below, we describe the various steps in more detail.
1.
Determine the target of sequence and
conformational optimization
2.
Describe the folded state (complex and
unbound peptide): Energy function and
structural model
3.
Compute the Interaction Energy Matrix
(IEM)
4.
Determine the knowledge-based model
for the unfolded peptide. Stability design
of the unbound peptide and complex
5.
6.
chemical similarity)
Study selected complexes by all-atom
MD and binding free energy analysis
Fig. 2 Flowchart of the various steps used in the peptide design
Computational Optimization of Peptide Affinity for FAK
387
2.1 Target of
Optimization
In a standard CPD methodology [19, 20], the molecule is partitioned into three groups: (1) “fixed” residues (usually the backbone
and all glycines and prolines), kept in the same conformation;
(2) “inactive” residues that explore a small set of conformations,
usually from a discrete rotamer database [30]; (3) “active” residues
that change both chemical type and conformation and correspond
to the design target. The structural relaxation of the fixed backbone
and the discrete sidechain rotamers is treated implicitly via the
choice of the protein dielectric constant [31].
2.2 Description of
the Folded State
The energy of a particular sequence (S)/rotamer {Ri} combination
of the folded complex and the folded, unbound peptide is described
by a physics-based free energy function with molecular mechanics
(MM) and solvation free energy terms:
2.2.1 Energy Function
vdw
Coulomb
DI
LK
bonded
GB
EX = EX
+ EX
+ EX
+ EX
+ EX
+ EX
EMM
Esolv
ð1Þ
Here, X denotes the complex (C) or the unbound peptide (P). The
MM energy terms correspond to bonded, van der Waals (vdw) and
Coulombic interactions and are modeled by the Amber ff99SB
force field [32]. The generalized Born (GB) term E GB
X models the
interaction of the solute charges with the solvent polarization and
corresponds to the Hawkins-Cramer-Truhlar approximation [33–
35]. The term E DI
X models solute-water attractive dispersion interactions [16, 36, 37]. The Lazaridis-Karplus term E LK
X models the
tendency of various groups to be exposed to solvent [38]. The
combination of GB with the DI and LK terms (MM/GBDILK
model) was parameterized in [16]. Our recent MD simulations and
free energy analysis have shown that this term reproduces well the
relative affinities of FAK complexes with LD-motif containing peptides [17]. The protein and solvent dielectric constants are set to
6.8 and 80 [16, 17].
Basic ingredients of the GB term are the atomic solvation radii,
which approximate the atomic distances from the solvent interface.
These radii depend on the entire protein geometry, rendering the
GB term a many-body function. The pretabulation of energies in an
interaction energy matrix (IEM) prior to the design (see below)
requires the use of a pairwise-approximation energy function. For
the GB term, we achieve this by the “Fluctuating Dielectric Boundary Method,” described in Note 1 and refs. 39, 40.
2.2.2 Structural Model
In “fixed backbone” methods, as the one used here, an optimization is performed for one or a few specific backbone conformations,
usually taken from an experimental structure (e.g., an x-ray structure) or an MD trajectory. The backbone choice determines the
atomic coordinates and the interactions of the entire system and
may affect the design. Multi-backbone methods have also been
388
Eleni Michael et al.
developed that iterate between sequence and backbone conformation optimizations [41], sample sequence/backbone conformations simultaneously [42, 43] or generate backbone motions via
MC [42] or MD [17, 44].
The design system employed here consists of FAK residues
916–1049 (in Homo sapiens numbering) and two 12-residue peptides bound at the LD-motif recognition sites 14 (interface of FAK
helices 1 and 4) and 23 (interface of FAK helices 2 and 3) (see
Fig. 1). Our previous all-atom MD simulations of FAK complexes
with peptides containing the paxillin LD2 or LD4 motifs showed
that site 14 has a stronger affinity for LD2 and site 23 has a similar
affinity for either motif [17]. The structural model employed here is
taken from a simulation of the complex with two LD2-motif peptides (see Note 2).
2.3 Evaluation of the
Interaction Energy
Matrix (IEM)
Prior to the construction of the IEM matrix, a set of positions are
designated as active or inactive. At this stage, an extended range of
chemical types and rotamer conformations can be defined (e.g., the
natural aminoacids and the conformations of a rotamer database
[30], augmented by conformations seen in an experimental structure). During design, the range of active and inactive residues and
the list of chemical types/rotamer conformations sampled can be
restricted to a subset of the choices included in the IEM construction. Extended interaction sites at protein—protein interfaces can
be handled by performing sequential design simulations targeting
selected subgroups of positions.
To perform the IEM computation, each active residue is
replaced by a “giant residue” that includes all possible sidechain
types attached to its backbone Cα atom. For all available chemical
types and conformations at each active or inactive position I, we
compute interaction energies of I with itself and the fixed portion
of the system (diagonal IEM elements), and with every other active
or inactive residue J (off-diagonal terms). The calculation is performed separately for each residue position and is trivially parallelizable. The construction of the giant molecule and the computation
of the IEM elements are performed by suitable input files and shell
scripts. For more details we refer to the Proteus 3.0 manual [45].
To alleviate steric clashes due to the fixed backbone/discrete
rotamer approximation, during the IEM construction we conduct
short energy minimizations (see Note 3). With this protocol we
avoid performing on-the-fly minimizations during MC design and
sample states from a Boltzmann ensemble [19, 46].
2.4
The physical meaning of the MC in sequence space and its statistical
ensemble is explained in [19]. Briefly, we consider a solution that
contains equal concentrations of all possible sequences of a
molecule X, distributed into folded (F) and unfolded (U) states
according to their relative stabilities. A sequence change S1 ! S2
Stability Design
Computational Optimization of Peptide Affinity for FAK
389
corresponds to a process where one folded molecule S1 unfolds and
one unfolded molecule S2 folds [19]. The corresponding energy
change in the Metropolis MC criterion is
S1 →S2
F
ΔEX
= EX
(S2 )
F
EX
(S1 )
U
EX
(S2 )
U
EX
(S1 )
ð2Þ
Application of Eq. 2 requires defining the folded and unfolded
states. We discuss separately the stability calculations of the
unbound peptide and the complex.
2.4.1 The Unbound
Peptide
We consider the helical conformation as the “folded” state of the
unbound peptide, since structural and secondary-structure prediction studies show that LD-motif peptides can form α-helices also in
solution. We model the unfolded unbound peptide P as a collection
of independent aminoacids [19]. In this model, an aminoacid with
chemical type t is associated with a characteristic “reference energy”
E ref
t . The total energy of a particular, unfolded peptide sequence is
the sum of reference energies of its constituent aminoacids:
EU
P ðSÞ ¼
P
t∈aa
nt ðSÞE ref
t
ð3Þ
In Eq. 3, nt(S) is the number of aminoacids of type t in sequence
S and the sum is over all aminoacid types.
We compute the aminoacid-dependent reference energies E ref
t
via a maximum-likelihood formalism employed previously in
whole-protein redesign [19, 47]. In our case, we adjust the reference energies to ensure that the designed sequences are consistent
with the AGADIR helix-propensity scale of Muñoz and Serrano
[18]. To achieve this, we compute the average aminoacid frequencies hfti during a stability design of the unbound peptide, and
compare with the aminoacid frequencies in the helicity scale f hel
t .
We adjust the values E ref
via
the
following
linear
update
rule
[47]:
t
hel
ref
E ref
t ðν þ 1Þ ¼ E t ðνÞ þ δE ½ f t h f t iðνÞ
ref
ð4Þ
where E t ðνÞ and hfti(ν) are the reference energies and the running
frequencies at MC iteration ν; the quantity δE is an empirical
constant with dimensions of energy, set to 0.5 kcal/mol in the
present work. The reference energies are updated until the design
frequencies converge to the helicity frequencies, h f t i f hel
t , 8t∈aa.
During design, we treat all peptide sidechains as active. The
design protocol of the unbound peptide is detailed in Note 4. We
improve sampling via a Replica-Exchange MC (REMC) method,
where multiple copies of the same system (replicas) are simulated at
a range of temperatures and are allowed to exchange temperatures
at periodic intervals. The procedure is performed by the C module
protMC.
390
Eleni Michael et al.
2.4.2 The Complex
In this case, the folded state is the structure of the complex. The
unfolded state consists of the unbound, unfolded peptide P and the
unbound, unfolded protein. The energy change in the MC
Metropolis criterion due to a peptide sequence mutation S1 ! S2
in the complex C is
U
ΔE SC1 !S 2 ¼ ½E FC ðS 2 Þ E FC ðS 1 Þ ½E U
P ðS 2 Þ E P ðS 1 Þ
ð5Þ
The energies E FC ðS 1 Þ and E FC ðS 2 Þ of the protein complexes with
peptide sequences S1 and S2 are evaluated via Eq. 1; the energies
U
EU
P ðS 1 Þ and E P ðS 2 Þ of the unbound, unfolded peptides are evaluated via Eq. 3. The energy of the unbound, unfolded protein is also
approximated as a sum of reference energies, dependent on the
aminoacid type. Since the protein composition is the same in the
two complexes, the unfolded protein energy does not contribute in
Eq. 5.
Peptide sidechains are treated as active and protein sidechains
within 8 Å of any peptide atom as “inactive,” excluding prolines
and glycines. Other sidechain and backbone atoms are kept fixed.
The design protocol is detailed in Note 5.
2.5 Filtering of
Sequences
The stability design calculations produce a large number of peptide
sequences in the environment of the complex and the unbound
peptide. To identify a set of promising sequences for further analysis, we apply various filtering criteria.
2.5.1 Binding Affinity
A main selection criterion is the binding affinity. Using the populations of a sequence S in the simulations of the complex ( pC(S)) and
the unbound peptide ( pP(S)), we first compute the stabilities of the
various sequences (S) relative to a reference sequence R in the
complex and the unbound peptide:
ΔΔG stab
C ðSÞ¼ k B T ln
ΔΔG stab
P ðSÞ¼
pC ðSÞ
pC ðRÞ
p ðSÞ
kB T ln P
pP ðRÞ
ð6Þ
The peptide sequence relative stabilities are generally different in
the complex and unbound state, due to the protein-peptide interactions. Subtracting the above equations, we obtain an estimate of
the relative binding free energy of sequence S in terms of the
sequence populations:
ΔΔG bind ðSÞ¼ kB T ln
pC ðSÞ
p ðSÞ
þ kB T ln P
pC ðRÞ
pP ðRÞ
ð7Þ
Extraction of the populations and calculation of the free energies is
performed with a python script.
Computational Optimization of Peptide Affinity for FAK
391
The above procedure can identify sequences with good stabilities and binding affinities. For a subset of the most promising
sequences, we perform additional rotamer optimization simulations of the unbound helical peptide and the complex, and compute
the corresponding average solution energies hE H
and
P ðSÞicf
hEC(S)icf, where h icf denotes the conformational average. We
evaluate the binding affinities relative to a reference sequence via
the expression:
H
ΔΔG bind ðSÞ ¼ ½hE C ðSÞicf hE H
P ðSÞicf ½hE C ðRÞicf hE P ðRÞicf ð8Þ
2.5.2 Additional Criteria
We filter further the obtained sequences, using as criteria the stability of the complex and the unbound peptide and the chemical
similarity of the obtained sequences. More details are presented in
the results.
2.6 All-Atom MD
Simulations of
Selected Complexes
Selected complexes predicted with good stabilities or binding affinities are subjected to all-atom MD simulations in explicit solvent.
Details of the simulation protocol are in Note 6. For each complex,
we extract coordinate sets at 10-ps intervals and compute the
solution free energies of the complex and the unbound molecules
via Eq. 1. We adopt the “single-trajectory” approximation, in
which the protein and peptide have identical conformations in the
complex and unbound states. In this approximation, bonded and
intramolecular vdw and Coulomb energies are identical in the
complex and unbound protein and do not contribute to stability.
The resulting binding affinities are
ΔG bind
DI
¼ ðE vdw
þ ΔE LK ÞþðE Coul
þ ΔE GB Þ
C þ ΔE
C
np
p
ΔG bind þ ΔG bind
ð9Þ
The last two terms correspond to nonpolar (np) and polar (p)
contributions to the binding free energies.
Structural relaxation accompanying complex formation is
ignored in the above approximation. Contributions to binding
free energy due to this relaxation could be estimated by conducting
independent simulations of the complex and unbound molecules.
These contributions could be associated with large uncertainties
and may not improve the results [48].
3
Results
3.1 Optimization of
the Unfolded Peptide
Reference Energies
We optimized the reference energies E ref
of the unfolded peptide
t
via stability design calculations of a 12-residue peptide, terminated
by acetylated (ACE) and N-methylated (CT3) blocking groups.
The logos of Fig. 3a show the design choices at each position.
392
Eleni Michael et al.
Fig. 3 Logo representation of the design sequences for (a) the unbound peptide; (b) the peptide bound at site
14 (top panel) or site 23 (bottom panel). The consensus LD motif LDXLLXXL is in the residue index range 0–7.
Amino acid types are shown in one-letter representation. The letter sizes are proportional to the aminoacid
probabilities. The letter color is associated with the aminoacid physical properties
The size of each letter reflects the corresponding aminoacid frequency. The various frequencies are fairly uniform (4 11%) in
accord with the Muñoz and Serrano (MS) scale [18]. Thus, the
computed reference energies almost flatten the sequence landscape
of the unbound peptide.
3.2 Preliminary
Unrestricted Stability
Design of the Complex
Using these reference energies, we performed a preliminary stability design test in the FAT complex with two 12-residue peptides
bound at sites 14 and 23. All 24 peptide sidechains were treated as
active. With an average 10 rotamers per aminoacid type, the resulting sequence/conformation space has a size ((18 10)24 10124 ¼ 10178 states), rendering the accurate computation
of sequence populations intractable. Nevertheless, the design is an
instructive test of the energy function and the computational protocol. The simulations employed 8 replicas in the range 88 K–
1510 K.
Computational Optimization of Peptide Affinity for FAK
393
The logos of Fig. 3b show the sequences of the roomtemperature replica. The upper and lower plot correspond, respectively, to peptides bound at sites 14 and 23. Comparison with
Fig. 3a shows that the intermolecular interactions with the FAT
domain introduce strong sequence preferences at various positions.
Despite the enormous size of the sequence space, the design calculation correctly predicts hydrophobic residues at positions 3,
0, +3, +4, and +7 at both binding sites, in accord with the consensus LD motif 0LDXLLXXL+7. Leucine appears most frequently in
nine out of these ten peptide positions, and solely at positions
0, +4 at site 14 and +3, +7 at site 23. These results suggest that
hydrophobic residues at these positions make key contacts with
nonpolar protein residues that stabilize the complex.
The design also places the signature aspartic acid (D) of the LD
motif as the most probable choice at position +1, for the peptide
bound at site 14; a glutamic acid (E) is also predicted as the fifthmost probable choice. At site 23, a larger range of aminoacid types
are accepted at position +1. At the remaining positions 2, 1, +2,
+5, +6, +8, a large variety of chemical types is inserted. Overall, the
test design reproduces key features of the LD-motif containing
sequences, despite the enormous space of possible sequence
combinations.
3.3 Focused Stability
Design
To improve sampling, we restricted the list sampled by active
residues to chemical types observed at the same positions in
Paxillin-family proteins; the choices are summarized in Table 2.
Even with these restrictions, the total number of possible sequences
is very large (107). To facilitate design, we considered separately
the segments (3) to (+2) (segment A) and (+2) to (+8) (segment
B). With the chemical types of Table 2 the two segments form
Table 2
Aminoacid types sampled during design
Residue
position
-3
-2
-1
0
+1
+2
+3
+4
+5
+6
+7
+8
Aminoacid
type
Leu, Thr
Polar or charged
Polar or negatively charged
Leu, Val
Polar or negatively charged
Polar or charged
Leu
Leu, Met
All except positively charged
Polar or negatively charged
Leu
Ser, Asn, Glu
Total number
of types
2
10
8
2
8
10
1
2
16
8
1
3
394
Eleni Michael et al.
25600 and 7680 sequences, respectively. We designed each segment independently, maintaining the other segment in the paxillin
LD2 sequence. At the end, we created 12-residue sequences by
combining designed segments A and B with the same residue type
at the common position +2.
Choice of Designed Sequences Based on Affinity and Stability The stability calculations considered the complex with peptides
bound at either of the two sites 14 and 23. Details of the design are
in Note 5. For segment A, we obtained 19,660 distinct sequences
at site 14 and 22,629 sequences at site 23. For segment B, we
obtained 4782 sequences at site 14 and 7461 sequences at site
23. In the case of the unbound peptide, we obtained 24,040
sequences for segment A and 6380 sequences for segment B.
In order to choose a set of sequences with good affinity and
stability, we applied a set of filtering criteria.
l
The binding free energies are estimated from the sequence probabilities via Eq. 7. Low-probability sequences should be discarded,
as their energies have large uncertainties. In our case, we conducted 8 108-step MC runs. If all 26,000 segment-A sequences
were equally favored, they would appear about 30,000 times in the
design solutions. We discarded sequences that appear less frequently
by a factor of 10 (less than 3000 times).
l
We selected sequences with better binding free energies than the
corresponding paxillin LD2 segment and better stabilities than the
corresponding native complex. We augmented this list with
sequences having a small positive stability relative to LD2 and
containing a D or E residue at position 1 (for segment A).
l
To create 12-residue sequences, we combined segments A and B
containing the same type at common position +2. For example,
segments 3LRTLDE+2 and +2ELLATLE+8 were joined to form 3
LRTLDELLATLE +8. This procedure produced 9762/15,271
sequences at sites 14/23.
l
We subjected the resulting complexes and unbound peptides to
rotamer optimization via MC simulations of 106 steps at room
temperature and computed the binding affinities of the various
complexes, relative to the native LD2 complexes via Eq. 8.
Profiles of sequences with better binding affinities than the
native complex are shown in Fig. 4.
At site 14, positions 0 and +4 are occupied almost exclusively
by leucine. At site 23, a valine is preferred over leucine at position 0;
methionine is almost as probable as leucine at position 4. A valine
residue is encountered at position 0 in paxillin motif LD3 and a
Computational Optimization of Peptide Affinity for FAK
395
Fig. 4 Sequence logos obtained from refined design of the peptide in site 1/4 (above) and 2/3 (below). The
profiles are Boltzmann-weighted by the sequence binding free energies
methionine at position 4 in paxillin, leupaxin, and Hic-5 motif LD4
(Table 1). Position +1 prefers negatively charged sidechains, at
both sites, in accordance with the features of the “LD” motif. A
leucine is favored over threonine at position 3, as in the paxillin
LD2 motif. This is in accord with a recent MD study [17] of paxillin
LD2:FAK and LD4:FAK complexes, where L3 formed improved
nonpolar interactions relative to T3 at both binding sites. Paxillin
motifs LD2 and LD4 contain a negative (E) residue at position 1.
In the simulations of ref. 17, the E sidechain made several unfavorable long-range polar interactions with the protein. The design
substitutes E by polar residues at this position. Polar residues are
also encountered in Paxillin LD5, Leupaxin LD4, and Hic-5 LD5
motifs (Table 1).
In our earlier MD simulations, LD2 motif residue E+6 was not
engaged in stable hydrogen bonds at site 14 and formed a persistent
hydrogen bond with K1002 at site 23 [17]. In agreement with this,
the design predicts mainly polar aminoacids for position +6 at site
14 and mainly negative aminoacids at site 23. The C-terminal
position +8 is occupied by N and S residues in Paxillin motifs
LD2 and LD4; Leupaxin LD4 motif contains a T residue, and
several other LD motifs contain a negatively charged E residue.
The design favors an E residue at position +8, for site 14. The E
sidechain might form favorable hydrogen-bonding interactions
with a nearby histidine (H1025).
3.4 Clustering Based
on Sequence Similarity
The design procedure identified several thousand sequences with
potentially improved binding affinity for FAK with respect to the
native LD2 motif. A next step is to choose a manageable number
(e.g., around ten) of sequences for further theoretical analysis and
experimental testing. For this purpose, we partitioned the solutions
into clusters of sequences with similar physicochemical properties
396
Eleni Michael et al.
ΔΔ
ΔΔ
[kcal/mol]
LKTLDELLAYLE
LKTLDELLLSLE
LKYLDELLAHLE
LKYLDELLLYLE
LRTLDDLLLYLE
LRTLDDLLAHLE
..
LKTLDRLLAELE
LYTLDKLLLELE
LSTLDRLLMELE
..
-2.37
-2.32
-2.25
-2.18
-2.15
-2.14
-1.49
-1.48
-1.48
[kcal/mol]
LRQVEKLLEELE
LQQVEKLLEELE
LTYVQKLMEELE
LKEVQKLMEELE
LRQVEKLLFDLE
LQQVEKLLYDLE
..
LRYLEQLLAELE
LRYLERLLDELE
LTYLERLLDELE
..
-5.40
-5.39
-5.04
-5.01
-4.98
-4.91
-3.15
-3.06
-2.92
Fig. 5 Selected top affinity sequences at sites 14 and 23 for some of the clusters identified by sequence
similarity analysis. Affinities are relative to the LD2 motif
(see Note 7). We identified 655 sequence clusters at site 14 and
707 clusters at site 23. Figure 5 shows the top affinity sequences for
some of the clusters, ranked by their binding affinity relative to the
LD2 sequence.
3.5 All-Atom MD
Simulations of
Selected Complexes
The efficiency of the CPD calculations is based on approximations
such as the selection of one or a few structural models for the IEM
matrix, the fixed backbone and the discretization of the rotamer
conformations. Checking the design solutions by explicit-solvent
MD simulations is an important additional step that can validate or
eliminate false positives. Below we present some examples of both
cases. Prior to the simulations, we optimized the rotamer conformations with the protX program [45]. The solvation and MD
simulation setup was prepared with the CHARMM-GUI interface
[49] and the simulations were conducted with the NAMD program
[21]. The binding affinities, shown in Table 3, were computed via
Eq. 9. The decomposition of peptide-protein interaction energies,
relative to the LD2 complex, of Fig. 6 provides insights on contributions from specific residues.
According to the design, sequences LKTLDELLAYLE and L
RQVEKKLEELE have the highest relative binding affinity for sites
14 and 23, respectively (Fig. 5). In the MD analysis, both
sequences retain improved affinities by 3.1–3.4 kcal/mol, relative
to the native LD2 complex. A dimer with these sequences
connected by a suitable linker might be a successful inhibitor of
the FAK:Paxillin complex. In the case of the first sequence,
improved binding at site 14 is mainly due to the nonpolar free
energy component (Table 3). The substitution N+8E (with respect
Computational Optimization of Peptide Affinity for FAK
397
Table 3
Binding affinities of selected designed sequences for the focal adhesion kinase FAT domain,
evaluated by all-atom MD simulations of the corresponding complexes and post-processing analysis
in the MM-GBDILK approximation (Eq. 9)
Peptide
Binding
sequence
site
Binding affinity terms
ΔG np
bind
ΔE Coul
C
LSELDRLLLELN 14
(LD2)
25.1
(1.6)
4.3 (0.6) 1.0
46.2
(0.7)
(2.2)
11.9
5.9
3.3
28.4
(0.2)
(0.2)
(0.5)
(1.9)
23
25.7
(1.0)
17.9
(1.1)
17.7
46.7
(0.7)
(1.4)
14.7
6.6
0.3
25.4
(0.5)
(0.3)
(1.1)
(0.9)
14
28.5
(0.3)
1.0
(2.7)
4.2
50.7
(2.6)
(0.6)
12.6
6.3
3.2
31.7
(0.4)
(0.1)
( 0.1)
( 0.3)
LRQVEKLLEELE 23
28.8
(2.3)
21.3
(6.0)
21.1
49.7
(4.9)
(3.8)
14.3
6.7
0.1
28.7
(0.3)
(0.1)
( 7.8)
( 3.8)
LKTLDRLLAELE 14
24.9
(1.0)
2.8
(0.3)
5.4
45.2
(0.2)
(1.2)
11.6
6.0
2.6
27.5
(0.3)
(0.0)
(0.5)
(1.5)
LRYLEQLLAELE
21.5
(5.3)
26.9
(1.6)
25.9
39.5
(2.1)
(7.3)
13.4
5.6
1.0
20.5
(0.6)
(0.7)
(0.6)
(5.9)
LKTLDELLAYLE
23
ΔEGB
ΔE vdW
C
ΔG pbind
ΔGbind
ΔELK
ΔEDI
Fig. 6 Decomposition of binding free energies for sequences LKTLDELLAYLE at site 14 (left) and LRQVEKL
LEELE at site 23 (right). The values are computed relative to native LD2 sequence (LSELDRLLLELN). Error bars
correspond to standard deviations. The labels “nX/Y” denote chemical types encountered at position n in the
LD2 (X) and designed (Y) sequence
to the LD2 motif) improves nonpolar contacts with proximal residues and electrostatic interactions with site 14 lysines; the substitution E1T alleviates unfavorable polar contacts with a nearby
tyrosine (Y925). Notably, additional contributions to affinity are
due to the conserved leucines L3, L0, and L+4. For sequence LR
QVEKLLEELE, the stronger relative affinity for site 23 is mainly
398
Eleni Michael et al.
due to the polar interactions of the substitution E/Q1 and L3.
The substitution L/V0 reduces unfavorable polar contributions,
while the lysine sidechain is electrostatically preferred over the
arginine at position +2. Additionally E+6 makes slightly better
polar and nonpolar interactions with respect to the native LD2
motif.
Two sequences predicted by the MD analysis to bind more
weakly than the LD2 sequence are also included in Table 3.
Sequence LKTLDRLLAELE binds at site 14 with a relative affinity
of 1.5 kcal/mol (design) or + 0.2 kcal/mol (MD). Similarly,
sequence LRYLEQLLAELE has a relative affinity for site 23 of
3.2 kcal/mol (design) and + 4.2 kcal/mol (MD).
4
Notes
1. In the “Fluctuating Boundary Method,” the GB interaction
0
term between two residues R and R is expressed as a polynomial of the residue solvation radii, with coefficients precomputed and stored [39, 40]. The solvation radii are updated
during the MC simulation and the GB term is computed
efficiently from the polynomial expression. A simpler method
is the “Native Environment Approximation,” where the solvation radius of a particular residue is computed with the rest of
the molecule in the wildtype sequence and native
structure [50].
2. A total of 1000 MD trajectory frames were extracted at 20-ps
intervals, aligned with respect to the backbone non-hydrogen
coordinates and clustered with Wordom [51]. Frames were in
the same cluster if their peptide and proximal protein Cα atoms
(within 5 Å of the peptide) had an RMSD less than 1.8 Å. For
the most populated cluster (62% of frames), we chose the
snapshot with the highest binding affinity in the
MM/GBDILK approximation (Eq. 9) [16, 17] as the structural model for the construction of the IEM of the complex
and unbound peptide.
3. In the computation of diagonal terms, we alleviate steric repulsions and improve sidechain orientations by 15 steps of
Powell’s conjugate gradient minimization. The sidechain is
retained at its rotamer via dihedral-angle harmonic restraints
with a 200 kcal/mol/rad2 constant and a tolerance range of
5∘ around the initial rotamer angle; only atoms beyond Cβ are
allowed to move. Rotamers that retain high energies after
minimization can be excluded from the sampling space. Interactions between two sidechains I and J are computed if the
minimum distance lmin between any atoms is smaller than 12 Å.
If lmin 3 Å, the orientations of the two sidechains are
Computational Optimization of Peptide Affinity for FAK
399
optimized by a 15-step minimization; the interactions of the
two sidechains and the fixed part are taken into account and the
sidechains are retained in their specific rotamer orientations by
dihedral harmonic restraints.
4. We conducted iterative replica-exchange Monte Carlo
(REMC) simulations, using eight replicas with temperatures
between 88 K and 1510 K (thermal energies ranging from
0.175 to 3 kcal/mol). Swaps between neighboring replicas
were attempted every 50,000 steps. Each cycle had a length
of 5 106 MC steps per replica. At the end of a cycle, the
average aminoacid composition was computed from the
sequences generated from the room-temperature replica
(kBT ¼ 0.592 kcal/mol) and the reference energies were
updated via Eq. 4.
5. REMC simulations were conducted as in the previous Note.
Each had a length of 8 108 MC steps, with swaps every 4000
steps. At each step, one or two positions were randomly chosen
and their rotamers and/or chemical types were modified with
the following frequencies: one-position rotamer changes 57%,
chemical type changes 11%; two-position rotamer changes
23%, rotamer/chemical type changes 6%, type changes 3%.
The modifications were accepted or rejected based on to the
Metropolis criterion.
6. Prior to the simulations, the “active” and “inactive” sidechain
rotamers were optimized in the presence of the backbone and
remaining fixed sidechains with the protX program [45]. The
solvation and simulation setup was performed with the
CHARMM-GUI interface [49] and the simulations were conducted with the NAMD program [21]. The structures were
solvated in a truncated octahedral water box, creating a hydration layer of minimum thickness of 15 Å. The solvated complexes were minimized by 500 conjugate gradient and
500 adopted-basis Newton-Raphson steps, with harmonic
restraints of 1.0 kcal/mol/Å2 on backbone and 0.1 kcal/
mol/Å2 on sidechain non-hydrogen atoms. Each complex
was equilibrated by an 0.5-ns simulation in the NVT ensemble
at 300 K, with backbone harmonic restraints and periodic
velocity reassignment. During production, the system temperature was maintained around 300 K by Langevin dynamics with
a friction coefficient of 5 ps1. The pressure was kept constant
at 1 atm using a Nose-Hoover Langevin piston with a period of
200 fs. Electrostatic interactions were computed every 2 steps
by the Particle Mesh Ewald method. A distance cutoff 12 Å was
set for all non-bonded interactions. Each production run had a
time-step of 2 fs and a total duration of 10 ns.
7. This can be achieved with the aid of a similarity substitution
matrix [52, 53], which takes into account aminoacid
400
Eleni Michael et al.
physicochemical properties. For example, substitutions which
preserve the sidechain size and charge, polar or nonpolar character can be classified as conservative; the corresponding
sequences can be grouped in the same cluster. Within each
cluster, the sequence with best binding affinity can be chosen
as the representative member of the cluster.
Acknowledgements
This work was co-funded by the European Regional Development
Fund and the Republic of Cyprus through the Research and Innovation Foundation (Project: INFRASTRUCTURES /1216/
0060). EM was supported by a graduate student fellowship from
the University of Cyprus.
References
1. Schaller MD (2010) Cellular functions of FAK
kinases: insight into molecular mechanisms and
novel functions. J Cell Sci 123(7):1007–1013
2. Walkiewicz KW, Girault J, Arold ST (2015)
How to awaken your nanomachines: sitespecific activation of focal adhesion kinases
through ligand interactions. Prog Biophys.
Mol. Bio 119(1):60–71
3. Naser R, Aldehaiman A, Dı́az-Galicia E, Arold
ST (2018) Endogenous control mechanisms of
FAK and PYK2 and their relevance to cancer
development. Cancers 10(6):196
4. Sulzmaier FJ, Jean C, Schlaepfer DD (2014)
FAK in cancer: mechanistic findings and clinical applications. Nat Rev Cancer 14(9):
598–610
5. Shen T, Guo Q (2018) Role of Pyk2 in human
cancers. Med Sci Monitor 24:8172–8182
6. Liu S, Chen L, Xu Y (2018) Significance of
PYK2 level as a prognosis predictor in patients
with colon adenocarcinoma after surgical resection. Oncotargets Ther 11:7625–7634
7. Quiroga MN, Dı́az MR, Moreno J, Aguilar
RG, Ibarra ML, Sánchez PP, Cabrero IA,
Gómez GV, Zavaleta LR, Aranda DA, Gómez
FS (2019) Increased expression of FAK isoforms as potential cancer biomarkers in ovarian
cancer. Oncol Lett 17:4779–4786
8. Pan M-R, Wu C-C, Kan J-Y, Li Q-L, Chang
S-J, Wu C-C, Li C-L, Ou-Yang F, Hou M-F,
Yip H-K, Luo C-W (2019) Impact of FAK
expression on the cytotoxic effects of CIK therapy in triple-negative breast cancer. Cancers
12(1):94
9. Golubovskaya VM, Ho B, Zheng M, Magis A,
Ostrov D, Morrison C, Cance WG (2013)
Disruption of focal adhesion kinase and p53
interaction with small molecule compound r2
reactivated p53 and blocked tumor growth.
BMC Cancer 13(1):342
10. Golubovskaya VM, Palma NL, Zheng M,
Ho B, Magis A, Ostrov D, Cance
WG (2013)
0
A small-molecule inhibitor, 5 -o-tritylthymidine, targets FAK and Mdm-2 interaction,
and blocks breast and colon tumorigenesis
in vivo. Anti-Cancer Agent Me 13(4):532–545
11. Ucar DA, Magis AT, He D, Lawrence NJ, Sebti
SM, Kurenova E, Zajac-Kaye M, Zhang J,
Hochwald SN (2013) Inhibiting the interaction of cMET and IGF-1r with FAK effectively
reduces growth of pancreatic cancer cells
in vitro and in vivo. Anti-Cancer Agent Me
13(4):595–602
12. Ucar DA, Kurenova E, Garrett TJ, Cance WG,
Nyberg C, Cox A, Massoll N, Ostrov DA,
Lawrence N, Sebti SM, Zajac-Kaye M, Hochwald SN (2014) Disruption of the protein
interaction between FAK and IGF-1R inhibits
melanoma tumor growth. Cell Cycle 11(17):
3250–3259
13. Lv P-C, Jiang A-Q, Zhang W-M, Zhu H-L
(2018) FAK inhibitors in cancer, a patent
review. Expert Opinion Therapeutic Patents
28(2):139–145. PMID: 29210300
14. Alvarado C, Stahl E, Koessel K, Rivera A,
Cherry BR, Pulavarti SVSRK, Szyperski T,
Cance W, Marlowe T (2019) Development of
a fragment-based screening assay for the focal
adhesion targeting domain using SPR and
NMR. Molecules 24(18):3352
Computational Optimization of Peptide Affinity for FAK
15. Mabonga L, Kappo AP (2020) Peptidomimetics: a synthetic tool for inhibiting protein–protein interactions in cancer. Int J Peptide Res
Therapeutics 26(1):225–241
16. Michael E, Polydorides S, Simonson T, Archontis G (2017) Simple models for nonpolar
solvation: parameterization and testing. J
Comp Chem 38(29):2509–2519
17. Michael E, Polydorides S, Promponas V,
Skourides P, Archontis G (2021) Recognition
of LD motifs by the focal adhesion targeting
domains of focal adhesion kinase and prolinerich tyrosine kinase 2beta: insights from molecular dynamics simulations. Proteins 89(29):
29–52
18. Munoz V, Serrano L (1995) Elucidating the
folding problem of helical peptides using
empirical parameters. ii. helix macrodipole
effects and rational modification of the helical
content of natural peptides. J Mol Biol 245(3):
275–296
19. Mignon D, Druart K, Michael E, Opuu V,
Polydorides S, Villa F, Gaillard T, Panel N,
Archontis G, Simonson T (2020) Physicsbased computational protein design: an
update. J Phys Chem A 2020:10637–10648
20. Simonson T, Gaillard T, Mignon D, am Busch
MS, Lopes A, Amara N, Polydorides S,
Sedano A, Druart K, Archontis G (2013)
Computational protein design: the proteus
software and selected applications. J Comput
Chem 34(28):2472–2484
21. Phillips JC, Hardy DJ, Maia JDC, Stone JE,
Ribeiro JV, Bernardi RC, Buch R, Fiorin G,
Henin J, Jiang W, McGreevy R, Melo MCR,
Radak BK, Skeel RD, Singharoy A, Wang Y,
Roux B, Aksimentiev A, Luthey-Schulten Z,
Kale LV, Schulten K, Chipot C, Tajkhorshid E
(2020) Scalable molecular dynamics on CPU
and GPU architectures with NAMD. J Chem
Phys 153:044130
22. Hayashi I, Vuori K, Liddington RC (2002)
The focal adhesion targeting (FAT) region of
focal adhesion kinase is a four-helix bundle that
binds paxillin. Nat Struct Mol Biol 9(2):
101–106
23. Arold ST, Hoellerer MK, Noble MEM (2002)
The structural basis of localization and signaling by the focal adhesion targeting domain.
Structure 10(3):319–327
24. Lulo J, Yuzawa S, Schlessinger J (2009) Crystal
structures of free and ligand-bound focal adhesion targeting domain of Pyk2. Biochem Bioph
Res Co 383(3):347–352
25. Alam T, Alazmi M, Gao X, Arold ST (2014)
How to find a leucine in a haystack? structure,
ligand recognition and regulation of
401
leucine–aspartic acid (LD) motifs. Biochem J
460(3):317–329
26. Liu G, Guibao CD, Zheng J (2002) Structural
insight into the mechanisms of targeting and
signaling of focal adhesion kinase. Mol Cell
Biol 22(8):2751–2760
27. Hoellerer MK, Noble MEM, Labesse G,
Campbell ID, Werner JM, Arold ST (2003)
Molecular recognition of paxillin LD motifs
by the focal adhesion targeting domain. Structure 11(10):1207–1217
28. Gao G, Prutzman KC, King ML, Scheswohl
DM, DeRose EF, London RE, Schaller MD,
Campbell SL (2004) NMR solution structure
of the focal adhesion targeting domain of focal
adhesion kinase in complex with a paxillin LD
peptide: evidence for a two-site binding model.
J Biolog Chem 279(9):8441–8451
29. Bertolucci CM, Guibao CD, Zheng J (2005)
Structural features of the focal adhesion kinasepaxillin complex give insight into the dynamics
of focal adhesion assembly. Prot Sci 14(3):
644–652
30. Tuffery P, Etchebest C, Hazout S, Lavery R
(1991) A new approach to the rapid determination of protein side chain conformations. J
Biomol Struct Dyn 8(6):1267–1289
31. Simonson T (2013) What is the dielectric constant of a protein when its backbone is fixed?
JCTC 9:4603–4608
32. Cornell W, Cieplak P, Bayly C, Gould I,
Merz K, Ferguson D, Spellmeyer D, Fox T,
Caldwell J, Kollman P (1995) A second generation force field for the simulation of proteins,
nucleic acids, and organic molecules. J Am
Chem Soc 117:S179–S197
33. Still WC, Tempczyk A, Hawley RC, Hendrickson T (1990) Semianalytical treatment of solvation for molecular mechanics and dynamics. J
Am Chem Soc 112(16):6127–6129
34. Hawkins GD, Cramer CJ, Truhlar DG (1995)
Pairwise solute descreening of solute charges
from a dielectric medium. Chem Phys Lett
246(1–2):122–129
35. Schaefer M, Karplus M (1996) A comprehensive analytical treatment of continuum electrostatics. J Phys Chem 100(5):1578–1599
36. Weeks JD, Chandler D, Andersen HC (1971)
Role of repulsive forces in determining the
equilibrium structure of simple liquids. J
Chem Phys 54(12):5237–5247
37. Aguilar B, Shadrach R, Onufriev AV (2010)
Reducing the secondary structure bias in the
generalized born model via r6 effective radii. J
Chem Theory Comp 6(12):3613–3630
402
Eleni Michael et al.
38. Lazaridis T, Karplus M (1999) Effective energy
function for proteins in solution. Proteins
35(2):133–152
39. Archontis G, Simonson T (2005) A residuepairwise generalized born scheme suitable for
protein design calculations. J Phys Chem B
109(47):22667–22673
40. Villa F, Mignon D, Polydorides S, Simonson T
(2017) Comparing pairwise-additive and
many-body generalized born models for acid/
base calculations and protein design. J Comput
Chem 38(28):2396–2410
41. Kuhlman B, Dantas G, Ireton GC, Varani G,
Stoddard BL, Baker D (2003) Design of a
novel globular protein fold with atomic-level
accuracy. Science 302:1364–1368
42. Ollikainen N, de Jong RM, Kortemme T
(2015) Coupling protein side-chain and backbone flexibility improves the re-design of
protein-ligand specificity. PLOS Comput Biol
11(9):1–22
43. Druart K, Bigot J, Audit E, Simonson T
(2016) A hybrid Monte Carlo scheme for multibackbone protein design. J Chem Theory
Comp 12:6035–6048
44. Hayes RL, Armacost, KA, Vilseck JZ, Brooks
III CL (2017) Adaptive landscape flattening
accelerates sampling of alchemical space in
multisite λ dynamics. J Phys Chem. B
121(15):3626–3635
45. Simonson T (2020) PROTEUS 3.0 Manual.
https://proteus.polytechnique.fr
46. Mignon D, Simonson T (2016) Comparing
three stochastic search algorithms for computational protein design: Monte Carlo, replica
exchange Monte Carlo, and a multistart,
steepest-descent heuristic. J Comput Chem
37(19):1781–1793
47. Mignon D, Panel N, Chen X, Fuentes EJ,
Simonson T (2017) Computational design of
the Tiam1 PDZ domain and its ligand binding.
J Chem Theory Comput 13(5):2271–2289
48. Genheden S, Ryde U (2015) The MM/PBSA
and MM/GBSA methods to estimate ligandbinding affinities. Expert Opin Drug Dis
10(5):449–461
49. Jo S, Kim T, Iyer VG, Im W (2008)
CHARMM-GUI: a web-based graphical user
interface for CHARMM. J Comp Chem
29(11):1859–1865
50. Polydorides S, Simonson T (2013) Monte
Carlo simulations of proteins at constant pH
with generalized Born solvent, flexible sidechains, and an effective dielectric boundary. J
Comput Chem 34:2742–2756
51. Seeber M, Cecchini M, Rao F, Setanni G,
Caflisch A (2007) Wordom: a program for efficient analysis of molecular dynamics simulations. Bioinf 23:2625–2627
52. Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science
185(4154):862–864
53. Sneath P (1966) Relations between chemical
structure and biological activity in peptides. J
Theoret Biol 12(2):157–195
Chapter 19
Knowledge-Based Unfolded State Model for Protein Design
Vaitea Opuu, David Mignon, and Thomas Simonson
Abstract
The design of proteins and miniproteins is an important challenge. Designed variants should be stable,
meaning the folded/unfolded free energy difference should be large enough. Thus, the unfolded state plays
a central role. An extended peptide model is often used, where side chains interact with solvent and nearby
backbone, but not each other. The unfolded energy is then a function of sequence composition only and
can be empirically parametrized. If the space of sequences is explored with a Monte Carlo procedure,
protein variants will be sampled according to a well-defined Boltzmann probability distribution. We can
then choose unfolded model parameters to maximize the probability of sampling native-like sequences.
This leads to a well-defined maximum likelihood framework. We present an iterative algorithm that follows
the likelihood gradient. The method is presented in the context of our Proteus software, as a detailed
downloadable tutorial. The unfolded model is combined with a folded model that uses molecular mechanics and a Generalized Born solvent. It was optimized for three PDZ domains and then used to redesign
them. The sequences sampled are native-like and similar to a recent PDZ design study that was experimentally validated.
Key words Monte Carlo, Proteus software, Molecular mechanics, Implicit solvent, Machine learning,
Maximum likelihood, PDZ domain
1
Introduction
Computational protein design (CPD) is an exciting field that has
had many successes. One important application is to design or
redesign whole proteins and miniproteins [1–9]. A new fold was
produced [2], SARS-Cov-2 miniprotein inhibitors were obtained
recently [8], and multiprotein assemblies as large as 43 nm have
been designed [10, 11]. For these applications, the unfolded state
of the protein plays a central role. Indeed, protein variants are
chosen according to their stability, which is the free energy difference between the folded and unfolded states. One possible
unfolded model is an extended peptide structure [12, 13]. Several
chapters in this volume describe the dynamics of extended peptides,
which are complex and expensive to explore thoroughly
Thomas Simonson (ed.), Computational Peptide Science: Methods and Protocols,
Methods in Molecular Biology, vol. 2405, https://doi.org/10.1007/978-1-0716-1855-4_19,
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2022
403
404
Vaitea Opuu et al.
Fig. 1 Extended peptide unfolded model: each residue interacts with nearby
backbone and solvent, leading to the picture on the right. The unfolded energy
has become a function of the sequence only
[14–17]. Another route is to posit a statistical distribution of
residue–residue interactions, such as would exist in a fluctuating
moderately compact polymer [18, 19], like a Gaussian chain
[20]. This route gives valuable insights, but has not been applied
to, and may not be accurate enough for whole protein design.
Rather, for whole protein design, an empirical knowledgebased unfolded model is normally used [21–24]. This model
makes two main assumptions. First, in the unfolded state, the side
chains do not interact with each other, but only with solvent and
nearby backbone. With this assumption, the energy of the unfolded
state depends on the sequence but not on any specific 3D structure,
as depicted in Fig. 1. Second, we assume that a good way to
parametrize the unfolded model is to choose parameters so that
the CPD calculation reproduces the experimental amino acid composition, for example, that of a specific protein family of interest
[22–24]. This can be achieved by an iterative trial-and-error
approach: run a design calculation, compute the resulting amino
acid frequencies, tweak parameters to push the individual frequencies up or down as appropriate, and start over. This idea can be cast
as a likelihood maximization problem, leading to well-defined
machine learning algorithms, such as the one described below.
The model does not require a precise 3D structure for the unfolded
state.
To search the vast space of protein variants with stability as a
guiding principle is a hard problem. Considerable work has gone
into developing effective search algorithms [25–29]. One important approach is to use Monte Carlo (MC) simulations. At each MC
step, a trial mutation R ! R0 is introduced at a particular position
(or a set of positions) in the folded protein. The resulting stability
change is computed, by taking energy differences between R and R0
in their folded and unfolded states. This is shown in Fig. 2. We see
that when a trial mutation is introduced for the folded state, we
need to consider the reverse mutation in the unfolded state. This
double mutation is equivalent to a double change in protein conformations: we effectively unfold R and refold R0 . The
corresponding energy change is indeed the stability change due to
the mutation. Because of this equivalence between a trial mutation
Knowledge-Based Unfolded State Model for Protein Design
405
Fig. 2 A Monte Carlo mutation move. A point mutation is introduced in the folded
protein (above); the reverse mutation is introduced in the unfolded protein
(below). The double mutation is equivalent to unfolding the initial variant R and
refolding the new one R0
and a double conformational change, it is easy to show [24, 30]
that the MC simulation mimics a precise statistical ensemble: a
collection of all protein variants, present at equal concentrations,
distributed between their folded and unfolded states according to
their stabilities. Thus, the MC simulation mimics a large, equimolar, combinatorial library of sequences, such as would be produced
experimentally [31, 32].
When we explore this statistical ensemble with MC, we are
constantly picking variants to unfold and others to refold. If the
MC move probabilities are correctly chosen, a long simulation will
produce a trajectory where the populations of folded protein variants follow a well-defined distribution: a Boltzmann distribution
controlled by the folding free energy [30, 33, 34]. It is this welldefined probability distribution that allows us to formulate the
unfolded state parametrization as a likelihood maximization. This
leads to practical algorithms.
In the next section, “Theory,” we briefly recall the unfolded
model and the maximum likelihood formalism. The model contains
one or two unfolded energy parameters per amino acid type t, say
E ut . We also outline the computational procedure. In the “Materials” section, we describe the energy function, the MC procedure,
and the experimental sequences used. After that, in the “Methods”
section, we present the procedure as a tutorial, in the context of our
Proteus software [22, 24]. Proteus uses a molecular mechanics
energy function with a Generalized Born (GB) solvent model for
the folded state. However, the methodology and practical steps are
general and could be implemented with other energy functions and
CPD programs, like Rosetta [35, 36] or Osprey [37]. In the
tutorial, we consider three PDZ proteins, and we adjust the E ut
parameters in an iterative way, during a series of MC simulations.
The goal is that a long MC simulation should give sequences with
the same amino acid composition as a reference set of natural PDZ
proteins. In the “Illustrative Application” section, we briefly characterize the designed sequences, to illustrate the quality of the
406
Vaitea Opuu et al.
model. We focus on the so-called native sequence recovery: to what
extent are the designed sequences similar to natural sequences from
the PDZ family. This is an established benchmark for CPD models.
Finally, the “Notes” section provides some caveats. We end with a
conclusion.
2
Theory
2.1 Extended Peptide
Model
For the unfolded state, a widespread model is based on a fully
extended, fully solvated peptide [13, 38]. This model is a useful
first step before introducing the knowledge-based unfolded model.
Indeed, with a fully extended peptide, it is natural to assume that
each amino acid interacts with solvent and nearby backbone, but
not the other amino acids. For a peptide whose sequence is denoted
S, the unfolded state energy then has the form:
E u ðSÞ ¼
P
E u ðt i Þ:
i∈S
ð1Þ
The sum is over all amino acids; ti represents the side chain type of
amino acid i. The type-dependent “unfolded energies” E u ðtÞ E ut
can be computed from the 3D structure of an extended peptide.
This model is important because it expresses the unfolded energy as
a function only of the protein sequence composition. A natural next
step is to use empirical values for the E ut, instead of ones based on a
peptide structure. This leads to the knowledge-based model,
described next.
2.2 KnowledgeBased Unfolded State
Model
For whole protein design, the folding energy is essential and an
extended peptide model is not accurate enough. Instead, the E ut
can be chosen empirically, to reproduce the amino acid composition of a given set of natural proteins. This was done in most
successful whole protein designs [1–9]. The E ut can be thought of
as effective chemical potentials. MC generates an ensemble where
the population of each sequence follows a Boltzmann distribution
[30]. This makes it possible to choose unfolded energies that
maximize the probability of the natural sequences. We recall briefly
the method [23, 24].
Let S be a set of N natural “target” sequences S, compatible
with one or more folds of interest. The Boltzmann probability
of S is
pðSÞ ¼
1
exp ð βΔG f ðSÞÞ,
Z
ð2Þ
where ΔGf(S) ¼ G f(S) Eu(S) is the folding free energy of S, G f(S)
is the free energy of the folded form, β ¼ 1/kT is the inverse
temperature, and Z is a normalizing constant (the partition function). We denote by ℒ the probability of the entire sequence set,
Knowledge-Based Unfolded State Model for Protein Design
407
which depends on the E ut; we refer to ℒ as their likelihood [39]. To
maximize ℒ, its derivatives with respect to the E ut should be zero.
After some algebra, we obtain [23]
∂
ln ℒ ¼ β
∂E ut
X
ðnS ðtÞhnðtÞiÞ ¼ βðN ðtÞ N hnðtÞiÞ,
ð3Þ
S
where n(t) is the number of amino acids of type t per sequence
sampled during the simulations, and N(t) is the number in the
whole dataset S [23]. Therefore,
ℒ maximum )
N ðtÞ
¼ hnðtÞi, 8t∈aa:
N
ð4Þ
Thus, to maximize ℒ, we choose fE ut g such that a long simulation
gives the same amino acid frequencies as the target database. Second derivatives also have a simple expression:
2
∂ ln ℒ
¼ N β2 ðhnðtÞnðwÞi hnðtÞihnðwÞiÞ:
∂E ut ∂E wu
ð5Þ
With the first and second derivatives in hand, various gradient
search methods can be used.
2.3 Gradient Search
Method
We use an iterative method to approach the maximum likelihood
fE ut g values. At iteration n, let fE ut ðnÞg be the current parameter
guess. We begin by running a simulation with these parameters. We
then update the parameters by moving along the gradient of ℒ ,
using the update rule [39]:
E ut ðn þ 1Þ ¼ E ut ðnÞ þ α
∂
exp
ln ℒ ¼ E ut ðnÞ þ δE ðnt hnðtÞin Þ:
∂E ut
ð6Þ
exp
nt
¼ N(t)/N is the mean population of
Here, α is a constant,
amino acid type t in the target database, hin indicates an average
over a simulation done with the current unfolded energies fE ut ðnÞg,
and δE is an empirical constant with the dimension of an energy,
referred to as the update amplitude. This update procedure is
repeated until convergence.
3
Materials
3.1 Energy Function
for the Folded State
The energy function for the folded state has the form
E
f
¼ E intra þ E GB þ E nonpolar :
ð7Þ
The first term is the protein internal energy. In our work, it is taken
from the Amber ff99SB force field [40]. The other two are solvent
contributions. The Generalized Born (GB) term EGB captures the
408
Vaitea Opuu et al.
main electrostatic effects [41], while Enonpolar represents dispersion
and hydrophobic effects through a Lazaridis–Karplus
(LK) term [42].
The GB term involves atomic solvation radii bi that approximate the distance from atom i to the protein surface and depend on
all coordinates. With a “Native Environment Approximation”
(NEA) [43, 44], each bi is computed ahead of time, with the rest
of the system in its native sequence and conformation. This
removes the many-body character of the GB solvent and leads to
a pairwise-additive energy. When a pairwise-additive energy is combined with a fixed backbone and a discrete rotamer library, residue
interaction energies can be computed ahead of time and stored in a
lookup table or “energy matrix” [1]. We also developed an exact
method where the bi are computed on the fly, during MC, yet an
energy matrix can still be used, with the GB energies represented by
a lookup table of lookup tables [44, 45]. This method is referred to
as the “Fluctuating Dielectric Boundary” method or FDB.
3.2 Monte Carlo
Exploration
MC simulations with Proteus use moves where either rotamers,
amino acid types, or both are changed at one or two positions.
Mutating positions are user-defined and depend on the problem.
Sampling can be enhanced by Replica Exchange Monte Carlo
(REMC), where several MC simulations are run in parallel at different temperatures [30]. With a precalculated energy matrix, one
billion REMC steps can be run in a few hours on a desktop
machine.
3.3 Experimental
Sequences and
Alignment
The tutorial below will consider three PDZ domains: 1G9O, 1R6J,
and 2BYG. To define the target amino acid frequencies, we collected homologous sequences for each domain by searching the
non-redundant (NR) database with NCBI/Blast [46], the Blosum62 scoring matrix, and the PDB sequence as query. We retained
homologs with sequence identities vs. the query above 60%. We
used the HMMER algorithm [47] and the Superfamily tool [48] to
identify and eliminate any Blast hits that did not belong to the same
protein family as the query, leaving a total of 199 homologues.
Finally, we aligned each query and its homologues with Clustal
Omega [49].
Experimental amino acid frequencies were averaged over the
alignment, with separate averaging for buried and exposed positions. Burial was determined by the fractional burial observed for
each position in the 3D structures of the test proteins.
3.4 Amino Acid
Group Constraints
To improve convergence, we usually apply constraints during the
early iterations [23]. The amino acid types are grouped into classes,
whose parameters are linked. Thus, Asp and Glu could form a class.
The difference between their E ut values is computed from molecular
mechanics (the energy function in 7) and kept fixed. During this
Knowledge-Based Unfolded State Model for Protein Design
409
early phase, the Asp and Glu values are thus interdependent and
chosen to reproduce the total Asp+Glu frequency in the target
dataset. Later, the class constraints are released, and the parameters
for the individual types (Asp and Glu) evolve independently. A
detailed optimization schedule is shown further on.
3.5 Proteus Software
Files and
Documentation
4
4.1
Proteus 3.0 is freely available from https://proteus.
polytechnique.fr to academic and government scientists. Scientists
from companies should contact the corresponding author. The
distribution includes source code, binaries for Intel processors,
extensive test cases, and detailed documentation. Files to run the
tutorial below are included.
Methods: A Proteus Tutorial
Overview
The present tutorial documents the procedure to parametrize the
unfolded state model in the context of CPD. It uses a set of three
PDZ domains: NHREF, Syntenin, and DLG2, referred to by their
PDB codes: 1G9O, 1R6J, and 2BYG. We assume these systems
have already been set up for Proteus. In particular, the 3D structural models are in place, and the three energy matrices have been
computed, following a procedure documented elsewhere
[24, 50]. The structural models (setup.pdb files) include information on the solvent accessibility of the protein residues. In addition,
a sequence alignment has been constructed, by searching a
sequence database using the three domains as queries [24]. During
the tutorial, we will
l
compute the target frequencies using the proteins from the
sequence alignment as a reference,
l
compute initial guesses for the unfolded energy parameters,
l
optimize the parameters iteratively, using Proteus.
The main data files and scripts are listed in Table 1. The top
directory is referred to as TOP.
The main computational steps (controlled by iteration.sh) will
typically be run in parallel on a small collection of laboratory
computers running Unix. One of them will serve as a “master”
node, hosting all the input data and software and coordinating the
calculations. At each iteration, for each test protein, one or more
MC simulations are run with the current parameter guess. In each
simulation, all or part of the protein residues are allowed to mutate.
In the tutorial, two simulations are done per protein, each with half
of the residues mutating (every other residue). Relevant information is provided by the user in a file TOP/machine.info on the
master node (see also Note 1):
410
Vaitea Opuu et al.
Table 1
Main tutorial files
The test proteins: 1G9O, 1R6J, 2BYG
TOP/1G9O/
matrix.bb
Diagonal elements of the energy matrix
matrix.pw
Off-diagonal elements of the energy matrix
setup.pdb
3D structure, with explicit residue burial information
1R6J/, 2BYG/
idem
TOP/lib/
all_seq.aln
Sequence alignment of natural homologs, Clustal format
Scripts and environment variables
TOP/
project.info
Download