Homology modelling of proteins. Definition: Prediction of the three dimensional structure of a target protein from the amino acid sequence of a homologous (template) protein for which an X-ray or NMR structure is available. Synonyms: Comparative modelling & Knowledge-based modelling. Protein Structure Modelling Three approaches to structure prediction: a. Ab initio prediction (no known homology with any sequence of known structure) Given only the sequence, predict the 3D structure from “first principles”, based on energetic or statistical principles. b. Sequence- Structure Threading Given the sequence, and a set of folds observed in PDB, see if any of the sequences could adopt one the known folds. c. Homology Modelling Given a sequence with homology (> 25%) to a known structure in PDB, use known structure as template to create a 3D model from the sequence. Various ways of homology modelling One structure as main template (I will illustrate here). Fragment based modelling: Protein structure can be build from a combination of segments from other proteins. The program Composer depends on the assembly of rigid fragments. Ab initio modelling There are two components to ab initio prediction: devising a scoring (ie, energy) function that can distinguish correct structures from incorrect ones a search method to explore the conformational space. In many methods, the two components are coupled together such that a search function drives, and is driven by, the scoring function to find native-like structures. BUT there is a difficulty of formulating an adequate scoring function and it requires formidable computational effort to solve it BECAUSE fully-descriptive energy function must consider interactions between all pairs of atoms in the polypeptide chain and the number of such pairs grows exponentially with the number of amino acids in the protein. A full model must also take into account vitally important interactions between the protein’s atoms and the environment, the so-called ‘hydrophobic effect’. For practical reasons, simplifying assumptions must be made. (you can predict a structure using ab initio techniques on http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.html) 1 Why does Homology (Comparative/knowledge based) modelling work? Proteins have a limited number of folds: The structure of a new protein can resemble a known fold even with no apparent sequence similarity. Why a model? A model is desirable when either X-ray crystallography or NMR cannot determine the structure of a protein, in time or at all. Many structure-function relationships can be deduced from a reasonable model. Indeed, sometimes a modelled structure can be used for successful drug design. The 3D structure of a protein can tell us much more about how individual residues interact to form a functional entity. For example residues that are far away in a 1D sequence can be very close together in the actual folded protein. Models are quite accurate: Form a rational basis for explaining experimental observation & help redesigning proteins to improve their function. Models can be used as starting points in the determination of protein structure by NMR or X-ray. Post-genomics – structural genomics The potential benefits of having a structural model has lead to the concept that the structures of all gene products should either be structurally solved or experimentally modelled. To model so many proteins the technique of producing accurate alignments and building three-dimensional models from the alignments has to be fully automated. Sanchez and Šali have automatically modelled a large fraction of the yeast genome, using their program MODELLER (see later section). But the process has been limited to only ORF (Open Reading Frame) sequences from yeast that had a relatively high homology to a three-dimensional template structure. Automation of techniques for lower sequence homology model building is a step that still needs to be addressed and considerable effort is being put into this type of research. History. The first homology modelling studies were done using wire and plastic models of bonds and atoms as early as the 1960’s. The models were constructed by taking the coordinates of a known protein structure and modified by hand for those amino acids that did not match the structure. In 1969 David Phillips, Brown and co-workers published the first paper regarding homology modelling. They modelled lactalbumin based on the structure of hen-egg white lysozyme. The sequence identity between these two proteins was 39%. In addition both proteins contained an identical pattern of cysteins suggesting a similar arrangement of disulphide bonds. When the structure of -lactalbumin was solved by X-ray crystallography it was compared to the model and analysed. The model was essentially correct apart from the C-terminal ends, which diverge in the structure in any case 2 Method. Figure below illustrates the major steps of obtaining structure from sequence. Protein Sequence . Database Searches Sequence alignment Secondary structure prediction Good Structure homologue? No Improve alignment using secondary structure prediction Yes Homology modelling Minimisation Check model Three dimensional structure Steps in molecular modelling: 1. Identification of structures that will form the template for the target structure (model). 2. Alignment – the most important step. Alignment of low homology sequences can be improved using secondary structure prediction (align-model-realignremodel). 3. Transfer of coordinates from the template(s) to the target of structurally conserved regions (SCR’s) - many fragment method - single structure. 4. Modelling variable regions Loops Insertions: Search of a high resolution fragment database Deletions: Local minimisation often sufficient. 5. Modelling side chains (practically a virtual step) 6. Minimisation: Local – especially loop-hinge regions 3 Global. 7. Molecular Dynamics: To study regional flexibility. 8. Checking the correctness of the model. Correctness of the overall fold by: - Bad: Non-polar side chains exposed to the solvent. - Bad: Buried ionizable groups. - Conformational energy calculations – Incorrect folds have high solvation energy. - Luthy’s method. Stereochemical properties: PROCHECK - Bond angles - Bond Length Modelling using the Restrained-based method Distance restraints (Havel & Snow 1991) Structural features restraint (Sali & Blundell 1993) Modelling of Loops 5 residue insertion Database search for 9 residue fragments annealing Anchor points (2 residues) 4 Modelling of Side Chains Side chains adopt distinct conformations that are dependent on Back Bone structure. This observation gave rise to ROTAMER libraries that are used in modelling procedures. Same S.C. conformer taken from template. Partial Similarity: substitution: Most S.C. build on template. build based on rotamer library & energetics. Minimisation • • • LOCAL: Minimise a fragment. Usually a loop and its anchor regions - as these often have bad geometries. First minimise without influence of surrounding structure then take surrounding structure into account. GLOBAL: Minimise whole protein (& H2O). Mainly to relieve short contacts and to rectify bad geometry, like bond angles, peptide planarity etc. Problems with minimisations are Local minima (egg box) and Approximations (Dynamics - often local. To study movement of particular loop and/or improve its geometry.) 5 Local minima problem of minimisation Energy Accuracy. Generally the accuracy of a model depends on the initial sequence alignment and percentage homology of the target to the template. Most errors occur in the loop or variable regions of the model. Check structural integrity of model • Check the correctness of the overall fold Look at distribution of polar (charged) and hydrophobic residues on surface and inside the protein. Buried charges must interact • Detect local errors • Check stereochemical parameters like bond length, bond angles and short contacts. Ramachandran plot. Procheck. Automatic modelling – Swiss model free Web and local. http://www.expasy.ch/swissmod/ Easypred free Web http://www.fundp.ac.be/urbm/bioinfo/esypred/ 6 WhatIf $$ local Modeller – Unix machines – quite difficult to learn How does Swiss-model work – an introduction: For complete reference look at the web site documentation. Step 1 Swiss_model first does a database search for homologous proteins. Then it Superposes all the structures it finds. Step 2 It generates a multiple alignment with the sequence to be modelled and all the homologous structure Step 3 Generates 3D framework for the target protein sequence. Atoms that occupy a similar spatial area and are aligned to the target sequence and are used to compute the averaged atomic position of the framework from which the target will be build. Side chains with incorrect geometries are removed . Step 4 Building of insertions or loops. SWISS_Model uses two techniques: The first method is the same as I described earlier. It also uses first principles, in other words it searches conformational space to build loops where: is uses 7 allowed combinations adequate space allocation for the loop space allocation for each -carbon Both methods exclude loops in conflict with structure Step 5 Side chain building It also uses a library of allowed side-chain rotamers. First the distorted but otherwise complete side chains are corrected Then the incomplete side chains are built with a probabilistic approach using the rotamers. A van der Waals exclusion test and dihedral angle constraints can be used to select the “best” side chain conformation Step 6 minimization Step 7 The correctness of the structure is checked by analysing the conformational space of each residue energetically. The correctness of the structure is also checked by looking at the packing density of the model which is compared to what is expected. 7 Automatic v Manual Is our target protein homologous enough for an automatic procedure? First it has to be found in a sequence search. Otherwise you can use PDB-viewer and your own sequence/structure. Even with some manual input will we get a good enough structure? Sometimes, other times only exhaustive manual modelling is needed. Need to decide what the model is going got be used for? Is it to look at e.g. mutations … or do we want to do docking of ligands. Do we really want to use an averaged template structure? Some structure can distort an averaged template Note: Modelling is not the end of the “Experiment” - it is the means for further theoretical studies. It gives us a 3D representation of a sequence alignment with the gaps filled in. It can be used further in structure-based ligand design if the model is accurate enough. It can suggest residues to mutate and these mutations can be further studied both theoretically and biochemically. It can be used to understand the function of the protein better. Further Reading & References: General: Protein Structure Prediction – A practical approach. Ed: Michael J. E. Sternberg. IRL Press. 1996. ISBN: 0-19-963496-3. Browne, W.J. et al. 1969. J. Mol. Biol., 42, 65. Greer, J. 1981. J. Mol. Biol. 153, 1027. Havel, T.F. & Snow, M. E. 1991. J. Mol. Biol., 217,1. Sali, A. & Blundell, T.L. 1993. J. Mol. Biol. 234, 779 Finan P, Koga H, Zvelebil, M.J, Waterfield, MD & Kellie S. 1996. J. Mol. Biol. 261, 173. 8