Practical: Comparative modelling of Gdh using MODELLER Coordinator: Muhammed Sayed June 2004 Initial requirements: Before you start running modeller, you will need one or more template structures (pdb format) and a sequence alignment of the target against the template sequence. The alignment should be in PIR format. An example of PIR format sequence : >P1;target sequence:target:@:@: 76 :@:target: :-1.00:-1.00 MIVFVRFNSSHGFPVEVDSDTSIFQLKEVVAKRQGVPADQLRVIFAGKELRNDW TVQNCDLDQQSIVHIVQRPWRK-* >P1;1aar structureX:1aar:@:@: 76 :@:1aar : :-1.00:-1.00 MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLS DYNIQKESTLHLVLR-LRGG* A PIR format file is easily confused with other formats. It has the following features. Each new protein begins with TWO header lines. The first header line begins >P1; Immediately after this is a code, normally a four letter code, that allows programs to find the corresponding PDB files, The next line begins either “structure” or “sequence” type record, and it should be last. Next comes your sequence, in one letter codes with appropriate gaps indicated by dashes, The sequence ends with a* Part 1: Search for homologues (templates) and obtain structure-based alignment of target against template sequences Introduction and background: You have a sequence and you want to do homology modelling. First thing is to fine some 3-D structures which have similar sequence (sequence identity of 30% or above will do, but a lower identity might be OK if there are strong conserved features such as a conserved cysteine pattern). Homologue structures and a structure-based alignment can be obtained in a variety of ways, but for the purposes of this practical we’ll be using the FUGUE server. NB: You will need to align the target sequence in such a way so that no gaps or insertions will be included in the alignment where there is conserved protein secondary structure (helices and strands), i.e. you want your gaps to be placed in the loop regions. What does FUGUE do ? FUGUE is a program for recognizing distant homologues by sequence-structure comparison. It utilizes environment-specific substitution tables and structure-dependent gap penalties, where scores for amino acid matching and insertions/deletions are evaluated depending on the local environment of each amino acid residue in a known structure. Given a query sequence (or a sequence alignment), FUGUE scans a database of structural profiles, calculates the sequence-structure compatibility scores and produces a list of potential homologues and alignments. Exercise : 1. Submit the target sequence (gdh) to the fugue server at http://wwwcryst.bioc.cam.ac.uk/~fugue/prfsearch.html 2. Analyze fugue output and identify potential templates for comparative modelling. Hits with Z-scores above 6.0 are normally highly significant. 3. Have a look at the “Joy” alignment produced from fugue for each ‘homologue’. Joy displays 3D structural information in a sequence alignment and helps one understand the conservation of amino acids in their specific local environments. 4. Copy alignment (target against best template structure) from the PIR format option in Fugue and save as gdh.ali. For now, we’ll only use one template with the highest Z-score. Part 2: Prepare input files for modeller Create a new directory where modeller will be run and download or create the following files: 1) Protein Data Bank atom files for templates - code.atm Coordinates for the template structures. Choice of templates based on Fugue results. Each atom file is named code.atm where code is a short protein code, preferably the PDB code; for example. The code must be used as that protein's identifier throughout the modeling. 2) Alignment file – gdh.ali Modeller needs an alignment file in PIR format just like gdh.ali created earlier from the fugue alignment, however, it needs additional information which it gets from the comment line. Edit gdh.ali as necessary so that it complies with the PIR format in the example below. >P1;gdh sequence:gdh:@:@: @ :@:gdh: :-1.00:-1.00 MIVFVRFNSSHGFPVEVDSDTSIFQLKEVVAKRQGVPADQLRVIFAGKELRNDW TVQNCDLDQQSIVHIVQRPWRK-* >P1;1aar structureX:1aar:@:@: 76 :@:1aar : :-1.00:-1.00 MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLS DYNIQKESTLHLVLR-LRGG* The first line has the four letter name of the pdb file after the semi colon, the second line contains the pdb name, the number of amino acids, and the name of the protein. Enter the data in the same format as above. 3) Script file – gdh.top The script file contains commands for MODELLER, in the TOP language Modeller will take an alignment (gdh.ali) that you give it and model your sequence against that alignment. All the parameters for this run are set up in a file ending with ".top". A sample script file called gdh.top is given below. Cut and paste this script in a file called gdh.top and edit as necessary. # PRIMER: STEP 5 # # This script should produce two models, gdh.B999901 and gdh.B999902. # # # INCLUDE # Include the predefined TOP routines SET ALNFILE = 'gdh.ali' SET KNOWNS = 'Xxxx' SET SEQUENCE = 'gdh' SET ATOM_FILES_DIRECTORY = '.' SET STARTING_MODEL= 1 SET ENDING_MODEL = 2 SET DEVIATION = 4.0 CALL ROUTINE = 'model' # alignment filename # codes of the templates, normally pdb code # code of the target # directories for input atom files # index of the first model # index of the last model # (determines how many models to calculate) # have to be >0 if more than 1 model # do homology modelling Set "ALNFILE" to your alignment file, in this case gdh.ali. Set "KNOWNS" to a list of the pdb files' four letter beginning that you are modelling your sequence against (only a space in between each.) Set "SEQUENCE" to the 3 to 4 letter name that you want the sequence that your modelling to be called (gdh for example). The "STARTING_MODEL" and "ENDING_MODEL" designate how many minimization runs are done. In the above example two runs will be performed. Part 3: Run modeller and evaluate quality of final models Exercise: 1) Run modeller and examine log file for best model Now your ready to run modeller. At the prompt, type "mod gdh.top" where gdh is the name that you gave your top file in part 2. Modeller, if run correctly, will run for a while as it models the sequence then does the minimization steps. The initial output is your 3 or 4 letter name for your protein followed by ".ini" and the actual models are followed by "B9999??" where ?? is the number of the minimization. Modeller will now build between 1 to 15 models depending on the number you asked for in the .top file. Which model do you choose ? The one with the lowest energy. There will be a table produced for each model. The energy value is the number associated with the term Value of Ln (Molecular pdf). The model with a combination of the lowest energy number and the lowest number of restraint violations is the one you want. 2) Evaluate the model Examine your best models by producing and assessing the Ramachandran plot. This can be done using the RAMPAGE server (http://raven.bioc.cam.ac.uk/rampage.php). Simply upload your model coordinates and press submit. Residues in the uploaded PDB file that fall into the "allowed" and "outlier" regions are listed, and a picture of the Ramachandran plot is displayed. Note: If models are of poor quality, go back to the alignment and see if it can be improved and rerun modeller. Repeat this process until you are happy with the quality of the models. Try using more than one template during the modelling process, Useful links Fugue: A fold recognition method using structural environment-specific substitution tables and structure-dependent gap penalties http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html Joy: Protein structure and alignment analysis http://www-cryst.bioc.cam.ac.uk/cgi-bin/joy.cgi Modeller: Program for Comparative Protein Structure Modelling http://www.salilab.org/modeller/modeller.html Rampage: Structural validation by assessment of the Ramachandran plot http://raven.bioc.cam.ac.uk/rampage.php