READING PDB FILES Claire Shoemake Definitions • Protein is used interchangeably with receptor • The implication is that the drug target (receptor) being considered is protein in nature • Ligand: The small molecule bound to the protein. This could be an endogenous molecule, or a drug. • Protein:ligand Complex: This is the small molecule bound to its receptor. Normally the small molecule modulates receptor function (agonist/antagonist) What is a Protein Data Bank (PDB) File? • It is a textual file format describing the three dimensional structures of molecules held in the Protein Data Bank. http://bip.weizmann.ac.il/oca-bin/ocamain • Most of the information in that database pertains to proteins, and the pdb format accordingly provides for rich description and annotation of protein properties. However, proteins are often crystallized in association with other molecules or ions such as water, ions, nucleic acids, drug molecules and so on, which therefore can be described in the pdb format as well. • The pdb file used as an example in this lecture is 1UZF http://bip.weizmann.ac.il/oca-bin/send-pdb?id=1uzf which descrbes the Angiotensin Converting Enzyme (ACE) bound to the ACE inhibiting drug Captopril Protein Classification PDB ID 1. Gives information regarding the content of the file. 2. Indicates that the protein is human. In this case human testicular ACE. 3. Indicates the nature of the tissue culture that is used to express, or grow, the protein described in this file. 4. Indicates the analytical technique- X-Ray or NMR, that was used by the authors to resolve the protein crystal. In this case the crystal being considered is testicular ACE complexed to captopril. 1. 2. 3. 4. X-ray Crystallography http://en.wikipedia.org/wiki/X-ray_crystallography • • • • • • • X-ray crystallography is a method of determining the arrangement of atoms within a crystal, in which a beam of Xrays strikes a crystal and diffracts into many specific directions. From the angles and intensities of these diffracted beams, a crystallographer can produce a three-dimensional picture of the density of electrons within the crystal. From this electron density, the mean positions of the atoms in the crystal can be determined, as well as their chemical bonds, and various other information. Since many materials can form crystals — such as salts, metals, minerals, semiconductors, as well as various inorganic, organic and biological molecules — X-ray crystallography has been fundamental in the development of many scientific fields. In its first decades of use, this method determined the size of atoms, the lengths and types of chemical bonds, and the atomic-scale differences among various materials, especially minerals and alloys. The method also revealed the structure and functioning of many biological molecules, including vitamins, drugs, proteins and nucleic acids such as DNA. X-ray crystallography is the chief method designing pharmaceuticals against diseases In an X-ray diffraction measurement, a crystal is mounted on a goniometer and gradually rotated while being bombarded with X-rays, producing a diffraction pattern of regularly spaced spots known as reflections. The two-dimensional images taken at different rotations are converted into a three-dimensional model of the density of electrons within the crystal using the mathematical method of Fourier transforms, combined with chemical data known for the sample. Poor resolution (fuzziness) or even errors may result if the crystals are too small, or not uniform enough in their internal makeup. 5. 5. 6. 6. Crystallographic team- also authors of the paper that must be published in a peer-reviewed journal prior to deposition acceptance by the Protein Data Bank Details of the journal publication submitted by the crystallographic team. It is of vital importance to obtain a copy of this publication when attempting drug design projects. These contain further information that may not be included in the pdb file It is necessary to choose the best possible crystallographic structure prior to embarking on a drug design project. This is because this structure serves as a starting point and template on which all successive steps are dependent. One critical factor in crystallographic data selection is its resolution. Resolution implies the smallest distance within which atoms may be reliably distinguished. The higher the resolution or the smaller the distance within which atoms may be reliably distinguished, the better is the crystallographic structure. Resolutions ranging from 2-3.5Å are considered acceptable starting points for drug design projects This particular crystal structure was resolved at 2.0Å. About 85% of the models (entries) in the Protein Data Bank were determined by X-ray crystallography. (Most of the remaining 15% were determined by solution nuclear magnetic resonance.) Analysis of x-ray diffraction patterns from protein crystals produces an electron density map, into which an atomic model of the protein is fitted. Major errors sometimes occur when fitting models in to low-resolution electron density maps The value of Free R is the best clue as to whether major errors may be present in a published model. Obtaining diffraction-quality crystals of proteins remains very difficult, despite many recent advances. For every new protein sequence targeted for X-ray crystallography, about one in twenty is solved Free R is a statistical quantity introduced in 1992 by Axel T. Brünger to assess the quality of a model from X-ray crystallographic data. It is calculated in the same manner as the R value, but from a subset of the data set aside for the calculation of free R, and not used in the refinement of the model. It is a more reliable tool for assessing the model than the R value because it is not self-referential -- that is, as an estimation of errors, free R is free of any bias that may have been introduced during refinement. As a rule of thumb, free R should not exceed the R value by more than 0.05; that is, if the R value is 0.20, free R should not significantly exceed 0.25. Free R values exceeding 0.40 raise serious doubts about the model. The R Value • The R value is used to assess progress in the refinement of a model from X-ray crystallographic data, and can be used as one factor in evaluating the quality of a model. R is a measure of error between the observed intensities from the diffraction pattern and the predicted intensities that are calculated from the model. R values of 0.20 or less are taken as evidence that the model is reliable. • As a rule of thumb, models with R values substantially exceeding (resolution/10) should be treated with caution. Thus, if the resolution of a model is 2.5 Å, that model's R value should not exceed 0.25. Completely erroneous models (e.g. random models) give R values of 0.40 to 0.60. • However, R values themselves must be treated with caution. Unlike the Free R, acceptable R values can be achieved despite serious errors in the model Kleywegt, GJ, AT Brünger. 1996. Checking your imagination: applications of the free R value. Structure 4:897-904. It is incumbent on the authors to submit experimental details to the Protein Data Bank. This allows their experimental conditions to be re-created, and their results to be reproduced. The related entries section of the pdb file is valuable since it provides the researcher with additional information regarding further structural information that may be available about the protein, or receptor of interest. In this case, three further depositions, with pdb IDs 1O86 (ACE + lisinopril), 1O8A (the unbound form of ACE), and 1UZE (ACE + enalaprilat) are available. It is of interest from a drug design point of view to visualise and compare these depositions in order to identify whether or not the tertiary structure of the ACE is in any way ligand dependant The primary amino acid sequence i.e. the linear sequence of the unfolded protein in this case of testicular ACE enzyme is listed in this section of the pdb file. At this point of the file it is also possible to deduce that the protein is a monomer. This may be seen from the fact that the third column of the file always contains the letter A. This means that there is only one chain labelled A, implying the monomeric status of the protein The term heteroatom is used in pdb files to designate all atoms that do not form part of the protein i.e. all atoms that do not form part of the primary structure of the protein. This part of the pdb file indicates all the heteroatoms (excluding water molecules) that form part of the protein (ACE):ligand (captopril) complex. The areas highlighted in blue are searchable, and lead to windows in which the structures of the heteroatoms may be found. In this case the presence of the Zn atom indicates the fact that ACE is a metalloprotease; MCO is the code given by the authors for captopril. HOH indicates water. Helices and sheets constitute the secondary structure of a protein, or more clearly the nature of the folding that occurs along segments of the protein. This section of the pdb file yields information regarding the secondary structure of the protein being described. The areas highlighted in blue are searchable...... Parts of which are shown above. In this case, the entry shows which amino acids form helix 1 on the ACE. The coordinate section of the pdb file describe the coordinates of the atoms that are part of the protein. For example, the first ATOM line on the left describes the alpha-N atom of the first residue of peptide chain A, which is an aspartate residue. The first three floating point numbers are its x, y and z coordinates and are in units of Ångströms. The next three columns are the occupancy, temperature factor, and the element name, respectively. The red rectangles delineate individual amino acids. The atoms making up any one amino acid have the same number in column 5 of the coordinate file. Thus, in this case, there are the coordinates of the first 6 amino acids in the primary amino acid sequence specifically aspartate, glutamine, alanine, glutamine, alanine and serine The temperature factor or B-factor can be thought of as a measure of how much an atom oscillates or vibrates around the position specified in the model. Atoms at side-chain termini are expected to exhibit more freedom of movement than main-chain atoms, and this movement amounts to spreading each atom over a small region of space. Occupancy is one of several parameters included in refinement. The occupancy nj of atom j is a measure of the fraction of molecules in the crystal in which atom j actually occupies the position specified in the model. If all molecules in the crystal are precisely identical, then occupancies for all atoms are 1.00. This part of the pdb file shows the last amino acid in the primary amino acid sequence of the protein. Its end is indicated by the TER entry encircled above. The pdb file then continues to describe the first in the series of heteroatoms included in this entry- that is of those atoms which are not part of the protein molecule. The first is NAG or N-acetylglucosamine. As indicated previously, two NAG molecules were crystallised in this protein:ligand complex. The coordinates for the metal ion (Zn) and the bound ligand molecule (Captopril) designated, as previously indicated through a code identifier MCO are indicated above. For each atom in the chemical component, lists to how many and to which other atoms that atom is bonded. The list of CONECT records is concluded with an END record. Ligand Protein Contacts (LPC) http://bip.weizmann.ac.il/oca-bin/lpc?PDB_ID=1uzf Most pdb files contain ligand:protein contact information. This is of vital importance from a drug design point of view: A clear idea of the amino acids which bind the ligand binding pocket is obtained Critical binding interactions between the ligand and the receptor may be identified Unstable contacts may also be identified and improved upon in the context of the design project In this case, the table above lists the amino acids on the ACE which make contact with captopril. The bond length, the contact Surface area, and the nature of the bond are also indicated. The Table above left is a glossary which explains the terms used in the table above. Hydrogen bonds play an important role in binding ligands to the ligand binding pocket of a receptor. They are different from hydrophobic or Van der Waals interactions. These latter are more numerous and are considered to be largely responsible for ligand stabilisation within a binding pocket. Hydrogen bonds, on the other hand, are associated with selectivity. This means that a ligand and its cognate receptor recognise each other on the basis of the hydrogen bonds they are capable of forging between them. This is very important from a drug design point of view where selectivity is of paramount importance. Pdb files conveniently list the hydrogen bonds forged between protein and ligand in a separate table in the LPC section of the file. In the above table, the first section on the extreme left describes the ligand atoms which are involved in hydrogen bond contacts with the protein amino acid side chains. In the first entry for example, Oxygen atom no1 (in the pdb entry) is forging a hydrogen bond with the hydroxyl group of tyrosine520 of the ACE. The protein atom section consequently describes the receptor atoms which forge hydrogen bond contacts with the ligand atoms. This hydrogen bond is 2.7Å long and occupies a total surface area of 19.4Å2 The classification section (Class in the table above) is discussed later on. This table lists each atomic contact between the protein and the ligand. It is similar to that for the hydrogen bond interactions on the previous slide. It differs in that it does not segregate for hydrogen bond interactions, but includes the bond types. It also indicates the unstable interactions in red. Drug designers will often try to optimise these instable contacts in order to create drug molecules that reside within a ligand binding pocket with improved stability. These are the reference tables included in the LPC section of a pdb file. They indicate the nature of the interactions forged between protein and ligand (listed under the Class Section), and in the case of the table on the right, there is also information regarding which types of interactions will give rise to stable or unstable contacts between the protein and the ligand. This data may be viewed graphically using specialised software such as VMD.......