The Construction, Refinement, and Assessment of Atomic Models 7.1 The interpretation of electron density maps and the role of resolution. 7.2 The Refinement of Atomic Models 7.2.1 What is “Refinement” 7.2.2 Stereochemical restraints as additional observations 7.2.3 Reducing the complexity of the model at low resolution 7.2.4 Determination of the best model parameters 7.2.5Model Building and refinement as an iterative process. 7.2.6 Measures of agreement between model and observations 7.3 Ordered solvent structure, and scattering by the bulk solvent. 7.4 Structure Validation: Judging the Quality and Utility of a Model Wednesday, 1 April 15 1 The resolution limit The resolution limit is related to the maximum scattering angle 2θmax of the observed data. ys a r dX e r e t t S ca Incident X-rays 2θ Detector Crystal The quantity 1/|s| = λ/2sinθ is termed the resolution The resolution limit = λ / 2sinθmax The importance of the resolution limit is that it determines the features can be resolved in electron density maps. Which is best illustrated by example … Wednesday, 1 April 15 2 The importance of resolution Resolution determines what kind of model it will be possible to build, and how good that model can ultimately be. . We will not discuss model building in great detail, as that’s best learned in a practical setting. Note: With the molecular replacement method, you’ll already have a pretty good starting model. With the isomorphous replacement and anomalous scattering methods, you’ll have only an electron density map. The model will need to be built from scratch. If you’re working at high resolution, with good experimental phases, this process has been effectively automated. From Blow (2002). Wednesday, 1 April 15 3 The importance of resolution... A second example From Cantor and Schimmel (1980). 4Å N.B: Since the projection of the structure is centro-symmetric, there are only two choices for the phase of each spot. 2Å Wednesday, 1 April 15 1Å 4 The Interpretation of Electron Density Maps: Model Building We’ll focus on Proteins. Successful interpretation of experimentally-phased electron density maps typically proceeds through: A) Recognition of secondary structural elements (α-helices and βstrands). This helps establish the directionality and topology of the polypeptide chain. B) The assignment of the amino acid sequence, through recognition of characteristic side chain density. You may have auxiliary information which will assist in model building (e.g. if SeMet substitution/ MAD phasing has been used, you will know the location of the Met residues, since the Selenium atoms are located during phasing). Wednesday, 1 April 15 5 Identification of α-helices Carbonyl oxygens point toward the C-terminal end of the helix C-terminus 10 9 8 7 6 5 4 Look for carbonyl bumps. These generally become visible at resolutions >3Å 3 2 1 Adapted from a slide prepared by Mike Sawaya, UCLA. Wednesday, 1 April 15 6 Identification of α-helices Cβ point toward the N-terminal end of the helix ... like the branches of a Christmas tree. Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ N-terminus Wednesday, 1 April 15 Adapted from a slide prepared by Mike Sawaya, UCLA. 7 Identification of α-helices Cβ point toward the N-terminal end of the helix ... like the branches of a Christmas tree. Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Cβ Adapted from a slide prepared by Mike Sawaya, UCLA. Wednesday, 1 April 15 8 Identification of α-helices Helices viewed from two different perspectives 90o Viewed down helical axis Viewed perpendicular to helical axis Wednesday, 1 April 15 Adapted from a slide prepared by Mike Sawaya, UCLA. 9 Identification of α-helices The hole through the center of a helix is a most distinctive feature of α-helix density. Viewed down helical axis Often it is easier to recognize helical density when viewed down the helix axis due to the distinctive hole through the center of the helix. This is generally visible at resolutions > 3 Å Viewed perpendicular to helical axis Wednesday, 1 April 15 Adapted from a slide prepared by Mike Sawaya, UCLA. 10 Identification of β-strands β-strands viewed from different perspectives From this perspective, side chains of successive residues alternate in and out of the plane of the page. N-terminus 1 3 2 1 5 7 4 3 6 5 90o 9 8 7 C-terminus 10 9 N-terminus C-terminus 2 4 6 8 10 From this perspective, side chains of successive residues alternate up and down. Often it is easiest to recognize a β-strand by this distinctive zig-zag pattern. Adapted from a slide prepared by Mike Sawaya, UCLA. Wednesday, 1 April 15 11 Identification of β-strands Be sure to view both perspectives when modeling a β-strand. When viewed using the zig-zag perspective, both orientations of the strand appear to fit the electron density OK. Correct Incorrect But, viewed perpendicular to the zig-zag perspective, it becomes clear that only one direction of the strand fits the carbonyl bumps in the electron density. Adapted from a slide prepared by Mike Sawaya, UCLA. Wednesday, 1 April 15 12 Identification of β-strands Strands generally don’t occur in isolation, but form part of a β-sheet. In maps calculated with error-ridden phases, electron density in β-sheets has a tendency to connect across the strands - which makes building a sheet structures more challenging than building helices. http://www-structmed.cimr.cam.ac.uk/Course/Fitting/fittingtalk.html Wednesday, 1 April 15 13 Sequence assignment While most amino acids have distinctive shapes, some are iso-steric. When in doubt, consider the protein environment. Adapted from a slide prepared by Mike Sawaya, UCLA. Wednesday, 1 April 15 14 (not outside). Choice of asymmetric unit To find the missing neighbors in the 3iti structure, symmetry operations with two-cell translations are required: 2 ! x, "12 + y, 12 ! z and As was noted in Lecture 2, the selection of the asymmetric unit is arbitrary. x, y, z < !2.0 x, y, z > +2.0 x, y, z < !1.0 x, y, z > +1.0 x, y, z < !0.5 x, y, z > +0.5 x, y, z < 0.0 0.0 < x, y, z < +0. All 5 2 ! x, 1 ! y, " synthase 3hrq z = !1.60, an interactions, w equivalent dim Taking into be shifted by always possib generally, of t +14. In fact, ap tions of the m example, 0 # presented. Ho outside this re Since the a gonal ångströ through conv table limitatio cell dimension increases with In conclusi PDB users if died’ by shifti origin of the u When there is non-crystallographic symmetry (multiple copies of the molecule in the asymmetric unit) it is possible to make choices for the asymmetric unit that obscure biologically relevant interactions. Here’s an example → When you build a model de novo, and there is non-crystallographic symmetry present, it pays to look carefully at what’s in you asymmetric unit, and consider if there are more sensible choices. References Figure 2 Four independent protein protomers in the asymmetric unit of the structure 1woc as presented in the PDB (a) and after regrouping (b), when it becomes apparent that this structure consists of two similar dimers. Berman, H. M., Shindyalov, I International T Heidelberg: S Richards, F. M. Voronoi, G. (19 Winn, M. D. et Dauter, Z. (2013). Placement of molecules in (not out of) the cell. Acta Crystallogr D 69, 2–4 Wednesday, 1 April 15 15 Refinement of models •Once you build a model, it needs to be refined against the experimental data •Refinement is the process of arriving at the “best” model parameters given the experimental observations. •This is an optimization problem and involves some fairly heavy duty mathematics and statistics. •Still - we need some insight into this mysterious process. Let’s start by making sure we’re clear on the nature of the model, the nature of the observations, and the connection between them. Wednesday, 1 April 15 16 In protein crystallography … The model is generally a collection of atoms, each defined by 1. An atom type ( which defines the appropriate X-ray scattering factors) 2. A position (coordinates x, y, z) 3. A B factor, providing an estimate of an atom’s vibration about the mean position 4. An occupancy (between 0 and 1) These numbers form the principal parameters of the model. In refinement we seek to adjust x, y, z, B and (sometimes) occ to achieve better agreement with ... Wednesday, 1 April 15 17 In protein crystallography … The primary observations, which are 1. The experimentally measured Intensities I(hkl), which we convert into Structure Factor Amplitudes |F(hkl)| 2. The experimentally measured phases α(hkl), if we have them. The model is connected to the observations by the structure factor equation F (h, k, l) = | f j exp "2ri (hx j + ky j + lz j)% j=n j= n = 1l) = / fj exp 62ri (hx j + ky j + lz j) @ F (h,jk, j= 1 Wednesday, 1 April 15 18 A big problem with refinement One difficulty with refinement is that we often lack sufficient experimental observations to meaningfully adjust the parameters of the model. To explain … •In a medium sized protein there are about 2500 atoms •With four parameters for each atom (x, y, z, B) there will be 10000 parameters in the model. •For such a protein, a diffraction data set to 2.5 Å resolution might contain 15000 measured intensities. Because the complicated and non-linear mathematical relationship between the structure factors and the model, refinement will not produce sensible results unless we have many more observations than parameters. Either we must simplify the model, or increase the number of observations. Often we increase the number of “observations”, by incorporating known information about protein structure … Wednesday, 1 April 15 19 Stereochemical restraints as additional observations . Tronrud (2004) Acta Cryst D60, 2156-2168 Stereochemical restraints in a dipeptide. This figure shows the bonds, bond angles and torsion angles for the dipeptide Ala-Ser. Black lines indicate bonds, red arcs indicate bond angles and blue arcs indicate torsion angles. The values of the bond lengths and bond angles are, to the precision required for most macromolecular-refinement problems, independent of the environment of the molecule and can be estimated reliably from small-molecule crystal structures. The values of most torsion angles are influenced by their environment and, although small-molecule structures can provide limits on the values of these angles, they cannot be determined uniquely without information specific to this crystal. Wednesday, 1 April 15 20 Hopefully you can “see” how this works . 2.0 Å map. Atoms are not well defined. Need additional stereochemical information to refine atomic positions 1.0 Å map. Atoms are well defined. Can refine atomic positions without stereochemical restraints Adapted from Cantor and Schimmel (1980). Wednesday, 1 April 15 21 Simplifying the model: Reducing the number of model parameters Even incorporating stereochemical restraints, refinement of an atomic model can still be problematic - especially as the resolution falls below 3.0 Å. Crystallographers must change the way the model is parameterized to make . refinement possible. Here’s two procedures in common use ... •Exploit Non-crystallographic symmetry: If there are several copies of a molecule in the asymmetric unit they can be restrained to be similar, or constrained to be identical. This dramatically reduces the number of model parameters. •Employ Rigid body refinement: At quite low resolution (<4 Å) it becomes next to impossible to refine individual atomic positions. Yet it is still possible to meaningfully refine the position and orientation of larger structural units (helices, strands, entire protein domains), which may yield useful biological information. Wednesday, 1 April 15 22 What do we mean by the best model parameters? So we have observations, we have a model, and we have an equation that connects them. How do we “refine” the model to arrive at the best set of model parameters? Most refinement packages now employ a branch of statistical theory termed Maximum Likelihood. The best model is the one most consistent with the observations. •Consistency is measured statistically, by the probability that the observations would be made, given the current model. This is termed the Likelihood •If the model is changed to make the observations more probable, the model gets better and the Likelihood goes up. •To calculate the relevant probabilities, we need to consider the errors ... both in the observations and in the model itself. •The way errors in a real space model get propagated into Fourier space is not intuitive, so let’s take a look ... Wednesday, 1 April 15 23 The atomic structure factor, when position and scattering are uncertain To get a feeling for how this plays out let’s consider what the probability distribution looks like for the structure factor of a single atom, when we assume that there are Gaussian errors in its position (specified by X,Y,Z), and in its scattering (specified by the element, the occupancy, and the B-factor) . Errors for an atomic structure factor. (a) An atom has variation in position (indicated by purple arrow) and in scattering (indicated in green concentric circles). (b) The variation in the atom's position and scattering are Gaussian. (c) The atom at its mean position with its mean scattering has a structure factor Fatom (shown with a black vector). Variation in the atom's position corresponds to variation in the phase of Fatom (shown with a purple arrow) and variation in the scattering corresponds to variation in the length of Fatom (shown with a green arrow). (d) The distributions of the structure factors owing to variation in the atom's position and scattering combine to give a boomerang-shaped structure-factor distribution (indicated with black contours). Since the distribution of structure factors is symmetric about Fatom, the average structure factor is shorter than Fatom (by a fraction d, where 0 < d < 1) but in the same direction as Fatom (dFatom). From McCoy. Liking likelihood. Acta Crystallogr D Biol Crystallogr (2004) vol. 60 (Pt 12 Pt 1) pp. 2169-83 Note this key point: The structure factor calculated from the most probable model is not the most probable value for the structure factor !!! Ouch. Wednesday, 1 April 15 24 Mathematical optimization methods Incorporating these kinds of probability distributions, one can build up a likelihood function. That function can then be maximized, by shifting the model parameters. This is a very difficult optimization problem, and we will not consider the details of how it’s done. But that’s what’s going on under the hood of your refinement program. The whole process is a generalization of the method of non-linear least squares - which you’ve almost certainly used to fit simple non-linear functions to experimental data in other contexts. Wednesday, 1 April 15 25 Model Building and Refinement as an iterative process Refinement procedures do not eliminate all the errors in a model. Some problems are too severe for refinement to fix. For example a side chain initially built in a wrong conformation will rarely be corrected, since that would involve a concerted movement of all the side chain atoms. The refinement process becomes trapped in a local minimum. If you do something like this, refinement will not help you !! From Rupp (2010) Wednesday, 1 April 15 26 Model Building and Refinement as an iterative process Local minima ... From Rupp (2010) So refinement needs to be interspersed with manual inspection and correction of the model. A crystallographer can see patterns in the electron density maps indicating the need for large shifts, outside the range of the refinement procedure (e.g. flipping a side chain into an alternate conformation). As the model (and derived phases) improve, the electron density maps become clearer, allowing troublesome regions to be reinterpreted. Wednesday, 1 April 15 27 Measures of agreement Regardless of the details of the refinement process, we need some statistics to measure the agreement between model and observations. Those in general use: •Agreement of Structure Factor Amplitudes is assessed with the Rfactor •Agreement of Phases is assessed with the Mean Phase Error (=Mean Phase Residual). •Agreement of Geometry is assessed with Root Mean Square Deviations (RMSDs) Wednesday, 1 April 15 28 The R-factor and the free R-factor The usual statistic for calculating agreement between the observed and calculated Structure Factor Amplitudes is the R-factor ∑ R = hkl Fobs (hkl) − Fcalc (hkl) ∑ hkl Fobs (hkl) Decreases in the R-factor imply improved agreement between the model and the data. € Typical R-Factors for fully refined protein structures are 0.10 - 0.25. Completely wrong structures (all the atoms in incorrect places), will generally yield R-factors > 0.50 Wednesday, 1 April 15 29 The R-factor and the free R-factor It is now standard practice in protein crystallography to omit a small (~5%) and randomly-selected fraction of the data from all refinement procedures. These reflections, and only these, are used to calculate the free R-factor (Rfree). The remainder - the data that are used in refinement - are used to calculate the working R-factor (Rwork) →Used to calculate Rwork →Used to calculate Rfree Adapted From Rupp (2010) The free R-factor was introduced and popularized by Axel Brunger, in the early 1990’s Wednesday, 1 April 15 30 The R-factor and the free R-factor Monitoring the free R-factor, calculated from observations that the refinement procedure does not “know about”, helps prevent spurious adjustments to the model, that decrease the working R-factor, but that by other criteria, do not objectively improve the model. In statistics, this is termed crossvalidation. Ideally, R and R-free should be close to one another (< 0.05 different). From Rupp (2010) Wednesday, 1 April 15 31 Agreement between observed and calculated phases is generally reported as the mean phase error (or mean phase residual): mean phase error = | hkl a OBS QhklV - a CALC QhklV n hkl (Mean phase errors of 90° are obtained when comparing random sets of non-centrosymmetric phases) Agreement between model geometry and ideal geometry is generally assessed with a Root Mean Square Deviation. For any geometric parameter p: RMSD = RMSD = 1 2 - p IDEAL h / ^RppMODEL 1n | 2 W n nn MODEL - p IDEAL For a well-refined protein structure, RMSD on bond lengths will generally be < 0.02 Å , and RMSD on bond angles < 2° Wednesday, 1 April 15 32 Modeling of ordered solvent In addition to the protein, models of X-ray scattering from biological crystals must also include the contributions from the solvent which - if you recall occupies around 50% of the volume of a typical protein crystal. It’s common for some solvent molecules to be localized on the surface of a protein through hydrogen bonding. Here’s an example of ordered water molecules, localized on the surface of the protein through hydrogen bonding. Wednesday, 1 April 15 33 Modeling of ordered solvent These we can give a X,Y,Z and B, just like the protein atoms. Strictly we should refine the occupancy of the water molecules. However B-factor and occupancy are so highly correlated that this is impossible at all but the highest resolutions. Hence these water molecules are treated as “fully present” and their B-factor incorporates the effects of variable occupancy. The majority of the solvent is not ordered in this fashion, but still contributes appreciably to the X-ray scattering. Wednesday, 1 April 15 34 Modeling the scattering contribution of the disordered (bulk) solvent. If our model of X-ray scattering from the crystal neglects the“bulk solvent”, which fills the channels and interstices between protein molecules in the crystal, and we calculate structure factor amplitudes from the protein and ordered solvent alone, there will be large discrepancies between the observed and calculated amplitudes at low resolution. Scattering from the bulk solvent was recognized even before the first protein structures had been solved. By manipulating the electron density of the solvent, Lawrence Bragg and Max Perutz were able to observe systematic changes in the intensity of the low order diffraction data, and infer the approximate dimensions of Haemoglobin. They had hoped this would help them resolve the phase problem (it didn’t) Wednesday, 1 April 15 35 Perutz’s data ... From these early experimental observations we can deduce that scattering from the bulk solvent reduces the amplitude of the low resolution protein diffraction data. Hence neglecting scattering from the bulk solvent (i.e placing the protein molecules “in a vacuum”) will lead to calculated structure factor amplitudes that are generally much too large. Wednesday, 1 April 15 36 Modeling scattering from the bulk solvent. First let’s partition the total electron density in the cell into two mutually exclusive parts - one part due to the protein and the other part due to the bulk solvent t TOTAL Q r V = t PROTEIN Q r V + t SOLVENT Q r V Or equivalently, in “reciprocal space”, we can write FTOTAL QhV = FPROTEIN QhV + FSOLVENT QhV (This works because the Fourier transform is a linear operator) We can calculate the structure factors from the protein FPROTEIN(hkl) readily enough. But how do we calculate FSOLVENT(hkl), the contribution from the bulk solvent? One way to proceed is to use a real space modeling method, as introduced by Simon Phillips. Wednesday, 1 April 15 37 Real Space Method for modelling scattering from the bulk solvent. In this procedure a) A grid is created, covering the unit cell. b) The boundary between protein and solvent is defined. c) All grid points in the solvent region are assigned a value of 1, while all those in protein region are assigned a value of 0. The resulting binary function is termed the solvent mask Wednesday, 1 April 15 38 Real Space Method for modelling scattering from the bulk solvent. If we Fourier transform the solvent mask we get FMASK(hkl) - the structure factors of the mask FSOLVENT(hkl), the structure factors of the solvent can then be readily calculated... 2 sin i sin i QhV]= Q FVMASK ]h g S X FSOLVENT B F FSOLVENT h gK=SOLKexp exp B aSOL SOL m 2 MASK SOL 2 kh m 2 where the parameter Ksol is the mean electron density of the solvent. The bracketed resolution dependent multiplier (involving the parameter Bsol) is introduced in order to blur the sharp boundary between the protein and solvent regions. The two parameters Ksol and Bsol can be adjusted during refinement of the model. Wednesday, 1 April 15 39 Fraud and The Bulk Solvent correction In the past few years several fraudulent structures have been published in which the investigators invented the structure and manufactured the X-ray diffraction data !! However, they didn’t do a very good job. One of the incriminating pieces of evidence, which led to their undoing, was the failure to correctly add the scattering contribution of the bulk solvent. . Rupp, B. Detection and analysis of unusual features in the structural model and structure-factor data of a birch pollen allergen. Acta Crystallogr F 68, 366–376 (2012). Real Data Made up data Bulk-solvent contribution analysis for 1fm4 and 3k78. The left panels depict the expected, nearly textbook-like behavior of a normal crystal structure like 1fm4. The top row shows the resolution-dependent behavior of R when the bulk-solvent correction is included Without (solid lines) and when it is not includedfor (dashed lines) in the R-value correcting scattering of Without correcting for ofscattering ofvalues bulk calculation. 1fm4 shows the expected increase low the resolution R in the absence of bulk-solvent correction, indicating that bulk-solvent scattering contributions are present in the observed data. Such is not the case for 3k78. Bottom row: the presence of bulk-solvent contributions also causes the low-resolution calculated structure bulk mean |Fobs| and |FThere calc|is no solvent, |Fobsthat| and |Fcalc diverge factors (dashedmean line) to be higher the observed ones| (solid), which are appropriately attenuated by thesolvent, disordered bulk scattering contributions in 1fm4. difference between F(obs) and F(calc) for 3k78, again indicating the absence of bulk-solvent scattering in the structure-factor data. agree perfectly at low resolution. markedly at low resolution. Rupp Unusual features Bet vvery 1d model and data Acta Cryst. sense. (2012). F68, 366–376 This makes no physical |F372 much too bigin at low resolution calc| is Figure 6 free ! Wednesday, 1 April 15 40 Model Validation We’ll assume that you are both competent and diligent, and build and refine good models. But what about assessing a report in the literature? In assessing the validity of a model, and the conclusions that are drawn from it, you should consider ... 1. The resolution. 2. The general quality of the experimental data (e.g. Merging R-Factors, various phasing statistics) 3. The agreement between model and experimental data (e.g. R-factor, Free R-factor, Mean phase errors, RMSDs) Even doing this it can sometimes be tough to tell if there are systemic problems with a structure. Also “global” statistics can easily obscure “local” problems (e.g. if the side chain you are really interested in is modeled incorrectly, this will hardly be reflected in the global R-Factor). There’s no substitute for looking at the electron density maps. A variety of other procedures exist for detecting inconsistencies, and likely trouble spots in models. Some rely on atom packing statistics, others on hydrogen bonding patterns. Here we will consider just one old, but fundamental tool ... the Ramachandran plot. Wednesday, 1 April 15 41 The Ramachandran plot The Ramachandran plot is a way to visualize the torsion angles, phi (ϕ) and psi (ψ) of the polypeptide backbone. Together these two angles effectively define the backbone conformation. G.N. Ramachandran see Subramanian and Subramanian. Nat Struct Biol (2001) vol. 8 (6) pp. 489-91 http://wiki.cmbi.ru.nl The main-chain torsion angles phi and psi are generally not restrained in refinement. However the distribution of these angles in the Ramachandran plot is quite restricted, both in theory and in practice. The Ramachandran plot is therefore a useful indicator of the quality of a structure. Wednesday, 1 April 15 42 The Ramachandran plot Here’s what the Ramachandran plot from a “good” model might look like (Okay ... this is one of mine). A Ramachandran plot with lots of “disallowed” values for Phi and Psi is a sure sign that something is awry. beta-strand backbone conformation alpha-helical backbone conformation Red regions are highly favored. Brown and yellow regions are much less favored. The rest is highly disfavored. Wednesday, 1 April 15 43