Topic

advertisement
The Construction, Refinement, and
Assessment of Atomic Models
7.1 The interpretation of electron density maps and the role of resolution.
7.2 The Refinement of Atomic Models
7.2.1 What is “Refinement”
7.2.2 Stereochemical restraints as additional observations
7.2.3 Reducing the complexity of the model at low resolution
7.2.4 Determination of the best model parameters
7.2.5Model Building and refinement as an iterative process.
7.2.6 Measures of agreement between model and observations
7.3 Ordered solvent structure, and scattering by the bulk solvent.
7.4 Structure Validation: Judging the Quality and Utility of a Model
Wednesday, 1 April 15
1
The resolution limit
The resolution limit is related to the maximum scattering angle 2θmax of the
observed data.
ys
a
r
dX
e
r
e
t
t
S ca
Incident X-rays
2θ
Detector
Crystal
The quantity 1/|s| = λ/2sinθ is termed the resolution
The resolution limit = λ / 2sinθmax
The importance of the resolution limit is that it determines the features
can be resolved in electron density maps. Which is best illustrated by
example …
Wednesday, 1 April 15
2
The importance of
resolution
Resolution determines what
kind of model it will be possible
to build, and how good that
model can ultimately be.
.
We will not discuss model building in
great detail, as that’s best learned in a
practical setting.
Note:
With the molecular replacement
method, you’ll already have a pretty good
starting model.
With the isomorphous replacement and
anomalous scattering methods, you’ll
have only an electron density map. The
model will need to be built from scratch.
If you’re working at high resolution, with
good experimental phases, this process
has been effectively automated.
From Blow (2002).
Wednesday, 1 April 15
3
The importance
of resolution... A
second example
From Cantor and Schimmel (1980).
4Å
N.B: Since the projection of the
structure is centro-symmetric, there are
only two choices for the phase of each
spot.
2Å
Wednesday, 1 April 15
1Å
4
The Interpretation of Electron Density Maps:
Model Building
We’ll focus on Proteins. Successful interpretation of experimentally-phased
electron density maps typically proceeds through:
A) Recognition of secondary structural elements (α-helices and βstrands). This helps establish the directionality and topology of the polypeptide
chain.
B) The assignment of the amino acid sequence, through recognition of
characteristic side chain density.
You may have auxiliary information which will assist in model building
(e.g. if SeMet substitution/ MAD phasing has been used, you will know the location of the Met
residues, since the Selenium atoms are located during phasing).
Wednesday, 1 April 15
5
Identification of α-helices
Carbonyl oxygens point toward the C-terminal end of the helix
C-terminus
10
9
8
7
6
5
4
Look for
carbonyl
bumps.
These
generally
become
visible at
resolutions
>3Å
3
2
1
Adapted from a slide prepared by Mike Sawaya, UCLA.
Wednesday, 1 April 15
6
Identification of α-helices
Cβ point toward the N-terminal end of the helix ... like the
branches of a Christmas tree.
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
N-terminus
Wednesday, 1 April 15
Adapted from a slide prepared by Mike Sawaya, UCLA.
7
Identification of α-helices
Cβ point toward the N-terminal end of the helix ... like the
branches of a Christmas tree.
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Cβ
Adapted from a slide prepared by Mike Sawaya, UCLA.
Wednesday, 1 April 15
8
Identification of α-helices
Helices viewed from two different perspectives
90o
Viewed down helical axis
Viewed perpendicular to
helical axis
Wednesday, 1 April 15
Adapted from a slide prepared by Mike Sawaya, UCLA.
9
Identification of α-helices
The hole through the center of a helix is a most distinctive feature of α-helix density.
Viewed down helical axis
Often it is easier to recognize helical density when
viewed down the helix axis due to the distinctive hole
through the center of the helix. This is generally visible
at resolutions > 3 Å
Viewed perpendicular to
helical axis
Wednesday, 1 April 15
Adapted from a slide prepared by Mike Sawaya, UCLA.
10
Identification of β-strands
β-strands viewed from different perspectives
From this perspective, side chains of successive residues alternate in and out of the plane of the page.
N-terminus
1
3
2
1
5
7
4
3
6
5
90o
9
8
7
C-terminus
10
9
N-terminus
C-terminus
2
4
6
8
10
From this perspective, side chains of successive residues alternate up and down. Often it is easiest to
recognize a β-strand by this distinctive zig-zag pattern.
Adapted from a slide prepared by Mike Sawaya, UCLA.
Wednesday, 1 April 15
11
Identification of β-strands
Be sure to view both perspectives when modeling a β-strand.
When viewed using the zig-zag perspective, both orientations of the strand appear to fit the electron
density OK.
Correct
Incorrect
But, viewed perpendicular to the zig-zag perspective, it becomes clear that only one direction of the strand
fits the carbonyl bumps in the electron density.
Adapted from a slide prepared by Mike Sawaya, UCLA.
Wednesday, 1 April 15
12
Identification of β-strands
Strands generally don’t occur in isolation, but form part of a β-sheet. In maps
calculated with error-ridden phases, electron density in β-sheets has a
tendency to connect across the strands - which makes building a sheet
structures more challenging than building helices.
http://www-structmed.cimr.cam.ac.uk/Course/Fitting/fittingtalk.html
Wednesday, 1 April 15
13
Sequence assignment
While most amino acids have distinctive shapes, some are iso-steric.
When in doubt, consider the protein environment.
Adapted from a slide prepared by Mike Sawaya, UCLA.
Wednesday, 1 April 15
14
(not outside).
Choice of asymmetric unit
To find the missing neighbors in the 3iti structure, symmetry operations with two-cell translations are required: 2 ! x, "12 + y, 12 ! z and
As was noted in Lecture 2, the selection
of the asymmetric unit is arbitrary.
x, y, z < !2.0
x, y, z > +2.0
x, y, z < !1.0
x, y, z > +1.0
x, y, z < !0.5
x, y, z > +0.5
x, y, z < 0.0
0.0 < x, y, z < +0.
All
5
2
! x, 1 ! y, "
synthase 3hrq
z = !1.60, an
interactions, w
equivalent dim
Taking into
be shifted by
always possib
generally, of t
+14. In fact, ap
tions of the m
example, 0 #
presented. Ho
outside this re
Since the a
gonal ångströ
through conv
table limitatio
cell dimension
increases with
In conclusi
PDB users if
died’ by shifti
origin of the u
When there is non-crystallographic
symmetry (multiple copies of the
molecule in the asymmetric unit) it is
possible to make choices for the
asymmetric unit that obscure
biologically relevant interactions.
Here’s an example →
When you build a model de novo, and
there is non-crystallographic symmetry
present, it pays to look carefully at
what’s in you asymmetric unit, and
consider if there are more sensible
choices.
References
Figure 2
Four independent protein protomers in the asymmetric unit of the structure 1woc
as presented in the PDB (a) and after regrouping (b), when it becomes apparent
that this structure consists of two similar dimers.
Berman, H. M.,
Shindyalov, I
International T
Heidelberg: S
Richards, F. M.
Voronoi, G. (19
Winn, M. D. et
Dauter, Z. (2013). Placement of molecules in (not out of) the cell. Acta
Crystallogr D 69, 2–4
Wednesday, 1 April 15
15
Refinement of models
•Once you build a model, it needs to be refined against the experimental data
•Refinement
is the process of arriving at the “best” model
parameters given the experimental observations.
•This is an optimization problem and involves some fairly heavy duty mathematics
and statistics.
•Still
- we need some insight into this mysterious process. Let’s start by making
sure we’re clear on the nature of the model, the nature of the observations, and
the connection between them.
Wednesday, 1 April 15
16
In protein crystallography …
The model is generally a collection of atoms, each defined by
1. An atom type ( which defines the appropriate X-ray scattering factors)
2. A position (coordinates x, y, z)
3. A B factor, providing an estimate of an atom’s vibration about the mean
position
4. An occupancy (between 0 and 1)
These numbers form the principal parameters of the model. In refinement we
seek to adjust x, y, z, B and (sometimes) occ to achieve better agreement with ...
Wednesday, 1 April 15
17
In protein crystallography …
The primary observations, which are
1. The experimentally measured Intensities I(hkl), which we convert into
Structure Factor Amplitudes |F(hkl)|
2. The experimentally measured phases α(hkl), if we have them.
The model is connected to the observations by the
structure factor equation
F (h, k, l) = | f j exp "2ri (hx j + ky j + lz j)%
j=n
j= n
= 1l) = / fj exp 62ri (hx j + ky j + lz j) @
F (h,jk,
j= 1
Wednesday, 1 April 15
18
A big problem with refinement
One difficulty with refinement is that we often lack sufficient experimental
observations to meaningfully adjust the parameters of the model.
To explain …
•In a medium sized protein there are about 2500 atoms
•With four parameters for each atom (x, y, z, B) there will be 10000 parameters in the model.
•For
such a protein, a diffraction data set to 2.5 Å resolution might contain 15000 measured
intensities.
Because the complicated and non-linear mathematical relationship between the structure factors
and the model, refinement will not produce sensible results unless we have many
more observations than parameters.
Either we must simplify the model, or increase the number of observations.
Often we increase the number of “observations”, by incorporating known information about
protein structure …
Wednesday, 1 April 15
19
Stereochemical restraints as additional
observations
.
Tronrud (2004) Acta Cryst D60, 2156-2168
Stereochemical restraints in a dipeptide. This figure shows the bonds, bond angles and torsion angles
for the dipeptide Ala-Ser. Black lines indicate bonds, red arcs indicate bond angles and blue arcs
indicate torsion angles. The values of the bond lengths and bond angles are, to the precision required
for most macromolecular-refinement problems, independent of the environment of the molecule
and can be estimated reliably from small-molecule crystal structures. The values of most torsion
angles are influenced by their environment and, although small-molecule structures can provide
limits on the values of these angles, they cannot be determined uniquely without information specific
to this crystal.
Wednesday, 1 April 15
20
Hopefully you can “see” how this works
.
2.0 Å map. Atoms are not well defined.
Need additional stereochemical information
to refine atomic positions
1.0 Å map. Atoms are well defined. Can
refine atomic positions without
stereochemical restraints
Adapted from Cantor and Schimmel (1980).
Wednesday, 1 April 15
21
Simplifying the model: Reducing the number
of model parameters
Even incorporating stereochemical restraints, refinement of an atomic model
can still be problematic - especially as the resolution falls below 3.0 Å.
Crystallographers must change the way the model is parameterized to make
.
refinement possible. Here’s two procedures in common use ...
•Exploit Non-crystallographic symmetry: If there are several copies of
a molecule in the asymmetric unit they can be restrained to be similar,
or constrained to be identical. This dramatically reduces the number of
model parameters.
•Employ
Rigid body refinement: At quite low resolution (<4 Å) it
becomes next to impossible to refine individual atomic positions. Yet it is still
possible to meaningfully refine the position and orientation of larger structural
units (helices, strands, entire protein domains), which may yield useful biological
information.
Wednesday, 1 April 15
22
What do we mean by the best model
parameters?
So we have observations, we have a model, and we have an equation that
connects them. How do we “refine” the model to arrive at the best set of model
parameters?
Most refinement packages now employ a branch of statistical theory termed
Maximum Likelihood. The best model is the one most
consistent with the observations.
•Consistency
is measured statistically, by the probability that the observations
would be made, given the current model. This is termed the Likelihood
•If
the model is changed to make the observations more probable, the model
gets better and the Likelihood goes up.
•To calculate the relevant probabilities, we need to consider the errors ... both in
the observations and in the model itself.
•The way errors in a real space model get propagated into Fourier space is not
intuitive, so let’s take a look ...
Wednesday, 1 April 15
23
The atomic structure factor, when position
and scattering are uncertain
To get a feeling for how this plays out let’s consider what the probability distribution looks like for
the structure factor of a single atom, when we assume that there are Gaussian errors in its position
(specified by X,Y,Z), and in its scattering (specified by the element, the occupancy, and the B-factor)
.
Errors for an atomic structure factor. (a) An atom has variation
in position (indicated by purple arrow) and in scattering
(indicated in green concentric circles). (b) The variation in the
atom's position and scattering are Gaussian. (c) The atom at
its mean position with its mean scattering has a structure factor
Fatom (shown with a black vector). Variation in the atom's
position corresponds to variation in the phase of Fatom (shown
with a purple arrow) and variation in the scattering corresponds
to variation in the length of Fatom (shown with a green arrow).
(d) The distributions of the structure factors owing to variation
in the atom's position and scattering combine to give a
boomerang-shaped structure-factor distribution (indicated with
black contours). Since the distribution of structure factors is
symmetric about Fatom, the average structure factor is shorter
than Fatom (by a fraction d, where 0 < d < 1) but in the same
direction as Fatom (dFatom).
From McCoy. Liking likelihood. Acta Crystallogr D Biol Crystallogr (2004) vol. 60 (Pt 12 Pt 1)
pp. 2169-83
Note this key point: The structure factor calculated from the most probable model is
not the most probable value for the structure factor !!! Ouch.
Wednesday, 1 April 15
24
Mathematical optimization methods
Incorporating these kinds of probability distributions, one can build up a
likelihood function. That function can then be maximized, by shifting the
model parameters. This is a very difficult optimization problem, and we will
not consider the details of how it’s done. But that’s what’s going on under
the hood of your refinement program.
The whole process is a generalization of the method of non-linear least
squares - which you’ve almost certainly used to fit simple non-linear
functions to experimental data in other contexts.
Wednesday, 1 April 15
25
Model Building and Refinement as an
iterative process
Refinement procedures do not eliminate all the errors in a model.
Some problems are too severe for refinement to fix. For example a
side chain initially built in a wrong conformation will rarely be
corrected, since that would involve a concerted movement of all the
side chain atoms. The refinement process becomes trapped in a local
minimum.
If you do something like
this, refinement will not help
you !!
From Rupp (2010)
Wednesday, 1 April 15
26
Model Building and Refinement as an
iterative process
Local minima ...
From Rupp (2010)
So refinement needs to be interspersed with manual inspection and correction
of the model. A crystallographer can see patterns in the electron density maps
indicating the need for large shifts, outside the range of the refinement procedure
(e.g. flipping a side chain into an alternate conformation). As the model (and
derived phases) improve, the electron density maps become clearer, allowing
troublesome regions to be reinterpreted.
Wednesday, 1 April 15
27
Measures of agreement
Regardless of the details of the refinement process, we need some statistics to
measure the agreement between model and observations. Those in general use:
•Agreement of Structure Factor Amplitudes is assessed with the Rfactor
•Agreement of Phases is assessed with the Mean Phase Error (=Mean
Phase Residual).
•Agreement of Geometry is
assessed with Root Mean Square
Deviations (RMSDs)
Wednesday, 1 April 15
28
The R-factor and the free R-factor
The usual statistic for calculating agreement between the observed and calculated
Structure Factor Amplitudes is the R-factor
∑
R = hkl
Fobs (hkl) − Fcalc (hkl)
∑
hkl
Fobs (hkl)
Decreases in the R-factor imply improved agreement between
the model and the data.
€
Typical R-Factors for fully refined protein structures are 0.10 - 0.25. Completely
wrong structures (all the atoms in incorrect places), will generally yield R-factors
> 0.50
Wednesday, 1 April 15
29
The R-factor and the free R-factor
It is now standard practice in protein crystallography to omit a small (~5%)
and randomly-selected fraction of the data from all refinement procedures.
These reflections, and only these, are used to calculate the free R-factor
(Rfree). The remainder - the data that are used in refinement - are used to
calculate the working R-factor (Rwork)
→Used to calculate Rwork
→Used to calculate Rfree
Adapted From Rupp (2010)
The free R-factor was introduced and popularized by Axel Brunger, in the early
1990’s
Wednesday, 1 April 15
30
The R-factor and the free R-factor
Monitoring the free R-factor, calculated from observations that the
refinement procedure does not “know about”, helps prevent spurious
adjustments to the model, that decrease the working R-factor, but that by other
criteria, do not objectively improve the model. In statistics, this is termed crossvalidation. Ideally, R and R-free should be close to one another (< 0.05
different).
From Rupp (2010)
Wednesday, 1 April 15
31
Agreement between observed and calculated phases is generally reported
as the mean phase error (or mean phase residual):
mean phase error =
|
hkl
a OBS QhklV - a CALC QhklV
n hkl
(Mean phase errors of 90° are obtained when comparing random sets of non-centrosymmetric phases)
Agreement between model geometry and ideal geometry is generally
assessed with a Root Mean Square Deviation. For any geometric
parameter p:
RMSD =
RMSD =
1
2
- p IDEAL h
/ ^RppMODEL
1n |
2
W
n nn MODEL - p IDEAL
For a well-refined protein structure, RMSD on bond lengths will generally be < 0.02 Å , and RMSD on
bond angles < 2°
Wednesday, 1 April 15
32
Modeling of ordered solvent
In addition to the protein, models of X-ray scattering from biological crystals
must also include the contributions from the solvent which - if you recall occupies around 50% of the volume of a typical protein crystal. It’s common for
some solvent molecules to be localized on the surface of a protein through
hydrogen bonding.
Here’s an example of
ordered water
molecules, localized on
the surface of the
protein through
hydrogen bonding.
Wednesday, 1 April 15
33
Modeling of ordered solvent
These we can give a X,Y,Z and B, just
like the protein atoms. Strictly we
should refine the occupancy of the
water molecules. However B-factor
and occupancy are so highly
correlated that this is impossible at all
but the highest resolutions. Hence
these water molecules are treated as
“fully present” and their B-factor
incorporates the effects of variable
occupancy.
The majority of the solvent is not ordered in this fashion, but still contributes
appreciably to the X-ray scattering.
Wednesday, 1 April 15
34
Modeling the scattering contribution of the
disordered (bulk) solvent.
If our model of X-ray scattering from the crystal neglects the“bulk solvent”, which fills the channels and
interstices between protein molecules in the crystal, and we calculate structure factor amplitudes from
the protein and ordered solvent alone, there will be large discrepancies between the observed and
calculated amplitudes at low resolution. Scattering from the bulk solvent was recognized even before the
first protein structures had been solved. By manipulating the electron density of the solvent, Lawrence
Bragg and Max Perutz were able to observe systematic changes in the intensity of the low order
diffraction data, and infer the approximate dimensions of Haemoglobin. They had hoped this would help
them resolve the phase problem (it didn’t)
Wednesday, 1 April 15
35
Perutz’s data ...
From these early experimental observations we
can deduce that scattering from the bulk solvent
reduces the amplitude of the low resolution
protein diffraction data.
Hence neglecting scattering from the bulk
solvent (i.e placing the protein molecules “in a
vacuum”) will lead to calculated structure factor
amplitudes that are generally much too large.
Wednesday, 1 April 15
36
Modeling scattering from the bulk solvent.
First let’s partition the total electron density in the cell into two mutually exclusive parts - one
part due to the protein and the other part due to the bulk solvent
t TOTAL Q r V = t PROTEIN Q r V + t SOLVENT Q r V
Or equivalently, in “reciprocal space”, we can write
FTOTAL QhV = FPROTEIN QhV + FSOLVENT QhV
(This works because the Fourier transform is a linear operator)
We can calculate the structure factors from the protein FPROTEIN(hkl) readily enough. But how do
we calculate FSOLVENT(hkl), the contribution from the bulk solvent? One way to proceed is to use
a real space modeling method, as introduced by Simon Phillips.
Wednesday, 1 April 15
37
Real Space Method for modelling scattering
from the bulk solvent.
In this procedure
a) A grid is created, covering the unit cell.
b) The boundary between protein and solvent is defined.
c) All grid points in the solvent region are assigned a value of 1, while all those in protein region are
assigned a value of 0. The resulting binary function is termed the solvent mask
Wednesday, 1 April 15
38
Real Space Method for modelling scattering
from the bulk solvent.
If we Fourier transform the solvent mask we get FMASK(hkl) - the structure factors
of the mask
FSOLVENT(hkl), the structure factors of the solvent can then be readily calculated...
2
sin
i
sin
i
QhV]=
Q FVMASK ]h g
S
X
FSOLVENT
B
F
FSOLVENT
h gK=SOLKexp
exp
B
aSOL SOL m 2 MASK
SOL
2 kh
m
2
where the parameter Ksol is the mean electron density of the solvent. The
bracketed resolution dependent multiplier (involving the parameter Bsol) is
introduced in order to blur the sharp boundary between the protein and solvent
regions. The two parameters Ksol and Bsol can be adjusted during refinement of the
model.
Wednesday, 1 April 15
39
Fraud and The Bulk Solvent correction
In the past few years several fraudulent structures have been published in which the investigators
invented the structure and manufactured the X-ray diffraction data !! However, they didn’t do a very
good job. One of the incriminating pieces of evidence, which led to their undoing, was the failure to
correctly add the scattering contribution of the bulk solvent.
.
Rupp, B. Detection and analysis of unusual features in the structural model and structure-factor data of a birch pollen allergen. Acta Crystallogr F 68, 366–376 (2012).
Real Data
Made up data
Bulk-solvent contribution analysis for 1fm4 and 3k78. The left panels depict the expected, nearly textbook-like behavior of a normal crystal structure like 1fm4. The top row
shows the resolution-dependent behavior of R
when the bulk-solvent correction is included Without
(solid lines) and when
it is not includedfor
(dashed
lines) in the R-value
correcting
scattering
of
Without
correcting
for ofscattering
ofvalues
bulk
calculation. 1fm4 shows
the expected increase
low the resolution R
in the absence of bulk-solvent correction, indicating that bulk-solvent scattering contributions
are present in the observed data. Such is not the case for 3k78. Bottom row: the presence of bulk-solvent contributions also causes the low-resolution calculated structure
bulk
mean
|Fobs| and
|FThere
calc|is no
solvent,
|Fobsthat| and
|Fcalc
diverge
factors (dashedmean
line) to be higher
the observed
ones| (solid),
which are appropriately attenuated
by thesolvent,
disordered bulk
scattering contributions
in 1fm4.
difference between F(obs) and F(calc) for 3k78, again indicating the absence of bulk-solvent scattering in the structure-factor data.
agree perfectly at low resolution.
markedly at low resolution.
Rupp
Unusual
features
Bet vvery
1d model
and data
Acta Cryst. sense.
(2012). F68, 366–376
This makes no physical
|F372
much
too
bigin at
low
resolution
calc| is
Figure 6
free
!
Wednesday, 1 April 15
40
Model Validation
We’ll assume that you are both competent and diligent, and build and refine good models. But what
about assessing a report in the literature?
In assessing the validity of a model, and the conclusions that are drawn from it, you should consider ...
1. The resolution.
2. The general quality of the experimental data
(e.g. Merging R-Factors, various phasing statistics)
3. The agreement between model and experimental data
(e.g. R-factor, Free R-factor, Mean phase errors, RMSDs)
Even doing this it can sometimes be tough to tell if there are systemic problems with a structure. Also
“global” statistics can easily obscure “local” problems (e.g. if the side chain you are really interested in is
modeled incorrectly, this will hardly be reflected in the global R-Factor).
There’s no substitute for looking at the electron density maps. A variety of other procedures exist for
detecting inconsistencies, and likely trouble spots in models. Some rely on atom packing statistics, others
on hydrogen bonding patterns.
Here we will consider just one old, but fundamental tool ... the Ramachandran plot.
Wednesday, 1 April 15
41
The Ramachandran plot
The Ramachandran plot is a way to visualize the torsion angles,
phi (ϕ) and psi (ψ) of the polypeptide backbone. Together these
two angles effectively define the backbone conformation.
G.N. Ramachandran
see Subramanian and Subramanian. Nat
Struct Biol (2001) vol. 8 (6) pp. 489-91
http://wiki.cmbi.ru.nl
The main-chain torsion angles phi and psi are generally not restrained in
refinement. However the distribution of these angles in the Ramachandran plot is
quite restricted, both in theory and in practice. The Ramachandran plot is
therefore a useful indicator of the quality of a structure.
Wednesday, 1 April 15
42
The Ramachandran plot
Here’s what the Ramachandran plot from a “good” model
might look like (Okay ... this is one of mine). A Ramachandran
plot with lots of “disallowed” values for Phi and Psi is a sure
sign that something is awry.
beta-strand backbone
conformation
alpha-helical backbone
conformation
Red regions are highly
favored.
Brown and yellow regions
are much less favored.
The rest is highly
disfavored.
Wednesday, 1 April 15
43
Download