Hybrid Computational Modelling and Single-Molecule Imaging of DNA Structure

advertisement
Hybrid Computational Modelling and
Single-Molecule Imaging of DNA Structure
Robert D. M. Gray
CoMPLEX, University College London
Supervisors: Dr. Bart Hoogenboom and Dr. Maya Topf
3571 words
19 January 2015
1 Introduction
1.1 DNA Structure in Gene Expression
DNA encodes the genetic information for the development and all the biological functioning of all living organisms in the form of genes. The coherent use of this information
by gene expression is clearly crucial in such functioning and is regulated at a number of
levels. An understanding of the complex mechanisms that regulate gene expression is a
“grand challenge for biophysics and epigenetics”.1
Transcriptional regulation is only one of many levels of control of gene expression but
it is the first in the path from DNA to functioning protein and as such is of “paramount
importance”.2 In general it is coordinated by a large array of regulatory proteins which
control the transcriptional ability of RNA polymerases by recognising and binding to
certain stretches of DNA. The binding affinity of these proteins is particularly important
as a parameter for trancriptional control.
This binding and recognition may typically be thought to depend on simple sequence,
but it is not valid to consider DNA as a one-dimensional monotonous code. DNA can
have complicated and varied 3D or tertiary structure, in analogy to that of proteins.
These structures and by extension the physical properties of DNA that govern their
formation are of great importance in transcriptional regulation in a number of ways.
For example, the smallest known genome, that of Mycoplasma genitalium, does not
appear to contain sufficient chemical information in its sequence to explain its global
control, and it has been suggested that the physical properties of its DNA store the
necessary information.3
More generally in eukaryotes DNA is packed into chromatin, which strongly affects
transcriptional regulation by control of the RNA polymerase’s access to genes. As such,
certainly the way in which DNA is packaged and how it folds provides an important
layer of transcriptional regulation.2 As well as affecting the ability of proteins to bind to
1
Figure 1: Model of a DNA minicircle in three states.8
DNA in this way, the small-scale physical properties of the DNA polymer can directly
influence the protein binding affinity.45
Finally, it is thought that the mechanical properties of the DNA double helix depend
on sequence itself. The extent to which this is true has been debated1 but certainly
there is a complex relationship between the DNA sequence, its small-scale mechanical
properties, and the binding affinity of the proteins involved in transcriptional regulation.
A full understanding of the mechanical properties of double-helix DNA is thus important to an understanding of transcriptional regulation. It may be surprising that this
understanding is elusive, given the long-standing knowledge of the molecular structure.
This is maybe because such knowledge is based on ensemble measurements whereas the
single-molecule resolution necessary for consideration of the mechanical properties is
only recently possible.
1.2 DNA Minicircles
One model system which is used to investigate these mechanical properties is that of
DNA minicircles. These are simply small rings of double-stranded DNA, hundreds of
base pairs long. They can exhibit a large variety of structural formations such as kinks,
denaturation bubbles and wrinkled conformations,6 and can mimic nucleosomal conformations7 meaning they are a useful model system.
They also demonstrate DNA supercoiling, a conformation where DNA winds around
itself forming coiled shapes. DNA in nature is generally in a supercoiled state, and the
maintenance of such states can be important for trancription, again demonstrating the
importance of DNA tertiary structure.2 Figure 1 shows a model of a DNA minicircle in
three different states, the rightmost of which is supercoiled.
The minicircles we are studying are manufactured to be 339 base pairs long which
means they typically have a radius of around 20nm. To consider the structures of
DNA minicircles I have been provided data in the format of Protein Data Bank (PDB)
2
Figure 2: Two atomic force micrographs of DNA minicircles adsorbed on a mica substrate
in NiCl2 solution. Pictures like these were my raw data.
files produced by the Computational Biophysics Group at the University of Leeds.8 A
PDB file contains the information to describe the structure of a molecule, including
the position and type of each atom. Those I am using are models of the structure of
minicircles.
Real minicircles have varied conformations but from this I can get a good idea of
what the typical structure of a DNA minicircle should be. I want to develop an idea
of the actual arrangement of this structure for an individual minicircle. This would be
novel and very useful in investigating the way tertiary structure affects the functioning
of gene expression as discussed in the first section.
1.3 AFM
To do this we need a method of accurately imaging individual DNA minicircles. There
are a number of methods for this, one of which is atomic force microscopy (AFM).
AFM works by “feeling” a surface with a very fine probe or tip. At the most basic
level, the tip interacts with molecules on a surface which causes it to deflect. This
deflection is measured and in this way an image of the surface is built up. The position of
the tip is precisely controlled by means of piezoelectric crystals and is typically detected
with a laser. Altogether this can allow resolution on sub-nanometer scales.9
As well as this excellent resolution, some advantages of AFM are that it can be applied in liquid environments and images individual molecules rather than averaging over
many as techniques such as X-ray crystallography do. These attributes are all necessary
for consideration of the small-scale mechanics of molecules such as DNA minicircles.
The diameter of the DNA double helix is approximately 2nm and the features of the
helix are correspondingly smaller. However, resolution of this level on DNA with AFM
3
has been shown.10 I have been provided with AFM data of minicircles for my work and
although it is difficult to resolve the helix structure in these there is sufficient resolution
to begin an analysis of their structure. Examples of AFM data of minicircles are shown
in Figure 2. This data consists of values of the height z at each point on the x-y scan.
My project was then to use this AFM data along with known minicircle sequence and
computational structures to try and develop a program which could begin to ascertain
the structure of individual minicircles.
Determining the structure would be extremely interesting in the context of transcriptional regulation as described above. The sequence of the minicircles is known so
with information about the structure we could study exactly how sequence and tertiary
structure are related. We could also look at how protein binding is affected by the
structure.
2 Methods
I used Python to write a program to make a guess at the structure of minicircles based
on AFM data. To visualise PDB files I used UCSF Chimera,11 software for doing this,
and many of my figures were made using Wolfram Mathematica.
2.1 Simulated AFM Images of DNA
Firstly, I needed a script to create simulated AFM images so that I could apply this
to a chosen structure and compare with the real AFM image. I chose to apply this to
a structure in the form of hard spheres characterised by their position in x, y, z space
and their radii. In my model of AFM, the tip is also a hard object and when it comes
into contact with the spheres (the positions overlap) I consider contact to be made.
In real AFM of course there are numerous complex interactions of various range, but
this is a useful approximation. I also considered the AFM tip to be a sphere, again a
simplification but a logical first approximation.
My AFM scanning function involves raster scanning across a chosen region with a
chosen step size. At each point in x-y space it finds the corresponding z which makes up
the scan. The most simple version of this was to lower my spherical “tip” with a given
radius R from a starting height zstart until “contact” was made with the sample atoms
in the form of spheres. That is, the inequality |xtip − xi | ≥ R + ri for all atoms i was no
longer satisfied, if this happened at all. The function therefore takes as argument the
start and end x and y points and the step size as well as R.
This was effective but a little slow, as for each step in z it needs to evaluate the
inequality for each atom i. I speeded things a little by localising the region of atoms i
that I needed to check. This was then able to produce simulated AFM scans of data in
the form of spheres.
4
40
30
20
10
(b)
(a)
40
40
30
30
20
20
10
10
0
0
(c)
(d)
Figure 3: A visualisation (a), scans of the atomistic DNA model with tip radius 1Å(b)
and 5Å(c) and of the coarse grained model with tip radius 5Å(d). All units are in Å.
2.2 Incorporating Real Structures
Now I was able to form simulated AFM images, I wanted to apply this to real structures.
To do this I used the Biopython package which allowed me to import PDB files into
Python. I extracted the atomic positions and put it into the form which I could use
my AFM simulation script with, spheres. For the radii of the atoms I used van der
Waals radii from WebElements. These are established from contact distances between
non-bonding atoms so are the relevant distances to use.
Figures 3a shows the structure of a small piece of DNA in PDB format visualised
with UCSF Chimera. Figures 3b and 3c show images of the same structure incorporated
into Python and run with my AFM simulator as described above with two different tip
radii. The effect of tip radius can be seen in that there is higher resolution with smaller
5
radius and structures appear larger with a higher radius. This is also true with real
AFM.
2.3 Coarse Graining
Moving on I could then incorporate the structures of DNA minicircles. I used Biopython
to import the PDB files into Python in the same way. However these are quite large
(339 base pairs, 21564 atoms) so examining them on an atomistic level would have been
computationally problematic. I therefore developed a form of coarse graining algorithm
to simplify things. Following Potoyan12 I used a very simple coarse grained model of
DNA, replacing each nucleotide with three beads corresponding to phosphate, sugar and
base groups.
For my purposes I again modelled each bead as a sphere, with radius taken from the
minimum of the Lennard-Jones potential used by Potoyan for the A and T bases, which
was 2.9Å. The radii should be different for different groups, but the model I based this
on was far more complicated and did not provide simple radii for the other cases. Given
the crudeness of this sphere method I did not think that small variations in what I took
as radius would make any difference to the result. I placed each bead at the centre of
mass of the corresponding atoms. Figure 3d shows a scan of a small piece of coarse
grained DNA. Few features are lost compared to the atomistic model, even with very
low tip radius. Using this method I was then able to generate simulated AFM scans of
whole minicircles.
2.4 Tracing Algorithm
In order to fit some sort of structure to an image of a minicircle, I wanted to quantify
the general shape in some way. Therefore we decided to try and use a tracing algorithm
to “draw” the shape of the minicircle, so that I could then lay down a structure around
this shape. Borrowing from Mazur and Maaloum13 who base their algorithm on Wiggins
et al.,14 I wrote an algorithm as follows. A series of points are produced which follow
the minicircle and joining these together with links forms the trace.
Two inital points, a0 and a1 are placed manually on the minicircle, forming the first
link of the trace. These are the initial start point and end point. At each stage, the end
point becomes the next start point. A prediction of the direction of the next end point
is made by moving forward in the direction of the previous link. This is then corrected
and a new direction is formed according to
Z
X new =
dsZ(x)(x − a0 )
(1)
segment
where these vectors are in 2D only, Z(x) is the height, a0 is the start point of the link
and the integral is over a segment perpendicular to the link, joined at its centre point by
the link at the end point a1 . X new then forms a new prediction for the end point. What
is formed is a z-weighted average of the perpendicular segment, so the new direction will
tend to go along the highest point, tracing the minicircle.
6
This process is iterated three times and the third time the next end point is chosen
along Xnew . The height of the scan is only known at certain points in x-y space so to
evaluate this integral I used an interpolation function in Python so that Z(x) can be
found at any point.
The length of the segment that is integrated and the size of the steps can be chosen.
In tracing DNA Mazur and Maaloum suggest a integration length of 10nm so I used the
same, and varied the size of the steps, although it was typically a few nm.
Running this algorithm on a minicircle image, real or simulated, produces a series of points at the ends
of the links. A line through these
points should trace out the middle of
the minicircle. An example of this is
shown in Figure 4.
2.5 Forming a Structure
My tracing algorithm produces a series of points which demark the position of the minicircle. The next step
was to use these to suggest a suitable
Figure 4: A plot of a real AFM scan of a mini- structure. This should be consistent
circle and the output of my tracing algorithm.
with my knowledge of the minicircles
and the result of using my AFM simulator on it should be consistent with
the AFM data. To do this I placed
together the structures of short segments of coarse-grained DNA taken from the PDB file of the minicircle to form a structure resembling that of a minicircle which should be similar to the real structure.
I made a short segment of model DNA, basing it on the data from the PDB file of
the minicircle. I used various lengths for this segment, between 3 and 17 base pairs. I
then manipulated copies of this in space to form up the structure I described. To do
this it was first necessary to be able to produce this structure at arbitrary position and
orientation. The way I did this is described in Appendix A.
I then placed these segments along the points of the trace. I ran the tracing algorithm
with a distance between points suitable for the length of segment that I was using, that
is some fraction of the length of the segment. I also ran the the tracing algorithm so
that it traced around the minicircle many times and used the points at the end of the
trace for this to try and avoid bias from the choice of starting points.
7
25
30
20
20
15
10
10
5
0
0
(a)
(b)
25
30
20
20
15
10
10
5
0
0
(c)
(d)
Figure 5: A comparison between real AFM scans of minicircles, (a) and (c), and simulated AFM scans of my structures built to match the real data, (b) and (d).
(a)
(b)
Figure 6: Side and top-down views of one of my assembled minicircle structures. The
red and orange lines are artefacts of Chimera.
8
The contour length of the trace will not be equal to the total contour length of the
segments. Consequently, there may be gaps of varying size between the segments when
I form them into a structure. As this is only a first estimate at the structure I did
not think this was that important, but such imperfections could be removed in a more
developed model.
With suitable choices of the distance between points and the length of the segment
my program produced structures that, when scanned with my AFM simulator, at least
superficially resembled the original data. Some examples of this are shown in Figure 5.
These structures were made with segments which were 17 base pairs long and correspondingly placed about 6.4nm apart. That is, the distance between traced points was
3.2nm and I placed them every two points. The tip radius for the scan was 1.5nm.
Finally, to visualise the structures I was assembling I wrote a function to return from
the coarse grained structures to PDB format and to export this for viewing in UCSF
Chimera. An example of what an assembled structure looks like when viewed in this
way is shown in Figure 5.
2.6 Comparing Scans
Finally, I made an attempt to use my results to estimate the tip radius of the actual
AFM scan. I can make multiple simulated scans of my structure with different radii
and so by comparing them to the original data I could estimate what the real tip radius
might be. It is therefore necessary to have a method of comparing scans. I used the
principle of least squares fitting to calculate the residuals between a simulated scan and
the real scan. This is given by
X
resid(R) =
|zireal − zisim (R)|2
(2)
i
over all points i in the scan. For each of the minicircles imaged in Figure 5 I attempted
to minimise the value of the residuals with respect to radius. That is, find the simulated
scan tip radius which produced the lowest residuals that could be close to the real tip
radius. For both, the function was minimum at around 7Å, and Figure 7 shows the
residuals at various radii around this local minimum. The two minima seem consistent
with each other although 7Å seems small for an AFM tip.
3 Conclusions
On completion of this project my program is effective at the following. I can import
AFM images of DNA minicircles and using PDB data of a minicircle I can construct a
first guess at the underlying DNA structure of them following the steps outlined above.
I have combined everything into one function to make it as straightforward as possible.
I can also export this structure in PDB format to view easily.
The limitations of this are fairly clear in that it only produces a very rough guess at a
structure. Viewing this in Chimera as in Figure 5 it is evidentally not that realistic with
9
resid(R)
1.00
0.99
Minicircle (a)
0.98
Minicircle (c)
0.97
0.96
5
6
7
8
9
R (Å)
Figure 7: Values of the residuals for a range of tip radii for the minicircle structures
shown in Figure 5. resid(R) is given relative to its value at 9.1Å
the stretches not really joining up properly. The effects of my method of producing the
structure can be seen in that the stretches are all inclined at an angle. This is because
the stretches are aligned with the lines connecting points on the trace from the first bead
to the last. The line connecting these beads is not parallel to the axis of the stretch
hence the inclined appearance. Despite this the program makes a reasonable first guess
at a structure.
My method of comparing scans allows estimation of the real AFM tip radius and can
allow the fitness of structures in matching to the real data to be compared.
4 Discussion
My guesses of structures are not complete but they are a good start as Figure 5 clearly
resembles a real minicircle. Importantly, it would now be possible to modify this slightly
to improve the fit with the data, for example by “jiggling” the pieces of DNA that form
my structure around and trying to improve the fit.
This could be done by Monte Carlo simulation. Similar to methods established in
work on protein structures15 I would make small adjustments to the position and orientation of the structure, then accept these adjustments with a probability corresponding
to how they change its scoring function. This scoring function needs be a measurement
of how well the structure matches the data, such as my residuals function.
However, the unrealistic estimate of the tip radius suggests that the residuals method
also has limitations. It is possible this comes down to the approximations made in the
production of simulated scans, such as approximating the interactions as simple contact,
coarse graining or modelling the AFM tip as a sphere. But I think there is scope within
this framework to greatly improve the guesses of structures, without too much difficulty.
There are other possibilities for a scoring function which could be more effective,
such as use of the derivatives of the scan rather than just the absolute height. In a more
advanced way, energetic constraints could be considered where the scoring function also
depends on the relative positions of the DNA, with energetically unfavourable arrangements correspondingly penalised. In general minimisation methods in conjunction with
10
a molecular mechanics forcefield could be used to improve the structure.
If this was implemented effectively increasingly accurate structures could be determined. Another possible development is to use AFM data which we can obtain where
single-stranded DNA binding to the minicircles is visible. This single-stranded DNA
binds at certain known points in the sequence, so this would provide a basepoint for
accurate fitting of the sequence itself, something I have not considered. The relationship
between sequence and DNA tertiary structure, such as if certain sequences are more
flexible, could then be probed.
Finally, the introduction of DNA binding proteins would allow the relationship between protein binding affinity, sequence and DNA tertiary structure to be directly measured. Experiments of this type would be extremely interesting and would allow real
investigation into the questions outlined in the first section.
5 Acknowledgements
I acknowledge the work of Alice Pyne in producing the AFM data on which this project
was based.
References
1 V.
Ortiz and J. J. de Pablo, “Molecular origins of DNA flexibility: Sequence effects
on conformational and mechanical properties,” Phys. Rev. Lett., vol. 106, p. 238107,
Jun 2011.
2 B.
Alberts, Molecular Biology of the Cell. New York: Garland Science, 4th ed., 2002.
3 C.
J. Dorman, “Regulation of transcription by DNA supercoiling in mycoplasma genitalium: global control in the smallest known self-replicating genome,” Molecular microbiology, vol. 81, no. 2, pp. 302–304, 2011.
4 M.
R. Gartenberg and D. M. Crothers, “Dna sequence determinants of CAP-induced
bending and protein binding affinity.,” Nature, vol. 333, no. 6176, pp. 824–829, 1988.
5 R.
Rohs, X. Jin, S. M. West, R. Joshi, B. Honig, and R. S. Mann, “Origins of specificity
in protein-DNA recognition,” Annual review of biochemistry, vol. 79, p. 233, 2010.
6 J.
S. Mitchell, C. A. Laughton, and S. A. Harris, “Atomistic simulations reveal bubbles,
kinks and wrinkles in supercoiled DNA,” Nucleic Acids Research, 2011.
7 T.
A. Lionberger, D. Demurtas, G. Witz, J. Dorier, T. Lillian, E. Meyhöfer, and
A. Stasiak, “Cooperative kinking at distant sites in mechanically stressed DNA,”
Nucleic Acids Research, vol. 39, no. 22, pp. 9820–9832, 2011.
8 “http://www.comp-bio.physics.leeds.ac.uk/,
versity of Leeds,” January 2015.
11
Computational Biophysics Group, Uni-
9 V.
J. Morris, A. R. Kirby, and A. P. Gunning, Atomic force microscopy for biologists,
vol. 57. World Scientific, 1999.
10 A.
Pyne, R. Thompson, C. Leung, D. Roy, and B. W. Hoogenboom, “Single-molecule
reconstruction of oligonucleotide secondary structure by atomic force microscopy,”
Small, 2014.
11 E.
F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt, E. C.
Meng, and T. E. Ferrin, “UCSF Chimera—a visualization system for exploratory
research and analysis,” Journal of computational chemistry, vol. 25, no. 13, pp. 1605–
1612, 2004.
12 D.
A. Potoyan, A. Savelyev, and G. A. Papoian, “Recent successes in coarse-grained
modeling of DNA,” Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 3, no. 1, pp. 69–83, 2013.
13 A.
K. Mazur and M. Maaloum, “Atomic force microscopy study of DNA flexibility on short length scales: smooth bending versus kinking,” Nucleic acids research,
p. gku1192, 2014.
14 P.
A. Wiggins, T. Van Der Heijden, F. Moreno-Herrero, A. Spakowitz, R. Phillips,
J. Widom, C. Dekker, and P. C. Nelson, “High flexibility of DNA on short length scales
probed by atomic force microscopy,” Nature nanotechnology, vol. 1, no. 2, pp. 137–141,
2006.
15 A.
Rossi, M. A. Marti-Renom, and A. Sali, “Localization of binding sites in protein
structures by optimization of a composite scoring function,” Protein science, vol. 15,
no. 10, pp. 2366–2380, 2006.
A Coordinate Transformation
To be able to write the positions of the atoms or beads in a length of DNA at arbitrary
position and orientation required transforming the coordinates. This was straightforward
enough, but an explanation may be useful in trying to understand my code.
I transformed the x, y, z coordinates of each bead into a u, v, w system where the
u-axis was defined along the axis of the segment, that is the vector connecting the first
and last beads. I then defined the other axes (arbitrarily) as v̂ = û × ẑ and ŵ = û × v̂.
A bead at point xi in x, y, z space thus corresponded simply to


xi · û
ui =  xi · v̂ 
(3)
xi · ŵ
in u, v, w space.
The purpose of this is then that whenever I want to lay down a new segment, I can
define a new u-axis along the desired axis of the segment, and write down each bead in
12
this new u, v, w space. The transformation is then reversed and I recover the position of
each bead in the new orientation xi .
This is done, in analogy to above, according to


ui · x̂0
xi = ui · ŷ 0 
(4)
ui · ẑ 0
where x̂0 , ŷ 0 , ẑ 0 are the x, y, z unit vectors in the new u, v, w space and are given by






û · ŷ
û · x̂
û · ẑ
x̂0 =  v̂ · x̂  , ŷ 0 =  v̂ · ŷ  , ẑ 0 =  v̂ · ẑ 
ŵ · x̂
ŵ · ẑ
ŵ · ŷ
13
(5)
Download