Hybrid Computational Modelling and Single-Molecule Imaging of DNA Structure Robert D. M. Gray CoMPLEX, University College London Supervisors: Dr. Bart Hoogenboom and Dr. Maya Topf 3571 words 19 January 2015 1 Introduction 1.1 DNA Structure in Gene Expression DNA encodes the genetic information for the development and all the biological functioning of all living organisms in the form of genes. The coherent use of this information by gene expression is clearly crucial in such functioning and is regulated at a number of levels. An understanding of the complex mechanisms that regulate gene expression is a “grand challenge for biophysics and epigenetics”.1 Transcriptional regulation is only one of many levels of control of gene expression but it is the first in the path from DNA to functioning protein and as such is of “paramount importance”.2 In general it is coordinated by a large array of regulatory proteins which control the transcriptional ability of RNA polymerases by recognising and binding to certain stretches of DNA. The binding affinity of these proteins is particularly important as a parameter for trancriptional control. This binding and recognition may typically be thought to depend on simple sequence, but it is not valid to consider DNA as a one-dimensional monotonous code. DNA can have complicated and varied 3D or tertiary structure, in analogy to that of proteins. These structures and by extension the physical properties of DNA that govern their formation are of great importance in transcriptional regulation in a number of ways. For example, the smallest known genome, that of Mycoplasma genitalium, does not appear to contain sufficient chemical information in its sequence to explain its global control, and it has been suggested that the physical properties of its DNA store the necessary information.3 More generally in eukaryotes DNA is packed into chromatin, which strongly affects transcriptional regulation by control of the RNA polymerase’s access to genes. As such, certainly the way in which DNA is packaged and how it folds provides an important layer of transcriptional regulation.2 As well as affecting the ability of proteins to bind to 1 Figure 1: Model of a DNA minicircle in three states.8 DNA in this way, the small-scale physical properties of the DNA polymer can directly influence the protein binding affinity.45 Finally, it is thought that the mechanical properties of the DNA double helix depend on sequence itself. The extent to which this is true has been debated1 but certainly there is a complex relationship between the DNA sequence, its small-scale mechanical properties, and the binding affinity of the proteins involved in transcriptional regulation. A full understanding of the mechanical properties of double-helix DNA is thus important to an understanding of transcriptional regulation. It may be surprising that this understanding is elusive, given the long-standing knowledge of the molecular structure. This is maybe because such knowledge is based on ensemble measurements whereas the single-molecule resolution necessary for consideration of the mechanical properties is only recently possible. 1.2 DNA Minicircles One model system which is used to investigate these mechanical properties is that of DNA minicircles. These are simply small rings of double-stranded DNA, hundreds of base pairs long. They can exhibit a large variety of structural formations such as kinks, denaturation bubbles and wrinkled conformations,6 and can mimic nucleosomal conformations7 meaning they are a useful model system. They also demonstrate DNA supercoiling, a conformation where DNA winds around itself forming coiled shapes. DNA in nature is generally in a supercoiled state, and the maintenance of such states can be important for trancription, again demonstrating the importance of DNA tertiary structure.2 Figure 1 shows a model of a DNA minicircle in three different states, the rightmost of which is supercoiled. The minicircles we are studying are manufactured to be 339 base pairs long which means they typically have a radius of around 20nm. To consider the structures of DNA minicircles I have been provided data in the format of Protein Data Bank (PDB) 2 Figure 2: Two atomic force micrographs of DNA minicircles adsorbed on a mica substrate in NiCl2 solution. Pictures like these were my raw data. files produced by the Computational Biophysics Group at the University of Leeds.8 A PDB file contains the information to describe the structure of a molecule, including the position and type of each atom. Those I am using are models of the structure of minicircles. Real minicircles have varied conformations but from this I can get a good idea of what the typical structure of a DNA minicircle should be. I want to develop an idea of the actual arrangement of this structure for an individual minicircle. This would be novel and very useful in investigating the way tertiary structure affects the functioning of gene expression as discussed in the first section. 1.3 AFM To do this we need a method of accurately imaging individual DNA minicircles. There are a number of methods for this, one of which is atomic force microscopy (AFM). AFM works by “feeling” a surface with a very fine probe or tip. At the most basic level, the tip interacts with molecules on a surface which causes it to deflect. This deflection is measured and in this way an image of the surface is built up. The position of the tip is precisely controlled by means of piezoelectric crystals and is typically detected with a laser. Altogether this can allow resolution on sub-nanometer scales.9 As well as this excellent resolution, some advantages of AFM are that it can be applied in liquid environments and images individual molecules rather than averaging over many as techniques such as X-ray crystallography do. These attributes are all necessary for consideration of the small-scale mechanics of molecules such as DNA minicircles. The diameter of the DNA double helix is approximately 2nm and the features of the helix are correspondingly smaller. However, resolution of this level on DNA with AFM 3 has been shown.10 I have been provided with AFM data of minicircles for my work and although it is difficult to resolve the helix structure in these there is sufficient resolution to begin an analysis of their structure. Examples of AFM data of minicircles are shown in Figure 2. This data consists of values of the height z at each point on the x-y scan. My project was then to use this AFM data along with known minicircle sequence and computational structures to try and develop a program which could begin to ascertain the structure of individual minicircles. Determining the structure would be extremely interesting in the context of transcriptional regulation as described above. The sequence of the minicircles is known so with information about the structure we could study exactly how sequence and tertiary structure are related. We could also look at how protein binding is affected by the structure. 2 Methods I used Python to write a program to make a guess at the structure of minicircles based on AFM data. To visualise PDB files I used UCSF Chimera,11 software for doing this, and many of my figures were made using Wolfram Mathematica. 2.1 Simulated AFM Images of DNA Firstly, I needed a script to create simulated AFM images so that I could apply this to a chosen structure and compare with the real AFM image. I chose to apply this to a structure in the form of hard spheres characterised by their position in x, y, z space and their radii. In my model of AFM, the tip is also a hard object and when it comes into contact with the spheres (the positions overlap) I consider contact to be made. In real AFM of course there are numerous complex interactions of various range, but this is a useful approximation. I also considered the AFM tip to be a sphere, again a simplification but a logical first approximation. My AFM scanning function involves raster scanning across a chosen region with a chosen step size. At each point in x-y space it finds the corresponding z which makes up the scan. The most simple version of this was to lower my spherical “tip” with a given radius R from a starting height zstart until “contact” was made with the sample atoms in the form of spheres. That is, the inequality |xtip − xi | ≥ R + ri for all atoms i was no longer satisfied, if this happened at all. The function therefore takes as argument the start and end x and y points and the step size as well as R. This was effective but a little slow, as for each step in z it needs to evaluate the inequality for each atom i. I speeded things a little by localising the region of atoms i that I needed to check. This was then able to produce simulated AFM scans of data in the form of spheres. 4 40 30 20 10 (b) (a) 40 40 30 30 20 20 10 10 0 0 (c) (d) Figure 3: A visualisation (a), scans of the atomistic DNA model with tip radius 1Å(b) and 5Å(c) and of the coarse grained model with tip radius 5Å(d). All units are in Å. 2.2 Incorporating Real Structures Now I was able to form simulated AFM images, I wanted to apply this to real structures. To do this I used the Biopython package which allowed me to import PDB files into Python. I extracted the atomic positions and put it into the form which I could use my AFM simulation script with, spheres. For the radii of the atoms I used van der Waals radii from WebElements. These are established from contact distances between non-bonding atoms so are the relevant distances to use. Figures 3a shows the structure of a small piece of DNA in PDB format visualised with UCSF Chimera. Figures 3b and 3c show images of the same structure incorporated into Python and run with my AFM simulator as described above with two different tip radii. The effect of tip radius can be seen in that there is higher resolution with smaller 5 radius and structures appear larger with a higher radius. This is also true with real AFM. 2.3 Coarse Graining Moving on I could then incorporate the structures of DNA minicircles. I used Biopython to import the PDB files into Python in the same way. However these are quite large (339 base pairs, 21564 atoms) so examining them on an atomistic level would have been computationally problematic. I therefore developed a form of coarse graining algorithm to simplify things. Following Potoyan12 I used a very simple coarse grained model of DNA, replacing each nucleotide with three beads corresponding to phosphate, sugar and base groups. For my purposes I again modelled each bead as a sphere, with radius taken from the minimum of the Lennard-Jones potential used by Potoyan for the A and T bases, which was 2.9Å. The radii should be different for different groups, but the model I based this on was far more complicated and did not provide simple radii for the other cases. Given the crudeness of this sphere method I did not think that small variations in what I took as radius would make any difference to the result. I placed each bead at the centre of mass of the corresponding atoms. Figure 3d shows a scan of a small piece of coarse grained DNA. Few features are lost compared to the atomistic model, even with very low tip radius. Using this method I was then able to generate simulated AFM scans of whole minicircles. 2.4 Tracing Algorithm In order to fit some sort of structure to an image of a minicircle, I wanted to quantify the general shape in some way. Therefore we decided to try and use a tracing algorithm to “draw” the shape of the minicircle, so that I could then lay down a structure around this shape. Borrowing from Mazur and Maaloum13 who base their algorithm on Wiggins et al.,14 I wrote an algorithm as follows. A series of points are produced which follow the minicircle and joining these together with links forms the trace. Two inital points, a0 and a1 are placed manually on the minicircle, forming the first link of the trace. These are the initial start point and end point. At each stage, the end point becomes the next start point. A prediction of the direction of the next end point is made by moving forward in the direction of the previous link. This is then corrected and a new direction is formed according to Z X new = dsZ(x)(x − a0 ) (1) segment where these vectors are in 2D only, Z(x) is the height, a0 is the start point of the link and the integral is over a segment perpendicular to the link, joined at its centre point by the link at the end point a1 . X new then forms a new prediction for the end point. What is formed is a z-weighted average of the perpendicular segment, so the new direction will tend to go along the highest point, tracing the minicircle. 6 This process is iterated three times and the third time the next end point is chosen along Xnew . The height of the scan is only known at certain points in x-y space so to evaluate this integral I used an interpolation function in Python so that Z(x) can be found at any point. The length of the segment that is integrated and the size of the steps can be chosen. In tracing DNA Mazur and Maaloum suggest a integration length of 10nm so I used the same, and varied the size of the steps, although it was typically a few nm. Running this algorithm on a minicircle image, real or simulated, produces a series of points at the ends of the links. A line through these points should trace out the middle of the minicircle. An example of this is shown in Figure 4. 2.5 Forming a Structure My tracing algorithm produces a series of points which demark the position of the minicircle. The next step was to use these to suggest a suitable Figure 4: A plot of a real AFM scan of a mini- structure. This should be consistent circle and the output of my tracing algorithm. with my knowledge of the minicircles and the result of using my AFM simulator on it should be consistent with the AFM data. To do this I placed together the structures of short segments of coarse-grained DNA taken from the PDB file of the minicircle to form a structure resembling that of a minicircle which should be similar to the real structure. I made a short segment of model DNA, basing it on the data from the PDB file of the minicircle. I used various lengths for this segment, between 3 and 17 base pairs. I then manipulated copies of this in space to form up the structure I described. To do this it was first necessary to be able to produce this structure at arbitrary position and orientation. The way I did this is described in Appendix A. I then placed these segments along the points of the trace. I ran the tracing algorithm with a distance between points suitable for the length of segment that I was using, that is some fraction of the length of the segment. I also ran the the tracing algorithm so that it traced around the minicircle many times and used the points at the end of the trace for this to try and avoid bias from the choice of starting points. 7 25 30 20 20 15 10 10 5 0 0 (a) (b) 25 30 20 20 15 10 10 5 0 0 (c) (d) Figure 5: A comparison between real AFM scans of minicircles, (a) and (c), and simulated AFM scans of my structures built to match the real data, (b) and (d). (a) (b) Figure 6: Side and top-down views of one of my assembled minicircle structures. The red and orange lines are artefacts of Chimera. 8 The contour length of the trace will not be equal to the total contour length of the segments. Consequently, there may be gaps of varying size between the segments when I form them into a structure. As this is only a first estimate at the structure I did not think this was that important, but such imperfections could be removed in a more developed model. With suitable choices of the distance between points and the length of the segment my program produced structures that, when scanned with my AFM simulator, at least superficially resembled the original data. Some examples of this are shown in Figure 5. These structures were made with segments which were 17 base pairs long and correspondingly placed about 6.4nm apart. That is, the distance between traced points was 3.2nm and I placed them every two points. The tip radius for the scan was 1.5nm. Finally, to visualise the structures I was assembling I wrote a function to return from the coarse grained structures to PDB format and to export this for viewing in UCSF Chimera. An example of what an assembled structure looks like when viewed in this way is shown in Figure 5. 2.6 Comparing Scans Finally, I made an attempt to use my results to estimate the tip radius of the actual AFM scan. I can make multiple simulated scans of my structure with different radii and so by comparing them to the original data I could estimate what the real tip radius might be. It is therefore necessary to have a method of comparing scans. I used the principle of least squares fitting to calculate the residuals between a simulated scan and the real scan. This is given by X resid(R) = |zireal − zisim (R)|2 (2) i over all points i in the scan. For each of the minicircles imaged in Figure 5 I attempted to minimise the value of the residuals with respect to radius. That is, find the simulated scan tip radius which produced the lowest residuals that could be close to the real tip radius. For both, the function was minimum at around 7Å, and Figure 7 shows the residuals at various radii around this local minimum. The two minima seem consistent with each other although 7Å seems small for an AFM tip. 3 Conclusions On completion of this project my program is effective at the following. I can import AFM images of DNA minicircles and using PDB data of a minicircle I can construct a first guess at the underlying DNA structure of them following the steps outlined above. I have combined everything into one function to make it as straightforward as possible. I can also export this structure in PDB format to view easily. The limitations of this are fairly clear in that it only produces a very rough guess at a structure. Viewing this in Chimera as in Figure 5 it is evidentally not that realistic with 9 resid(R) 1.00 0.99 Minicircle (a) 0.98 Minicircle (c) 0.97 0.96 5 6 7 8 9 R (Å) Figure 7: Values of the residuals for a range of tip radii for the minicircle structures shown in Figure 5. resid(R) is given relative to its value at 9.1Å the stretches not really joining up properly. The effects of my method of producing the structure can be seen in that the stretches are all inclined at an angle. This is because the stretches are aligned with the lines connecting points on the trace from the first bead to the last. The line connecting these beads is not parallel to the axis of the stretch hence the inclined appearance. Despite this the program makes a reasonable first guess at a structure. My method of comparing scans allows estimation of the real AFM tip radius and can allow the fitness of structures in matching to the real data to be compared. 4 Discussion My guesses of structures are not complete but they are a good start as Figure 5 clearly resembles a real minicircle. Importantly, it would now be possible to modify this slightly to improve the fit with the data, for example by “jiggling” the pieces of DNA that form my structure around and trying to improve the fit. This could be done by Monte Carlo simulation. Similar to methods established in work on protein structures15 I would make small adjustments to the position and orientation of the structure, then accept these adjustments with a probability corresponding to how they change its scoring function. This scoring function needs be a measurement of how well the structure matches the data, such as my residuals function. However, the unrealistic estimate of the tip radius suggests that the residuals method also has limitations. It is possible this comes down to the approximations made in the production of simulated scans, such as approximating the interactions as simple contact, coarse graining or modelling the AFM tip as a sphere. But I think there is scope within this framework to greatly improve the guesses of structures, without too much difficulty. There are other possibilities for a scoring function which could be more effective, such as use of the derivatives of the scan rather than just the absolute height. In a more advanced way, energetic constraints could be considered where the scoring function also depends on the relative positions of the DNA, with energetically unfavourable arrangements correspondingly penalised. In general minimisation methods in conjunction with 10 a molecular mechanics forcefield could be used to improve the structure. If this was implemented effectively increasingly accurate structures could be determined. Another possible development is to use AFM data which we can obtain where single-stranded DNA binding to the minicircles is visible. This single-stranded DNA binds at certain known points in the sequence, so this would provide a basepoint for accurate fitting of the sequence itself, something I have not considered. The relationship between sequence and DNA tertiary structure, such as if certain sequences are more flexible, could then be probed. Finally, the introduction of DNA binding proteins would allow the relationship between protein binding affinity, sequence and DNA tertiary structure to be directly measured. Experiments of this type would be extremely interesting and would allow real investigation into the questions outlined in the first section. 5 Acknowledgements I acknowledge the work of Alice Pyne in producing the AFM data on which this project was based. References 1 V. Ortiz and J. J. de Pablo, “Molecular origins of DNA flexibility: Sequence effects on conformational and mechanical properties,” Phys. Rev. Lett., vol. 106, p. 238107, Jun 2011. 2 B. Alberts, Molecular Biology of the Cell. New York: Garland Science, 4th ed., 2002. 3 C. J. Dorman, “Regulation of transcription by DNA supercoiling in mycoplasma genitalium: global control in the smallest known self-replicating genome,” Molecular microbiology, vol. 81, no. 2, pp. 302–304, 2011. 4 M. R. Gartenberg and D. M. Crothers, “Dna sequence determinants of CAP-induced bending and protein binding affinity.,” Nature, vol. 333, no. 6176, pp. 824–829, 1988. 5 R. Rohs, X. Jin, S. M. West, R. Joshi, B. Honig, and R. S. Mann, “Origins of specificity in protein-DNA recognition,” Annual review of biochemistry, vol. 79, p. 233, 2010. 6 J. S. Mitchell, C. A. Laughton, and S. A. Harris, “Atomistic simulations reveal bubbles, kinks and wrinkles in supercoiled DNA,” Nucleic Acids Research, 2011. 7 T. A. Lionberger, D. Demurtas, G. Witz, J. Dorier, T. Lillian, E. Meyhöfer, and A. Stasiak, “Cooperative kinking at distant sites in mechanically stressed DNA,” Nucleic Acids Research, vol. 39, no. 22, pp. 9820–9832, 2011. 8 “http://www.comp-bio.physics.leeds.ac.uk/, versity of Leeds,” January 2015. 11 Computational Biophysics Group, Uni- 9 V. J. Morris, A. R. Kirby, and A. P. Gunning, Atomic force microscopy for biologists, vol. 57. World Scientific, 1999. 10 A. Pyne, R. Thompson, C. Leung, D. Roy, and B. W. Hoogenboom, “Single-molecule reconstruction of oligonucleotide secondary structure by atomic force microscopy,” Small, 2014. 11 E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt, E. C. Meng, and T. E. Ferrin, “UCSF Chimera—a visualization system for exploratory research and analysis,” Journal of computational chemistry, vol. 25, no. 13, pp. 1605– 1612, 2004. 12 D. A. Potoyan, A. Savelyev, and G. A. Papoian, “Recent successes in coarse-grained modeling of DNA,” Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 3, no. 1, pp. 69–83, 2013. 13 A. K. Mazur and M. Maaloum, “Atomic force microscopy study of DNA flexibility on short length scales: smooth bending versus kinking,” Nucleic acids research, p. gku1192, 2014. 14 P. A. Wiggins, T. Van Der Heijden, F. Moreno-Herrero, A. Spakowitz, R. Phillips, J. Widom, C. Dekker, and P. C. Nelson, “High flexibility of DNA on short length scales probed by atomic force microscopy,” Nature nanotechnology, vol. 1, no. 2, pp. 137–141, 2006. 15 A. Rossi, M. A. Marti-Renom, and A. Sali, “Localization of binding sites in protein structures by optimization of a composite scoring function,” Protein science, vol. 15, no. 10, pp. 2366–2380, 2006. A Coordinate Transformation To be able to write the positions of the atoms or beads in a length of DNA at arbitrary position and orientation required transforming the coordinates. This was straightforward enough, but an explanation may be useful in trying to understand my code. I transformed the x, y, z coordinates of each bead into a u, v, w system where the u-axis was defined along the axis of the segment, that is the vector connecting the first and last beads. I then defined the other axes (arbitrarily) as v̂ = û × ẑ and ŵ = û × v̂. A bead at point xi in x, y, z space thus corresponded simply to xi · û ui = xi · v̂ (3) xi · ŵ in u, v, w space. The purpose of this is then that whenever I want to lay down a new segment, I can define a new u-axis along the desired axis of the segment, and write down each bead in 12 this new u, v, w space. The transformation is then reversed and I recover the position of each bead in the new orientation xi . This is done, in analogy to above, according to ui · x̂0 xi = ui · ŷ 0 (4) ui · ẑ 0 where x̂0 , ŷ 0 , ẑ 0 are the x, y, z unit vectors in the new u, v, w space and are given by û · ŷ û · x̂ û · ẑ x̂0 = v̂ · x̂ , ŷ 0 = v̂ · ŷ , ẑ 0 = v̂ · ẑ ŵ · x̂ ŵ · ẑ ŵ · ŷ 13 (5)