DNA, RNA and Protein Structure Prediction Eero Pennala 63203L Foreword This is an exercise work for the course S-114.500 Basics for Biosystems of the Cell. First we go through the basics of DNA, RNA and proteins, and structure prediction concerning them. After that some freeware programs related to these issues are introduced. There are more freeware programs that could have been tested, but most them require a special license from the author. Even though I mailed to a couple of institutes, I didn’t get a response and a license. Also many programs are only for Linux systems and I didn’t have a chance to try them. Some of the free programs are only a server, where you can input your file and get result. This is in itself easy, but the user doesn’t get an answer to the question How?, which I as an engineer would appreciate. Foreword ............................................................................................................................. 2 Basics .................................................................................................................................. 3 DNA................................................................................................................................ 3 Alpha-helix ................................................................................................................. 3 RNA ................................................................................................................................ 3 RNA folding................................................................................................................ 4 Predicting RNA secondary structure .......................................................................... 5 Predicting Protein Structure............................................................................................ 6 Programs ............................................................................................................................. 6 DNA and RNA Structure Prediction............................................................................... 6 Circles ......................................................................................................................... 6 RNA Shapes................................................................................................................ 7 Protein Structure Prediction............................................................................................ 9 WHAT IF .................................................................................................................... 9 Visualizing Proteins ........................................................................................................ 9 Protein Explorer ........................................................................................................ 10 DeepView Swiss-PdbViewer.................................................................................... 10 Basics DNA The structure of deoxyribonucleic acid (DNA) was discovered by James Watson and Francis Crick in 1953. DNA was determined to be a right handed double helix based on x-ray crystallographic data provided to Watson and Crick by Maurice Wilkins and Rosalind Franklin. DNA is composed of repeating subunits called nucleotides. Nucleotides are further composed of a phosphate group, a sugar, and a nitrogenous base. Four different bases are commonly found in DNA: adenine (A), guanine (G), cytosine (C), and thymine (T). In their common structural configurations, A and T form two hydrogen bonds while C and G form three hydrogen bonds. Because of the specificity of base pairing, the two strands of DNA are said to be complementary. This characteristic makes DNA unique and capable of transmitting genetic information. Alpha-helix The amino acids in an α helix are arranged in a right-handed helical structure, 5.4 Å wide. Each amino acid corresponds to a 100° turn in the helix), and a translation of 1.5 Å along the helical axis. Most importantly, the N-H group of an amino acid forms a hydrogen bond with the C = O group of the amino acid four residues earlier. This operation is repeated , and this hydrogen bonding defines an α-helix. RNA Ribonucleic acid (RNA) is a nucleic acid polymer consisting of nucleotide monomers. RNA nucleotides contain ribose rings and uracil unlike deoxyribonucleic acid (DNA), which contains deoxyribose and thymine. It is transcribed (synthesized) from DNA by enzymes called RNA polymerases and further processed by other enzymes. RNA serves as the template for translation of genes into proteins, transferring amino acids to the ribosome to form proteins, and also translating the transcript into proteins. Picture 1: RNA Structural motifs [11] RNA folding RNA is transcribed in cells as single strands of (ribose) nucleic acids. However, these sequences are not simply long strands of nucleotides. Rather, intra-strand base pairing will produce structures such as the ones shown in picture 1. In RNA, guanine and cytosine pair (GC) by forming a triple hydrogen bond, and adenine and uracil pair (AU) by a double hydrogen bond; additionally, guanine and uracil can form a single hydrogen bond base pair. The stability of a particular secondary structure is a function of several constraints: 1. The number of GC versus AU and GU base pairs. (Higher energy bonds form more stable structures.) 2. The number of base pairs in a stem region. (Longer stems result in more bonds.) 3. The number of base pairs in a hairpin loop region. (Formation of loops with more than 10 or less than 5 bases requires more energy.) 4. The number of unpaired bases, whether interior loops or bulges. (Unpaired bases decrease the stability of the structure.) The stability of a secondary structure is quantified as the amount of free energy released or used by forming base pairs. Positive free energy requires work to form a configuration; negative free energies release stored work. Free energies are additive, so one can determine the total free energy of a secondary structure by adding all the component free energies (units are kilocalories per mole). The more negative the free energy of a structure, the more likely is formation of that structure, because more stored energy is released. This fact is used to predict the secondary structure of a particular sequence. Discovering a base pair configuration with the minimum possible free energy is the goal of most secondary structure prediction algorithms. To compute the minimum free energy of a sequence, empirical energy parameters are used. These parameters summarize free energy change (positive or negative) associated with all possible pairing configurations, including base pair stacks and internal base pairs, internal, bulge and hairpin loops, and various motifs which are know to occur with great frequency. Four major classes of RNA exist, and can be found in most organisms: 1. mRNA - messenger RNA, is a sequence which codes for formation of one or more proteins. 2. tRNA - transfer RNA, small (~80 bases) sequences which bring amino acids to the ribosome, where they translate mRNA into amino acid sequences. 3. rRNA - ribosomal RNA sequences form ribosomes (along with ribosomal proteins). 4. viral RNA It is important to note that most RNA folding algorithms predict only secondary, rather than tertiary structure. The three-dimensional shape of the molecule is important to molecular function, but is harder to predict. This is because tertiary structure is known from crystallography for only tRNA sequences. Secondary structure is usually considered a sufficient approximation, until more is known about tertiary structure of RNA. Predicting RNA secondary structure The number of possible secondary structures (S) of n bases with k base pairs is given as A number of strategies for predicting secondary structure have been developed. A taxonomy of folding algorithms could be summarized in the following way: Deterministic Minimum free energy Kinetic folding* 5'-3' folding* Partition function Stochastic Simulated annealing* * algorithm can predict pseudo-knots Now that we can find the minimum free energy structure of a sequence in computationally tractable time, there may be more than one structure with the optimum free energy. Or there may be many structures within 5% to 10% of the minimum free energy, and these may be topologically very different. A minimum energy folding algorithm will return only one secondary structure, though there are many candidates for the natural structure. To address this, some software packages will display a number of suboptimal folds. Inferring what structure is truly representative of the natural structure requires additional information. Phylogenetic information is often used to constrain the search by identifying highly conserved motifs. Some programs allow the user to specify constraints on the secondary structure, by specifying paired, single-stranded, or nonpairable regions, or by actively participating in the folding process. Of course, there are a number of limiting assumptions to existing folding algorithms. These include the kinetics of folding during transcription, the difficulty of predicting pseudo-knots, the role of chaperone proteins in folding, and the importance of modified bases. Some algorithms attempt to incorporate these considerations. At best, RNA folding algorithms are first-order approximations used to infer the natural structure of a known sequence. Predicting Protein Structure A number of factors exist that make protein structure prediction a very difficult task, including: • • • • • The number of possible structures that proteins may possess is extremely large The physical basis of protein structural stability is not fully understood. The tertiary structure of a native protein may not be readily formed without the aid of trans-acting factors. For example, proteins known as chaperones are required for some proteins to properly fold; other proteins cannot fold properly without modifications such as glycosylation. A particular sequence may be able to assume multiple conformations depending on its environment, and the biologically active conformation may not be the most thermodynamically favorable. Direct simulation of protein folding via methods such as molecular dynamics is not generally tractable for both practical and theoretical reasons. Programs DNA and RNA Structure Prediction Circles[2] Circles is an experimental Windows program for inferring RNA secondary structure using the comparative method. The user can compute a maximum weight matching, and export one or more secondary structures in standard formats. The program will display the sequences in a sequence window (Picture 2). Picture 2: alignment of mitochondrial 12S rRNA sequences in animals. Viewing the result MWM can be computed from two different sources of information, mutual information and helix plot scores. There are various options determining how the scores are computed, and if both mutual information and helix plot scores are used the relative weights given to each source can be specified. The program will display a circle plot of the pairings for the sequence. You can toggle between two different styles of drawing the pairings, and whether you want the bases displayed or not. A circle plot the RNA sequence is depicted as a circle and the base pairs by lines or chords connecting pairs of bases. Helices are indicated by sets of parallel chords (lines).They can be straight or curved, as presented in picture 3. If the lines overlap then this may be evidence of a pseudoknot, or it may be due to weak or conflicting evidence for different helices. Picture 3: circle plots of mitochondrial 12S rRNA sequences in human RNA Shapes[3] RNA Shapes offers five major program modes: Shape folding: RNA folding based on abstract shapes. This is the standard mode of operation when no other options are given. It calculates the shapes and the corresponding shreps based on free energy minimization. Suboptimal shape folding: Complete suboptimal folding of RNA. This mode uses a non-ambiguous grammar that also handles dangling bases of multiloop components in a non-ambiguous way. Shape probabilities: This option calculates the shape probabilities based on partition function. The probability of a shape is the sum of the probabilities of all structures that fall into this shape. Sampling: Probabilistic sampling based on partition function. This mode combines stochastic sampling with a-posteriori shape abstraction. A sample from the structure space holds M structures together with their shapes, on which classification is performed. The probability of a shape can then be approximated by its frequency in the sample. Sequences up to a length of around 1500 can be handled with this mode. In our experience, 1000 iterations are sufficient to achieve reasonable results for shapes with high probability. Consensus shapes: For a family of RNA sequences, this method independently enumerates the near-optimal abstract shape space, and predicts as the consensus an abstract shape common to all sequences. For each sequence, it delivers the thermodynamically best structure which has this common shape. Since the shape space is much smaller than the structure space, and identification of common shapes can be done in linear time (in the number of shapes considered), the method is essentially linear in the num Shape type The shape type is the level of abstraction or dissimilarity which defines a different shape. In general, helical regions are depicted by a pair of opening and closing square brackets and unpaired regions are represented as a single underscore. The differences of the shape types are due to whether a structural element (bulge loop, internal loop, multiloop, hairpin loop, stacking region and external loop) contributes to the shape representation. Five types are implemented: 1 Most accurate - all loops and all unpaired 2 Nesting pattern for all loop types and unpaired regions in external loop and multiloop 3 Nesting pattern for all loop types but no unpaired regions 4 Helix nesting pattern and unpaired regions in external loop and multiloop 5 Most abstract - helix nesting pattern and no unpaired regions User can change many parameters, so that for example lonely base pairs are allowed or unstable structures (positive free energy) are ignored. Also structure graphs can be created as postscript files. One structure graph is presented in picture 4. Picture 4: Prediction of folding created with RNAshapes Protein Structure Prediction WHAT IF [4] WHAT IF is a server which provides various methods to explore the properties of proteins. Here is presented how 2d-image of a protein is constructed. The B-factor plot means that the molecule will be colored accordingly to its temperature factor, from dark blue for low B-factor to red for high B-factor. Blue means helix, red means strand and green means turns and random coil. The height at each residue position indicates the average B-factor of all atoms in the residue. Left Picture 5: B-factor plot of oxyhemoglobin generated with WHAT IF Right Picture 6: 2d image of oxyhemoglobin generated with WHAT IF Visualizing Proteins Protein structures have already been widely modeled, so if you want to find out how a particular protein looks like, the easiest way is to use a program designed for visualizing proteins and load a PDB file into it. PDB files is a data file that specifies the positions in space of every atom in a molecule. The generic name for such a file is an atomic coordinate file. I found several pages from where to get pdb-data. With some programs it is enough to know the PDB identification code, which is a four-character code (Examples:1hho oxyhemoglobin, 1bl8 potassium channel). Atlas of Macromolecules [8] is easy to use webpage, and it works great with Protein Explorer. Worldwide Protein Data Bank[9] is a communion of several institutes around the world, and it has perhaps the largest collection of PDB files. I used DBGET Search [10] with Deepview, because it was the fastest page to download PDB files to your own computer. PDB format is quite old, but is still the most widely used format because all relevant software can read it. An newer and more flexible alternative format, agreed upon by the International Union of Crystallographers, is mmCIF (macromolecular crystallographic information format). Although mmCIF is offered by the PDB, its use is not yet universal. Protein Explorer [5] Protein Explorer is a web-based program, which allows the user to easily visualize proteins. Program is used with internet browser, and MDL Chime is required. This program is great if you tend surf through internet and want to see proteins saved in PDB format. All you have to do is to click the link and Protein Explorer does the rest. So you don’t have to be a rocket scientist to use Protein Explorer. About the user interface I have some complaints. Visualizing the molecule is in itself good, and the user can choose between 2D and 3D views. But the rest of the program is a mess. It feels like you are browsing through a cheap commercial webpage. Also there are not many options how the user can affect in the visualization. Below is a screenshot from Protein Explorer. Picture 7: Collagen fiber, 1CAG (1994) viewed with Protein Explorer DeepView Swiss-PdbViewer [6] DeepView - Swiss-PdbViewer is an application that provides a user friendly interface allowing to analyze several proteins at the same time. The proteins can be superimposed in order to deduce structural alignments and compare their active sites or any other relevant parts DeepView is perhaps a little harder to use, because user has to download pdb-files from the internet. Compared to Protein Explorer it is however more professional looking and the user interface is much more versatile. I believe there is much potential in this program, and all the functions in the menus were not even ready yet. But even now you can for example compute amino acid mutations, H-bonds, angles and distances between atoms, molecular surface, electrostatic potential etc. Colour codes in pictures are: carbon(C) white, oxygen(O) red, nitrogen(N) blue, sulfur(S) yellow, phosphor(P) orange hydrogen(H) cyan and other molecules grey. Picture 8: Adenylated full-length T4 RNA Ligase 2 viewed with DeepView With POVRay 3.1 and POV modeler Moray 3 it is possible to create 3D-rendered images of proteins and other molecules. All that is needed is the pdb file from one of the previously mentioned protein data banks. An example is in picture 9, which shows hemocyanin rendered with extra effects. Picture 9: 3d-rendered model of the oxygenated hemocyanin active-site [7] Viitteet [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] Frontpage picture: 1R2W RNA BINDING PROTEIN/RNA http://www.rcsb.org/pdb/explore/explore.do?structureId=1R2W Circles http://taxonomy.zoology.gla.ac.uk/rod/circles/ RNA Shapes http://bibiserv.techfak.uni-bielefeld.de/rnashapes/ WHAT IF http://swift.cmbi.kun.nl/WIWWWI/ Protein Explorer and MDL Chime http://molvis.sdsc.edu/protexpl/frntdoor.htm http://molvis.sdsc.edu/protexpl/mdlchime.htm DeepView Swiss-PdbViewer http://au.expasy.org/spdbv/ Hemocyanin active-site http://wwwchem.leidenuniv.nl/metprot/armand/008.html Atlas of Macromolecules http://molvis.sdsc.edu/atlas/atlas.htm Worldwide Protein Data Bank http://www.wwpdb.org/ DBGET Search http://www.genome.jp/dbget-bin/www_bfind?pdb X.Z. Fu et.al.: RNA Pseudoknot Prediction using Term Rewriting, http://www.lce.hut.fi/teaching/S-114.500/DNA,RNA,Protein.pdf Wikipedia: DNA, RNA, Protein Structure Prediction. http://en.wikipedia.org/wiki/