DNA, RNA and Protein Structure Prediction

advertisement
DNA, RNA and Protein Structure Prediction
Eero Pennala
63203L
Foreword
This is an exercise work for the course S-114.500 Basics for Biosystems of the Cell. First
we go through the basics of DNA, RNA and proteins, and structure prediction concerning
them. After that some freeware programs related to these issues are introduced. There are
more freeware programs that could have been tested, but most them require a special
license from the author. Even though I mailed to a couple of institutes, I didn’t get a
response and a license. Also many programs are only for Linux systems and I didn’t have
a chance to try them. Some of the free programs are only a server, where you can input
your file and get result. This is in itself easy, but the user doesn’t get an answer to the
question How?, which I as an engineer would appreciate.
Foreword ............................................................................................................................. 2
Basics .................................................................................................................................. 3
DNA................................................................................................................................ 3
Alpha-helix ................................................................................................................. 3
RNA ................................................................................................................................ 3
RNA folding................................................................................................................ 4
Predicting RNA secondary structure .......................................................................... 5
Predicting Protein Structure............................................................................................ 6
Programs ............................................................................................................................. 6
DNA and RNA Structure Prediction............................................................................... 6
Circles ......................................................................................................................... 6
RNA Shapes................................................................................................................ 7
Protein Structure Prediction............................................................................................ 9
WHAT IF .................................................................................................................... 9
Visualizing Proteins ........................................................................................................ 9
Protein Explorer ........................................................................................................ 10
DeepView Swiss-PdbViewer.................................................................................... 10
Basics
DNA
The structure of deoxyribonucleic acid (DNA) was discovered by James Watson and
Francis Crick in 1953. DNA was determined to be a right handed double helix based on
x-ray crystallographic data provided to Watson and Crick by Maurice Wilkins and
Rosalind Franklin. DNA is composed of repeating subunits called nucleotides.
Nucleotides are further composed of a phosphate group, a sugar, and a nitrogenous base.
Four different bases are commonly found in DNA: adenine (A), guanine (G), cytosine
(C), and thymine (T). In their common structural configurations, A and T form two
hydrogen bonds while C and G form three hydrogen bonds. Because of the specificity of
base pairing, the two strands of DNA are said to be complementary. This characteristic
makes DNA unique and capable of transmitting genetic information.
Alpha-helix
The amino acids in an α helix are arranged in a right-handed helical structure, 5.4 Å wide.
Each amino acid corresponds to a 100° turn in the helix), and a translation of 1.5 Å along
the helical axis. Most importantly, the N-H group of an amino acid forms a hydrogen
bond with the C = O group of the amino acid four residues earlier. This operation is
repeated
, and this hydrogen bonding defines an α-helix.
RNA
Ribonucleic acid (RNA) is a nucleic acid polymer consisting of nucleotide monomers.
RNA nucleotides contain ribose rings and uracil unlike deoxyribonucleic acid (DNA),
which contains deoxyribose and thymine. It is transcribed (synthesized) from DNA by
enzymes called RNA polymerases and further processed by other enzymes. RNA serves
as the template for translation of genes into proteins, transferring amino acids to the
ribosome to form proteins, and also translating the transcript into proteins.
Picture 1: RNA Structural motifs [11]
RNA folding
RNA is transcribed in cells as single strands of (ribose) nucleic acids. However, these
sequences are not simply long strands of nucleotides. Rather, intra-strand base pairing
will produce structures such as the ones shown in picture 1. In RNA, guanine and
cytosine pair (GC) by forming a triple hydrogen bond, and adenine and uracil pair (AU)
by a double hydrogen bond; additionally, guanine and uracil can form a single hydrogen
bond base pair.
The stability of a particular secondary structure is a function of several constraints:
1. The number of GC versus AU and GU base pairs.
(Higher energy bonds form more stable structures.)
2. The number of base pairs in a stem region.
(Longer stems result in more bonds.)
3. The number of base pairs in a hairpin loop region.
(Formation of loops with more than 10 or less than 5 bases requires more energy.)
4. The number of unpaired bases, whether interior loops or bulges.
(Unpaired bases decrease the stability of the structure.)
The stability of a secondary structure is quantified as the amount of free energy released
or used by forming base pairs. Positive free energy requires work to form a configuration;
negative free energies release stored work. Free energies are additive, so one can
determine the total free energy of a secondary structure by adding all the component free
energies (units are kilocalories per mole). The more negative the free energy of a
structure, the more likely is formation of that structure, because more stored energy is
released. This fact is used to predict the secondary structure of a particular sequence.
Discovering a base pair configuration with the minimum possible free energy is the goal
of most secondary structure prediction algorithms.
To compute the minimum free energy of a sequence, empirical energy parameters are
used. These parameters summarize free energy change (positive or negative) associated
with all possible pairing configurations, including base pair stacks and internal base pairs,
internal, bulge and hairpin loops, and various motifs which are know to occur with great
frequency.
Four major classes of RNA exist, and can be found in most organisms:
1. mRNA - messenger RNA, is a sequence which codes for formation of one or
more proteins.
2. tRNA - transfer RNA, small (~80 bases) sequences which bring amino acids
to the ribosome, where they translate mRNA into amino acid sequences.
3. rRNA - ribosomal RNA sequences form ribosomes (along with ribosomal
proteins).
4. viral RNA
It is important to note that most RNA folding algorithms predict only secondary, rather
than tertiary structure. The three-dimensional shape of the molecule is important to
molecular function, but is harder to predict. This is because tertiary structure is known
from crystallography for only tRNA sequences. Secondary structure is usually considered
a sufficient approximation, until more is known about tertiary structure of RNA.
Predicting RNA secondary structure
The number of possible secondary structures (S) of n bases with k base pairs is given as
A number of strategies for predicting secondary structure have been developed. A
taxonomy of folding algorithms could be summarized in the following way:
Deterministic
Minimum free energy
Kinetic folding*
5'-3' folding*
Partition function
Stochastic
Simulated annealing*
* algorithm can predict pseudo-knots
Now that we can find the minimum free energy structure of a sequence in
computationally tractable time, there may be more than one structure with the optimum
free energy. Or there may be many structures within 5% to 10% of the minimum free
energy, and these may be topologically very different. A minimum energy folding
algorithm will return only one secondary structure, though there are many candidates for
the natural structure. To address this, some software packages will display a number of
suboptimal folds. Inferring what structure is truly representative of the natural structure
requires additional information. Phylogenetic information is often used to constrain the
search by identifying highly conserved motifs. Some programs allow the user to specify
constraints on the secondary structure, by specifying paired, single-stranded, or nonpairable regions, or by actively participating in the folding process.
Of course, there are a number of limiting assumptions to existing folding algorithms.
These include the kinetics of folding during transcription, the difficulty of predicting
pseudo-knots, the role of chaperone proteins in folding, and the importance of modified
bases. Some algorithms attempt to incorporate these considerations. At best, RNA folding
algorithms are first-order approximations used to infer the natural structure of a known
sequence.
Predicting Protein Structure
A number of factors exist that make protein structure prediction a very difficult task,
including:
•
•
•
•
•
The number of possible structures that proteins may possess is extremely large
The physical basis of protein structural stability is not fully understood.
The tertiary structure of a native protein may not be readily formed without the
aid of trans-acting factors. For example, proteins known as chaperones are
required for some proteins to properly fold; other proteins cannot fold properly
without modifications such as glycosylation.
A particular sequence may be able to assume multiple conformations depending
on its environment, and the biologically active conformation may not be the most
thermodynamically favorable.
Direct simulation of protein folding via methods such as molecular dynamics is
not generally tractable for both practical and theoretical reasons.
Programs
DNA and RNA Structure Prediction
Circles[2]
Circles is an experimental Windows program for inferring RNA secondary structure
using the comparative method. The user can compute a maximum weight matching, and
export one or more secondary structures in standard formats. The program will display
the sequences in a sequence window (Picture 2).
Picture 2: alignment of mitochondrial 12S rRNA sequences in animals.
Viewing the result MWM can be computed from two different sources of information,
mutual information and helix plot scores. There are various options determining how the
scores are computed, and if both mutual information and helix plot scores are used the
relative weights given to each source can be specified. The program will display a circle
plot of the pairings for the sequence. You can toggle between two different styles of
drawing the pairings, and whether you want the bases displayed or not. A circle plot the
RNA sequence is depicted as a circle and the base pairs by lines or chords connecting
pairs of bases. Helices are indicated by sets of parallel chords (lines).They can be straight
or curved, as presented in picture 3. If the lines overlap then this may be evidence of a
pseudoknot, or it may be due to weak or conflicting evidence for different helices.
Picture 3: circle plots of mitochondrial 12S rRNA sequences in human
RNA Shapes[3]
RNA Shapes offers five major program modes:
Shape folding:
RNA folding based on abstract shapes. This is the standard mode of operation when no
other options are given. It calculates the shapes and the corresponding shreps based on
free energy minimization.
Suboptimal shape folding:
Complete suboptimal folding of RNA. This mode uses a non-ambiguous grammar that
also handles dangling bases of multiloop components in a non-ambiguous way.
Shape probabilities:
This option calculates the shape probabilities based on partition function. The probability
of a shape is the sum of the probabilities of all structures that fall into this shape.
Sampling:
Probabilistic sampling based on partition function. This mode combines stochastic
sampling with a-posteriori shape abstraction. A sample from the structure space holds M
structures together with their shapes, on which classification is performed. The
probability of a shape can then be approximated by its frequency in the sample.
Sequences up to a length of around 1500 can be handled with this mode. In our
experience, 1000 iterations are sufficient to achieve reasonable results for shapes with
high probability.
Consensus shapes:
For a family of RNA sequences, this method independently enumerates the near-optimal
abstract shape space, and predicts as the consensus an abstract shape common to all
sequences. For each sequence, it delivers the thermodynamically best structure which has
this common shape. Since the shape space is much smaller than the structure space, and
identification of common shapes can be done in linear time (in the number of shapes
considered), the method is essentially linear in the num
Shape type
The shape type is the level of abstraction or dissimilarity which defines a different shape.
In general, helical regions are depicted by a pair of opening and closing square brackets
and unpaired regions are represented as a single underscore. The differences of the shape
types are due to whether a structural element (bulge loop, internal loop, multiloop,
hairpin loop, stacking region and external loop) contributes to the shape representation.
Five types are implemented:
1 Most accurate - all loops and all unpaired
2 Nesting pattern for all loop types and unpaired regions in external loop and multiloop
3 Nesting pattern for all loop types but no unpaired regions
4 Helix nesting pattern and unpaired regions in external loop and multiloop
5 Most abstract - helix nesting pattern and no unpaired regions
User can change many parameters, so that for example lonely base pairs are allowed or
unstable structures (positive free energy) are ignored. Also structure graphs can be
created as postscript files. One structure graph is presented in picture 4.
Picture 4: Prediction of folding created with RNAshapes
Protein Structure Prediction
WHAT IF [4]
WHAT IF is a server which provides various methods to explore the properties of
proteins. Here is presented how 2d-image of a protein is constructed.
The B-factor plot means that the molecule will be colored accordingly to its
temperature factor, from dark blue for low B-factor to red for high B-factor. Blue
means helix, red means strand and green means turns and random coil. The height at each
residue position indicates the average B-factor of all atoms in the residue.
Left Picture 5: B-factor plot of oxyhemoglobin generated with WHAT IF
Right Picture 6: 2d image of oxyhemoglobin generated with WHAT IF
Visualizing Proteins
Protein structures have already been widely modeled, so if you want to find out how a
particular protein looks like, the easiest way is to use a program designed for visualizing
proteins and load a PDB file into it. PDB files is a data file that specifies the positions in
space of every atom in a molecule. The generic name for such a file is an atomic
coordinate file. I found several pages from where to get pdb-data. With some programs it
is enough to know the PDB identification code, which is a four-character code
(Examples:1hho oxyhemoglobin, 1bl8 potassium channel).
Atlas of Macromolecules [8] is easy to use webpage, and it works great with Protein
Explorer. Worldwide Protein Data Bank[9] is a communion of several institutes around
the world, and it has perhaps the largest collection of PDB files. I used DBGET Search
[10] with Deepview, because it was the fastest page to download PDB files to your own
computer. PDB format is quite old, but is still the most widely used format because all
relevant software can read it. An newer and more flexible alternative format, agreed upon
by the International Union of Crystallographers, is mmCIF (macromolecular
crystallographic information format). Although mmCIF is offered by the PDB, its use is
not yet universal.
Protein Explorer [5]
Protein Explorer is a web-based program, which allows the user to easily visualize
proteins. Program is used with internet browser, and MDL Chime is required. This
program is great if you tend surf through internet and want to see proteins saved in PDB
format. All you have to do is to click the link and Protein Explorer does the rest. So you
don’t have to be a rocket scientist to use Protein Explorer.
About the user interface I have some complaints. Visualizing the molecule is in itself
good, and the user can choose between 2D and 3D views. But the rest of the program is a
mess. It feels like you are browsing through a cheap commercial webpage. Also there are
not many options how the user can affect in the visualization. Below is a screenshot from
Protein Explorer.
Picture 7: Collagen fiber, 1CAG (1994) viewed with Protein Explorer
DeepView Swiss-PdbViewer [6]
DeepView - Swiss-PdbViewer is an application that provides a user friendly interface
allowing to analyze several proteins at the same time. The proteins can be superimposed
in order to deduce structural alignments and compare their active sites or any other
relevant parts
DeepView is perhaps a little harder to use, because user has to download pdb-files from
the internet. Compared to Protein Explorer it is however more professional looking and
the user interface is much more versatile. I believe there is much potential in this
program, and all the functions in the menus were not even ready yet. But even now you
can for example compute amino acid mutations, H-bonds, angles and distances between
atoms, molecular surface, electrostatic potential etc. Colour codes in pictures are:
carbon(C) white, oxygen(O) red, nitrogen(N) blue, sulfur(S) yellow, phosphor(P) orange
hydrogen(H) cyan and other molecules grey.
Picture 8: Adenylated full-length T4 RNA Ligase 2 viewed with DeepView
With POVRay 3.1 and POV modeler Moray 3 it is possible to create 3D-rendered images
of proteins and other molecules. All that is needed is the pdb file from one of the
previously mentioned protein data banks. An example is in picture 9, which shows
hemocyanin rendered with extra effects.
Picture 9: 3d-rendered model of the oxygenated hemocyanin active-site [7]
Viitteet
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
Frontpage picture: 1R2W RNA BINDING PROTEIN/RNA
http://www.rcsb.org/pdb/explore/explore.do?structureId=1R2W
Circles
http://taxonomy.zoology.gla.ac.uk/rod/circles/
RNA Shapes
http://bibiserv.techfak.uni-bielefeld.de/rnashapes/
WHAT IF
http://swift.cmbi.kun.nl/WIWWWI/
Protein Explorer and MDL Chime
http://molvis.sdsc.edu/protexpl/frntdoor.htm
http://molvis.sdsc.edu/protexpl/mdlchime.htm
DeepView Swiss-PdbViewer
http://au.expasy.org/spdbv/
Hemocyanin active-site
http://wwwchem.leidenuniv.nl/metprot/armand/008.html
Atlas of Macromolecules
http://molvis.sdsc.edu/atlas/atlas.htm
Worldwide Protein Data Bank
http://www.wwpdb.org/
DBGET Search
http://www.genome.jp/dbget-bin/www_bfind?pdb
X.Z. Fu et.al.: RNA Pseudoknot Prediction using Term Rewriting,
http://www.lce.hut.fi/teaching/S-114.500/DNA,RNA,Protein.pdf
Wikipedia: DNA, RNA, Protein Structure Prediction.
http://en.wikipedia.org/wiki/
Download