4.1 4. Nucleic Acids Overview 1. Introduction 2. Chemical structure of nuclei acids 3. 3D-structure of DNA 4. Copying of DNA 5. Genetic code 6. Translation 7. Tools in genetics 1. Introduction Dogma in molecular biology "Information is stored in DNA, copied to RNA and used to build proteins" replication transcription DNA DNA: RNA: replication: transcription: translation: translation RNA Protein deoxyribonucleic acid; ribonucleic acid; copying of entire genome prior to cell divisions; copying of one or a few genes (from the DNA) to RNA; synthesis of a protein according to information from RNA. Empty arrows "violate" the dogma: RNA-viruses and reverse transcription. The sequence of nucleic acids represents the basis for the storage of genetic information. The 3D-structure is important for processes such as reading of this information. Chemical composition of E.coli cells Molecule Number Types nucleic acids: DNA 2-4 1 mRNA 1000 1000 tRNA 4*105 60 rRNA 30’000 3 6 proteins 10 3000 H2O Molecular mass % of cell mass 2.6*109 1 5 8*10 \ 25’000 } 6 105 – 106 / 40’000 15 70 Nucleic acids and proteins make up for a large fraction of the cellular mass. 4.2 Overview of nucleic acids There are two main types of nucleic acids: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). As explained later, the differences are rather subtle but with significant consequences on stability and structure. DNA DNA is found in cells of prokaryotes, in cell nuclei and mitochondria of eukaryotes and in viruses. It occurs in linear or circular form. The following table summarizes information on the occurrence and size of DNA. ("Base pairs" will be explained later.) Organism Simian virus 40 Bacteriophage T4 E. coli Yeast (S. cerevisiae) Drosophila melanogaster Mammals base pairs 5243 ~166’000 4’720’000 13*106 165*106 3000*106 genes 6 >100 >3000 length [mm] 0.0017 0.061 1.3 4.3 56 1000 Mammalians have about 1 m of DNA distributed over chromosomes (23 in humans, haploid genome). The muntjak, an Asian deer, has about 1000*106 base pairs, but only 3 large chromosomes. Only about 10% of the DNA in eukaryotes store information: ~30’000 genes. RNA There are 3 types of RNA: transfer RNA, ribosomal RNA and messenger RNA. The sizes below are for E. coli. Name tRNA (transfer) rRNA (ribosomal) mRNA (messenger) size (nucleotides) 75-95 16S 1542 23S 2904 5S 120 100 – 10’000 function transfer of amino acid to ribosome structure and function of ribosome copy of gene read by ribosome Ribosomes are very large molecular systems consisting of many proteins and nucleic acids. They translate genomic information from mRNAs into proteins (see later). The size indications "16S" etc. stem from ultra-centrifugation measurements and indicate the sedimentation behavior (units are Svedberg). Special attention has recently been given to certain small RNA strands that form specific 3D structures with catalytic activity, the ribozymes. Interest in these molecules is twofold: they may provide new aspects to the question of what was first, proteins or DNA, and they may allow the construction of new types of “enzymes”. 4.3 History 1941 “one gene–one enzyme” hypothesis 1944 genetic information is on DNA 1946 bacterial genetics 1952 DNA: information for proteins 1953 DNA double helix ~1957 dogma of molecular biology 1961 gene regulation, lac-operon 1962 restriction enzymes 1965 first sequence of a tRNA (yeast,Ala) 1962-66 genetic code ~1970 gene technology 1976 first 3D structure of a RNA 1977 sequencing of DNA 1978 splicing of mRNA 1980… x-ray/NMR: DNA structure studies 1982 ribozymes 1982 photosynthetic reaction center 1984 homeobox 1986 PCR (polymer chain reaction) 1991 prions 2000 ribosome at atomic resolution Beadle, Tatum (1958) Avery Lederberg, Tatum (1958) Hershey, Chase (1969) Watson, Crick (1962) (Crick) Jacob, Monod (1965) Arber (1978) Holley (1968) Nirenberg, Khorana (1968) Rich, Klug (1982) Gilbert, Sanger (1980) Sharp, Roberts (1993) Rich, Dickerson Cech, Altman (1989) Deisenhofer, Huber, Michel (1988) Lewis, Nüsslein-Volhard, Wieschaus (95) Mullis (1993) Prusiner (1997) Steitz Nobel Prize Winners are underlined. 2. Chemical structure of nuclei acids Nucleic acids are linear chains (like proteins). The elements of these chains, forming a linear sequence of “letters” with genomic information, are nucleotides that are each composed of a base, a sugar and a phosphate. Sugars and phosphates form the backbone; they are identical for all nucleotides of a nucleic acid chain. The bases are attached as side chains to the sugars; they differ from one “letter” to the other. Four different bases occur in DNA (and RNA). Consult also the figure on the next page for the following discussion of the three components. Phosphate group Each phosphate group carries (at normal pH) one negative charge. The oxygen atoms provide reactive centers for hydrogen bonds or ion binding. Sugars (see figure at bottom of next page) Consist of 5 carbons: C1' to C5' Chiral centers at: C2’, C3’, C4’ Full name: (2’-deoxy-) -D-ribofuranose furanose: 5-membered ring ribose: same chirality at C2’, C3’, C4’ D: configuration around C4’ : trans configuration of oxygens on C1’ and C2’ (base!) 2’-deoxy: DNA rather than RNA 4.4 Chemical structure RNA fragment AUG (end groups missing); atom radii: P > O > N > C > H Left: gray shades for different nucleotides Right: phosphate groups (black), sugars (light gray) and bases (dark gray) Sugar fragments H-O5’ C5’ base O4’ C4’ C3’ H-O3’ C1’ C2’ (O2’) The ring has internal flexibility: ring pucker (with strain). The DNA sugars have two O-H groups, 3’ and 5’, where the phosphate groups are attached. This means that the backbone is oriented. 4.5 Bases Bases are heterocyclic rings with aromatic character; the rings are therefore (almost) planar. The chemical structure of the common bases and the atom numbering is provided in the figure below. Additional, rare, bases occur in tRNA (see later). Characteristic features of the bases are: Purines: adenine (A, Ade): NH2 at position 6 guanine (G, Gua): O at position 6, NH2 at position 2 Pyrimidines: thymine (T, Thy): CH3 at position 5, O at positions 2 and 4 cytosine (C, Cyt): O at position 2, NH2 at position 4 uracil (U, Ura): O at positions 2 and 4 These chemical differences define different patterns for interaction with other molecules (e.g. proteins): hydrogen bond donors and acceptors, hydrophobic patches (see the discussion of the 3D-structure of DNA below). Nucleotides Nucleotides consist of one (or more) phosphates, a sugar and a base. The base is attached to the ribose at the C1’ via a glycosidic linkage. Ester bonds connect the riboses and the phosphates. Normally, the phosphates bind to the ribose at the 5’ end. DNA nucleotides: RNA nucleotides: deoxyriboses plus riboses plus A, G, T, C A, G, U, C Other related nucleotides are: ATP: adenine-tri-phosphate; GTP: guanine-tri-phosphate NAD: nicotine-adenine-dinucleotide These often serve as energy carriers in cells. Polynucleotides The combination of several nucleotides results in a sugar-phosphate backbone with a 5’-end group and a 3´-end group, each with or without phosphate. Several notations are used: 5’-pA-C-G-T-3’ or simply ACGT. DNA-strands are always written from 5’ to 3’. 4.6 3. 3D-structure of DNA Interaction between nucleotides The following interactions (de-)stabilize the 3D-structure of DNA: DNA and RNA carry negatives charges. For stability reasons these must interact with external cations, since no positive charges are found on the DNA. Electron clouds around atoms are polarized, which results in weak attraction between non-polar atoms. In DNA, a sizeable force is due to the interaction between -orbitals of stapled bases: “base stacking”. This interaction is an important factor in stabilizing DNA double helices. A hydrogen atom may be “shared” between two polar atoms, a donor and an acceptor. Polar groups are found in all parts of a nucleotide; most interesting are those in the bases: Base pairs are formed by hydrogen bonds between A – T and G – C in DNA, and between A – U and G- C in RNA (see figure on page 5). The bases in base pairs are complementary with respect to hydrogen bonds and space requirements. Consequences are (a) [A] = [T] and [G] = [C] when base pairs are formed ([X] indicates the concentration of X), and (b) GC-rich fragments are more stable then AT-rich ones (see figures below). For the specific recognition of DNA by other molecules, e.g. proteins, hydrogen bond donors and acceptors as well as hydrophobic groups (CH3 in T) are essential. Conformation Structures may be described in various, equivalent ways: - (x,y,z) coordinates: provides 3D-structure, but requires many numbers - “internal coordinates: bond lengths, bond angles and torsion angles around bonds; the advantage here is that bond lengths and bond angles can be considered constant, leaving only the torsion angles as parameters. A structure description by torsion angles is called conformation. - Bonds that can be rotated are all bonds along the backbone and the sugar-base connection; bases are rigid. Note that the torsion angles in the sugar ring are correlated, and different "sugar puckers" are observed: C3’ C4’ C2' O4’ C2’ "C3'-endo" pucker C1’ C4' O4' C3' "C2'-endo" pucker C1' 4.7 Duplex structures Double stranded DNA molecules adopt the famous double helix proposed by Watson and Crick. Their model was based on the observation of a periodicity of both 3.4 Å and 34 Å from fiber diffraction experiments, and of the occurrence of equal concentrations [A]=[T] and [G]=[C]. The model corresponds to an ideal B-DNA form, which is adopted by most DNA duplexes. Other double helix forms are ADNA, mostly observed for RNA, and Z-DNA, mostly an artifact caused by high salt and the exclusive presence of G-C base pairs (this was however the first experimental 3D-structure in 1980). General aspects of the model are that through base pairs A-T and C-G complementary strands are formed, which run antiparallel to each other. A major groove and a minor groove provide access to the bases, and thus allow sequence-specific recognition! Finally, it should be mentioned that DNA structures are flexible; one should therefore talk about "B-DNA type DNA". The following figure and table illustrate the three forms and summarizes characteristic features. Parameter Handedness Major groove Minor groove Repeating unit Bases per turn Twist per base pair Height per base Pitch Base pairing Sugar pucker Glycosyl angle Base inclination (90o-tilt) Base roll Propeller twist Axis displacement A-DNA right very deep shallow 1 base pair 10.9 33o 2.9Å 31.6Å Watson-Crick 3’ endo anti 13o 6o 15o 4Å B-DNA right deep, wide deep, narrow 1 base pair 10.0 36 o 3.4Å 34.0Å Watson-Crick 2’ endo anti -2 o -1o 12o 0Å Z-DNA left shallow very deep 2 base pairs 12.0 GC: -51 o; CG: -9 o GC: 3.5Å; CG: 4.1Å 45.6Å Watson-Crick C: 2’endo; G: 3’ endo C: anti; G: syn 9o 3o 4o -3Å 4.8 The following figure explains the above entities twist, tilt, inclination, roll and propeller twist. Other structural features of DNA Topology A relaxed DNA is like normal B-DNA ("Tw=14"). Certain enzymes, topoisomerases can cut the DNA open and undo double helical turns ("Tw=12"). In circular DNA, the unwinding can be compensated by supercoiling ("Tw=14, Wr=-2, Lk=12"). Supercoiling is descried by the number of twists Tw, the writhe (supercoil turns) Wr, and the linking number, Lk: Lk=Tk+Wr. Lk can only be changed by cutting circular DNA open. Palindromic DNA During various events of DNA reading, palindromic DNA plays an important role, an example are early stages of replication. Palindromic DNA is double stranded DNA that reads the same from both 5’ends. An example is: GCATTAATGC CGTAATTACG Longer stretches can adopt cross-shaped forms with B-DNA like arms. 4.9 Triplexes and quadruplexes New interactions among DNA-strands can be formed based on new types of base pairing: Hoogsten and Reverse Hoogsten base pairs. Triplexes occur for long sequences with only purines or pyrimidines (example GAGAGA…). Quadruplexes are essential for the protein-DNA structures found at the end of chromosomes, the telomeres. 4.10 Packing of DNA in chromosomes While the DNA occurs as a single long molecule in prokaryotes, it is more organized in the cell nuclei of eukaryotes. The DNA double helix is first wound around histone proteins forming nucleosomes; the x-ray structure of a complete nucleosome has recently been determined at atomic resolution. These nucleosomes are then further organized into chromatin and eventually form the chromosomes. 4. Copying of DNA The replication of DNA is achieved by DNA polymerases. These require the presence of a template DNA strand, a short starting strand called primer, and the deoxynucleoside 5'-triphosphates dATP, dGTP, dTTP and dCTP. Polymerases will elongate the primer with nucleotides that are complementary to the template strand. Chain elongation always occurs in the 5' to 3' direction. Many polymerases possess in addition a nuclease activity that allows them to remove mismatched nucleotides. They achieve error rates that are less than 10-8 per base pair! Replication occurs in a "semiconservative" manner: Each strand of a DNA duplex is copied, i.e. complemented by a newly synthesized strand. The results are two daughter molecules that contain each a parent and a new strand. Note that for both daughter molecules synthesis occurs in the 5' to 3' direction, which requires noncontinuous synthesis in one case. Many viruses have RNA as genetic material. Some of them rely on RNA polymerases to replicate their genome. Others, called retroviruses (e.g. HIV-1), use reverse transcription to make a DNA copy of their RNA genome. The figure shows a hybrid DNA-RNA duplex with a RNA-strand (dark and a strand consisting of a RNA primer (dark) and DNA continuation (light) as it occurs in HIV reverse transcription. This hybrid involves a variation in the width of the minor groove. 4.11 Somewhat similar to replication, (other) RNA polymerases are used to copy selected genes from DNA to RNA. transcription. Again a template made of DNA as well as ribonucleoside triphosphates are needed. However, no primer for the new strand is required (in fact primer synthesis for DNA replication is performed by RNA polymerases). Transcription therefore relies on a complex system of promoter sites on the DNA and proteins (e.g. transcription factors) to start transcription at the desired site, i.e. to create a mRNA copy with the requested gene(s). 5. Genetic code Genetic information is the sequence of nucleotides in DNA (or RNA). It codes for the sequences of amino acid residues in proteins. Because the "DNA-alphabet" contains only 4 "letters" while the "protein-alphabet" contains 20 "letters", fragments of 3 nucleotides are required to identify a specific amino acid. These triplets of nucleotides are called codons (and anticodon on t-RNA). The table below provides the translation from nucleotide triplets to amino acid residues, i.e. the genetic code. Genetic code First position (5') Second position Third position (3') U C A G --------------------------------------------------------------------------------------------Phe Ser Tyr Cys U U Phe Ser Tyr Cys C Leu Ser STOP STOP A Leu Ser STOP Trp G --------------------------------------------------------------------------------------------Leu Pro His Arg U C Leu Pro His Arg C Leu Pro Gln Arg A Leu Pro Gln Arg G --------------------------------------------------------------------------------------------Ile Thr Asn Ser U A Ile Thr Asn Ser C Ile Thr Lys Arg A Met Thr Lys Arg G --------------------------------------------------------------------------------------------Val Ala Asp Gly U G Val Ala Asp Gly C Val Ala Glu Gly A Val Ala Glu Gly G --------------------------------------------------------------------------------------------The genetic code has several interesting properties. It is degenerate, meaning that most amino acids are coded by several "codons"; exceptions are Trp and Met. There is a correlation between the number of codons for an amino acid and its frequency of occurrence in proteins. Often, degenerate codes have the first two "letters" in common. Three codons serve as stop signals, ending a gene. The codon AUG codes for Met but is also part of the initiation signal (start of a gene). Chemically similar amino acids often share the middle base; an example is the presence of U or C for hydrophobic residues. 4.12 The genetic code is almost universal. Thus human genes are correctly read by bacterial systems. However, a few exceptions exist. Human mitochondria read slightly differently; e.g. UGA codes for Trp rather than being s stop signal. Ciliated protozoa have only one stop-signal. Messenger RNAs contain the information for one or a few genes, with start and stop signals (the start signal is more complex than just AUG). Eukaryotic genes and their mRNA copies are often discontinuous, i.e. large parts of these sequences are not translated. These intervening parts are called introns, while the coding parts are called exons. The mRNA is subjected to a further process called splicing to exclude the introns. The advantage of this complication is higher flexibility to form new proteins. The 7700 bp long ovalbumin gene for example has 1872 bp coding for 624 residues. exon intron exon intron exon intron exon 6. Translation tRNA Protein synthesis, translation, occurs on a molecular complex called ribosome. tRNAs bring amino acids to the ribosome. To this end they have two binding sites: one for the amino acid and one for the codon of the mRNA called the anticodon. There are about 60 different tRNAs corresponding to the number of possible codons. These consist of about 80 nucleotides and have a molecular weight of about 30’000. Many bases in tRNAs are rare (examples: inosine, thymine) and often obtained by modification of normal bases. The function of these rare bases remains unknown. tRNAs form stable and water-soluble structures with short parts of double helices formed by complementary segments. By maximizing the number of base pairs (using the sequence) one arrives at a cross-like shape with four arms: - acceptor-arm with CCA at 3’-end: amino acid binds here - anticodon-arm - D-arm including dihydrouridine - T-arm including sequence TC (pseudouridine) The 3D-structure resembles the letter “L” with the acceptor and antidocon arms at opposite ends yielding a maximal distance between the two binding sites. 4.13 The ribosome Ribosomes are molecular “machine” for the synthesis of proteins. A cell contains about 20’000 ribosomes corresponding to 1/3 of the cell mass. The ribosome consists of two units. In E. coli the 50S subunit contains 32 proteins, the 5S and the 23S rRNA; the 30S subunit contains 21 proteins and the 16S rRNA. Schematic view of peptide growth in ribosomes: protein protein 5’ protein tRNA (loaded) mRNA 5’ mRNA 5’ mRNA Recently a crystal structure at 2.4 Å resolution of the large unit of the ribosome from the prokaryote Haloarcula marismortui has been presented. A major finding is that the ribosome acts as a ribozyme, i.e. the active site is formed exclusively of rRNA with an adenine playing a similar role as the histidine in chymotrypsin. 4.14 7. Tools in genetics Restriction enzymes Restriction enzymes are endonucleases that cut specific, palindromic sequences of DNA duplexes (nucleases are enzymes that cut DNA, exonucleases cut terminal nucleotides, endonucleases cut within the DNA). The tool chest of molecular biologists contains more than 100 restriction enzymes. An example is EcoR1 that cuts as follows: -N-N-N-G-A-A-T-T-C-N-N-N-N-N-N-G A-A-T-T-C-N-N-N-N-N-N-C-T-T-A-A-G-N-N-N-N-N-N-C-T-T-A-A G-N-N-NThe result of using restriction enzymes on a DNA fragment is called a restriction map. Consider the following example: A 10kb DNA is cut by a restriction enzyme R1 into fragments of 2 and 8 kb, and by another enzyme R2 into fragments of 3 and 7 kb. If both enzymes are applied, fragments of 2, 3 and 5 kb are obtained. Conclusion: R1 cuts near one end after 2 kb and R2 cuts 3 kb before the other end. The following figure shows the application of a restriction enzyme to insert a DNA fragment into a plasmid (small circular DNA duplexes (1-200 kb) that can duplicate autonomously). AATT TTAA plasmid restriction enzyme AATT TTAA anneal AATT AATT TTAA TTAA AATT TTAA DNA fragment for insertion Separation of DNA fragments Several techniques allow the separation of DNA fragments obtained for example from restriction enzyme analysis. An often-used characteristic is the electrophoretic mobility. Polyacrylamide gels cam be used for fragments up to 1000 base pairs, and porous agarose gels can resolve larger fragments with as many as 20 kb. Resolution can be as good as one nucleotide difference in length of fragments with a few hundred nucleotides. After separation, DNA fragments can be transferred to nitrocellulose and hybridized with a 32P-labeled probe. An autoradiogram then shows if a fragment and which one is complementary to the probe. DNA sequencing DNA can be sequenced by controlled termination of enzymatic replication. The DNA to be sequenced is added to a polymerization mixture with DNA polymerase and labeled triphosphates as building blocks. In each of four such mixtures one of the nucleotides is also added as an analog (2',3'-dideoxy). Insertion of this analog will terminate the replication process. The results are fragments ending at the various positions of A (or T or G or C, respectively), which can be separated according to their length, and thus provide the sequence. This method can, in an automated version, be used to sequence entire genomes (human genome project). 4.15 PCR (polymerase chain reaction) 5' 3' 3' 5' DNA to be amplified Step 1: denature DNA 5' 3' 3' 5' Step 2: anneal primers 5' 3' x 5' 3' 3' 5' 3' 5' Step 3: primer extension 5' 3' 3' 5' 5' 3' 3' 5' Product of first cycle: two double stranded DNA molecules Repeat cycles to yield a greater than 106-fold increase in DNA PCR can be used to amplify very small amounts of DNA, including the DNA of a single cell. Besides for genetic testing, PCR is used in forensics or molecular paleontology.