SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. G Geenneess & &G Geennoom mee oorrggaanniizzaattiioonn IInnttrroodduuccttiioonn The genetic information of heritable traits of all biological organisms on planet Earth is laid down in form of a sequence of the nitrogenous bases adenine (A), guanine (G), cytosine (C) and thymine (T) as central part of the DNA double helix However, not all nucleotide letters of the DNA molecule are actually coding for a final gene product, i.e. a protein or enzyme, and are therefore not translated Only certain nucleotide sequences along the chromosomal DNA, the sequences of so-called genes, are actually translated into a final, functional gene product - along the more than thousand or million base pairs comprising the complete genome of a biological organisms (for comparison see Table below), only some stretches are coding genes - the DNA sequences between genes, or so-called intergenic sequences, full-fill other, vastly unknown functions - in the recent years, scientists unraveled other important biological functions, e.g. gene regulation, imprinting, of many of these often referred to “junk DNA” sequences in the genomes of biological organisms (see also: micro- or silencer RNA) The genomes of all organisms are organized in many other non-coding sequences and DNA regions which we will look up in this chapter in more detail eukaryotic chromosomal DNA is much more complex organized than prokaryotic chromosomal material eukaryotic chromosomes contain so-called scaffold proteins which help to Shape and organize the complex 3-dimensional chromosomal structure some of these proteins are play a role in the control of gene activity (see Chapter 10) each eukaryotic chromosome consists of one long, linear DNA double helix which codes for thousands of genes - the chromosomal ends are made up from single-stranded chromosomal DNA, the so-called telomeres - the telomeres itself are protected from “erosion” by several telomeric proteins a gene is a segment on the DNA strand of the genome which codes for a distinct protein or enzyme the long DNA double helix of each eukaryotic chromosome codes for thousands of genes, each comprising important elements and sections (see Figure below) 1 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. Organization and functional regions of a eukaryotic gene ~ 5000 bp 100-200bp 100bp 25-35bp CpG island AU-rich Site Transcription Start Site TATA box Exon Intron (= coding) (= non-coding) Termination Sites Enhancer TGA,TAA DNA 20-50bp 6bp ATG AAUAAA Start codon Promoter proximal elements Promoter Gene transcript the average gene is about 1000 nucleotide base pairs long - almost all genes which make up an eukaryotic organism are found in the cell nucleus - some genes are located on the so-called extra-chromosomal DNA which is located in mitochondria Definition: Gene A gene is the entire nucleic acid sequence of a DNA molecule that is necessary for the synthesis of a functional polypeptide exceptions are genes for rRNA or tRNA molecules 2 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. a gene includes following DNA sequences: 1. Coding sequence DNA sequence that codes for the final polypeptide begins in most organism with an ATG start codon 2. Initiation sequences is the site on the gene that directs DNA transcription can be located 1,000 bp away from the actual coding region 3. Enhancer sequences transcription-control regions in eukaryotes can be located more than 50,000 bp away from the actual coding region 4. 3′ cleavage sites 5. Polyadenylation [poly(A)] sites genes in prokaryotes, e.g. the E.coli bacterium, are organized in functional units/clusters called operons operons contain genes which encode enzymes involved in related functions operons are transcribed as a single transcription unit = ‘polycistronic RNA’ 3 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. genes in eukaryotic organisms produce mRNAs that encode only one protein = ‘monocistronic RNA’ translation begins at the AUG start codon closest to the mRNA 5’-cap region genes of eukaryotes have exon-intron structures exons contain coding sequences introns are non-coding sequences 95 percent of eukaryotic gene sequences are introns bacterial and yeast genes generally lack introns eukaryotic chromosomes contain much more genes and are much more complex than prokaryotic chromosomes - e.g. a human cell has about 35,000 – 40, 000 genes, while the genome of a bacterium harbors about 3000 genes - eukaryotic chromosomes contain proteins which help to organize the complex 3dimensional (X-shaped) structure - some of these proteins are play a role in the control of gene activity the sequence of nucleotides (see Graphic below) or the so-called letter code which makes up a gene, determines the later shape and function of the gene product the gene product can either be a protein, which helps to build up the cell structure or an enzyme, which regulates essential part of the cell’s biochemical pathways 4 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. The DNA sequence of a typical gene (= gene of the human enzyme superoxide dismutase) SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 560) AUTHORS Sherman,L., Dafni,N., Lieman-Hurwitz,J. and Groner,Y. TITLE Nucleotide sequence and expression of human chromosome 21-encoded superoxide dismutase mRNA JOURNAL Proc. Natl. Acad. Sci. U.S.A. 80 (18), 5465-5469 (1983) BASE COUNT 158 a ORIGIN (human) 108 c 160 g 134 t bp1 A ATTG GGCGACGA AGGCCGTGTG CGTGCTGAAG GGCGACGGCC CAGTGCAGGCATCATCAATTTCGAGCAGA AGGAAAGTAA TGGACCAGTG AAGGTGTGGGAAGCATTAAAGGACTGACTGAAGGCCTGCATGGATTCCTGTTCAT GAGTTTGGAGATAATACGGCAGCTGTACCAGTGCAGGTCCTCACTTTAATCCTCTA TCCAGAAAACACGGTGGGCCAAAGGATGAAGAGAGGCATGTTGGAGACTTGGGCA ATGTGACTGCTGACAAAGATGGTGTGGCCGATGTGTCTATTGAAGATTCTGTGATC TCACTCTCAGGAGACCATTGCATCATTGGCCGCACACTGGTGGTCCATGAAAAAG CAGATGACTTGGGCAAAGGTGGAAATGAAGAAAGTACAAAGACAGGAAACGCTGG AAGTCGTTTGGCTTGTGGTGTAATTGGGATCGCCCAATAAACATTCCCTTGGATGT AGTCTGAGG CCCCTTAACT CATCTGTTAT CCTGCTAGCT GTAGAAATGT ATCCTGATAAACATTAAACA CTGTAATCTT bp561 // (from: NIH/NCBI Entrez Nucleotide data base) Nucleotide abbreviation: A = Adenine T = Thymine G = Guanine C = Cytosine ATG = Start codon The invention and improvement of the so-called DNA sequencing technology in the past 20 years (see: DNA sequencers), as well as the introduction of computerassisted comparison of nucleotide sequences of different genomes (see: Bioinformatics), lead to a deeper understanding of the complex organization of the genetic information in the genomes and to the identification of different types of genes and other genetic elements 5 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. Today molecular biologists classify the genomes in different genetic elements which are: 1. Protein-coding genes - solitary protein coding genes are genes which appear in only one single version within the genome e.g. the eukarytic gene for the enzyme lysozyme (see Figure below) The lysozyme gene: an example of a solitary gene Example: Chicken lysozyme gene • 15-kb DNA sequence • single transcription unit • protein component of chicken egg-white • cleaves the polysaccharides in bacterial cell walls • also found as anti-bacterial enzyme in human tears and in white blood cells ATG Start gene = Exons mRNA = Introns = Alu sequences 6 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. 2. Duplicated and diverged genes - this class of genes includes genes which appear in multiple, but variant versions, within the genomes of eukaryotic organisms - some of the gene variants comprise large gene families, e.g. the globin gene familiy (see Figure below) - a gene family is a set of duplicated genes that encode proteins with similar but non-identical amino acid sequences - most gene families arose by duplication of an ancestral gene, most likely as the result of an “unequal crossover” during meiosis in an ancestral germ-cell (egg or sperm) precursor - the coded proteins usually belong to the same protein family but may have gained different cellular functions during the evolution of the biological organism - today, newly sequenced proteins or genes are checked for sequence similarity with known proteins or genes and classified into protein or gene families with the help of mathematical algorithms and databases such as: 1. Prosite a database of protein families and domains helps to connect new protein sequences with known protein families http://www.expasy.ch/prosite/ 2. Pfam 3. BLOCKS detects and verifies protein sequence homology by comparing a protein or DNA sequence to a protein blocks database http://www.blocks.fhcrc.org/blocks/ - - examples of evolutionary conserved and important protein families are: 1. Protein kinases 2. Transcription factors 3. Immunoglobulins (vertebrates) 4. Cyclins 5. Heat shock proteins 6. Cytoskeletal proteins (tubulin, actin, keratin) 7. Globins see -globin gene family some gene variants have lost their biological function during the course of evolution and turned into non-functional, so-called “pseudo-genes” 3. Tandemly repeated genes (= Tandem Repeats) - tandem repeats are coding DNA sequences which appear in more than one version but with the same gene sequence within the genome (see Figure below) - important examples for tandem repeats in the genomes of higher organisms are the genes for: rRNA 5S rRNA tRNA histones 7 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. The globin gene family: an example of a showing duplicated genes and pseudogenes • Gene duplication of the -globin gene resulted from unequal crossing over between 2 homologous chromosomes carrying an ancestral globin gene • it most likely involved the two homologous L1 repeated sequences located 3’ and 5’ to the globin gene Human globin gene cluster G A 1 5’ 3’ 1 Chr.#11 30 31 10 105 146 3’ 5’ 0 400 800 1200 1600 Exon3 Exon1 Exon2 1 Pseudogene non-functional bp -globin gene functional 8 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. Tandem repeats of the 5S-rRNA gene the genes encoding rRNAs, tRNAs, histones and several other proteins are organized as tandemly repeated arrays which are repeated copies of the same gene - e.g. frogs have more than 20,000 copies of the 5S rRNA gene!! the nucleotide sequence of rRNA or tRNA tandem repeats is exactly, or almost exactly, identical only the non-transcribed so-called intergenic spacer regions located between the transcribed regions show sequence variation tandem repeats meet the great cellular demand for its rRNA and tRNA transcripts 100 – 20,000 copies Tandem repeats of the 5S-rRNA gene 5S-rRNA 5S-rRNA 5S-rRNA 5S-rRNA 5S-rRNA Single copy gene Intergenic spacer region (= variant DNA) 9 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. 4. Repetitious DNA vast parts of the eukaryotic genome consists of so-called non-coding repetitious DNA, which can be: 1. Simple-sequence DNA - make up about 10 – 15% of the mammalian genomic DNA - are composed largely of several different sets of 5- to 10 bp sequences repeated in long tandems - long tandem repeats of simple sequences with 20 – 200 bp length also exist; these are also referred to as satellite DNA - in humans some simple sequence DNA exists in short 1- to 5-kb regions made up of 20 – 50 repeat units each with 15 - 100 bp, which are called minisatellites - since the total lengths of various minisatellites differs in different human individuals, it is used for genetic fingerprinting, e.g. in forensic science - in most mammals, much of the simple-sequence DNA is found near the chromosomal centromere region role in the structure and functioning of the kinetochore? the function for most other simple sequence DNAs is not known - - - in chromosomes of Drosophila melanogaster, simple-sequence DNA is found in centromeres and telomeres since in humans, simple sequence DNA can be found at different locations on chromosomes, they are useful for chromosome identification by fluorescence in situ hybridization (FISH) the repeat units composing simple-sequence DNA tandem arrays are highly conserved among human individuals, they can be used for genetic fingerprinting see: Variable number tandem repeat (VNTR) method individual differences due to different unequal crossing over events during meiosis 2. Moderately repeated DNA or mobile DNA elements - first discovered by the American molecular biologist and Nobel prize winner Barbara McClintock in common maize/corn - moderately repeated DNA are Transposons, Viral retrotransposons and Nonviral retrotransposons (for more info see below) • the characteristics of mobile DNA elements are: 1. they are interspersed throughout the genomes of bacteria, higher plants and animals 2. they are hundreds to a few thousand bp long 3. they copy and insert into new sites in the genome by a cellular process called transposition 4. Transposition requires either DNA or RNA intermediates • mobile DNA with DNA intermediates (“transposons”) - requires excision, copying and insertion by enzymes, e.g Transposase 10 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. • mobile DNA with RNA intermediates (“retrotransposons”) - requires RNA polymerase & Reverse transcriptase - movement and DNA insertion is analogous to the infectious process of retroviruses • based on their mechanism of movement and genome integration, transposons and retrotransposons are further classified into: 1. Bacterial insertion sequences (IS elements) - have a typical 50 bp inverted repeats (IRs) at the endings (see Figure below) - have a DNA sequence which codes for the enzyme transposase (or resolvase) necessary for transposition 2. Bacterial transposons • bacterial transposons are mobile DNA elements widely observed in bacteria that are capable to: 1. cause mutations 2. mediate genomic rearrangements • they are also responsible for: 1. duplications of existing gene sequences 2. aquiration of new genes and its dissemination within bacterial population role in horizontal gene transfer? role in “DNA scavenging” from bio-films? • 5 major classes of bacterial transposons have been identified (see Figure below): 1. Composite transposons - simple insertion sequences - 780 – 1,500 bp long - inverted repeats (IR) (15-25 bp) at the 3’ and 5’ ends - contain one or 2 transposase genes 2. Complex transposons - 2,000 – 40,000 bp long - contain insertion sequences as IRs - insertion sequences code for genes other than transposase, e.g. for adhesins, toxins, antibiotic=resistance genes & other virulence factors - e.g. Tn5, Tn10 (E.coli) 11 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. Example of a typical bacterial IS element 1. General structure: Transposase (or Resolvase) 2. Non-replicative transposition of IS10 in E.coli 12 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. 3. TnA family transposons - plasmid-bound transposon - contain, e.g. ampicillin-resistance genes - e.g. Tn3, Tn1000 Transposons & Disease A conjugative plasmid-bound transposon Tn1546 has been recently been identified in a vancomycin-resistant strain of Staphylococcus areus (VRSA) in a hospital in the U.S. This observation is alarming since the antibiotic vancomycin is commonly considered as the “last resort” antibiotic to treat bacterial infections! 4. Bacteriophage Mu & related temperate phage TPs 5. Conjugative transposons - mostly found in gram-positive bacteria - e.g. Tn 917 • Bacterial transposons are larger DNA segments than IS elements • Bacterial transposons are widely used as highly selective biological mutagens in basic research “gene knock-out” studies (affect only a single cellular gene) • Bacterial transposons are easy identifiable by newly acquired antibiotic resistance phenotypes of certain bacteria and through the appearance of different restriction fragments 13 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. Domains and genes of important E.coli transposons Inverted repeat (IR) Tn10 organization 5’ 3’ tet operon IS10L IS10R Tn3 organization β-lactamase gene IS3L (38bp) IS3R (38bp) Tn5 organization ? IS3L (19bp) virulence gene? IRleft CTGACTCTTATACACAAGT Kanamycin Neomycin Bleomycin Streptomycin – resistance gene IS3R (19bp) IRright ACTTGTGTATAAGAGTCAG Graphics©E.Schmid/2002 3. Eukaryotic transposons • Are mobile genetic elements which are observed in many eukaryotic genomes e.g. the so-called P- elements in Drosophila account for approx. 50% of all spontaneous mutations • Eukaryotic transposons were originally discovered by B. McClintock in form of the mobile (Ac and Ds) elements in Zea maize (corn), which lead to mutant phenotypes of the kernel color Ds elements are deleted forms of the Ac element with deleted portion of the sequence encoding the enzyme transposase Ds elements cannot revert kernel mutations unless Ac is also present in the genome 14 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. Ac elements have introns transpose via direct DNA movement w/o RNA intermediate • structure of these elements is similar to bacterial IS elements • transposition occurs by a non-replicative mechanism simple, non-replicative excision of DNA and its insertion at target site within the genome 4. Viral retrotransposons Are abundant mobile DNA elements in yeast (e.g., Ty elements) and in Drosophila (e.g. copia elements They have characteristic ≈250- to 600-bp long terminal repeats (LTRs) on both ends LTRs are characteristic of integrated retroviral DNA (see: Retro viruses) see Figure below • The transposition is similar to mechanism used by retroviruses to integrate their DNA into the host-cell genome • Ty elements transpose at a very low rate • Ty elements and copia encode reverse transcriptase and integrase important for transposition and integration of dsDNA product into new genome site Schematic organization of a viral retrotransposon General structure left LTR serves as promoter site right LTR genomic Host DNA 15 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. 5. Non-viral retrotransposons • Are the most abundant mobile DNA elements in mammalian genomes they are present in thousands of copies throughout the genome • non-viral retrotransposons lack LTRs • most belong to two classes of moderately repeated DNA sequences: 1. Long interspersed elements (LINES) - are ≈6 – 7 kb long (in H. sapiens) (see Figure below) - are very abundant in mammalian genomes - 10 classes of LINES have been identified in mammalian genomes - the most common is the L1 LINE family the human genome has approx. 600,000 copies of L1 elements - L1 LINE sequence insertion mutations have been found in many human genetic diseases - transposition of non-viral retrotransposons occurs through an RNA Intermediate and requires the enzyme reverse transcriptase - majority of L1 sequences contain stop codons and frame-shift mutations in ORF1 and ORF2 Schematic organization of a LINE sequence as an example of a non-retroviral transposon General structure of a L1 element: RNA-binding protein Reverse transcriptasehomolog protein Role in transposition? Genomic DNA 16 SOUTHWESTERN COLLEGE, CHULA VISTA SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D. 2. Short interspersed elements (SINES) - SINES are short, ≈300 bp long mobile DNA elements (see Figure below) they contain A/T-rich regions - SINES are flanked by direct repeats and do not encode proteins they are transcribed by RNA polymerase III and are found primarily in the genomes of mammalian animals - so far, several hundred different SINES have been identified, all of them having high nucleotide sequence homology regions the nucleotide sequence of SINES is 80% identical between different species (= 80% intra-species identity) - many of the SINE sequences in human DNA contain a unique recognition site for the restriction enzyme AluI collectively called Alu family or Alu sequences - an astonishing ≈1 million Alu sequences are located in the human genome Alus make up 10% of the total human DNA - the Alu sequence SINE has been discovered as inactivating Alu sequence mutation in one NF1 allele of a patient suffering from the heritable disorder Neurofibromatosis - Alu sequences show a high nucleotide sequence homology to small cellular 7SL RNA 7SL RNA is part of the signal-recognition ribonucleoprotein particle complex, that plays an important role in polypeptide trafficking through the phospholipids membrane of the endoplasmic reticulum 7SL RNA genes are evolutionary conserved and probably existed long Before the Alu sequences arose - the biological function of SINES is not known one hypothesis states that they may have an impact on the speed of evolutionary change (= mutation rates) through causing homologous recombinations and other DNA rearrangements? creation of novel combinations of preexisting exons? control in gene expression? Example of a SINE/Alu sequence located on chromosome #7 of Homo sapiens chromosome (= 7q22) ggctgggtacagtggctcaggcctgtaatcccagcacctttcgaggctgaggcaggtgga ttgcttgaggtcaggagtttgagaccagcctgggcagcttggcaaaacctcatctctgca aaaaatacaaaaatca AluI cut site COUNT: 37DNA a 32 c 5. BASE Unclassified spacer 39 g 28 t 17