Genes & Genome organization

advertisement
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
G
Geenneess &
&G
Geennoom
mee oorrggaanniizzaattiioonn
IInnttrroodduuccttiioonn
 The genetic information of heritable traits of all biological organisms on planet Earth
is laid down in form of a sequence of the nitrogenous bases adenine (A), guanine
(G), cytosine (C) and thymine (T) as central part of the DNA double helix
 However, not all nucleotide letters of the DNA molecule are actually coding for a final
gene product, i.e. a protein or enzyme, and are therefore not translated
 Only certain nucleotide sequences along the chromosomal DNA, the sequences of
so-called genes, are actually translated into a final, functional gene product
- along the more than thousand or million base pairs comprising the complete
genome of a biological organisms (for comparison see Table below), only some
stretches are coding genes
- the DNA sequences between genes, or so-called intergenic sequences, full-fill
other, vastly unknown functions
- in the recent years, scientists unraveled other important biological functions, e.g.
gene regulation, imprinting, of many of these often referred to “junk DNA”
sequences in the genomes of biological organisms (see also: micro- or silencer
RNA)
 The genomes of all organisms are organized in many other non-coding sequences
and DNA regions which we will look up in this chapter in more detail

eukaryotic chromosomal DNA is much more complex organized than prokaryotic
chromosomal material
 eukaryotic chromosomes contain so-called scaffold proteins which help to
Shape and organize the complex 3-dimensional chromosomal structure
 some of these proteins are play a role in the control of gene activity (see Chapter
10)

each eukaryotic chromosome consists of one long, linear DNA double helix which
codes for thousands of genes
- the chromosomal ends are made up from single-stranded chromosomal DNA,
the so-called telomeres
- the telomeres itself are protected from “erosion” by several telomeric proteins

a gene is a segment on the DNA strand of the genome which codes for a distinct
protein or enzyme

the long DNA double helix of each eukaryotic chromosome codes for thousands of
genes, each comprising important elements and sections (see Figure below)
1
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
Organization and functional regions of a eukaryotic gene
~ 5000 bp
100-200bp
100bp
25-35bp
CpG
island
AU-rich
Site
Transcription
Start Site
TATA
box
Exon
Intron
(= coding)
(= non-coding)
Termination
Sites
Enhancer
TGA,TAA
DNA
20-50bp
6bp
ATG
AAUAAA
Start codon
Promoter
proximal
elements

Promoter
Gene transcript
the average gene is about 1000 nucleotide base pairs long
- almost all genes which make up an eukaryotic organism are found in the cell
nucleus
- some genes are located on the so-called extra-chromosomal DNA which is
located in mitochondria
Definition: Gene
A gene is the entire nucleic acid sequence of a DNA molecule that is necessary for the
synthesis of a functional polypeptide
 exceptions are genes for rRNA or tRNA molecules
2
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.

a gene includes following DNA sequences:
1. Coding sequence
 DNA sequence that codes for the final polypeptide
 begins in most organism with an ATG start codon
2. Initiation sequences
 is the site on the gene that directs DNA transcription
 can be located 1,000 bp away from the actual coding region
3. Enhancer sequences
 transcription-control regions in eukaryotes
 can be located more than 50,000 bp away from the actual coding region
4. 3′ cleavage sites
5. Polyadenylation [poly(A)] sites

genes in prokaryotes, e.g. the E.coli bacterium, are organized in functional
units/clusters called operons
 operons contain genes which encode enzymes involved in related functions
 operons are transcribed as a single transcription unit = ‘polycistronic RNA’
3
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.

genes in eukaryotic organisms produce mRNAs that encode only one protein =
‘monocistronic RNA’
 translation begins at the AUG start codon closest to the mRNA 5’-cap region

genes of eukaryotes have exon-intron structures
 exons contain coding sequences
 introns are non-coding sequences
 95 percent of eukaryotic gene sequences are introns

bacterial and yeast genes generally lack introns

eukaryotic chromosomes contain much more genes and are much more complex
than prokaryotic chromosomes
- e.g. a human cell has about 35,000 – 40, 000 genes, while the genome of a
bacterium harbors about 3000 genes
- eukaryotic chromosomes contain proteins which help to organize the complex 3dimensional (X-shaped) structure
- some of these proteins are play a role in the control of gene activity

the sequence of nucleotides (see Graphic below) or the so-called letter code
which makes up a gene, determines the later shape and function of the gene
product
 the gene product can either be a protein, which helps to build up
the cell structure or an enzyme, which regulates essential part of
the cell’s biochemical pathways
4
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
The DNA sequence of a typical gene
(= gene of the human enzyme superoxide dismutase)
SOURCE
human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 560)
AUTHORS Sherman,L., Dafni,N., Lieman-Hurwitz,J. and Groner,Y.
TITLE Nucleotide sequence and expression of human chromosome 21-encoded
superoxide dismutase mRNA
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 80 (18), 5465-5469 (1983)
BASE COUNT
158 a
ORIGIN (human)
108 c
160 g
134 t
bp1 A
ATTG
GGCGACGA AGGCCGTGTG CGTGCTGAAG GGCGACGGCC
CAGTGCAGGCATCATCAATTTCGAGCAGA AGGAAAGTAA TGGACCAGTG
AAGGTGTGGGAAGCATTAAAGGACTGACTGAAGGCCTGCATGGATTCCTGTTCAT
GAGTTTGGAGATAATACGGCAGCTGTACCAGTGCAGGTCCTCACTTTAATCCTCTA
TCCAGAAAACACGGTGGGCCAAAGGATGAAGAGAGGCATGTTGGAGACTTGGGCA
ATGTGACTGCTGACAAAGATGGTGTGGCCGATGTGTCTATTGAAGATTCTGTGATC
TCACTCTCAGGAGACCATTGCATCATTGGCCGCACACTGGTGGTCCATGAAAAAG
CAGATGACTTGGGCAAAGGTGGAAATGAAGAAAGTACAAAGACAGGAAACGCTGG
AAGTCGTTTGGCTTGTGGTGTAATTGGGATCGCCCAATAAACATTCCCTTGGATGT
AGTCTGAGG CCCCTTAACT CATCTGTTAT CCTGCTAGCT GTAGAAATGT
ATCCTGATAAACATTAAACA CTGTAATCTT bp561
//
(from: NIH/NCBI Entrez Nucleotide data base)
Nucleotide abbreviation:
A = Adenine
T = Thymine G = Guanine C = Cytosine
ATG = Start codon
 The invention and improvement of the so-called DNA sequencing technology in the
past 20 years (see: DNA sequencers), as well as the introduction of computerassisted comparison of nucleotide sequences of different genomes (see:
Bioinformatics), lead to a deeper understanding of the complex organization of the
genetic information in the genomes and to the identification of different types of
genes and other genetic elements
5
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
 Today molecular biologists classify the genomes in different genetic elements which
are:
1. Protein-coding genes
- solitary protein coding genes
 are genes which appear in only one single version within the genome
 e.g. the eukarytic gene for the enzyme lysozyme (see Figure below)
The lysozyme gene: an example of a solitary gene
Example: Chicken lysozyme gene
• 15-kb DNA sequence
• single transcription unit
• protein component of chicken egg-white
• cleaves the polysaccharides in bacterial cell walls
• also found as anti-bacterial enzyme in human tears
and in white blood cells
ATG
Start
gene
= Exons
mRNA
= Introns
= Alu sequences
6
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
2. Duplicated and diverged genes
- this class of genes includes genes which appear in multiple, but variant versions,
within the genomes of eukaryotic organisms
- some of the gene variants comprise large gene families, e.g. the globin gene
familiy (see Figure below)
- a gene family is a set of duplicated genes that encode proteins with similar but
non-identical amino acid sequences
- most gene families arose by duplication of an ancestral gene, most likely as the
result of an “unequal crossover” during meiosis in an ancestral germ-cell (egg or
sperm) precursor
- the coded proteins usually belong to the same protein family but may have
gained different cellular functions during the evolution of the biological organism
- today, newly sequenced proteins or genes are checked for sequence similarity
with known proteins or genes and classified into protein or gene families with the
help of mathematical algorithms and databases such as:
1. Prosite
 a database of protein families and domains
 helps to connect new protein sequences with known protein
families
 http://www.expasy.ch/prosite/
2. Pfam
3. BLOCKS
 detects and verifies protein sequence homology by comparing a
protein or DNA sequence to a protein blocks database
 http://www.blocks.fhcrc.org/blocks/
-
-
examples of evolutionary conserved and important protein families are:
1. Protein kinases
2. Transcription factors
3. Immunoglobulins (vertebrates)
4. Cyclins
5. Heat shock proteins
6. Cytoskeletal proteins (tubulin, actin, keratin)
7. Globins  see -globin gene family
some gene variants have lost their biological function during the course of
evolution and turned into non-functional, so-called “pseudo-genes”
3. Tandemly repeated genes (= Tandem Repeats)
- tandem repeats are coding DNA sequences which appear in more than one
version but with the same gene sequence within the genome
(see Figure below)
- important examples for tandem repeats in the genomes of higher organisms
are the genes for:
 rRNA
 5S rRNA
 tRNA
 histones
7
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
The globin gene family: an example of a showing duplicated genes and
pseudogenes
• Gene duplication of the -globin gene resulted from unequal
crossing over between 2 homologous chromosomes carrying an
ancestral globin gene
• it most likely involved the two homologous L1 repeated sequences
located 3’ and 5’ to the globin gene
Human globin gene cluster

G
A 1


5’
3’
1
Chr.#11
30 31 10
105
146
3’
5’
0
400
800
1200
1600
Exon3
Exon1 Exon2
1
Pseudogene
 non-functional
bp

-globin gene
 functional
8
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
Tandem repeats of the 5S-rRNA gene
 the genes encoding rRNAs, tRNAs, histones and several other
proteins are organized as tandemly repeated arrays which are
repeated copies of the same gene
- e.g. frogs have more than 20,000 copies of the 5S rRNA gene!!
 the nucleotide sequence of rRNA or tRNA tandem repeats is
exactly, or almost exactly, identical
 only the non-transcribed so-called intergenic spacer regions
located between the transcribed regions show sequence variation
 tandem repeats meet the great cellular demand for its
rRNA and tRNA transcripts
100 – 20,000 copies
Tandem repeats of the 5S-rRNA gene
5S-rRNA
5S-rRNA
5S-rRNA
5S-rRNA
5S-rRNA
Single copy gene
Intergenic spacer region
(= variant DNA)
9
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
4. Repetitious DNA

vast parts of the eukaryotic genome consists of so-called non-coding repetitious
DNA, which can be:
1. Simple-sequence DNA
- make up about 10 – 15% of the mammalian genomic DNA
- are composed largely of several different sets of 5- to 10 bp sequences
repeated in long tandems
- long tandem repeats of simple sequences with 20 – 200 bp length also
exist; these are also referred to as satellite DNA
- in humans some simple sequence DNA exists in short 1- to 5-kb regions
made up of 20 – 50 repeat units each with 15 - 100 bp, which are called
minisatellites
- since the total lengths of various minisatellites differs in different human
individuals, it is used for genetic fingerprinting, e.g. in forensic science
- in most mammals, much of the simple-sequence DNA is found near the
chromosomal centromere region
 role in the structure and functioning of the kinetochore?
 the function for most other simple sequence DNAs is not known
-
-
-
in chromosomes of Drosophila melanogaster, simple-sequence DNA is
found in centromeres and telomeres
since in humans, simple sequence DNA can be found at different locations
on chromosomes, they are useful for chromosome identification by
fluorescence in situ hybridization (FISH)
the repeat units composing simple-sequence DNA tandem arrays are
highly conserved among human individuals, they can be used for genetic
fingerprinting  see: Variable number tandem repeat (VNTR) method
individual differences due to different unequal crossing over events during
meiosis
2. Moderately repeated DNA or mobile DNA elements
- first discovered by the American molecular biologist and Nobel prize winner
Barbara McClintock in common maize/corn
- moderately repeated DNA are Transposons, Viral retrotransposons and Nonviral retrotransposons (for more info see below)
•
the characteristics of mobile DNA elements are:
1. they are interspersed throughout the genomes of bacteria,
higher plants and animals
2. they are hundreds to a few thousand bp long
3. they copy and insert into new sites in the genome by a cellular process called
transposition
4. Transposition requires either DNA or RNA intermediates
•
mobile DNA with DNA intermediates (“transposons”)
- requires excision, copying and insertion by enzymes, e.g Transposase
10
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
•
mobile DNA with RNA intermediates (“retrotransposons”)
- requires RNA polymerase & Reverse transcriptase
- movement and DNA insertion is analogous to the
infectious process of retroviruses
•
based on their mechanism of movement and genome integration, transposons
and retrotransposons are further classified into:
1. Bacterial insertion sequences (IS elements)
- have a typical 50 bp inverted repeats (IRs) at the endings (see Figure below)
- have a DNA sequence which codes for the enzyme transposase (or
resolvase) necessary for transposition
2. Bacterial transposons
•
bacterial transposons are mobile DNA elements widely observed in bacteria that
are capable to:
1. cause mutations
2. mediate genomic rearrangements
•
they are also responsible for:
1. duplications of existing gene sequences
2. aquiration of new genes and its dissemination
within bacterial population
 role in horizontal gene transfer?
 role in “DNA scavenging” from bio-films?
•
5 major classes of bacterial transposons have been identified (see Figure
below):
1. Composite transposons
- simple insertion sequences
- 780 – 1,500 bp long
- inverted repeats (IR) (15-25 bp) at the 3’ and 5’ ends
- contain one or 2 transposase genes
2. Complex transposons
- 2,000 – 40,000 bp long
- contain insertion sequences as IRs
- insertion sequences code for genes other than transposase, e.g.
for adhesins, toxins, antibiotic=resistance genes & other virulence
factors
- e.g. Tn5, Tn10 (E.coli)
11
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
Example of a typical bacterial IS element
1. General structure:
Transposase
(or Resolvase)
2. Non-replicative transposition of IS10 in E.coli
12
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
3. TnA family transposons
- plasmid-bound transposon
- contain, e.g. ampicillin-resistance genes
- e.g. Tn3, Tn1000
Transposons & Disease
A conjugative plasmid-bound transposon Tn1546 has been recently been identified in
a vancomycin-resistant strain of Staphylococcus areus (VRSA) in a hospital in the U.S.
This observation is alarming since the antibiotic vancomycin is commonly considered as
the “last resort” antibiotic to treat bacterial infections!
4. Bacteriophage Mu & related temperate phage TPs
5. Conjugative transposons
- mostly found in gram-positive bacteria
- e.g. Tn 917
•
Bacterial transposons are larger DNA segments than IS elements
•
Bacterial transposons are widely used as highly selective biological mutagens
in basic research “gene knock-out” studies (affect only a single cellular gene)
•
Bacterial transposons are easy identifiable by newly acquired antibiotic
resistance phenotypes of certain bacteria and through the appearance of
different restriction fragments
13
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
Domains and genes of important E.coli transposons
Inverted repeat
(IR)
Tn10 organization
5’
3’
tet operon
IS10L
IS10R
Tn3 organization
β-lactamase
gene
IS3L
(38bp)
IS3R
(38bp)
Tn5 organization
?
IS3L
(19bp)
virulence gene?
IRleft
CTGACTCTTATACACAAGT
Kanamycin Neomycin Bleomycin Streptomycin –
resistance gene
IS3R
(19bp)
IRright
ACTTGTGTATAAGAGTCAG
Graphics©E.Schmid/2002
3. Eukaryotic transposons
• Are mobile genetic elements which are observed in many eukaryotic genomes
 e.g. the so-called P- elements in Drosophila account for approx. 50% of
all spontaneous mutations
•
Eukaryotic transposons were originally discovered by B. McClintock in form of
the mobile (Ac and Ds) elements in Zea maize (corn), which lead to mutant
phenotypes of the kernel color
 Ds elements are deleted forms of the Ac element with
deleted portion of the sequence encoding the enzyme transposase
 Ds elements cannot revert kernel mutations unless Ac is
also present in the genome
14
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
 Ac elements have introns  transpose via direct DNA movement w/o RNA intermediate
•
structure of these elements is similar to bacterial IS elements
•
transposition occurs by a non-replicative mechanism
 simple, non-replicative excision of DNA and its insertion at target site within
the genome
4. Viral retrotransposons

Are abundant mobile DNA elements in yeast (e.g., Ty elements) and in
Drosophila (e.g. copia elements

They have characteristic ≈250- to 600-bp long terminal repeats (LTRs) on both
ends
 LTRs are characteristic of integrated retroviral DNA (see: Retro viruses)
 see Figure below
•
The transposition is similar to mechanism used by retroviruses to integrate their
DNA into the host-cell genome
•
Ty elements transpose at a very low rate
•
Ty elements and copia encode reverse transcriptase and integrase
 important for transposition and integration of dsDNA product
into new genome site
Schematic organization of a viral retrotransposon
General structure
left LTR
 serves as promoter site
right LTR
genomic
Host DNA
15
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
5. Non-viral retrotransposons
•
Are the most abundant mobile DNA elements in mammalian genomes
 they are present in thousands of copies throughout the genome
•
non-viral retrotransposons lack LTRs
•
most belong to two classes of moderately repeated DNA sequences:
1. Long interspersed elements (LINES)
- are ≈6 – 7 kb long (in H. sapiens) (see Figure below)
- are very abundant in mammalian genomes
- 10 classes of LINES have been identified in mammalian genomes
- the most common is the L1 LINE family
 the human genome has approx. 600,000 copies of L1 elements
- L1 LINE sequence insertion mutations have been found in many human genetic
diseases
- transposition of non-viral retrotransposons occurs through an RNA
Intermediate and requires the enzyme reverse transcriptase
- majority of L1 sequences contain stop codons and frame-shift mutations in
ORF1 and ORF2
Schematic organization of a LINE sequence as an example of a non-retroviral
transposon
General structure of a L1 element:
RNA-binding protein
Reverse transcriptasehomolog protein
Role in transposition?
Genomic DNA
16
SOUTHWESTERN COLLEGE, CHULA VISTA
SCHOOL OF MATHEMATICS, SCIENCE & ENGINEERING
Molecular & Cellular Biology; Instructor: Elmar Schmid, Ph.D.
2. Short interspersed elements (SINES)
- SINES are short, ≈300 bp long mobile DNA elements (see Figure below)
 they contain A/T-rich regions
- SINES are flanked by direct repeats and do not encode proteins
 they are transcribed by RNA polymerase III and are found primarily in
the genomes of mammalian animals
- so far, several hundred different SINES have been identified, all of them having
high nucleotide sequence homology regions
 the nucleotide sequence of SINES is 80% identical between
different species (= 80% intra-species identity)
- many of the SINE sequences in human DNA contain a unique recognition site
for the restriction enzyme AluI
 collectively called Alu family or Alu sequences
- an astonishing ≈1 million Alu sequences are located in the human genome
 Alus make up 10% of the total human DNA
- the Alu sequence SINE has been discovered as inactivating Alu sequence
mutation in one NF1 allele of a patient suffering from the heritable disorder
Neurofibromatosis
- Alu sequences show a high nucleotide sequence homology to small cellular
7SL RNA
 7SL RNA is part of the signal-recognition ribonucleoprotein particle
complex, that plays an important role in polypeptide trafficking
through the phospholipids membrane of the endoplasmic reticulum
 7SL RNA genes are evolutionary conserved and probably existed long
Before the Alu sequences arose
- the biological function of SINES is not known
 one hypothesis states that they may have an impact on the speed of
evolutionary change (= mutation rates) through causing
homologous recombinations and other DNA rearrangements?
 creation of novel combinations of preexisting exons?
 control in gene expression?
Example of a SINE/Alu sequence located on chromosome #7
of Homo sapiens chromosome (= 7q22)
ggctgggtacagtggctcaggcctgtaatcccagcacctttcgaggctgaggcaggtgga
ttgcttgaggtcaggagtttgagaccagcctgggcagcttggcaaaacctcatctctgca
aaaaatacaaaaatca
 AluI cut site
COUNT:
37DNA
a 32 c
5. BASE
Unclassified
spacer
39 g
28 t
17
Download