Topics
Chap. 6 Genes, Genomics, and
Chromosomes (Part A)
• Eukaryotic Gene Structure
• Chromosomal Organization of Genes and Noncoding DNA
• Transposable (Mobile) DNA Elements
Goals
• Learn how genes
encoded by complex
transcription units are
expressed.
• Learn the origin,
types, and functions
of DNA in higher
organisms.
• Learn the properties
of transposons and
their roles in gene
evolution.
RxFISH-painted human chromosomes.
Overview of Human Genes & Chromosomes
Human diploid genomic DNA contains ~109 bp divided among 22
autosomes and 2 sex chromosomes. The longest autosome (#1)
contains 280 x 106 bp. Only 1.5% of human DNA encodes
proteins or functional RNA products. The expressed, coding
segments of genes are called exons. Exons are highly conserved in
sequence. Noncoding DNA
consists of spacer DNA
between genes and intron
DNA within genes.
Noncoding DNA is not
strongly conserved and
accounts for most of the
variations in sequences
between individual humans.
As discussed later, DNA
is highly condensed
(overall ~105-fold in
mitotic chromosomes) by
protein-nucleic acid
complexes called
nucleosomes and other
higher-order structures
(Fig. 6.1).
Simple Transcription Units
Eukaryotic genes are monocistronic in that only one protein is
produced from a given mRNA. However, multiple forms of mRNAs,
and therefore proteins, are produced from many genes. Simple
gene transcription units produce only one type of mRNA and protein
(Fig. 6.3a). Mutations at sites a & b often reduce or prevent
transcription. Mutations at site c can change the amino acid
sequence of the protein and interfere with its function. Mutations
at site d affecting the selection of the exon 2/3 splice site can
result in an abnormally spliced mRNA and nonfunctional protein.
Complex Transcription Units
Complex gene transcription
units produce several species
of mRNAs, and thus proteins
(Fig. 6.3b). The exon content
of mRNAs and domain
composition of proteins are
varied by selection of
alternative splice sites (Top),
polyadenylation sites (Middle),
and even promoter sites
(Bottom). Site selection may
vary in different cell types and
during different stages of
development. The effects of
mutations (e.g., c & d) on the
gene products synthesized
from these transcription units
will be discussed in class.
About 60% of humans genes
are contained in complex
transcription units.
Alternative Splicing & Gene Regulation
Protein domains can be encoded by a single exon or by a small
collection of exons within a larger gene. The coding regions for
domains can be spliced in or out of the primary transcript by the
process of alternative splicing. The resulting mRNAs encode
different forms of the protein, known as isoforms. Alternative
splicing is an important method for regulation of gene expression
in different tissues and different physiological states. It is
estimated that 60% of all human genes are expressed as
alternatively spliced mRNAs. Alternative splicing is illustrated in
Fig. 4.16 for the fibronectin gene. The fibroblast and hepatocyte
isoforms differ in their content of the EIIIA and EIIIB domains
which mediate cell surface binding.Twenty different isoforms of
fibronectin produced by alternative splicing have been identified.
Human Genomic DNA: Protein-coding Genes
Genomic DNA of higher eukaryotes contains 4 main classes of
DNA--1) protein-coding genes, 2) tandemly repeated genes, 3)
repetitious DNA, and 4) unclassified spacer DNA (Table 6.1).
Protein coding genes are grouped into the categories known as
solitary genes, and duplicated or diverged genes belonging to gene
families. In humans, roughly equal numbers of protein-coding genes
occur in these two categories. Groups of homologous duplicated
genes form gene and protein families, such as the ß-globin family.
(25-30%)
The Human ß-globin Gene Family
The ß-globin gene cluster on chromosome 11 is shown in Fig. 6.4a.
The ß-globin genes are expressed in different stages of life. , Ag,
and Gg are expressed during different trimesters of fetal
development (next slide). ß expression begins around birth &
continues throughout adult life. Fetal hemoglobin molecules made
with the d and Gg or Ag polypeptides have a higher affinity for
O2 than maternal hemoglobin, facilitating O2 transfer to the fetus.
The 5 ß-globin genes are derived from an ancestral ß-globin gene
via gene duplication. Over time, these genes accumulated adaptive
mutations via sequence drift resulting in the specialized species of
ß-globin proteins. Genomic DNA also contains nonfunctional DNA
sequences called pseudogenes that are derived from gene
duplication or reverse transcription and integration of cDNA
sequences made from mRNA (covered below). ß-globin pseudogenes
contain introns and thus were derived by gene duplication. Over
time these genes became nonfunctional also due to sequence drift.
Because they are not harmful, pseudogenes remain in the genome,
marking a gene duplication event in an earlier ancestor.
Expression of Human Globin Genes
Exon and Gene Duplication from Unequal
Crossing Over
Fig. 6.2 illustrates how duplication of genes (e.g., the ß-globins)
and exons can occur via unequal crossing over during meiosis and
formation of gametes. Exon duplication results in proteins
containing repeated domains (e.g., the EGF precursor, Fig. 3.11).
In the examples shown, recombination is shown to occur between
L1 retrotransposon sequences which are common in genomic DNA.
Modular Domain Structure of Proteins
Domains are independently folding and functionally specialized
tertiary structure units within a protein. The respective
globular and fibrous structural domains of the hemagglutinin
monomer (which happen to be individual polypeptide chains) are
illustrated above in Fig. 3.10a. Domains (such as the EGF
domain) also may be encoded within a single polypeptide chain,
as illustrated in Fig. 3.11. Domains still perform their
standard functions although fused together in a longer
polypeptide (e.g., DNA binding and ATPase domains of a
transcription factor). The modular domain structure of many
proteins has resulted from the shuffling and splicing together
of their coding sequences within longer genes.
Epidermal growth
factor (EGF) domain
Gene Density in Genomic DNA
Higher eukaryotes contain far more noncoding DNA between
genes than bacteria and simple eukaryotes (Fig. 6.4). The region
of human genomic DNA containing the ß-globin gene cluster
shown in the figure actually is a relatively "gene-rich" region of
human DNA. Some regions known as gene-poor "deserts" also
occur. Higher eukaryotes also contain a larger amount of intron
DNA. Although one-third of human DNA is transcribed into premRNA, 95% ends up being degraded after RNA splicing
reactions. On average, the typical exon is 50-200 bp in length,
while the median length of introns is 3.3 kb in human genes.
Human Genomic DNA: Tandemly
Repeated Genes
Tandemly repeated genes also are derived by gene duplication.
Unlike gene families, the sequences of these duplicated genes
are identical or strongly conserved. In addition, they commonly
are arranged in a head-to-tail fashion in tandem arrays over a
long stretch of DNA. rRNAs and snRNAs (used in splicing
reactions, Chap. 8) are representative of this group (Table
6.1). Multiple copies of these genes are needed due to the
requirement for vast amounts of these RNAs in the cell. tRNA
and histone genes are included in this category, but these
genes typically occur in clusters and not true tandem arrays.
Nonprotein-coding Genes in Human
Genomic DNA
Thousands of genes in the human genome encode functional RNAs (Table
6.2). The functions of several of these are covered in later chapters.
Repetitious DNA
Two main categories of repetitious DNA--simple-sequence DNA
and interspersed repeats--occur in eukaryotic genomes (Table
6.1). Interspersed repeats are more common and are derived
largely from transposons. Simple-sequence DNA is less prevalent,
accounting for ~ 6% of human genomic DNA. Simple-sequence DNA
is also known as satellite DNA, due to its formation of satellite
bands during cesium chloride density gradient ultracentrifugation.
The function of this DNA is mostly obscure. It is commonly found
at the centromere and telomere regions of chromosomes.
(25-30%)
Properties of Satellite DNA
Satellite DNA is classified into 3 types
based on length. True satellite DNA
consists of 14-500 bp sequence units
that tandemly repeat over 20-100 kb
lengths of genomic DNA. Minisatellite
DNA consists of 15-100 bp sequence
units that tandemly repeat over 1-5 kb
stretches of DNA. Microsatellite DNA
consists of 1-13 bp units that can
repeat up to 150 times. Microsatellite
DNA is thought to originate from
“backward slippage” of a growing
daughter strand on its template strand
during DNA replication (Fig. 6.5).The
sequences of repeat units are highly
conserved which suggests they perform
important functions. Each category of
satellite DNA contains a number of
different repeat sequences. Simplesequence DNAs can serve as DNA
markers due to variations in repeat
number. Satellite DNAs are exploited in
FISH (fluorescence in situ hybridization)
chromosome staining (Fig. 6.6).
DNA Fingerprinting
DNA fingerprinting is a method for
identifying individuals based on their
minisatellite DNA (Fig. 6.7). It was
developed in the mid-80s and is
widely used in forensics, paternity
analysis, and for research purposes.
In the method, minisatellite DNA
from a genomic DNA specimen is
amplified by PCR using primers that
bind to unique sequences flanking
minisatellite repeat units. Bands
corresponding to each minisatellite
locus then are separated on gels.
Although satellite DNA is highly
conserved in sequence, the number
of tandem copies at each loci is
highly variable between individuals.
This results from unequal crossing
over during formation of gametes in
meiosis. Due to the variation in the
number of repeats at each locus,
different individuals can be readily
distinguished based on banding
patterns.
Chap. 6 Problem 3
Satellite DNA is classified into 3 categories based on
length. Satellite DNA consists of 14-500 bp sequence
units that tandemly repeat over 20-100 kb lengths of
genomic DNA. Minisatellite DNA consists of 15-100
bp sequence units that tandemly repeat over 1-5 kb
stretches of DNA. Microsatellite DNA consists of 113 bp units that can repeat up to 150 times.
Although the sequences of satellite DNA are highly
conserved, the number of tandem copies at each locus
is highly variable between individuals. This originates
due to unequal crossing over during formation of
gametes in meiosis (Upper figure). DNA fingerprinting
is a method for identifying individuals based on
variations in minisatellite DNA (Fig. 6.7). In the
method, minisatellite DNA is amplified by PCR using
unique primers flanking repeat regions, and the
collection of fragments is run on a gel. Due to the
variation in the number of repeats at different loci,
different individuals can be readily distinguished.
Interspersed Repeats
Interspersed repeat DNA comprises the largest fraction of
repetitious DNA in eukaryotic genomes. This DNA, which is also
called moderately repeated DNA makes up ~45% of human genomic
DNA. Interspersed repeat DNA is composed of partial and
complete transposon sequences or "mobile DNA". Mobile DNAs
were discovered by Barbara McClintock in the 1940s. These
sequences move by "transposition". Transpositions in germ line cells
are inheritable and occur at a rate of one transposition per 8
individuals. In somatic cells they can cause somatic cell mutations.
Mobile DNA has been very important in genome evolution.
(25-30%)
Mobile DNA Elements
Mobile DNA elements are
grouped into two classes,
DNA transposons and
retrotransposons (Fig. 6.8).
DNA transposons move
directly as DNA via a "cutand-paste" mechanism.
Retrotransposons move via an
RNA intermediate and a
"copy-and-paste" mechanism,
wherein the original copy of
the transposon is preserved.
Retroviruses, like HIV,
formally are a subclass of
retrotransposons that can
move between cells because
they encode viral coat
proteins. DNA transposons
predominate in bacteria;
retrotransposons are more
prevalent in eukaryotes.
Mobile DNA in Prokaryotes
Bacteria contain DNA transposons called insertion sequences (Fig.
6.9). IS elements are 1-2 kb DNAs that transpose within the
bacterial genome to random locations. Transposition ("jumping") is
mediated by an encoded transposase protein. Insertion usually
causes gene inactivation and is harmful. Nonetheless, E. coli
encodes ~20 types of IS elements. They are tolerated in part
due to their low transposition rate (1 in 105 - 107 cells per
generation). This rate is set by the low rate of transcription of
the transposase gene. IS elements contain inverted repeat
sequences of ~50 bp at each end of the protein-coding region
that are crucial for transposition.
Mechanism of IS Element Transposition
Transposition occurs in 3 main
steps, as summarized in Fig.
6.10. The excision of the IS
element and its cutting-andpasting into the target sequence
is mediated by the transposase
(Steps 1 & 2). The singlestranded DNA regions remaining
at the insertion site after
transposase action are filled-in
and the nicks sealed by cellular
DNA polymerase and DNA ligase
(Step 3). All transposases we
will cover produce staggered
cuts at their target sites. This
leads to production of short
direct repeat sequences
immediately flanking the sites of
insertion. Eukaryotic DNA
transposons jump in genomic
DNA by a similar mechanism.
Mechanism of DNA Transposon Copy
Number Increase
About 3 x 105 copies of
full-length and truncated
DNA transposons occur in
human genomic DNA (3%
of DNA). Although DNA
transposons move via a
cut-and-paste mechanism,
their copy number in the
genome will increase if
they transpose during
DNA synthesis preceding
the first meiotic division
of gametogenesis (Fig.
6.11).
LTR Retrotransposons
Eukaryotic retrotransposons fall into two major groups--LTR
retrotransposons and non-LTR retrotransposons. Together, these
sequences account for 42% of human genomic DNA.
LTRs stand for long direct
terminal repeats. LTRs consist
of 250-600 bp direct repeat
sequences located at the ends
of the retrotransposon coding
region (Fig. 6.12). LTR
retrotransposons share many
features with retroviruses.
They both encode LTRs,
reverse transcriptase, and
DNA integrase. However, LTR
retrotransposons lack coat
proteins that allow
retroviruses to move between
cells. Transposition occurs via
an RNA intermediate that is
transcribed from a promoter
in the left LTR (Fig. 6.13).
The primary transcript is
polyadenylated, forming the
retroviral genomic RNA.
Retroviral & LTR-retrotransposon DNA
Synthesis
The mechanism by which
retroviral and LTR
retrotransposon DNA is
synthesized prior to
integration into genomic
DNA is shown in Fig. 6.14.
DNA integrase inserts the
completed retroviral DNA
into genomic DNA via a
mechanism similar to that
described for bacterial IS
elements. Namely, a short
direct repeat is produced at
each end of the integrated
DNA. On the order of 4.4
x 105 LTR retrotransposon
sequences occur in human
DNA. Most of these are
non-functional due to
recombination between LTR
sequences and deletion of
the intervening DNA.
Non-LTR Retrotransposons
Even more abundant in human genomic DNA are non-LTR
retrotransposon sequences. There are two main classes of non-LTR
retrotransposons, known as long interspersed elements (LINEs, ~6
kb), and short interspersed elements (SINEs, ~300 bp). LINEs
encode a reverse transcriptase (ORF2) needed for transposition
(Fig. 6.16), whereas SINEs do not. Instead SINEs are thought to
rely on LINE-encoded enzymes for transposition. LINEs are
grouped into L1, L2, and L3 families, of which only L1 is active
today. LINE sequences occur at ~9 x 105 copies per human
genome. SINEs occur at ~1.6 x 106 copies. The most abundant
SINE is the Alu element, which is named based on the fact that it
encodes an AluI restriction site. Alu elements were important for
gene duplications at the ß-globin locus (Figs. 6.4).
promoter
site
poly(A)
site
L/SINE Transposition (I)
The mechanism of LINE (and
SINE) transposition is illustrated
in Fig. 6.17. In summary, LINE
primary transcripts are translated
into the ORF1 and ORF2 gene
products in the cytosol. The RNA
then returns to the nucleus with
the ORF1 & 2 proteins. These
enzymes catalyze reverse
transcription and integration of
the LINE element at AT-rich
regions of genomic DNA. The
poly(A) tail of the LINE RNA is
used for selection of integration
sites. SINE element
retrotransposition is thought to be
mediated by the ORF1 & 2
proteins encoded by LINEs.
1
Nicking
L/SINE Transposition (II)
Many LINEs are truncated at the
5' end due to incomplete reverse
transcription of the LINE RNA.
For this reason, and sequence
drift, only 0.01% of LINE
elements are functional today
(~100 per genome). It is further
thought that LINE & SINE
transpositions occur at a rate of
~1 in 8 individuals in the
population. LINE transpositions
have been implicated in human
disease. About 1/600 mutations
causing disease can be traced to
LINE transposition. However,
LINE & SINE transpositions have
been crucial in the evolution of
the human genome, as discussed
in the remaining slides. Lastly,
the ORF1 and 2 LINE proteins
are thought to be responsible for
insertion of processed
pseudogenes into genomic DNA.
Exon Shuffling via Recombination Between
Homologous Interspersed Repeats
We previously have noted that gene evolution has involved exon
shuffling between protein-coding genes in the genome. A large
amount of shuffling has occurred due to the prevalence of
interspersed repeats in the genome. Due to sequence conservation
within these regions, crossover events can take place at these
sites (Fig. 6.18). This results in exon shuffling between
nonhomologous genes and the formation of new genes with new
combinations of protein domains. As illustrated in Fig. 6.2, such
events also have been important in exon and gene duplications.
Exon Shuffling via Transposition
Exon shuffling can also occur via cut-and-paste transpositions
mediated by DNA transposons. The mechanism by which this
occurs is illustrated in Fig. 6.19a. It requires that two copies of
the transposon flank the target exon. Both DNA transposons and
the exon will move as one piece of DNA if the transposase
happens to cleave DNA at the left inverted repeat of the
upstream transposon and at the right inverted repeat of the
downstream transposon. Gene 1 ends up losing the exon, and Gene
2 acquires the exon
Exon Shuffling via Transposition
Exons can move along with a LINE element when it transposes via
its copy-and-paste mechanism (Fig. 6.19b). When a LINE element
has a weak poly(A) signal, RNA polymerase II continues to
transcribe downstream, potentially through an exon. If this exon
has a strong poly(A) signal, then transcription stops and the RNA
is polyadenylated. Then following the mechanism in Fig. 6.17,
DNA encoding the exon and the LINE element can be incorporated
into another gene. The spliced mRNA produced from the acceptor
gene may contain the newly introduced exon. Exon shuffling is
supported by experimental evidence and the enormous amount of
interspersed repeat DNA in genomes. Over billions of years, it
has played a major role in evolution of genomes.