Genome Organization & Evolution

advertisement
Genome Organization &
Protein Synthesis and
Processing in Plants
Viral genomes
Viral genomes: ssRNA, dsRNA, ssDNA, dsDNA, linear or
ciruclar
Viruses with RNA genomes:
•Almost all plant viruses and some bacterial and animal viruses
•Genomes are rather small (a few thousand nucleotides)
Viruses with DNA genomes (e.g. lambda = 48,502 bp):
•Often a circular genome.
Replicative form of viral genomes
•all ssRNA viruses produce dsRNA molecules
•many linear DNA molecules become circular
Molecular weight and contour length:
• duplex length per nucleotide = 3.4 Å
• Mol. Weight per base pair = ~ 660
Procaryotic genomes
• Generally 1 circular chromosome (dsDNA)
• Usually without introns
• Relatively high gene density (~2500 genes
per mm of E. coli DNA)
• Contour length of E.coli genome: 1.7 mm
• Often indigenous plasmids are present
Plasmids
Extra chromosomal circular DNAs
•
•
•
•
•
•
•
•
•
•
-lactamase
ori
Found in bacteria, yeast and other fungi
Size varies form ~ 3,000 bp to 100,000 bp. foreign gene
Replicate autonomously (origin of replication)
May contain resistance genes
May be transferred from one bacterium to another
May be transferred across kingdoms
Multicopy plasmids (~ up to 400 plasmids/per cell)
Low copy plasmids (1 –2 copies per cell)
Plasmids may be incompatible with each other
Are used as vectors that could carry a foreign gene of
interest (e.g. insulin)
Eukaryotic genome
• Moderately repetitive
– Functional (protein coding, tRNA coding)
– Unknown function
• SINEs (short interspersed elements)
– 200-300 bp
– 100,000 copies
• LINEs (long interspersed elements)
– 1-5 kb
– 10-10,000 copies
Eukaryotic genome
• Highly repetitive
– Minisatellites
• Repeats of 14-500 bp
• 1-5 kb long
• Scattered throughout genome
– Microsatellites
• Repeats up to 13 bp
• 100s of kb long, 106 copies
• Around centromere
– Telomeres
• Short repeats (6 bp)
• 250-1,000 at ends of chromosomes
Eucaryotic genomes
• Located on several chromosomes
• Relatively low gene density (50 genes per mm of
DNA in humans)
• Contour length of DNA from a single human cell = 2
meters
• Approximately 1011 cells = total length 2 x 1011 km
• Distance between sun and earth (1.5 x 108 km)
• Human chromosomes vary in length over a 25 fold
range
• Carry organelles genome as well
Mitochondrial genome (mtDNA)
•
•
•
•
Multiple identical circular chromosomes
Size ~15 Kb in animals
Size ~ 200 kb to 2,500 kb in plants
Over 95% of mitochondrial proteins are
encoded in the nuclear genome.
• Often A+T rich genomes.
• Mt DNA is replicated before or during
mitosis
Chloroplast genome
(cpDNA)
•
•
•
•
Multiple circular molecules
Size ranges from 120 kb to 160 kb
Similar to mtDNA
Many chloroplast proteins are encoded
in the nucleus (separate signal
sequence)
“Cellular” Genomes
Viruses Procaryotes
Eucaryotes
Nucleus
Capsid
Plasmids
Viral genome
Bacterial
chromosome
Chromosomes
(Nuclear genome)
Mitochondrial
genome
Chloroplast
genome
Genome: all of an organism’s genes plus intergenic DNA
Intergenic DNA = DNA between genes
Estimated genome sizes
mammals
plants
fungi
bacteria (>100)
mitochondria (~ 100)
viruses (1024)
1e1
1e2 1e3
1e4 1e5
1e6
1e7 1e8
1e9 1e10 1e11 1e12
Size in nucleotides. Number in ( ) = completely sequenced genomes
Size of genomes
Epstein-Barr virus
0.172 x 106
E. coli
4.6 x 106
S. cerevisiae
12.1 x 106
C. elegans
95.5 x 106
A. thaliana
117 x 106
D. melanogaster
180 x 106
H. sapiens
3200 x 106
Chromosome organization
Eucaryotic chromosome
Telomere
Centromere
p-arm
Telomere
q-arm
Centromere:
• DNA sequence that serve as an attachment for protein during mitosis.
• In yeast these sequences (~ 130 nts) are very A+T rich.
• In higher eucaryotes centromers are much longer and contain
“satellite DNA”
Telomeres:
• At the end of chromosomes; help stabilize the chromosome
• In yeast telomeres are ~ 100 bp long (imperfect repeats)
• Repeats are added by a specific telomerase
5’ – (TxGy)n
3’ – (AxCy)n
x and y = 1 - 4
n = 20 to 100; (1500 in mammals)
Gene classification
intergenic
coding genes
region
Chromosome
(simplified)
Messenger RNA
non-coding
genes
Structural RNA
Proteins
transfer
RNA
Structural proteins
Enzymes
ribosomal
RNA
other
RNA
•
What is a gene ?
Definitions
1. Classical definition: Portion of a DNA that determines a
single character (phenotype)
2. One gene – one enzyme (Beadle & Tatum 1940): “Every
gene encodes the information for one enzyme”
3. One gene – one protein: “One gene contains information
for one protein (structural proteins included) one gene –
one polypeptide
4. Current definition: A piece of DNA (or in some cases
RNA) that contains the primary sequence to produce a
functional biological gene product (RNA, protein).
Coding region
Nucleotides (open reading frame) encoding
the amino acid sequence of a protein
The molecular definition of gene includes
more than just the coding region
Noncoding regions
• Regulatory regions
– RNA polymerase binding site
– Transcription factor binding sites
• Introns
• Polyadenylation [poly(A)] sites
Gene
Molecular definition:
Entire nucleic acid sequence necessary for the
synthesis of a functional polypeptide
(protein chain) or functional RNA
Anatomy of a gene
• ORF. From start (ATG) to stop (TGA, TAA,
TAG)
• Upstream region with binding site. (e.g.
TATA box).
• Poly-a ‘tail’
• Splices. Bounded by AG and GT splice
signals.
Bacterial genes
• Most do not have introns
• Many are organized in operons: contiguous
genes, transcribed as a single polycistronic
mRNA, that encode proteins with related
functions
Polycistronic mRNA encodes several proteins
Bacterial operon
What would be the effect of a mutation in
the control region (a) compared to a
mutation in a structural gene (b)?
Eucaryotic genes
Hemoglobin beta subunit gene
Exon 1 Intron A Exon 2
90 bp 131 bp
222 bp
Intron B
851 bp
Exon 3
126 bp
Splicing
Introns: intervening sequences within a gene that are not translated
into a protein sequence. Collagen has 50 introns.
Exons: sequences within a gene that encode protein sequences
Splicing: Removal of introns from the mRNA molecule.
Regulatory mechanisms
• ‘organize expression of genes’ (function
calls)
• Promoter region (binding site), usually near
coding region
• Binding can block (inhibit) expression
• Computational challenges
– Identify binding sites
– Correlate sequence to expression
Eukaryotic genes
• Most have introns
• Produce monocistronic mRNA: only one
encoded protein
• Large
Alternative splicing
• Splicing is the removal of introns
• mRNA from some genes can be spliced into
two or more different mRNAs
“Nonfunctional” DNA
80 kb
• Higher eukaryotes have a lot of noncoding
DNA
• Some has no known structural or regulatory
function (no genes)
Types of eukaryotic DNA
Duplicated genes
• Encode closely related (homologous)
proteins
• Clustered together in genome
• Formed by duplication of an ancestral gene
followed by mutation
Five functional genes and two pseudogenes
Pseudogenes
• Nonfunctional copies of genes
• Formed by duplication of ancestral gene, or
reverse transcription (and integration)
• Not expressed due to mutations that
produce a stop codon (nonsense or
frameshift) or prevent mRNA processing, or
due to lack of regulatory sequences
Repetitive DNA
• Moderately repeated DNA
– Tandemly repeated rRNA, tRNA and histone
genes (gene products needed in high amounts)
– Large duplicated gene families
– Mobile DNA
• Simple-sequence DNA
– Tandemly repeated short sequences
– Found in centromeres and telomeres (and others)
– Used in DNA fingerprinting to identify
individuals
Types of DNA repeats
Perfect repeats vs degenerate repeats
Tandem repeats (e.g. satellite DNA)
5’-CATGTGCTGAAGGCTATGTGCTGCGACG- 3’
3’-GTACACGACTTCCGATACACGACGCTGC- 5’
Inverted repeats (e.g. in transposons)
5’-CATGTGCTGAAGGCTCAGCACATCGACG- 3’
3’-GTACACGACTTCCGAGTCGTGTAGCTGC- 5’
• Form stem-loop structures
Palindroms = adjacent inverted repeats
(e.g. restriction sites)
• Form hairpin structures
Loop
Stem
Hairpin
Repetitive sequences
Satellite DNA
Chromosomal DNA
Repeats in the mouse genome
Caesium chloride
density gradient
Type
No. of
Repeats
Size
Percent of
genome
Highly
repetitive
Moderately
repetitive
> 1 Mill
< 10 bp
10 %
> 1000
~ 150 - ~300 bp
20 %
DNA repeats and forensics
AluSTXa
Gender determination
1) Standard technique: PCR amplification
of the amelogenin locus
(Males = XY => 103 + 109 bp)
2) AluSTXa Alu insertion on X
3) AluSTYa Alu insertion on Y
M
F
Suspect
878 bp
556 bp
AluSTYa
X-Y homologous regions
AluSTYa
X
Y
Alu sequence
M
F
Suspect
528 bp
199 bp
Mobile DNA
• Move within genomes
• Most of moderately repeated DNA sequences
found throughout higher eukaryotic genomes
– L1 LINE is ~5% of human DNA (~50,000 copies)
– Alu is ~5% of human DNA (>500,000 copies)
• Some encode enzymes that catalyze
movement
Transposition
• Movement of mobile DNA
• Involves copying of mobile DNA element
and insertion into new site in genome
Why?
• Molecular parasite: “selfish DNA”
• Probably have significant effect on
evolution by facilitating gene duplication,
which provides the fuel for evolution, and
exon shuffling
RNA or DNA intermediate
• Transposon moves
using DNA
intermediate
• Retrotransposon
moves using RNA
intermediate
Types of mobile DNA elements
LTR (long terminal repeat)
• Flank viral retrotransposons and retroviruses
• Contain regulatory sequences
Transcription start site and poly (A) site
LINES and SINES
• Non-viral retro-transposons
– RNA intermediate
– Lack LTR
• LINES (long interspersed elements)
– ~6000 to 7000 base pairs
– L1 LINE (~5% of human DNA)
– Encode enzymes that catalyze movement
• SINES (short interspersed elements)
– ~300 base pairs
– Alu (~5% of human DNA)
Proteins
•
•
•
•
•
Most protein sequences (today) are inferred
What’s wrong with this?
Proteins (and nucleic acids) are modified
‘mature’ Rna
Computational challenges
– Identify (possible) aspects of molecular life cycle
– Identify protein-protein and protein-nucleic acid
interactions
Genetic variation
• Variable number tandem repeats
(minisatellites). 10-100 bp. Forensic
applications.
• Short tandem repeat polymorphisms
(microsatellites). 2-5 bp, 10-30 consecutive
copies.
• Single nucleotide polymorphisms
Single nucleotide polymorphisms
• 1/2000 bp.
• Types
– Silent
– Truncating
– Shifting
• Significance: much of individual variation.
• Challenge: correlation to disease
Yeast genome
• 4.6 x 106 bp. One chromosome. Published
1997.
• 4,285 protein-coding genes
• 122 structural RNA genes
• Repeats. Regulatory elements. Transposons.
• Lateral transfers.
Yeast protein functions
Regulatory
Cell structure
Transposons,etc
Transport & binding
Putative transport
Replication, repair
Transcription
Translation
Enzymes
Unknown
45
182
87
281
146
115
55
182
251
1632
1.05%
4.24
2.03
6.55
3.40
2.68
1.28
4.24
5.85
38.06
Download