Protein-coding genes in eukaryotic DNA

advertisement
Gene Structure &
Genomes
Biology 224
Instructor: Tom Peavy
Oct 12 & 14, 2009
<Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner>
Similarities & Differences
Prokaryotic vs. Eukaryotic Genomic DNA
 size of genome?
Complexity of genes?
 Open reading Frames (1 gene per stretch)?
 Regulatory sequences for Transcription?
 Density of genes?
 One gene = 1 transcript?
Finding genes in eukaryotic DNA
Types of genes include
• protein-coding genes
• pseudogenes
• functional RNA genes: tRNA, rRNA and others
--snoRNA
small nucleolar RNA
--snRNA
small nuclear RNA
--miRNA
microRNA
There are several kinds of exons:
-- noncoding
-- initial coding exons
-- internal exons
-- terminal exons
-- some single-exon genes are intronless
Eukaryotic gene prediction algorithms
distinguish several kinds of exons
Gene-finding algorithms
Homology-based searches (“extrinsic”)
Rely on previously identified genes
Algorithm-based searches (“intrinsic”)
Investigate nucleotide composition, openreading frames, and other intrinsic
properties of genomic DNA
(refer to Chapter 16, Eukaryotic Chromosome, Figure 16-9 for a list
of extrinsic vs intrinsic based algorithms).
Extrinsic, homology-based searching:
compare genomic DNA to expressed genes (ESTs)
DNA
intron
RNA
RNA
protein
DNA
RNA
Intrinsic, algorithm-based searching:
Identify open reading frames (ORFs).
Compare DNA in exons (unique codon usage)
to DNA in introns (unique splices sites)
and to noncoding DNA.
human DNA
chimpanzee
DNA
Comparative genomics: Compare gene models between
species. (For annotation of the chimpanzee genome
reported in 2005, BLAT and BLASTZ searches were used to
align the two genomes.)
Finding genes in eukaryotic DNA
Cautionary Notes:
-- The quality of EST sequence is sometimes low
-- Highly expressed genes are disproportionately
represented in many cDNA libraries
-- ESTs provide no information on genomic location
Finding genes in eukaryotic DNA
Both intrinsic and extrinsic algorithms vary in their rates
of false-positive and false-negative gene identification.
Programs such as GENSCAN and Grail account for
features such as the nucleotide composition of coding
regions, and the presence of signals such as promoter
elements.
Finding genes in eukaryotic DNA
In as study using 100,000 base pairs of human DNA, intrinsic
algorithms correctly identified several exons of RBP4, but
failed to generate a complete gene model.
As another example, initial annotation of the rice genome
yielded over 75,000 gene predictions, only 53,000 of which
were complete (having initial and terminal exons). Also,
it is very difficult to accurately identify exon-intron boundaries.
Estimates of gene content improve dramatically when
finished (rather than draft) sequence is analyzed.
Page 561
Genome sequencing projects
There are three main resources for genomes:
EBI
European Bioinformatics Institute
http://www.ebi.ac.uk/genomes/
NCBI
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov
TIGR
The Institute for Genomic Research
http://www.tigr.org
C value paradox:
why eukaryotic genome sizes vary
The haploid genome size of eukaryotes, called the C value,
varies enormously.
Small genomes include:
Encephalotiozoon cuniculi (2.9 Mb)
A variety of fungi (10-40 Mb)
Takifugu rubripes (pufferfish)(365 Mb)(same number of genes
as other fish or as the human genome, but 1/10th the size)
Large genomes include:
Pinus resinosa (Canadian red pine)(68 Gb)
Protopterus aethiopicus (Marbled lungfish)(140 Gb)
Amoeba dubia (amoeba)(690 Gb)
Genome sizes in nucleotide base pairs
plasmids
viruses
bacteria
fungi
plants
algae
insects
mollusks
bony fish
The size of the human
genome is ~ 3 X 109 bp;
almost all of its complexity
is in single-copy DNA.
amphibians
reptiles
birds
The human genome is thought
to contain ~30,000-40,000 genes.
104
105
106
107
mammals
108
109
1010
1011
http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt
C value paradox:
why eukaryotic genome sizes vary
The range in C values does not correlate well with the
complexity of the organism. This phenomenon is called
the C value paradox.
Why?
Britten and Kohne (1968) identified
repetitive DNA classes
Reassociation Kinetics = isolated genomic DNA,
Shear, denature (melted), & measure the rates of DNA
reassociation.
Protein-coding genes in eukaryotic DNA:
a new paradox
Why are the number of protein-coding genes about the same
for worms, flies, plants, and humans?
This has been called the N-value paradox (number of genes)
or the G value paradox (number of genes).
Five main classes of repetitive DNA
1. Interspersed repeats (RNA/DNA transposon-derived)
-- approx 45% of human genome (e.g. LINES, SINES, Alu)
2.
Processed pseudogenes (gene loss)
3.
Simple sequence repeats
-- Microsatellites (1-12 bp); Minisatellites (12-500 bp)
4. Segmental duplications
-- blocks of about 1 kilobase to 300 kb that are copied
intra- or interchromosomally (5% of human genome)
5. Blocks of tandem repeats
-- includes telomeric and centromeric repeats
and can span millions bp (often species-specific)
The spectrum of variation
Category of variation
Size
Single base pair changes 1 bp
type
SNPs,
point mutations
Small insertions/deletions 1 – 50 bp
Short tandem repeats
1 – 500 bp microsatellites
Fine-scale structural var. 50 bp – 5 kb del, dup, inv
tandem repeats
Retroelement insertions 0.3 – 10 kb SINEs, LINEs
LTRs, ERVs
Intermediate-scale struct. 5 kb – 50 kb del, dup, inv,
tandem repeats
Large-scale structural var. 50 kb – 5 Mb del, dup, inv, large
tandem repeats
Chromosomal variation
>>5Mb
aneuploidy
Adapted from Sharp AJ et al. (2006) Annu Rev Genomics Hum Genet
7:407-42
nucleolar organizing center
centromere
human chromosome 21
at www.ensembl.org
centromere
human chromosome 21
at UCSC Genome Browser
Chromosomes can be highly dynamic, in several ways.
• Whole genome duplication (autopolyploidy) can occur,
as in yeast (Chapter 15) and some plants.
• The genomes of two distinct species can merge, as in the
mule (male donkey, 2n = 62 and female horse, 2n = 64)
• An individual can acquire an extra copy of a chromosome
(e.g. Down syndrome, trisomy 13 or 18)
• Chromosomes can fuse; e.g. human chromosome 2 derives
from a fusion of two ancestral primate chromosomes
• Chromosomal regions can be inverted or deleted
• Segmental and other duplications occur
Page 565
Conservative nature of chromosome evolution
Among placental mammals, the number of diploid
chromosomes is:
84 in black rhinoceros
46 in Homo sapiens
17 in two rodent species
The process of chromosome evolution tends to remain
conservative. Heterozygous carriers of most types of
chromosomal rearrangements are semisterile. Thus many
chromosomal changes cannot be fixed.
Ohno (1970) p. 41
Diploidization of the tetraploid
A species can become tetraploid. All loci are duplicated,
and what was formerly the diploid chromosome
complement is now the haploid set of the genome.
Polyploid evolution occurs commonly in plants. For
example, in the cereal plant Sorghum
S. versicolor (diploid) 2n = 2 x 5; 10 chromosomes
S. sudanense (tetraploid) 4n = 4 x 5; 20 chromosomes
S. halepense (octoplooid) 8n = 8 x 5; 40 chromosomes
Ohno (1970) pp 98- 101
“Retrotransposons constitute over 40% of the human genome and
consist of several millions of family members. They play important
roles in shaping the structure and evolution of the genome and in
participating in gene functioning and regulation. Since L1, Alu, and
SVA retrotransposons are currently active in the human genome, their
recent and ongoing retrotranspositional insertions generate a unique
and important class of genetic polymorphisms (for the presence or
absence of an insertion) among and within human populations. As
such, they are useful genetic markers in population genetics studies
due to their identical-by-descent and essentially homoplasy-free
nature. Additionally, some polymorphic insertions are known to be
responsible for a variety of human genetic diseases. dbRIP is a
database of human Retrotransposon Insertion Polymorphisms (RIPs).
dbRIP contains all currently known Alu, L1, and SVA polymorphic
insertion loci in the human genome.”
--dbRIP
Homoplasy: having some states arise more than once on a tree.
http://falcon.roswellpark.org:9090/index.html
Five main classes of repetitive DNA
2. Processed pseudogenes
These genes have a stop codon or frameshift mutation
and do not encode a functional protein. They commonly
arise from retrotransposition, or following gene
duplication and subsequent gene loss.
For a superb on-line resource, visit Mark Gerstein’s
website, http://www.pseudogene.org. Gerstein and
colleagues (2006) suggest that there are ~19,000
pseudogenes in the human genome, slightly fewer than
the number of functional protein-coding genes. (11,000
non-processed, 8,000 processed [lack introns].)
Page 547
Five main classes of repetitive DNA
3. Simple sequence repeats
Microsatellites: from one to a dozen base pairs
Examples: (A)n, (CA)n, (CGG)n
These may be formed by replication slippage.
Minisatellites: a dozen to 500 base pairs
Simple sequence repeats of a particular length and
composition occur preferentially in different species.
In humans, an expansion of triplet repeats such as CAG
is associated with at least 14 disorders (including
Huntington’s disease).
Page 546
Successive tandem
gene duplications
(after Lacazette et al., 2000)
observed today
Fig. 16.3
Page 548
Successive tandem
gene duplications
(after Lacazette et al., 2000)
Fig. 16.3
Page 548
Successive tandem
gene duplications
(after Lacazette et al., 2000)
Fig. 16.3
Page 548
Successive tandem
gene duplications
(after Lacazette et al., 2000)
Fig. 16.3
Page 548
Transcription factor databases
In addition to identifying repetitive elements and genes,
it is also of interest to predict the presence of genomic
DNA features such as promoter elements and GC content.
Websites that predict transcription factor binding sites
and related sequences.
AliBaba2 (http://www.gene-regulation.de/)
Eukaryotic Promoter Database
(http://www.epd.isb-sib.ch)
PlantProm (http://mendel.cs.rhul.ac.uk)
Eponine predicts transcription start sites in promoter regions.
The algorithm uses a set of DNA weight matrices recognizing
sequence motifs that are associated with a position
distribution relative to the transcription start site. The model
is as follows:
The specificity is good (~70%), and the positional accuracy is
excellent. The program identifies ~50% of TSSs—although it
does not always know the direction of transcription.
http://www.sanger.ac.uk/Users/td2/eponine
The ENCODE project
Goal of ENCODE: build a list of all sequence-based functional
elements in human DNA. This includes:
► protein-coding genes
► non-protein-coding genes
► regulatory elements involved in the control of gene
transcription
► DNA sequences that mediate chromosomal structure and
dynamics.
VISTA output for an alignment of human and mouse
genomic DNA (including RBP4)
Chronology of genome sequencing projects
1977 first viral genome
(Sanger et. Al. bacteriophage fX174; 11 genes)
1981 Human mitochondrial genome
16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)
Today, over 400 mitochondrial genomes sequenced
1986 Chloroplast genome
156,000 base pairs (most are 120 kb to 200 kb)
1995 Haemophilus influenzae genome sequenced
1996 Saccharomyces cerevisiae (1st Euk. Genome)
and archaeal genome, Methanococcus jannaschii.
Chronology of genome sequencing projects
1997 More bacteria and archaea
Escherichia coli 4.6 megabases, 4200 proteins (38% of unknown function)
1998 Nematode Caenorhabditis elegans (1st multicellular org.)
97 Mb; 19,000 genes.
1999 first human chromosome: Chrom 22 (49 Mb, 673 genes)
2000 Drosophila melanogaster (13,000 genes);
Plant Arabidopsis thaliana & Human chromosome 21
2001: draft sequence of the human genome
(public consortium and Celera Genomics)
Overview of genome analysis
[1] Selection of genomes for sequencing
[2] Sequence one individual genome, or several?
[3] How big are genomes?
[4] Genome sequencing centers
[5] Sequencing genomes: strategies
[6] When has a genome been fully sequenced?
[7] Repository for genome sequence data
[8] Genome annotation
Overview of genome analysis
[1] Selection of genomes for sequencing is based
on criteria such as:
• genome size (some plants are >>>human genome)
• cost
• relevance to human disease (or other disease)
• relevance to basic biological questions
• relevance to agriculture
Overview of genome analysis
[2] Sequence one individual genome, or several?
--Each genome center may study one
chromosome from an organism
--It is necessary to measure polymorphisms
(e.g. SNPs) in large populations
For viruses, thousands of isolates may be sequenced.
For the human genome, cost is the impediment.
Overview of genome analysis
[3] How big are genomes?
Viral genomes: 1 kb to 350 kb (Mimivirus: 800 kb)
Bacterial genomes: 0.5 Mb to 13 Mb
Eukaryotic genomes: 8 Mb to 686 Mb
Overview of genome analysis
[4] 20 Genome sequencing centers contributed
to the public sequencing of the human genome.
Many of these are listed at the Entrez genomes site.
Overview of genome analysis
[5] There are two main strategies for sequencing genomes
a) Whole genome shotgun (WGS) method
-- applied to the entire genome all at once
(sequenced fragments ordered by alignment of overlaps)
VERSUS
b) hierarchical shotgun method
--applied to large overlapping DNA fragments of known location
in the genome.
(Assemble contigs from chromosomes and then systematically
sequence them and reassemble complete sequence)
Overview of genome analysis
[6] When has a genome been fully sequenced?
A typical goal is to obtain five to ten-fold coverage.
Finished sequence: a clone insert is contiguously
sequenced with high quality standard of error rate
0.01%. There are usually no gaps in the sequence.
Draft sequence: clone sequences may contain several
regions separated by gaps. The true order and
orientation of the pieces may not be known.
Overview of genome analysis
[7] Repository for genome sequence data
Raw data from many genome sequencing projects
are stored at the trace archive at NCBI or EBI
(main NCBI page, bottom right)
Overview of genome analysis
[8] Genome annotation
Information content in genomic DNA includes:
-- repetitive DNA elements
-- nucleotide composition (GC content)
-- protein-coding genes, other genes
How can whole genomes be compared?
-- molecular phylogeny
-- You can BLAST (or PSI-BLAST) all the DNA and/or
protein in one genome against another
-- TaxPlot and COG for bacterial (and for
some eukaryotic) genomes
-- PipMaker, MUMmer and other programs align large
stretches of genomic DNA from multiple species
Download