How many genes are in a genome?

advertisement
How many genes are in a genome?
We earlier examined the great diversity in genome sizes across the living
world (see Table 1 in “How big are genomes?”). As a first step in refining
our understanding of the information content of these genomes, we need
a sense of the number of genes that they harbor. Interestingly, though
genome sizes differ by as much as 8 orders of magnitude (from < 2kb for
Hepatitis D virus (BNID 105570) to > 100Gbp for the Polychaos dubium
amoeba (BNID 104470], the Marbled lungfish (BNID 100597) and
certain Fritillaria flowers (BNID 102726)), the range in the number of
genes varies by at most four orders of magnitude. Many bacteria have
several thousand genes. This gene content can be rationalized simply by
thinking about the relation between genome size and protein size as
shown below. By way of contrast, eukaryotic genomes, which are often a
thousand times or more larger than those in prokaryotes, contain only
an order of magnitude more genes than their prokaryotic counterparts.
The inability to successfully estimate the number of genes in eukaryotes
based on knowledge of the gene content of prokaryotes was one of the
unexpected twists of modern biology.
Clearly, the size of a genome is not necessarily a faithful measure of its
gene content. The simplest estimate of the number of genes in a genome
unfolds by assuming that the entirety of the genome codes for genes of
interest. To make further progress with the estimate, we need to have a
measure of the number of amino acids in a typical protein which we will
take to be roughly 300, cognizant however of the fact that like genomes,
proteins come in a wide variety of sizes themselves as is revealed in the
vignette on that topic ”what are the sizes of proteins?”. On the basis of
this meager assumption, we see that the number of bases needed to code
for our typical protein is roughly 1000 (3 base pairs per amino acid).
Hence, within this mindset, the number of genes contained in a genome
is estimated to be the genome size/1000. For bacterial genomes, this
strategy works surprisingly well as can be appreciated from table 1. For
example, when applied to the E. coli K-12, genome of 4.6 x 106 bp, this
rule of thumb leads to an estimate of 4600 genes, which can be
compared to the current best knowledge of this quantity which is 4225.
In going through a dozen representative bacteria and archea genomes in
the table a similarly striking predictive power to within about 10% is
observed. On the other hand, this strategy fails spectacularly when we
apply it to eukaryotic genomes, resulting for example in the estimate that
the number of genes in the human genome should be 3,000,000, a gross
overestimate. The unreliability of this estimate helps explain the
existence of the Genesweep betting pool which as recently as the early
2000s had people betting on the number of genes in the human genome,
with people’s estimates varying by more than a factor of ten.
What explains this spectacular failure of the most naïve estimate and
what does it teach us about the information organized in genomes?
Eukaryotic genomes, especially those associated with multicellular
organisms, are characterized by a host of intriguing features that disrupt
the simple coding picture exploited in the naïve estimate. These
differences in genome usage are depicted pictorially in Figure 1 which
shows the percentage of the genome used for other purposes than
protein coding. As evident in Figure 1, prokaryotes can efficiently
compact their protein coding sequences such that they are almost
continuous and result in less than 10% of their genomes being assigned
to non coding DNA (12% in E. coli, BNID 105750) whereas in humans
over 98% (BNID 103748) is non protein coding. The discovery of these
other uses of the genome constitute some of the most important insights
into DNA, and biology more generally, from the last 60 years. One of
these alternative uses for genomic real estate is the regulatory genome,
namely, the way in which large chunks of the genome are used as targets
for the binding of regulatory proteins that give rise to the combinatorial
control so typical of genomes in multicellular organisms. Another of the
key features of eukaryotic genomes is the organization of their genes
into introns and exons, with the expressed exons being much smaller
than the intervening and spliced out introns. Beyond these features,
Table 1: A comparison between the number of genes in an organism and a naïve
estimate based on the genome size divided by a constant factor of 1000bp/gene, i.e.
predicted number of genes = genome size/1000. One finds that this crude rule of
thumb works surprisingly well for many bacteria and archea.
there are endogenous retroviruses, fossil relics of former viral infections
and strikingly, over 50% of the genome is taken up by the existence of
repeating elements and transposons, various forms of which can perhaps
be interpreted as selfish genes that have mechanisms to proliferate in a
host genome. Some of these repeating elements and transposons are still
active today whereas others have remained a relic after losing the ability
to further proliferate in the genome.
Figure 1 – Schematic view of the fraction of the genome that is not coding for proteins by
phylogenetic group. A significant fraction of the difference in sizes among organisms can
be related to the differing ratios of non coding sequences. Note that this is a highly
simplified view, in reality, for example, some plants have a fraction of non coding
genome higher than that of human (from Mattick, Sci. Amer. 2004).
Download