How many genes are in a genome? We earlier examined the great diversity in genome sizes across the living world (see Table 1 in “How big are genomes?”). As a first step in refining our understanding of the information content of these genomes, we need a sense of the number of genes that they harbor. Interestingly, though genome sizes differ by as much as 8 orders of magnitude (from < 2kb for Hepatitis D virus (BNID 105570) to > 100Gbp for the Polychaos dubium amoeba (BNID 104470], the Marbled lungfish (BNID 100597) and certain Fritillaria flowers (BNID 102726)), the range in the number of genes varies by at most four orders of magnitude. Many bacteria have several thousand genes. This gene content can be rationalized simply by thinking about the relation between genome size and protein size as shown below. By way of contrast, eukaryotic genomes, which are often a thousand times or more larger than those in prokaryotes, contain only an order of magnitude more genes than their prokaryotic counterparts. The inability to successfully estimate the number of genes in eukaryotes based on knowledge of the gene content of prokaryotes was one of the unexpected twists of modern biology. Clearly, the size of a genome is not necessarily a faithful measure of its gene content. The simplest estimate of the number of genes in a genome unfolds by assuming that the entirety of the genome codes for genes of interest. To make further progress with the estimate, we need to have a measure of the number of amino acids in a typical protein which we will take to be roughly 300, cognizant however of the fact that like genomes, proteins come in a wide variety of sizes themselves as is revealed in the vignette on that topic ”what are the sizes of proteins?”. On the basis of this meager assumption, we see that the number of bases needed to code for our typical protein is roughly 1000 (3 base pairs per amino acid). Hence, within this mindset, the number of genes contained in a genome is estimated to be the genome size/1000. For bacterial genomes, this strategy works surprisingly well as can be appreciated from table 1. For example, when applied to the E. coli K-12, genome of 4.6 x 106 bp, this rule of thumb leads to an estimate of 4600 genes, which can be compared to the current best knowledge of this quantity which is 4225. In going through a dozen representative bacteria and archea genomes in the table a similarly striking predictive power to within about 10% is observed. On the other hand, this strategy fails spectacularly when we apply it to eukaryotic genomes, resulting for example in the estimate that the number of genes in the human genome should be 3,000,000, a gross overestimate. The unreliability of this estimate helps explain the existence of the Genesweep betting pool which as recently as the early 2000s had people betting on the number of genes in the human genome, with people’s estimates varying by more than a factor of ten. What explains this spectacular failure of the most naïve estimate and what does it teach us about the information organized in genomes? Eukaryotic genomes, especially those associated with multicellular organisms, are characterized by a host of intriguing features that disrupt the simple coding picture exploited in the naïve estimate. These differences in genome usage are depicted pictorially in Figure 1 which shows the percentage of the genome used for other purposes than protein coding. As evident in Figure 1, prokaryotes can efficiently compact their protein coding sequences such that they are almost continuous and result in less than 10% of their genomes being assigned to non coding DNA (12% in E. coli, BNID 105750) whereas in humans over 98% (BNID 103748) is non protein coding. The discovery of these other uses of the genome constitute some of the most important insights into DNA, and biology more generally, from the last 60 years. One of these alternative uses for genomic real estate is the regulatory genome, namely, the way in which large chunks of the genome are used as targets for the binding of regulatory proteins that give rise to the combinatorial control so typical of genomes in multicellular organisms. Another of the key features of eukaryotic genomes is the organization of their genes into introns and exons, with the expressed exons being much smaller than the intervening and spliced out introns. Beyond these features, Table 1: A comparison between the number of genes in an organism and a naïve estimate based on the genome size divided by a constant factor of 1000bp/gene, i.e. predicted number of genes = genome size/1000. One finds that this crude rule of thumb works surprisingly well for many bacteria and archea. there are endogenous retroviruses, fossil relics of former viral infections and strikingly, over 50% of the genome is taken up by the existence of repeating elements and transposons, various forms of which can perhaps be interpreted as selfish genes that have mechanisms to proliferate in a host genome. Some of these repeating elements and transposons are still active today whereas others have remained a relic after losing the ability to further proliferate in the genome. Figure 1 – Schematic view of the fraction of the genome that is not coding for proteins by phylogenetic group. A significant fraction of the difference in sizes among organisms can be related to the differing ratios of non coding sequences. Note that this is a highly simplified view, in reality, for example, some plants have a fraction of non coding genome higher than that of human (from Mattick, Sci. Amer. 2004).