What the papers say
Summary
Rattus norvegicus is an important experimental organism and interesting to evolutionary biologists. The recently published draft rat genome sequence (1) provides us with insights into both the rat’s evolution and its physiology.
We learn more about genome evolution and, in particular, the adaptive significance of gene family expansions and the evolution of rodent genomes, which appears to have decelerated since the divergence of mouse and rat. An important observation is that some regions of genomes, many in noncoding regions, show very high sequence conservation, while others show unexpectedly fast evolution. Both of these may be pointers to functional significance.
BioEssays 26:1039–1042, 2004.
ß 2004 Wiley Periodicals, Inc.
Introduction
Now that a full representative human genome sequence is available,
(2) we have entered a new phase of genome research. This period has much less-well-defined goals than the first phase, when we were learning how to sequence large genomes and the basics of how to analyse the sequences.
We now have a large amount of high-throughput sequencing capacity, and this has given rise to an increasing variety of approaches. These range from large-scale sequencing of bacterial genomes,
(3) and even ecological genomics,
(4) where the sequences being generated are predominantly from unknown species, through diversity studies,
(5) to more or less complete sequencing of additional metazoan genomes.
Additional genome sequences can provide us with a number of kinds of information. Firstly, they can provide us with specific information on an organism that is interesting to us for some reason—for example laboratory model organisms, pathogens or agricultural species. Secondly, they can give us an insight into the processes taking place at the genome level during evolution. Finally, if selected appropriately, they can provide us with an invaluable source of comparative sequence information that allows us to identify conserved elements with functional significance.
Experimental model organisms have done well in the race to have their genomes sequenced, with a draft mouse sequence appearing in 2002,
(6) and 2004 has seen the appearance of a draft sequence from another important rodent model, the rat Rattus norvegicus .
(1) [I will refer to the authors of this paper as RGSPC 1 throughout for convenience]. The rat is an important part of the genomics jigsaw for a number of reasons. Firstly, it is the human physiologists’ model organism of choice, with a position equivalent to the mouse in mammalian genetics, and has made important contributions to a number of other research areas relevant to human health.
Indeed the earliest experimental studies on the rat pre-date those on the mouse.
(1)
As well as containing valuable useful material for rat researchers, especially as rat genetics becomes more sophisticated,
(7) a rat genome adds to the set of mammalian genomes from an evolutionary perspective. The high degree of similarity between rat and mouse sequences
(and indeed morphology) and their apparently recent divergence time mean that this contribution is generally specific to mouse and rat; the argument that rat can serve as an outgroup to human–mouse is fallacious as it is not an outgroup in the phylogeny. However, the rat sequence does cast light on the evolution of both rat as a species and the murinae, which have been suggested to have undergone recent rapid sequence and genome evolution.
(8,9)
Further, genomics is rapidly becoming a fully fledged evolutionary science with an interest in features of genomes either ignored by molecular evolutionists or previously inaccessible to them, and the rat sequence sheds light on a number of these topics. Examples are the relative roles of genome, segmental and individual gene duplication in genome evolution,
(10) the evolutionary dynamics of large gene families,
(11) the distribution and history of chromosome rearrangements, (12) and the origins of sequence conservation, especially outside exons.
(13)
Genome size and constituent parts
Calculations from the draft sequence suggest that the rat genome, at around 2.75–2.82 Gb
2 is intermediate in size between human (2.9 Gb) and mouse (2.6 Gb). As known for some time,
(14) these differences do not predominantly reflect differences in gene numbers (for example the 11,503 mouse– rat orthologous gene pairs are similar in number to the 11,084
MRC Mammalian Genetics Unit, Harwell, U.K.
E-mail: j.hancock@har.mrc.ac.uk
DOI 10.1002/bies.20121
Published online in Wiley InterScience (www.interscience.wiley.com).
1
The Rat Genome Sequencing Project Consortium is the consorium led by the
Baylor College of Medicine Human Genome Sequencing Center.
2
A gigabase is equivalent to 1 10
9 base pairs.
BioEssays 26:1039 – 1042, ß 2004 Wiley Periodicals, Inc.
BioEssays 26.10
1039
What the papers say human–mouse and the 10,066 human–rat pairs; RGSPC suggest that 86–94% of rat genes have orthologues in mouse and 89–90% have orthologues in human). Considerable emphasis has been given to the role of transposable elements
(TEs
3
) in altering genome size.
(14,15)
We might therefore expect to see evidence of a burst of transposition after the divergence of rat and mouse accounting for their genome size difference. Analysing the TE content of the two genomes,
RGSPC showed a predominance of LINE-1 4 (Long Interspersed Nuclear Element 1) expansion in the rat lineage over other TEs, and that LINE-1s make up about 2% more of the rat genome than they do in mouse. However, LINE-1 spreading also took place in mouse after divergence, so that lineagespecific elements make up 14.9% of the rat genome (0.41 Gb) and 14.26% of the mouse genome (0.37 Gb). Differential expansion of known TE families therefore only accounts for around 0.04 Gb of the 0.3 Gb genome size difference between the species. Another possible source of variation in genome size are microsatellites and minisatellites (types of simple sequence repeat or SSR
5
): analyses of the repetitiveness of mouse, rat and human genomic sequences have shown that mouse and rat sequences tend to be more repetitive than human, while repetitiveness correlates broadly with genome size in eukaryotes.
(16,17) The mouse and rat genomes contain a much higher proportion of SSRs than the human genome (1.4% compared to 0.45% in humans) but again the proportion is similar in the two species. Proportions of satellite
DNA, another rapidly evolving DNA class, are also similar between rat and mouse
(18,19) suggesting that any sequencebased explanation for the difference must lie elsewhere. The alternative is error in genome size estimation.
SSRs are also found within coding regions, encoding amino acid repeats (typically polyglutamine) and, in some cases, causing diseases, which are predominantly neurological.
(20)
These repeats tend to lie within rapidly evolving regions of proteins,
(21) and can be grouped into two classes: those that differ greatly in length between species and are encoded by tandem repeats of a single codon, and those that are relatively stable in length between species and are encoded by mixtures of different codons that are presumably not subject to replication slippage.
(22)
Most of this analysis was carried out on mouse but the rat provides an alternative, although not independent perspective (because of shared evolutionary history).
As for the human–mouse comparison, there are classes of polyglutamine repeat that are expanded either in rat or in human and a class that is more or less conserved in length although RGSPC do not present data on codon structure.
Evolution of gene content
Differences in gene content between the sequenced mammalian genomes appear to be mostly due to expansion and contraction of gene families, although there is evidence of pseudogenization of a few genes. Some gene families, such as ribosomal DNA encoding the ribosomal RNAs, may change in copy number in a quasi-neutral manner, but others appear to evolve in a co-evolutionary relationship with the environment and lifestyle of the organism of which they form a part.
(6,11,23)
Both the mouse and rat genome papers
(1,6) have remarked that gene families with higher copy numbers in rodents than in humans have functions related to olfaction and odorant reception, antigen recognition and reproduction. In the rat, there are also signs of duplications in the foreign compound detoxification system. All of these have a clear relationship with the rodent way of life. These duplicated genes often show signs of rapid adaptive evolution (high ratios of non-synonymous to synonymous substitution rate; Ka/Ks 6 ). This puts new emphasis on the study of gene duplication in genome evolution.
An interesting question is what the underlying mechanism of adaptive changes in gene copy number might be. It seems likely that unequal crossing-over (UCO
7
) is the major source of individual gene duplication.
(24)
Duplicated genes appear to arise close to one another (at least in mammals) and the gene clusters are broken up by subsequent genome rearrangement.
(25)
UCO is stimulated by TEs,
(15) and it is interesting that a recent analysis of a region of the mouse genome containing three expanded gene families showed an association of gene families with elevated TE concentrations.
(26)
Causality and correlation are not the same thing, and it could be that TEs accumulate in regions of frequent UCO, but an attractive hypothesis is that regions rich in TEs drive gene family evolution by UCO and act as ‘‘gene factories’’.
(26)
Genome rearrangement breaks up gene clusters and will bring together new combinations of genes on chromosomes, which could give rise to different influences of the hitchhiking effect in different lineages. A more significant side effect may be the incidental generation of segmental duplication, the duplication of segments of genomes more than
3
Transposable elements are DNA sequence elements that spread through genomes via DNA or RNA intermediates.
4
LINE-1 is a member of a class of autonomous transposable elements that contains all the genes required for its own transposition.
5
Simple sequence repeats are components of genomes made up of clusters of short-sequence motifs. In the case of microsatellites and minisatellites, these are arranged tandemly; in cryptically simple regions they are clustered but not tandemly arranged.
6
The ratio of the rates of non-synonymous (Ka) to synonymous (Ks) substitution in protein-coding regions is taken as an indicator of the type and level of selection acting on the sequence, as non-synonymous mutations change the protein sequence and are therefore subject to purifying or positive selection whereas this is not so for synonymous mutations.
7
Unequal crossing-over is a recombinational process that can take place between identical or near-identical sequences in different positions on different chromosomes. When recombination takes place between these sequences, deletions and duplications result.
1040 BioEssays 26.10
What the papers say
5 kb
8
(1 kb ¼ 1,000 base pairs) long. This can give rise to duplication of sets of genes that are not members of a gene family, producing different effects to local, UCO-driven gene duplication of single gene types, and providing the organism with a broader spectrum of substrates for adaptive evolution.
It could also bring about rearrangement and duplication of regulatory elements, many of which act at great distance from their target gene in mammals and, in rare cases, may even give rise to gene fusion events, producing novel proteins (as for mutations in general, most of these events are likely to be deleterious, as in human cancers, but some may be advantageous). The rat genome is again intermediate in its content of detectable segmental duplications, 2.9% of the sequence compared to 1–2% in mouse and 5–6% in humans.
Rodents have long been known to show an elevated substitution rate compared to other mammals, hypothetically due to their short generation time.
(8)
This has given rise to significant problems dating the rat–mouse divergence by molecular means and in placing rodents in the mammalian phylogeny. RGSPC carried out joint analysis of the rat, human and mouse genomes using ancestral repeat sequences to root their maximum likelihood phylogeny and determine the relative rates of evolution in the rodent and primate branches.
Although ingenious, this approach could give rise to error if repeats evolve significantly differently from the rest of the genome, which is possible. With this caveat, their analysis appears to show that sequence evolution in the ancestral rodent genome was approximately twice as rapid as in primates, but has been slower in mouse and rat since their divergence, the rat lineage showing slightly faster evolution than the mouse lineage.
Non-coding conservation
A feature of mammalian genomes that has only recently been recognised is the Evolutionarily Conserved Region or
ECR (27)9 (also known by a number of other acronyms, depending on researcher and definition, for example CNG
(Conserved Nongenic Sequence) or MCS (Multi-species
Conserved Sequence) (28,29) ). These are short regions that show high conservation between species (typically human and mouse) and may show conservation in species as divergent as the sea squirt Ciona intestinalis .
(26)
In human–mouse comparisons around 50% of ECRs lie outside recognised coding regions, suggesting that they may represent conserved regulatory regions. This is an exciting possibility as we are currently lacking computational or high throughput experimental tools to identify regulatory regions of genes. There is
8
A kilobase is equivalent to 1,000 base pairs.
9
ECRs, CGRs and MGRs are generally short regions of a genome that show higher than expected conservation in another genome, although the precise definition differs for each type of sequence. Conservation of this kind is to be expected within coding regions, but many occur outside exons. It has been suggested that these may represent conserved regulatory elements.
some evidence that some non-coding ECRs resemble regulatory regions
(30) and supporting experimental evidence is starting to appear.
(31)
It is likely that better discrimination between selectively and stochastically conserved ECRs can be achieved by adding additional branch length to the phylogeny of the species under study. Various approaches to this have been posited, from using very distant species
(26,29) to using a set of species from a given lineage (such as the primates) which may be expected to have some common features of gene regulation. Adding rat to human þ mouse adds relatively little (c. 15%) to the power of mouse–human comparisons
(1) but might be worthwhile in particular for analysing ECRs that might be functional in mouse (or rat), as it forms the basis of a potentially broader rodent data set.
Although they did not carry out an extensive analysis of this kind, RGSPC looked at some specific regions with the aim of investigating whether adding conservation to gene prediction is worthwhile. They produced some encouraging results when looking for conserved transcription-factor-binding sites of a pre-defined class (GATA-1) in a study of two hypersensitive sites from the b -globin complex.
Bejerano et al.
(32) recently described ultraconserved blocks of up to 779 bp that are identical in human, rat and mouse and show very high conservation in other mammals. On the contrary, RGSPC detected 5,055 regions longer than 100 bp that showed more than tenfold difference in evolutionary rate between the mouse and rat lineages. Like ECRs, both of these classes of sequence may be valuable targets for identifying important functional genomic regions: ultraconserved regions
(which are 20-fold more conserved than typical coding regions) may have some as yet unidentified, critical function in the genome,
(32) while rapidly evolving regions may be responsible for some of the phenotypic difference between species
(6)
(or, potentially, between laboratory strains, which often show phenotypic differences).
Future trends
The current trend in genomics is to sequence an increasing number of genomes at increasingly low coverage (and therefore higher error rate). The rat genome is currently sequenced to seven times coverage, corresponding to an error rate of around 10
4
. RGSPC emphasise the importance of a highquality draft sequence, but a higher error rate than for finished sequence is nevertheless unavoidable. This is not a problem for many types of analysis, for example of substitution rates, but is potentially a problem where precision is important. It may for example interfere with the identification of conserved elements, especially if they are small like transcription-factorbinding sites. It may also come into play when distinguishing bona fide genes from unprocessed pseudogenes, as single errors can lead to in silico pseudogenization of bona fide genes, and vice versa. It may be that part of the process of functional analysis of un-finished genome sequences will need
BioEssays 26.10
1041
What the papers say to be systematic re-sequencing of genes and unprocessed pseudogenes to finished standard. Similar problems may arise for regulatory regions, which are harder to identify and for which false negatives are more difficult to detect.
We are still in the early stages of the exploration of genome sequence space, during which we will doubtless learn vastly more about genome sequences and their evolution than we currently know. The themes that are currently emerging in genomics are very much based on what we already know, but it may be that there are parts of genomes that have functions that we can currently only guess at—perhaps hinted at by the inexplicably high conservation seen in ultraconserved regions and amongst non-coding ECRs. There may yet be genomic dark matter, not all of which may represent regulatory regions as we currently understand them. Onwards and upwards!
References
1. Rat Genome Sequencing Project Consortium. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature
428:493–521.
2. International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 408:860–921.
3. Meinke A, Henics T, Nagy E. 2004. Bacterial genomes pave the way to novel vaccines. Curr Opin Microbiol 7:314–320.
4. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. 2004.
Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428:37–43.
5. The International HapMap Consortium. 2003. The International HapMap
Project. Nature 426:789–796.
6. Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–562.
7. Cowley AW Jr, Roman RJ, Jacob HJ. 2004 . Application of chromosomal substitution techniques in gene-function discovery. J Physiol 554:
46–55.
8. Wu CI, Li WH. 1985. Evidence for higher rates of nucleotide substitution in rodents than in man. Proc Natl Acad Sci USA 82:1741–1745.
9. Bourque G, Pevzner PA, Tesler G. 2004. Reconstructing the Genomic
Architecture of Ancestral Mammals: Lessons From Human, Mouse, and
Rat Genomes. Genome Res 14:507–516.
10. Eichler EE, Sankoff D. 2003. Structural dynamics of eukaryotic chromosome evolution. Science 301:793–797.
11. Dehal P, Predki P, Olsen AS, Kobayashi A, Folta P, et al. 2001. Human chromosome 19 and related regions in mouse: conservative and lineagespecific evolution. Science 293:104–111.
12. Pevzner P, Tesler G. 2003. Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc Natl
Acad Sci USA 100:7672–7677.
13. Dermitzakis ET, Kirkness E, Schwarz S, Birney E, Reymond A,
Antonarakis SE. 2004. Comparison of human chromosome 21 conserved nongenic sequences (CNGs) with the mouse and dog genomes shows that their selective constraint is independent of their genic environment.
Genome Res 14:852–859.
14. Kidwell MG. 2002. Transposable elements and the evolution of genome size in eukaryotes. Genetica 115:49–63.
15. Kazazian HH Jr. 2004. Mobile elements: drivers of genome evolution.
Science 303:1626–1632.
16. Hancock JM. 1995. The contribution of slippage-like processes to genome evolution. J Mol Evol 41:1038–1047.
17. Hancock JM. 2002. Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects. Genetica 115:93–103.
18. Pech M, Igo-Kemenes T, Zachau HG. 1979. Nucleotide sequence of a highly repetitive component of rat DNA. Nucleic Acids Res 7:417–
432.
19. Waring M, Britten RJ. 1966. Nucleotide sequence repetition: a rapidly reassociating fraction of mouse DNA. Science 154:791–794.
20. Bowater RP, Wells RD. 2001. The intrinsically unstable life of DNA triplet repeats associated with human hereditary disorders. Prog Nucleic Acid
Res Mol Biol 66:159–202.
21. Hancock JM, Worthey EA, Santibanez-Koref MF. 2001. A role for selection in regulating the evolutionary emergence of disease-causing and other coding CAG repeats in humans and mice. Mol Biol Evol
18:1014–1023.
22. Alba MM, Santibanez-Koref MF, Hancock JM. 1999. Conservation of polyglutamine tract size between mice and humans depends on codon interruption. Mol Biol Evol 16:1641–1644.
23. Emes RD, Goodstadt L, Winter EE, Ponting CP. 2003. Comparison of genomes of human and mouse lays the foundation of genome zoology.
Hum Mol Genet 12:701–709.
24. Schimenti JC. 1999. Mice and the role of unequal recombination in genefamily evolution. Am J Hum Genet 64:40–45.
25. Friedman R, Hughes AL. 2004. Two patterns of genome organization in mammals: the chromosomal distribution of duplicate genes in human and mouse. Mol Biol Evol 21:1008–1013.
26. Mallon A-M, Wilming L, Weekes J, Gilbert JGR, Ashurst J, et al. 2004.
Organization and evolution of a gene-rich region of the mouse genome: a
12.7 Mb region deleted in the Del(13)Svea36H mouse. Genome Res
(In Press).
27. Mallon AM, Platzer M, Bate R, Gloeckner G, Botcherby MR, et al. 2000.
Comparative genome sequence analysis of the Bpa/Str region in mouse and Man. Genome Res 10:758–775.
28. Dermitzakis ET, Reymond A, Scamuffa N, Ucla C, Kirkness E, et al. 2003.
Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs). Science 302:1033–1035.
29. Margulies EH, Blanchette M, Haussler D, Green ED. 2003. Identification and characterization of multi-species conserved sequences. Genome
Res 13:2507–2518.
30. Dermitzakis ET, Reymond A, Scamuffa N, Ucla C, Kirkness E, Rossier C, et al. 2003. Evolutionary discrimination of mammalian conserved nongenic sequences (CNGs). Science 302:1033–1035.
31. Frazer KA, Tao H, Osoegawa K, de Jong PJ, Chen X, et al. 2004.
Noncoding sequences conserved in a limited number of mammals in the SIM2 interval are frequently functional. Genome Res 14:367–372.
32. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, et al. 2004.
Ultraconserved elements in the human genome. Science 304:1321–
1325.
1042 BioEssays 26.10