A global view of genomic information – master regulator John S. Mattick

Review A global view of genomic information – moving beyond the gene and the master regulator John S. Mattick1, Ryan J. Taft1 and Geoffrey J. Faulkner2 1 2 Institute for Molecular Bioscience, The University of Queensland, St Lucia, 4072, QLD, Australia Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Roslin, EH25 9PS, UK The current view of gene regulation in complex organisms holds that gene expression is largely controlled by the combinatoric actions of transcription factors and other regulatory proteins, some of which powerfully influence cell type. Recent large-scale studies have confirmed that cellular differentiation involves many different regulatory factors. However, other studies indicate that the genome is pervasively transcribed to produce a variety of short and long non-protein-coding RNAs, including those derived from retrotransposed sequences, which also play important roles in the epigenetic regulation of gene expression. The evidence suggests that ontogenesis requires interplay between state-specific regulatory proteins, multitasked effector complexes and target-specific RNAs that recruit these complexes to their sites of action. Moreover, the semi-continuous nature of the transcriptome prompts the reassessment of ‘genes’ as discrete entities and indicates that the mammalian genome might be more accurately viewed as islands of protein-coding information in a sea of cisand trans-acting regulatory sequences. Regulatory paradigms in metazoa Perhaps the biggest surprises of the genome sequencing projects are that the number of protein-coding genes in animals does not change appreciably with increasing developmental complexity (known as the ‘G-value paradox’) [1] and that, notwithstanding clade-specific expansions and innovations (e.g. RNA editing enzymes in vertebrates [2,3]), most proteins are orthologous [4]. It is generally accepted that phenotypic divergence in animals is based largely on the variation of the regulatory information that controls the expression of these proteins and their isoforms [5]. In addition, it is generally assumed that most regulatory transactions are conveyed by sequencespecific regulatory proteins that bind to enhancers, promoters and transcripts to modulate mRNA expression and processing, with the vast differences in developmental and cognitive complexity between nematodes and humans ascribed to an expanded suite of cis-regulatory elements and the presumed explosive power of the combinatoric interactions between the regulatory proteins that recognize these elements [6]. Corresponding authors: Mattick, J.S. (geoff.faulkner@roslin.ed.ac.uk). (j.mattick@uq.edu.au); Faulkner, G.J. Here, we discuss recent advances in our understanding of the nature and regulation of gene expression in mammals, particularly in relation to the complexity of the hierarchical networks of regulatory factors involved, the unfolding discovery of previously hidden layers of regulatory RNAs (including many derived from retrotransposon sequences and pseudogenes) and the emerging realization that the genome might not be constructed as a discrete set of protein-coding genes with associated regulatory sequences, but as an interleaved continuum of both coding and cis- and trans-acting regulatory information. ‘Transcription factors’ and regulatory networks The term ‘transcription factor’ loosely describes proteins with different modes of action that operate at various levels to facilitate and control the production of RNA. Transcription factors often act generically, in that they are expressed in multiple cell types and can regulate the expression of Glossary Long non-protein-coding RNAs (lncRNAs): RNAs of little protein-coding potential, >200 nt in length, some of which can be >100 Kb [59,60]. The 200 nt lower limit is an arbitrary figure based on a convenient practical cut-off in RNA purification protocols [32] that excludes most known classes of small RNAs. MicroRNAs (miRNAs): 22 nt small RNAs that regulate gene expression by partial complementary base pairing to specific mRNAs. This annealing inhibits protein translation and can also facilitate the degradation of the target mRNA. Piwi-interacting RNAs (piRNAs): Dicer-independent 26–30 nt small RNAs principally restricted to the germline and somatic cells bordering the germline. They associate with Piwi-clade Argonaute proteins and regulate transposon activity and chromatin state [27,28]. Promoter-associated short RNAs (PASRs): generally 20–200 nt long with 50 ends that coincide with the transcription start sites of protein-coding and noncoding genes [32]. Pseudogene: a supposedly non-functional paralog of a protein-coding gene generated by gene duplication or retrotransposition. The vast majority of pseudogenes are computationally defined by looking for features such as premature stop codons that prevent the translation of a viable protein. Small interfering RNAs (siRNAs): 21 nt small RNAs produced by Dicer cleavage of perfectly complementary dsRNA duplexes. They form complexes with Argonaute proteins and are involved in gene regulation, transposon control and viral defense [27,28]. Small nucleolar RNAs (snoRNAs): two classes: C/D box snoRNAs (70–120 nt) guide the methylation of target RNAs and H/ACA box snoRNAs (100–200 nt) guide pseudouridylation [121]. Recent evidence suggests that they might also be precursors for at least two classes of small RNAs that might have miRNAlike activity [30,31]. Transcription initiation RNAs (tiRNAs): 18 nt tiny RNAs derived from sequences just downstream of transcription initiation sites, which seem to be linked to the position of the first nucleosome and might be derived from RNA Pol II backtracking and TFIIS-mediated cleavage [33,34]. Transposed element (TE): repetitive element generated by the activity of retrotransposons and transposons. Almost always incapable of further transposition because of mutations and truncations. 0168-9525/$ – see front matter ß 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2009.11.002 Available online 26 November 2009 21 Review many genes by recognizing cis-acting binding sites with relatively relaxed consensus sequences that occur in many places around the genome, suggesting that another layer of specificity is required. What determines which subset of these sites is addressed in a given cell or developmental context is unknown but is presumed to be influenced by chromatin accessibility [7]. Some transcription factors are expressed in a cellrestricted manner and can have a powerful influence on cell fate. Good examples are the transcription factors Pou5f1 (also known as Oct4), Sox2 and Nanog, which are considered ‘master regulators’ of stem cell pluripotency with the ability to revert somatic cells to undifferentiated states capable of elaborating various developmental trajectories [8], and Hox proteins, which control cellular patterning in many contexts, including the segmental organization and neural circuitry of the hindbrain [9]. Other proteins that seem to function as master regulators of subsequent cell differentiation (of which there are many) include the helix–loop–helix transcription factor Myod1, which can convert differentiated cells into muscle cells [10], and the zinc finger protein Egr2 (also known as Krox-20), which is expressed in and required for the development of specific rhombomeres (segmented compartments) in the embryonic hindbrain [11]. However, such proteins are only parts of larger networks that influence muscle or hindbrain development in vivo, respectively [9,12], and do not fully explain the diversity and fine structure of organs and tissues. Similarly, chromatin-modifying proteins have a profound impact on developmental processes [13] because they lie at the functional centre of epigenetic regulatory networks, not because they themselves make locus-specific regulatory decisions but because they act on other information that does. The complex interplay of regulatory factors in cellular differentiation was recently illustrated by examining the effects of the systematic small interfering RNA (siRNA)mediated knockdown of 52 transcription factors during the PMA-induced differentiation of the human monocytic cell line THP-1, which showed that cellular states are determined by complex networks involving both positive and negative regulatory interactions among substantial numbers of transcription factors and that no single transcription factor is necessary and sufficient to drive the differentiation process [14]. A similar approach using the short hairpin RNA (shRNA)-mediated perturbation of 125 transcription factors, chromatin modifiers and RNA-binding proteins found that the transcriptional response of mouse dendritic cells to pathogens involves (at least) 24 ‘core regulators’ and 76 ‘fine-tuners’ [15]. The layered complexity of regulatory networks is also illustrated by the interplay between microRNAs (miRNAs) and transcription factors in cell pluripotency and differentiation [16–18]. The regulatory challenge and regulatory hierarchies This complexity is no surprise. Beyond cells in culture, the enormous and underappreciated challenge for genetic programming is not simply to define the phenotypic state of a cell, but rather to organize the 4-dimensional growth and differentiation of cells into a myriad of precisely 22 Trends in Genetics Vol.26 No.1 sculpted organs and tissues. In humans, these include the lungs, kidneys, heart, liver, pancreas, intestine and so on, a wide range of skeletal muscles including those in the face and many bones such as the vertebrae, each of which has a specific and unique architecture, as well as the dizzyingly complex organization of the brain with some 1011 neurons and 1014 synapses. Organogenesis, which involves directional cell division, cell movement, cell differentiation and programmed apoptosis, requires networked interactions between many hierarchical levels of gene regulation, including the modulation of chromatin architecture, transcription initiation and elongation, alternative splicing, RNA editing and other forms of post-transcriptional modifications, translation, posttranslational modifications, RNA half-life and RNA and protein trafficking and signaling. This is an extensive list, elements of which are often studied in isolation from others, with respect to both the gene(s) and level(s) of regulation under scrutiny. Each can have important if not profound effects on the cellular phenotype, as exemplified by the pleiotropic effects of protein phosphorylation by the serine/threonine kinases Akt1–3 on cell survival, growth, division, migration and metastasis, depending on the phosphorylated substrate and crosstalk with other pathways [19], although the complex interactions between different gene products and different levels of gene regulation in these networks are as yet only poorly understood. Guide RNAs in regulatory networks Although it is presumed that differentiation and development are mainly controlled by regulatory proteins, it is becoming increasingly apparent that there exists an additional, potentially vast, layer of regulatory RNAs that interact with some of these proteins and provide specificity to them (Figure 1). A well-documented, although by no means fully explored, example is that of miRNAs, which are generated by the RNA interference (RNAi) pathway and have unexpectedly emerged over the past decade as major players in global and specific gene regulation. Indeed, two recent reports have shown that a single miRNA miR-302, like the transcription factors Pou5f1, Sox2 and Nanog, is capable of reprogramming cells into an embryonal stem cell-like pluripotent state, including the induction of these transcription factors [16,17], which are then repressed by other miRNAs during differentiation [18]. miRNAs have no intrinsic catalytic function but act as guides to recruit a relatively generic protein complex (the RNA-induced silencing complex, or RISC, which contains ‘regulatory’ proteins of the Argonaute family [20]) to regulate the translation and half-life of specific target mRNAs through binding sites in both their coding [18,21] and 3’ untranslated regions, whose length and miRNA-recognition repertoire is modulated during differentiation [22,23]. Altogether, miRNAs regulate almost every known developmental process [24] and are perturbed in pathological processes such as cancer [25]. Although these examples illustrate the regulatory interplay between regulatory factors themselves, the key general point is that in the case of miRNAs, and other classes of regulatory RNAs, Review Trends in Genetics Vol.26 No.1 Figure 1. Transcriptional complexity at a single locus. Recent research indicates that most of the eukaryotic genome is transcribed into interweaving nests of both sense and antisense RNA species, whose expression is regulated to some extent by transcription factor activity, and also by local chromatin modifications, boundary elements and TEs, and other regulatory RNA species. Examples of these phenomena are depicted in this figure. (a) Transcription can initiate at multiple sites in a single locus, including from the 50 ends of annotated genes (left-hand edge of the red blocks), which frequently show evidence of antisense transcription in addition to a single dominant and many minor sense-oriented TSSs. Transcription can also initiate at transposed elements upstream of a canonical protein-coding gene or at sense- or antisense-oriented sites in introns. Such noncanonical transcriptional activity can also have a direct regulatory function – blocking the activity and accessibility of downstream promoters. Both protein-coding and non-coding RNA TSSs are regulated by transcription factors, and transcription factor targeting itself can be regulated by small ncRNAs. Tethered long ncRNAs (represented by light blue wavy line) can recruit chromatin or other DNA modifying complexes to regulate TSS accessibility to transcription factors and RNA polymerase II. (b) Transcripts generated from the TSSs described in panel (a) are depicted. The first three large transcripts shown below the arrow depict a canonical mRNA, an mRNA-like ncRNA and an alternate mRNA product with an exon extension derived from an alternative TSS in a repetitive element. The long ncRNAs shown in yellow are derived from the bidirectional transcription of the protein-coding gene and transcription factor-regulated intronic TSSs. Like protein-coding mRNAs, long ncRNAs can be spliced and capped. Small TSS-proximal RNAs (e.g. tiRNAs and PASRs) are derived from both protein-coding and non-coding transcription and might regulate transcriptional activity. Other small RNAs, such as snoRNAs and pre-miRNAs, can be processed from the introns of protein-coding or non-coding transcripts, and further cleaved into sdRNAs and mature miRNAs. Indeed, miRNAs bound to RISC complexes (top right) add an additional layer of regulation by targeting transcripts for degradation or inhibiting translation. Thin blocks represent short (green) or long (yellow) non-coding RNAs, TEs (purple) or 50 and 30 untranslated regions (red). Thick blocks indicate protein-coding exons (red). Blocks connected by thin lines indicate spliced transcripts and their respective splicing patterns. Double-stranded structures are representative of the genome. The size of the arrow indicates the relative abundance of transcripts derived from the TSS. Abbreviations: transcription initiation RNA (tiRNA); promoter-associated small RNA (PASR); small nucleolar RNA (snoRNA); sno-derived RNA (sdRNA); microRNA (miRNA); RNA-induced silencing complex (RISC); transcription factor (TF); 7-methylguanosine cap (7mG). the regulatory signal has been de-coupled from the consequent analog action, which provides enormous efficiency and flexibility in the evolution and deployment of such regulatory signals and networks. The number of known miRNAs in mammals is of the order of 103, but might be much higher given the indications of deep sequencing and evidence of cell-specific miRNAs [26]. In recent years, a number of other classes of small RNAs have emerged, including (i) Piwi-interacting small RNAs (piRNAs) that interact with other members of the Argonaute family and seem to have a role in silencing transposon activity in the germline [27,28], (ii) siRNAs derived from sense–antisense duplexes that play a role in the epigenetic regulation of adjacent loci [27,28] or are imputed to do so by Argonaute-dependent processes [29], (iii) at least two classes of small RNAs derived from small nucleolar RNAs (snoRNAs) that might also have miRNA-like activity [30,31], (iv) promoter-associated small RNAs (PASRs) of unknown function [32] and (v) transcription initiation RNAs (tiRNAs) linked to transcription start sites and nucleosome positioning [33,34]. Interestingly, it has recently been shown that exons are preferentially positioned in nucleosomes in both somatic [35–38] and sperm cells [38], which might provide a mechanistic basis for the 23 Review observed coupling of chromatin structure, transcription and splicing, and a potential basis for exon selection through various histone modifications within these nucleosomes that report the status of particular exons during differentiation and development, a process that itself might be RNA-directed. Hidden layers of RNA It is now evident that the genomes of mammals are almost entirely transcribed (as are, as far as one can tell, the genomes of all other organisms), apparently in a developmentally regulated manner. That is, the expansion in the extent of non-coding sequences with increased complexity is paralleled by a corresponding increase in the extent of transcription [4]. This has been shown by whole chromosome tiling arrays [39–42] and deep sequencing normalized cDNA libraries, which has revealed tens of thousands of intergenic, antisense, overlapping and intronic long non-protein-coding RNAs (long ncRNAs or lncRNAs) that are dynamically expressed from the mammalian genome [43–46], yielding a picture of extraordinary transcriptional complexity at individual loci (Figure 1). Although initially suspected to be transcriptional ‘noise’, a view that has been entertained if not favored because it does not disturb the orthodox view of the informational structure of the genome, there is now substantial genome-wide evidence pointing to the intrinsic functionality of these transcripts (for a review, see [47]). In addition, although the mechanisms are not yet well understood, there is increasing evidence that these RNAs play important roles in the regulation of differentiation and development, including the involvement of lncRNAs in the regulation of the expression of homeotic genes [48,49] and oncogenes [50], and in the regulation of skeletal development, eye development and epithelial-to-mesenchymal transition among many others (for reviews, additional examples and references, see [51,52]). The range of functions of the large number of documented lncRNAs has barely been explored, and there is as yet little information that would allow their structural parsing and functional classification. Nonetheless, lncRNAs might be expected to perform a wide range of functions in eukaryotic cell and developmental biology, given the capacity of RNA to form sophisticated structures that can be allosterically altered by interactions with other molecules [53,54], as well as to embrace highly sequence-specific recognition of other RNAs, DNA and proteins. Preliminary evidence suggests that many if not most lncRNAs are trafficked to specific subcellular locations [55], and there are well-documented examples of lncRNAs that are required for the formation and structural integrity of nuclear paraspeckles (which seem to regulate mRNA export) in differentiated cells [56,57] or are specifically associated with a novel subnuclear domain in a subset of neurons [58]. These observations demonstrate that RNA molecules themselves are crucial for proper cellular function. There are even more subterranean layers of the transcriptome. Most identified lncRNAs seem to be polyadenylated and produced by RNA polymerase II, although many are derived from internal oligo(dT) priming of internal A-rich sequences during cDNA cloning, often from 24 Trends in Genetics Vol.26 No.1 much larger precursors (‘macroRNAs’) that can extend over 100 Kb in length [59,60]. However, tiling array analyses, which do not depend on oligo(dT)-based capture and priming to reduce the background of rRNA and other infrastructural RNAs, have shown that almost half of all transcripts are not polyadenylated and that this fraction is largely distinct in sequence composition from polyadenylated RNAs [41], indicating that a large proportion of the dynamic transcriptome has remained hidden from view for unexpected technical reasons. Many of these RNAs, including those derived from transposed elements (see below), might be produced by RNA polymerase III (RNAPIII), which also transcribes various types of known or putative regulatory RNAs, although the full extent of the RNAPIII transcriptome is unknown [61]. Large-scale screening of ncRNAs, using siRNA- or shRNA-mediated knockdown (which is surprisingly effective [47]) or ectopic expression strategies, in parallel with existing efforts to deconstruct the roles of protein-coding genes, would undoubtedly reveal their role in many cellular and developmental processes. Indeed, given the phenotypic differences between species that retain a similar complement of proteins, it is possible that regulatory RNAs, including those that are not well conserved [62– 64], underlie many adaptations. RNA regulation of the epigenome Dynamic and mitotically heritable changes to chromatin architecture are a hallmark of differentiation and development and are central to the 4-dimensional control of these processes. These changes include DNA methylation as well as a myriad of different modifications to different residues in the tails of histones that form the nucleosome [65], and that are imposed by a relatively generic set of enzymes and chromatin-modifying complexes that, like RISC, lack inherent sequence specificity. To date, the field of epigenomics has largely been concerned with cataloging the changes in these modifications during differentiation (including cancer) and their association with particular features such as promoters or exons [66], and it seems to have been widely assumed, although not often articulated, that the positional specificity of these modifications is regulated by proteins that recruit the appropriate chromatin-modifying enzymes and/or complexes to their various sites of action at different loci in different cell types at different stages of growth and differentiation [67]. There is a degree of circularity in this assumption because it is also thought that changes to chromatin structure can also facilitate or restrict access to the transcription factors [7] that regulate the next level of specificity in the gene expression hierarchy (i.e. transcription itself). Recent evidence strongly suggests that both short and long ncRNAs are involved in, and perhaps central to, the control of chromatin architecture [68,69], as predicted earlier [70]. Indeed, ncRNA-directed regulatory circuits underpin most if not all complex epigenetic phenomena in eukaryotes, including transcriptional and post-transcriptional gene silencing, position effect variegation, hybrid dysgenesis, chromosome dosage compensation, parental imprinting and allelic exclusion, and possibly transvection and transinduction [47]. More specifically, Review it has been shown that many long ncRNAs are spatially and temporarily expressed from homeotic loci during development [71], and are associated with either chromatin repressor (Polycomb group) complexes [49,72] or chromatin activator (Trithorax group) complexes and activated forms of histones [73,74], suggesting that these RNAs function to guide infrastructural chromatin-modifying complexes to specific genomic loci to regulate gene expression [67–69]. Interestingly, and significantly for conceptions of evolution [75], RNA-directed epigenetic changes can also be meiotically inherited in animals and plants [76,77]. Enhancers are long distance and often very highly conserved regulatory elements that drive tissue-specific patterns of gene expression during development, a process that is not well understood but is thought to involve the recruitment of transcription factors and chromosomal looping to bring the resultant complexes into contact with specific promoters [67,78]. Intriguingly, the available evidence suggests that enhancers are themselves transcribed in the tissues in which they are active [79,80]. The resulting ncRNAs have been thought not to be functional but rather to be a passive byproduct of a transcriptional event that is required to open up the enhancer (or the promoter) to protein binding [81], but there are also documented examples of trans-acting functions for ncRNAs in regulating the expression of developmental genes [48,49,73]. There is evidence that both the act of transcription and the RNA itself are crucial factors in establishing cellular identity, and it remains an open question whether ncRNAs have an integral role in enhancer function [71]. There is also evidence that some classes of transcription factors and chromatin-modifying complexes have RNA-binding domains or high affinity for RNA:DNA structures ([82] and references therein). Retrotransposon sequences and pseudogenes Sequences derived from transposable elements (TEs) comprise at least half of the mammalian genome. They are thought to be largely non-functional and have consequently been used to assess the rate of ‘neutral evolution’, despite McClintock’s originally derided but later celebrated depiction of them as ‘controlling elements’ in maize [83], and Britten and Davidson’s subsequent suggestion that they form gene regulatory networks in animals [84]. Indeed, there is increasing evidence that these sequences might not only play a key role in genome evolution [85,86] but also in genome biology and the control of gene expression [87–91], including the potentially dynamic remodeling of the somatic genome in neurogenesis and/or neuronal function [92]. Moreover, it has been shown that the transcription of ncRNAs from a SINE B2 element controls chromatin structure in the mouse growth hormone locus [81] and that human Alu RNA acts as a modular transacting repressor of mRNA transcription during heat shock [93]. More recently, it has been found that thousands of TEs are transcribed in a tissue-specific manner, are enriched and typically coincide with the expression of nearby protein-coding genes and can comprise up to 30% of the total capped RNA present in a cell [91]. Trends in Genetics Vol.26 No.1 The central message here is that TEs contribute an integral – and underestimated – fraction to the mammalian transcriptome and regulome. Owing to their enrichment near protein-coding genes, TEs can harness nearby exons to generate a large number of protein-coding transcripts, or produce ncRNAs that overlap or regulate protein-coding genes [94]. Moreover, the dramatic variation in TE composition among mammals might reflect lineage-specific functional exaptation [95], as well as examples of convergent evolution [96], calling into question the relevance of conservation-based metrics of TE function [97]. Because TEs contribute a substantial proportion of human polymorphisms [98], they might also be responsible for many phenotypic differences between individuals, a particularly important perspective for genome resequencing projects given genome-wide association studies indicate that the vast majority of variations affecting complex traits and complex diseases lie within non-coding regions of the genome [99]. Another related class of sequences that might have a regulatory role is pseudogenes, which are presumed nonfunctional paralogs of functional protein-coding genes generated by gene duplication or retrotransposition. Computational analyses of genome sequence data currently indicate a cohort of 20 000 pseudogenes per mammalian genome [100]. Numerous pseudogenes have been discovered in recent years that contain active promoters and generate either sense or antisense RNAs distinct to their ancestral paralogs [100–104], including key markers of embryonic stem cell pluripotency (Oct4 and Nanog have at least six and 10 pseudogenes, respectively) [105]. The functional consequences of pseudogene transcription are unclear, but it has been proposed that ncRNAs generated by pseudogenes might silence paralogous mRNAs in trans directly by forming RNA–RNA duplexes [102] or generating siRNAs from such duplexes [106,107]. Taken together these data suggest that the term ‘pseudogene’ might ultimately prove to be a misnomer. It should be noted that the repetitive nature of both TEs and pseudogenes hinders, but does not prevent, the reliable identification of their associated RNAs. Massively parallel sequencing technologies can achieve a degree of resolution impossible with hybridization-based strategies, which can only reliably target the non-repetitive half of the human genome [41]. If deep sequencing reads of sufficient quality and length are produced, particularly with the use of paired-end protocols and sophisticated bioinformatic methods [108,109], it is possible to discriminate the expression of individual repetitive elements and investigate these species with conventional laboratory techniques [91] including RNAi, despite challenges in the use of exogenous siRNA molecules against repetitive regions. An information continuum? All these observations suggest that the extent of regulatory information in mammalian genomes is far greater than previously thought. They also call into question the longstanding assumption that most genes encode proteins (with their cis-acting regulatory elements) and that proteins transact most genetic information. We suggest that a new paradigm needs to emerge that expands the 25 Review definition of genetic information to include large numbers of regulatory RNAs, and recognizes the possibility that many ‘regulatory’ proteins, including those that might be state-specific, are provided with an additional layer of target specificity by guide RNAs. The most parsimonious explanation for the G-value paradox is that we have constrained ourselves to a limited and incomplete definition of the gene, with respect not only to the fact that many different product isoforms can be produced by post-transcriptional and post-translational modifications, but also the increasing likelihood that many functional genes do not encode proteins. Moreover, even the conception of a discrete ‘gene’ as a basic organizational unit might be inappropriate to describe the true nature of genetic information in higher organisms. Not only is there the pervasive transcription of the genome, but also a complex mix of overlapping, interleaved and bidirectional coding and non-coding, sense and ‘antisense’ transcripts expressed from most loci [32,110,111]. Many transcripts contain distal 50 exons (an average of 20 Kb in Drosophila and almost 200 Kb in human) that are only used in particular developmental contexts and traverse large genomic regions including other genes [112,113], contain internal initiation sites for alternative transcripts [46,114] and are processed by alternative splicing and other pathways to produce a variety of long and short RNAs [32,110,111]. Thus, the boundaries of ‘genes’ are blurred and become indefinable. In addition, transcriptomic studies have revealed interesting and unexpected species, such as chimeric transcripts, that might indicate a higher order network organization [115,116], consistent with the cell type-specific organization of chromosomes into territories and transcription factories [117]. For this reason it might be difficult to parse the genome in only one dimension or associate a genomic locus with a single developmental process, physiological function or role in disease progression. Rather, the genome might be better viewed as a highly organized, information dense and heavily transcribed structure wherein each region variably responds to feed–forward regulatory signals and environmental stimuli through a myriad of RNA and protein products [53,82]. This creates challenges for resolving the complex genetic bases for common human diseases, where gene networks rather than master genes drive specific phenotypes [118], and modeling genetic networks based on the limited conventional descriptions of a gene. This conceptual upheaval requires a fresh look at how the genomes of complex organisms, and perhaps all organisms, might be described and parsed to best reflect their biological information content. It has been suggested that genes might be redefined as ‘fuzzy transcription clusters with multiple products’ [119] and more recently as a ‘union of genomic sequences encoding a coherent set of potentially overlapping functional products’ [120], which explicitly includes ncRNAs, moves away from a protein-centric model and recapitulates some of the earliest definitions of genes as genetic loci. However, this does not deal with transcripts containing distal exons that traverse apparently unrelated loci. A not mutually exclusive alternative would be to invert the functional genomics paradigm of 26 Trends in Genetics Vol.26 No.1 annotating ‘genes’ as discrete entities by the product(s) they produce. Instead, RNAs could be annotated by their genomic origin, genomic environment and what is known of their function, including open reading frame content and interactions with other molecules. In effect, this would circumvent the issue of gene definition (including its boundaries) by consigning it to obsolescence. Acknowledgements We thank Alistair Forrest, Piero Carninci and the reviewers for their constructive and helpful comments. JSM and RJT are supported by a Federation Fellowship grant (FF0561986) and a Discovery Project grant (DP0988851) from the Australian Research Council. GJF is supported by an Overseas Based Biomedical Fellowship (CJ Martin Award) from the Australian National Health and Medical Research Council (ID 575585) and a UK BBSRC Institutional Strategic Programme Grant. References 1 Hahn, M.W. and Wray, G.A. (2002) The G-value paradox. Evolution & development 4, 73–75 2 Maas, S. et al. (2003) A-to-I RNA editing: recent news and residual mysteries. J Biol. Chem. 278, 1391–1394 3 Navaratnam, N. and Sarwar, R. (2006) An overview of cytidine deaminases. Int. J. Hematol. 83, 195–200 4 Taft, R.J. et al. (2007) The relationship between non-protein-coding DNA and eukaryotic complexity. Bioessays 29, 288–299 5 Carroll, S.B. (2008) Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell 134, 25–36 6 Levine, M. and Tjian, R. (2003) Transcription regulation and animal diversity. Nature 424, 147–151 7 Cairns, B.R. (2009) The logic of chromatin architecture and remodeling at promoters. Nature 461, 193–198 8 Pei, D. (2009) Regulation of pluripotency and reprogramming by transcription factors. J. Biol. Chem. 284, 3365–3369 9 Narita, Y. and Rijli, F.M. (2009) Hox genes in neural patterning and circuit formation in the mouse hindbrain. Curr. Top Dev. Biol. 88, 139– 167 10 Tapscott, S.J. (2005) The circuitry of a master switch: Myod and the regulation of skeletal muscle gene transcription. Development 132, 2685–2695 11 Giudicelli, F. et al. (2001) Krox-20 patterns the hindbrain through both cell-autonomous and non cell-autonomous mechanisms. Genes Dev. 15, 567–580 12 Brand-Saberi, B. (2005) Genetic and epigenetic control of skeletal muscle development. Ann. Anat. 187, 199–207 13 Kwon, C.S. and Wagner, D. (2007) Unwinding chromatin for development and growth: a few genes at a time. Trends Genet. 23, 403–412 14 Suzuki, H. et al. (2009) The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat. Genet. 41, 553–562 15 Amit, I. et al. (2009) Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses. Science, Epub ahead of print 10.1126/science.1179050 16 Lin, S.L. et al. (2008) Mir-302 reprograms human skin cancer cells into a pluripotent ES-cell-like state. RNA 14, 2115–2124 17 Lee, N.S. et al. (2008) miR-302b maintains ‘‘stemness’’ of human embryonal carcinoma cells by post-transcriptional regulation of cyclin D2 expression. Biochem. Biophys Res. Commun. 377, 434–440 18 Tay, Y. et al. (2008) MicroRNAs to Nanog, Oct4 and Sox2 coding regions modulate embryonic stem cell differentiation. Nature 455, 1124–1128 19 Manning, B.D. and Cantley, L.C. (2007) AKT/PKB signaling: navigating downstream. Cell 129, 1261–1274 20 Peters, L. and Meister, G. (2007) Argonaute proteins: mediators of RNA silencing. Mol Cell 26, 611–623 21 Rigoutsos, I. (2009) New tricks for animal microRNAs: targeting of amino acid coding regions at conserved and non-conserved sites. Cancer Res 69, 3245–3248 22 Sandberg, R. et al. (2008) Proliferating cells express mRNAs with shortened 3’ untranslated regions and fewer microRNA target sites. Science 320, 1643–1647 Review 23 Bartel, D.P. (2009) MicroRNAs: target recognition and regulatory functions. Cell 136, 215–233 24 Stefani, G. and Slack, F.J. (2008) Small non-coding RNAs in animal development. Nat. Rev. Mol. Cell Biol. 9, 219–230 25 Medina, P.P. and Slack, F.J. (2008) microRNAs and cancer: an overview. Cell Cycle 7, 2485–2492 26 Berezikov, E. et al. (2006) Many novel mammalian microRNA candidates identified by extensive cloning and RAKE analysis. Genome Res. 16, 1289–1298 27 Malone, C.D. and Hannon, G.J. (2009) Small RNAs as guardians of the genome. Cell 136, 656–668 28 Ghildiyal, M. and Zamore, P.D. (2009) Small silencing RNAs: an expanding universe. Nat. Rev. Genet. 10, 94–108 29 Morris, K.V. et al. (2008) Bidirectional transcription directs both transcriptional gene activation and suppression in human cells. PLoS Genet. 4, e1000258 30 Ender, C. et al. (2008) A human snoRNA with microRNA-like functions. Mol. Cell 32, 519–528 31 Taft, R.J. et al. (2009) Small RNAs derived from snoRNAs. RNA 15, 1233–1240 32 Kapranov, P. et al. (2007) RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–1488 33 Taft, R.J. et al. (2009) Tiny RNAs associated with transcription start sites in animals. Nat. Genet. 41, 572–578 34 Taft, R.J. et al. (2009) Evolution, biogenesis and function of promoterassociated RNAs. Cell Cycle 8, 2332–2338 35 Schwartz, S. et al. (2009) Chromatin organization marks exon–intron structure. Nat. Struct. Mol. Biol. 16, 990–995 36 Tilgner, H. et al. (2009) Nucleosome positioning as a determinant of exon recognition. Nat. Struct. Mol. Biol. 16, 996–1001 37 Andersson, R. et al. (2009) Nucleosomes are well positioned in exons and carry characteristic histone modifications. Genome Res. 19, 1732– 1741 38 Nahkuri, S. et al. (2009) Nucleosomes are preferentially positioned at exons in somatic and sperm cells. Cell Cycle 8, 3420–3424 39 Kapranov, P. et al. (2002) Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 40 Bertone, P. et al. (2004) Global identification of human transcribed sequences with genome tiling arrays. Science 306, 2242–2246 41 Cheng, J. et al. (2005) Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308, 1149–1154 42 Birney, E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 43 Okazaki, Y. et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563– 573 44 Carninci, P. et al. (2005) The transcriptional landscape of the mammalian genome. Science 309, 1559–1563 45 Katayama, S. et al. (2005) Antisense transcription in the mammalian transcriptome. Science 309, 1564–1566 46 Carninci, P. et al. (2006) Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38, 626–635 47 Mattick, J.S. (2009) The genetic signatures of non-coding RNAs. PLoS Genet. 5, e1000459 48 Feng, J. et al. (2006) The Evf-2 non-coding RNA is transcribed from the Dlx-5/6 ultraconserved region and functions as a Dlx-2 transcriptional coactivator. Genes Dev. 20, 1470–1484 49 Rinn, J.L. et al. (2007) Functional demarcation of active and silent chromatin domains in human Hox loci by non-coding RNAs. Cell 129, 1311–1323 50 Yu, W. et al. (2008) Epigenetic silencing of tumor suppressor gene p15 by its antisense RNA. Nature 451, 202–206 51 Prasanth, K.V. and Spector, D.L. (2007) Eukaryotic regulatory RNAs: an answer to the ‘genome complexity’ conundrum. Genes Dev. 21, 11– 42 52 Amaral, P.P. and Mattick, J.S. (2008) Non-coding RNA in development. Mamm. Genome 19, 454–492 53 St Laurent, G., III and Wahlestedt, C. (2007) Non-coding RNAs: couplers of analog and digital information in nervous system function? Trends in Neurosciences 30, 612–621 54 Serganov, A. (2009) The long and the short of riboswitches. Curr. Opin. Struct. Biol. 19, 251–259 Trends in Genetics Vol.26 No.1 55 Mercer, T.R. et al. (2008) Specific expression of long non-coding RNAs in the mouse brain. Proc. Natl. Acad. Sci. U. S. A. 105, 716–721 56 Sunwoo, H. et al. (2009) MEN e/b nuclear-retained non-coding RNAs are upregulated upon muscle differentiation and are essential components of paraspeckles. Genome Res. 19, 347–359 57 Chen, L.L. and Carmichael, G.G. (2009) Altered nuclear retention of mRNAs containing inverted repeats in human embryonic stem cells: functional role of a nuclear non-coding RNA. Mol. Cell 35, 467–478 58 Sone, M. et al. (2007) The mRNA-like non-coding RNA Gomafu constitutes a novel nuclear domain in a subset of neurons. Journal of Cell Science 120, 2498–2506 59 Ravasi, T. et al. (2006) Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res. 16, 11–19 60 Furuno, M. et al. (2006) Clusters of internally primed transcripts reveal novel long non-coding RNAs. PLoS Genet. 2, e37 61 Dieci, G. et al. (2007) The expanding RNA polymerase III transcriptome. Trends Genet. 23, 614–622 62 Pang, K.C. et al. (2006) Rapid evolution of non-coding RNAs: lack of conservation does not mean lack of function. Trends Genet. 22, 1–5 63 Pheasant, M. and Mattick, J.S. (2007) Raising the estimate of functional human sequences. Genome Res. 17, 1245–1253 64 Nordstrom, K.J. et al. (2009) Crucial evaluation of the FANTOM3 noncoding RNA transcripts. Genomics 94, 169–176 65 Kouzarides, T. (2007) Chromatin modifications and their function. Cell 128, 693–705 66 Bernstein, B.E. et al. (2007) The mammalian epigenome. Cell 128, 669–681 67 Simon, J.A. and Kingston, R.E. (2009) Mechanisms of Polycomb gene silencing: knowns and unknowns. Nat. Rev. Mol. Cell Biol. 10, 697– 708 68 Mattick, J.S. et al. (2009) RNA regulation of epigenetic processes. Bioessays 31, 51–59 69 Morris, K.V. (2009) Long antisense non-coding RNAs function to direct epigenetic complexes that regulate transcription in human cells. Epigenetics 4, 296–301 70 Mattick, J.S. and Gagen, M.J. (2001) The evolution of controlled multitasked gene networks: the role of introns and other noncoding RNAs in the development of complex organisms. Mol. Biol. Evol. 18, 1611–1630 71 Lempradl, A. and Ringrose, L. (2008) How does non-coding transcription regulate Hox genes? Bioessays 30, 110–121 72 Khalil, A.M. et al. (2009) Many human large intergenic non-coding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc. Natl. Acad. Sci. U. S. A. 106, 11667–11672 73 Sanchez-Elsner, T. et al. (2006) Non-coding RNAs of trithorax response elements recruit Drosophila Ash1 to Ultrabithorax. Science 311, 1118–1123 74 Dinger, M.E. et al. (2008) Long non-coding RNAs in mouse embryonic stem cell pluripotency and differentiation. Genome Res. 18, 1433– 1445 75 Mattick, J.S. (2009) Has evolution learnt how to learn? EMBO Rep. 10, 665 76 Chandler, V.L. (2007) Paramutation: from maize to mice. Cell 128, 641–645 77 Cuzin, F. et al. (2008) Inherited variation at the epigenetic level: paramutation from the plant to the mouse. Curr. Opin. Genet. Dev. 18, 193–196 78 Kleinjan, D.A. and van Heyningen, V. (2005) Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76, 8–32 79 Calin, G.A. et al. (2007) Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell 12, 215– 229 80 Ling, J. et al. (2005) The HS2 enhancer of the beta-globin locus control region initiates synthesis of non-coding, polyadenylated RNAs independent of a cis-linked globin promoter. J. Mol. Biol. 350, 883–896 81 Lunyak, V.V. et al. (2007) Developmentally regulated activation of a SINE B2 repeat as a domain boundary in organogenesis. Science 317, 248–251 82 Mattick, J.S. (2007) A new paradigm for developmental biology. J. Exp. Biol. 210, 1526–1547 27 Review 83 McClintock, B. (1956) Controlling elements and the gene. Cold Spring Harb. Symp. Quant. Biol. 21, 197–216 84 Britten, R.J. and Davidson, E.H. (1971) Repetitive and non-repetitive DNA sequences and a speculation on the origins of evolutionary novelty. Q Rev. Biol. 66, 111–138 85 Britten, R. (2006) Transposable elements have contributed to thousands of human proteins. Proc. Natl. Acad. Sci. U. S. A. 103, 1798–1803 86 Oliver, K.R. and Greene, W.K. (2009) Transposable elements: powerful facilitators of evolution. Bioessays 31, 703–714 87 Brosius, J. (1999) RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene 238, 115–134 88 Lowe, C.B. et al. (2007) Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc. Natl. Acad. Sci. U. S. A. 104, 8005–8010 89 Feschotte, C. (2008) Transposable elements and the evolution of regulatory networks. Nat. Rev. Genet. 9, 397–405 90 Cordaux, R. and Batzer, M.A. (2009) The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 10, 691–703 91 Faulkner, G.J. et al. (2009) The regulated retrotransposon transcriptome of mammalian cells. Nat. Genet. 41, 563–571 92 Coufal, N.G. et al. (2009) L1 retrotransposition in human neural progenitor cells. Nature 460, 1127–1131 93 Mariner, P.D. et al. (2008) Human Alu RNA is a modular transacting repressor of mRNA transcription during heat shock. Mol. Cell 29, 499–509 94 Volff, J.N. and Brosius, J. (2007) Modern genomes with retro-look: retrotransposed elements, retroposition and the origin of new genes. Genome Dynamics 3, 175–190 95 Mattick, J.S. and Mehler, M.F. (2008) RNA editing, DNA recoding and the evolution of human cognition. Trends in Neurosciences 31, 227–233 96 Romanish, M.T. et al. (2007) Repeated recruitment of LTR retrotransposons as promoters by the antiapoptotic locus NAIP during mammalian evolution. PLoS Genet. 3, e10 97 Faulkner, G.J. and Carninci, P. (2009) Altruistic functions for selfish DNA. Cell Cycle 8, 2895–2900 98 Wang, J. et al. (2006) dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum. Mutat. 27, 323–329 99 Hindorff, L.A. et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. U. S. A. 106, 9362–9367 100 Svensson, O. et al. (2006) Genome-wide survey for biologically functional pseudogenes. PLoS Comput. Biol. 2, e46 101 Zhou, B.S. et al. (1992) Identification of antisense RNA transcripts from a human DNA topoisomerase I pseudogene. Cancer Res. 52, 4280–4285 28 Trends in Genetics Vol.26 No.1 102 Korneev, S.A. et al. (1999) Neuronal expression of neural nitric oxide synthase (nNOS) protein is suppressed by an antisense RNA transcribed from an NOS pseudogene. J. Neurosci. 19, 7711–7720 103 Frith, M.C. et al. (2006) Pseudo-messenger RNA: phantoms of the transcriptome. PLoS Genet. 2, e23 104 Zheng, D. and Gerstein, M.B. (2007) The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they? Trends Genet. 23, 219–224 105 Pain, D. et al. (2005) Multiple retropseudogenes from pluripotent cellspecific gene expression indicates a potential signature for novel gene identification. J. Biol. Chem. 280, 6265–6268 106 Watanabe, T. et al. (2008) Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes. Nature 453, 539–543 107 Tam, O.H. et al. (2008) Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocytes. Nature 453, 534–538 108 Faulkner, G.J. et al. (2008) A rescue strategy for multi-mapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics 91, 281–288 109 Hashimoto, T. et al. (2009) Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics 25, 2613–2614 110 Mattick, J.S. and Makunin, I.V. (2006) Non-coding RNA. Hum Mol Genet. 15, R17–29 111 Kapranov, P. et al. (2007) Genome-wide transcription and the implications for genomic organization. Nat. Rev. Genet. 8, 413–423 112 Manak, J.R. et al. (2006) Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nat. Genet. 38, 1151–1158 113 Denoeud, F. et al. (2007) Prominent use of distal 50 transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746–759 114 Valen, E. et al. (2009) Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE. Genome Res. 19, 255–265 115 Li, X. et al. (2009) Short homologous sequences are strongly associated with the generation of chimeric RNAs in eukaryotes. J. Mol. Evol. 68, 56–65 116 Gingeras, T.R. (2009) Implications of chimaeric non-co-linear transcripts. Nature 461, 206–211 117 Dekker, J. (2008) Gene regulation in the third dimension. Science 319, 1793–1794 118 Schadt, E.E. (2009) Molecular networks as sensors and drivers of common human diseases. Nature 461, 218–223 119 Mattick, J.S. (2003) Challenging the dogma: the hidden layer of nonprotein-coding RNAs in complex organisms. Bioessays 25, 930–939 120 Gerstein, M.B. et al. (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res. 17, 669–681 121 Matera, A.G. et al. (2007) Non-coding RNAs: lessons from the small nuclear and small nucleolar RNAs. Nat. Rev. Mol. Cell Biol. 8, 209–220

A global view of genomic information – master regulator John S. Mattick

Related documents

Products

Support

A global view of genomic information – master regulator John S. Mattick

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib