A global view of genomic information – master regulator John S. Mattick

advertisement
Review
A global view of genomic information –
moving beyond the gene and the
master regulator
John S. Mattick1, Ryan J. Taft1 and Geoffrey J. Faulkner2
1
2
Institute for Molecular Bioscience, The University of Queensland, St Lucia, 4072, QLD, Australia
Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Roslin, EH25 9PS, UK
The current view of gene regulation in complex organisms holds that gene expression is largely controlled by
the combinatoric actions of transcription factors and
other regulatory proteins, some of which powerfully
influence cell type. Recent large-scale studies have confirmed that cellular differentiation involves many different regulatory factors. However, other studies indicate
that the genome is pervasively transcribed to produce a
variety of short and long non-protein-coding RNAs, including those derived from retrotransposed sequences,
which also play important roles in the epigenetic regulation of gene expression. The evidence suggests that
ontogenesis requires interplay between state-specific
regulatory proteins, multitasked effector complexes
and target-specific RNAs that recruit these complexes
to their sites of action. Moreover, the semi-continuous
nature of the transcriptome prompts the reassessment
of ‘genes’ as discrete entities and indicates that the
mammalian genome might be more accurately viewed
as islands of protein-coding information in a sea of cisand trans-acting regulatory sequences.
Regulatory paradigms in metazoa
Perhaps the biggest surprises of the genome sequencing
projects are that the number of protein-coding genes in
animals does not change appreciably with increasing
developmental complexity (known as the ‘G-value paradox’) [1] and that, notwithstanding clade-specific expansions and innovations (e.g. RNA editing enzymes in
vertebrates [2,3]), most proteins are orthologous [4]. It is
generally accepted that phenotypic divergence in animals
is based largely on the variation of the regulatory information that controls the expression of these proteins and
their isoforms [5]. In addition, it is generally assumed that
most regulatory transactions are conveyed by sequencespecific regulatory proteins that bind to enhancers, promoters and transcripts to modulate mRNA expression and
processing, with the vast differences in developmental and
cognitive complexity between nematodes and humans
ascribed to an expanded suite of cis-regulatory elements
and the presumed explosive power of the combinatoric
interactions between the regulatory proteins that recognize these elements [6].
Corresponding authors: Mattick, J.S.
(geoff.faulkner@roslin.ed.ac.uk).
(j.mattick@uq.edu.au); Faulkner, G.J.
Here, we discuss recent advances in our understanding
of the nature and regulation of gene expression in mammals, particularly in relation to the complexity of the
hierarchical networks of regulatory factors involved, the
unfolding discovery of previously hidden layers of regulatory RNAs (including many derived from retrotransposon
sequences and pseudogenes) and the emerging realization
that the genome might not be constructed as a discrete set
of protein-coding genes with associated regulatory
sequences, but as an interleaved continuum of both coding
and cis- and trans-acting regulatory information.
‘Transcription factors’ and regulatory networks
The term ‘transcription factor’ loosely describes proteins
with different modes of action that operate at various levels
to facilitate and control the production of RNA. Transcription factors often act generically, in that they are expressed
in multiple cell types and can regulate the expression of
Glossary
Long non-protein-coding RNAs (lncRNAs): RNAs of little protein-coding
potential, >200 nt in length, some of which can be >100 Kb [59,60]. The 200 nt
lower limit is an arbitrary figure based on a convenient practical cut-off in RNA
purification protocols [32] that excludes most known classes of small RNAs.
MicroRNAs (miRNAs): 22 nt small RNAs that regulate gene expression by
partial complementary base pairing to specific mRNAs. This annealing inhibits
protein translation and can also facilitate the degradation of the target mRNA.
Piwi-interacting RNAs (piRNAs): Dicer-independent 26–30 nt small RNAs
principally restricted to the germline and somatic cells bordering the germline.
They associate with Piwi-clade Argonaute proteins and regulate transposon
activity and chromatin state [27,28].
Promoter-associated short RNAs (PASRs): generally 20–200 nt long with 50
ends that coincide with the transcription start sites of protein-coding and noncoding genes [32].
Pseudogene: a supposedly non-functional paralog of a protein-coding gene
generated by gene duplication or retrotransposition. The vast majority of
pseudogenes are computationally defined by looking for features such as
premature stop codons that prevent the translation of a viable protein.
Small interfering RNAs (siRNAs): 21 nt small RNAs produced by Dicer
cleavage of perfectly complementary dsRNA duplexes. They form complexes
with Argonaute proteins and are involved in gene regulation, transposon
control and viral defense [27,28].
Small nucleolar RNAs (snoRNAs): two classes: C/D box snoRNAs (70–120 nt)
guide the methylation of target RNAs and H/ACA box snoRNAs (100–200 nt)
guide pseudouridylation [121]. Recent evidence suggests that they might also
be precursors for at least two classes of small RNAs that might have miRNAlike activity [30,31].
Transcription initiation RNAs (tiRNAs): 18 nt tiny RNAs derived from
sequences just downstream of transcription initiation sites, which seem to be
linked to the position of the first nucleosome and might be derived from RNA
Pol II backtracking and TFIIS-mediated cleavage [33,34].
Transposed element (TE): repetitive element generated by the activity of
retrotransposons and transposons. Almost always incapable of further
transposition because of mutations and truncations.
0168-9525/$ – see front matter ß 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2009.11.002 Available online 26 November 2009
21
Review
many genes by recognizing cis-acting binding sites with
relatively relaxed consensus sequences that occur in many
places around the genome, suggesting that another layer of
specificity is required. What determines which subset of
these sites is addressed in a given cell or developmental
context is unknown but is presumed to be influenced by
chromatin accessibility [7].
Some transcription factors are expressed in a cellrestricted manner and can have a powerful influence on
cell fate. Good examples are the transcription factors
Pou5f1 (also known as Oct4), Sox2 and Nanog, which
are considered ‘master regulators’ of stem cell pluripotency
with the ability to revert somatic cells to undifferentiated
states capable of elaborating various developmental trajectories [8], and Hox proteins, which control cellular patterning in many contexts, including the segmental
organization and neural circuitry of the hindbrain [9].
Other proteins that seem to function as master regulators
of subsequent cell differentiation (of which there are many)
include the helix–loop–helix transcription factor Myod1,
which can convert differentiated cells into muscle cells [10],
and the zinc finger protein Egr2 (also known as Krox-20),
which is expressed in and required for the development of
specific rhombomeres (segmented compartments) in the
embryonic hindbrain [11]. However, such proteins are only
parts of larger networks that influence muscle or hindbrain
development in vivo, respectively [9,12], and do not fully
explain the diversity and fine structure of organs and
tissues. Similarly, chromatin-modifying proteins have a
profound impact on developmental processes [13] because
they lie at the functional centre of epigenetic regulatory
networks, not because they themselves make locus-specific
regulatory decisions but because they act on other information that does.
The complex interplay of regulatory factors in cellular
differentiation was recently illustrated by examining the
effects of the systematic small interfering RNA (siRNA)mediated knockdown of 52 transcription factors during the
PMA-induced differentiation of the human monocytic cell
line THP-1, which showed that cellular states are determined by complex networks involving both positive and
negative regulatory interactions among substantial numbers of transcription factors and that no single transcription factor is necessary and sufficient to drive the
differentiation process [14]. A similar approach using
the short hairpin RNA (shRNA)-mediated perturbation
of 125 transcription factors, chromatin modifiers and
RNA-binding proteins found that the transcriptional
response of mouse dendritic cells to pathogens involves
(at least) 24 ‘core regulators’ and 76 ‘fine-tuners’ [15]. The
layered complexity of regulatory networks is also illustrated by the interplay between microRNAs (miRNAs)
and transcription factors in cell pluripotency and differentiation [16–18].
The regulatory challenge and regulatory hierarchies
This complexity is no surprise. Beyond cells in culture, the
enormous and underappreciated challenge for genetic
programming is not simply to define the phenotypic state
of a cell, but rather to organize the 4-dimensional growth
and differentiation of cells into a myriad of precisely
22
Trends in Genetics Vol.26 No.1
sculpted organs and tissues. In humans, these include
the lungs, kidneys, heart, liver, pancreas, intestine and so
on, a wide range of skeletal muscles including those in the
face and many bones such as the vertebrae, each of which
has a specific and unique architecture, as well as the
dizzyingly complex organization of the brain with some
1011 neurons and 1014 synapses. Organogenesis, which
involves directional cell division, cell movement, cell
differentiation and programmed apoptosis, requires networked interactions between many hierarchical levels of
gene regulation, including the modulation of chromatin
architecture, transcription initiation and elongation,
alternative splicing, RNA editing and other forms of
post-transcriptional modifications, translation, posttranslational modifications, RNA half-life and RNA and
protein trafficking and signaling. This is an extensive list,
elements of which are often studied in isolation from
others, with respect to both the gene(s) and level(s) of
regulation under scrutiny. Each can have important if not
profound effects on the cellular phenotype, as exemplified
by the pleiotropic effects of protein phosphorylation by the
serine/threonine kinases Akt1–3 on cell survival, growth,
division, migration and metastasis, depending on the
phosphorylated substrate and crosstalk with other pathways [19], although the complex interactions between
different gene products and different levels of gene
regulation in these networks are as yet only poorly
understood.
Guide RNAs in regulatory networks
Although it is presumed that differentiation and development are mainly controlled by regulatory proteins, it is
becoming increasingly apparent that there exists an
additional, potentially vast, layer of regulatory RNAs that
interact with some of these proteins and provide specificity
to them (Figure 1). A well-documented, although by no
means fully explored, example is that of miRNAs, which
are generated by the RNA interference (RNAi) pathway
and have unexpectedly emerged over the past decade as
major players in global and specific gene regulation.
Indeed, two recent reports have shown that a single
miRNA miR-302, like the transcription factors Pou5f1,
Sox2 and Nanog, is capable of reprogramming cells into
an embryonal stem cell-like pluripotent state, including
the induction of these transcription factors [16,17], which
are then repressed by other miRNAs during differentiation
[18]. miRNAs have no intrinsic catalytic function but act as
guides to recruit a relatively generic protein complex (the
RNA-induced silencing complex, or RISC, which contains
‘regulatory’ proteins of the Argonaute family [20]) to
regulate the translation and half-life of specific target
mRNAs through binding sites in both their coding
[18,21] and 3’ untranslated regions, whose length and
miRNA-recognition repertoire is modulated during differentiation [22,23].
Altogether, miRNAs regulate almost every known
developmental process [24] and are perturbed in pathological processes such as cancer [25]. Although these
examples illustrate the regulatory interplay between regulatory factors themselves, the key general point is that in
the case of miRNAs, and other classes of regulatory RNAs,
Review
Trends in Genetics
Vol.26 No.1
Figure 1. Transcriptional complexity at a single locus. Recent research indicates that most of the eukaryotic genome is transcribed into interweaving nests of both sense
and antisense RNA species, whose expression is regulated to some extent by transcription factor activity, and also by local chromatin modifications, boundary elements
and TEs, and other regulatory RNA species. Examples of these phenomena are depicted in this figure. (a) Transcription can initiate at multiple sites in a single locus,
including from the 50 ends of annotated genes (left-hand edge of the red blocks), which frequently show evidence of antisense transcription in addition to a single dominant
and many minor sense-oriented TSSs. Transcription can also initiate at transposed elements upstream of a canonical protein-coding gene or at sense- or antisense-oriented
sites in introns. Such noncanonical transcriptional activity can also have a direct regulatory function – blocking the activity and accessibility of downstream promoters. Both
protein-coding and non-coding RNA TSSs are regulated by transcription factors, and transcription factor targeting itself can be regulated by small ncRNAs. Tethered long
ncRNAs (represented by light blue wavy line) can recruit chromatin or other DNA modifying complexes to regulate TSS accessibility to transcription factors and RNA
polymerase II. (b) Transcripts generated from the TSSs described in panel (a) are depicted. The first three large transcripts shown below the arrow depict a canonical mRNA,
an mRNA-like ncRNA and an alternate mRNA product with an exon extension derived from an alternative TSS in a repetitive element. The long ncRNAs shown in yellow are
derived from the bidirectional transcription of the protein-coding gene and transcription factor-regulated intronic TSSs. Like protein-coding mRNAs, long ncRNAs can be
spliced and capped. Small TSS-proximal RNAs (e.g. tiRNAs and PASRs) are derived from both protein-coding and non-coding transcription and might regulate
transcriptional activity. Other small RNAs, such as snoRNAs and pre-miRNAs, can be processed from the introns of protein-coding or non-coding transcripts, and further
cleaved into sdRNAs and mature miRNAs. Indeed, miRNAs bound to RISC complexes (top right) add an additional layer of regulation by targeting transcripts for
degradation or inhibiting translation. Thin blocks represent short (green) or long (yellow) non-coding RNAs, TEs (purple) or 50 and 30 untranslated regions (red). Thick blocks
indicate protein-coding exons (red). Blocks connected by thin lines indicate spliced transcripts and their respective splicing patterns. Double-stranded structures are
representative of the genome. The size of the arrow indicates the relative abundance of transcripts derived from the TSS. Abbreviations: transcription initiation RNA
(tiRNA); promoter-associated small RNA (PASR); small nucleolar RNA (snoRNA); sno-derived RNA (sdRNA); microRNA (miRNA); RNA-induced silencing complex (RISC);
transcription factor (TF); 7-methylguanosine cap (7mG).
the regulatory signal has been de-coupled from the consequent analog action, which provides enormous efficiency
and flexibility in the evolution and deployment of such
regulatory signals and networks. The number of known
miRNAs in mammals is of the order of 103, but might be
much higher given the indications of deep sequencing and
evidence of cell-specific miRNAs [26].
In recent years, a number of other classes of small RNAs
have emerged, including (i) Piwi-interacting small RNAs
(piRNAs) that interact with other members of the Argonaute family and seem to have a role in silencing transposon activity in the germline [27,28], (ii) siRNAs derived
from sense–antisense duplexes that play a role in the
epigenetic regulation of adjacent loci [27,28] or are imputed
to do so by Argonaute-dependent processes [29], (iii) at
least two classes of small RNAs derived from small nucleolar RNAs (snoRNAs) that might also have miRNA-like
activity [30,31], (iv) promoter-associated small RNAs
(PASRs) of unknown function [32] and (v) transcription
initiation RNAs (tiRNAs) linked to transcription start sites
and nucleosome positioning [33,34]. Interestingly, it has
recently been shown that exons are preferentially positioned in nucleosomes in both somatic [35–38] and sperm
cells [38], which might provide a mechanistic basis for the
23
Review
observed coupling of chromatin structure, transcription
and splicing, and a potential basis for exon selection
through various histone modifications within these nucleosomes that report the status of particular exons during
differentiation and development, a process that itself
might be RNA-directed.
Hidden layers of RNA
It is now evident that the genomes of mammals are almost
entirely transcribed (as are, as far as one can tell, the
genomes of all other organisms), apparently in a developmentally regulated manner. That is, the expansion in
the extent of non-coding sequences with increased complexity is paralleled by a corresponding increase in the
extent of transcription [4]. This has been shown by whole
chromosome tiling arrays [39–42] and deep sequencing
normalized cDNA libraries, which has revealed tens of
thousands of intergenic, antisense, overlapping and intronic long non-protein-coding RNAs (long ncRNAs or
lncRNAs) that are dynamically expressed from the mammalian genome [43–46], yielding a picture of extraordinary
transcriptional complexity at individual loci (Figure 1).
Although initially suspected to be transcriptional
‘noise’, a view that has been entertained if not favored
because it does not disturb the orthodox view of the informational structure of the genome, there is now substantial
genome-wide evidence pointing to the intrinsic functionality of these transcripts (for a review, see [47]). In
addition, although the mechanisms are not yet well understood, there is increasing evidence that these RNAs play
important roles in the regulation of differentiation and
development, including the involvement of lncRNAs in the
regulation of the expression of homeotic genes [48,49] and
oncogenes [50], and in the regulation of skeletal development, eye development and epithelial-to-mesenchymal
transition among many others (for reviews, additional
examples and references, see [51,52]).
The range of functions of the large number of documented lncRNAs has barely been explored, and there is as yet
little information that would allow their structural parsing
and functional classification. Nonetheless, lncRNAs might
be expected to perform a wide range of functions in eukaryotic cell and developmental biology, given the capacity of
RNA to form sophisticated structures that can be allosterically altered by interactions with other molecules
[53,54], as well as to embrace highly sequence-specific
recognition of other RNAs, DNA and proteins. Preliminary
evidence suggests that many if not most lncRNAs are trafficked to specific subcellular locations [55], and there are
well-documented examples of lncRNAs that are required for
the formation and structural integrity of nuclear paraspeckles (which seem to regulate mRNA export) in differentiated cells [56,57] or are specifically associated with a
novel subnuclear domain in a subset of neurons [58]. These
observations demonstrate that RNA molecules themselves
are crucial for proper cellular function.
There are even more subterranean layers of the transcriptome. Most identified lncRNAs seem to be polyadenylated and produced by RNA polymerase II, although
many are derived from internal oligo(dT) priming of
internal A-rich sequences during cDNA cloning, often from
24
Trends in Genetics Vol.26 No.1
much larger precursors (‘macroRNAs’) that can extend
over 100 Kb in length [59,60]. However, tiling array
analyses, which do not depend on oligo(dT)-based capture
and priming to reduce the background of rRNA and other
infrastructural RNAs, have shown that almost half of all
transcripts are not polyadenylated and that this fraction is
largely distinct in sequence composition from polyadenylated RNAs [41], indicating that a large proportion of the
dynamic transcriptome has remained hidden from view for
unexpected technical reasons. Many of these RNAs, including those derived from transposed elements (see
below), might be produced by RNA polymerase III (RNAPIII), which also transcribes various types of known or
putative regulatory RNAs, although the full extent of the
RNAPIII transcriptome is unknown [61].
Large-scale screening of ncRNAs, using siRNA- or
shRNA-mediated knockdown (which is surprisingly effective [47]) or ectopic expression strategies, in parallel with
existing efforts to deconstruct the roles of protein-coding
genes, would undoubtedly reveal their role in many cellular and developmental processes. Indeed, given the phenotypic differences between species that retain a similar
complement of proteins, it is possible that regulatory
RNAs, including those that are not well conserved [62–
64], underlie many adaptations.
RNA regulation of the epigenome
Dynamic and mitotically heritable changes to chromatin
architecture are a hallmark of differentiation and development and are central to the 4-dimensional control of
these processes. These changes include DNA methylation
as well as a myriad of different modifications to different
residues in the tails of histones that form the nucleosome
[65], and that are imposed by a relatively generic set of
enzymes and chromatin-modifying complexes that, like
RISC, lack inherent sequence specificity. To date, the field
of epigenomics has largely been concerned with cataloging
the changes in these modifications during differentiation
(including cancer) and their association with particular
features such as promoters or exons [66], and it seems
to have been widely assumed, although not often articulated, that the positional specificity of these modifications
is regulated by proteins that recruit the appropriate chromatin-modifying enzymes and/or complexes to their various sites of action at different loci in different cell types at
different stages of growth and differentiation [67]. There is
a degree of circularity in this assumption because it is also
thought that changes to chromatin structure can also
facilitate or restrict access to the transcription factors
[7] that regulate the next level of specificity in the gene
expression hierarchy (i.e. transcription itself).
Recent evidence strongly suggests that both short and
long ncRNAs are involved in, and perhaps central to, the
control of chromatin architecture [68,69], as predicted earlier [70]. Indeed, ncRNA-directed regulatory circuits
underpin most if not all complex epigenetic phenomena
in eukaryotes, including transcriptional and post-transcriptional gene silencing, position effect variegation,
hybrid dysgenesis, chromosome dosage compensation,
parental imprinting and allelic exclusion, and possibly
transvection and transinduction [47]. More specifically,
Review
it has been shown that many long ncRNAs are spatially
and temporarily expressed from homeotic loci during development [71], and are associated with either chromatin
repressor (Polycomb group) complexes [49,72] or chromatin activator (Trithorax group) complexes and activated
forms of histones [73,74], suggesting that these RNAs
function to guide infrastructural chromatin-modifying
complexes to specific genomic loci to regulate gene expression [67–69]. Interestingly, and significantly for conceptions of evolution [75], RNA-directed epigenetic
changes can also be meiotically inherited in animals and
plants [76,77].
Enhancers are long distance and often very highly conserved regulatory elements that drive tissue-specific patterns of gene expression during development, a process
that is not well understood but is thought to involve the
recruitment of transcription factors and chromosomal looping to bring the resultant complexes into contact with
specific promoters [67,78]. Intriguingly, the available evidence suggests that enhancers are themselves transcribed
in the tissues in which they are active [79,80]. The resulting ncRNAs have been thought not to be functional but
rather to be a passive byproduct of a transcriptional event
that is required to open up the enhancer (or the promoter)
to protein binding [81], but there are also documented
examples of trans-acting functions for ncRNAs in regulating the expression of developmental genes [48,49,73].
There is evidence that both the act of transcription and
the RNA itself are crucial factors in establishing cellular
identity, and it remains an open question whether ncRNAs
have an integral role in enhancer function [71]. There is
also evidence that some classes of transcription factors and
chromatin-modifying complexes have RNA-binding
domains or high affinity for RNA:DNA structures ([82]
and references therein).
Retrotransposon sequences and pseudogenes
Sequences derived from transposable elements (TEs) comprise at least half of the mammalian genome. They are
thought to be largely non-functional and have consequently been used to assess the rate of ‘neutral evolution’, despite McClintock’s originally derided but later
celebrated depiction of them as ‘controlling elements’ in
maize [83], and Britten and Davidson’s subsequent suggestion that they form gene regulatory networks in
animals [84]. Indeed, there is increasing evidence that
these sequences might not only play a key role in genome
evolution [85,86] but also in genome biology and the control of gene expression [87–91], including the potentially
dynamic remodeling of the somatic genome in neurogenesis and/or neuronal function [92]. Moreover, it has been
shown that the transcription of ncRNAs from a SINE B2
element controls chromatin structure in the mouse growth
hormone locus [81] and that human Alu RNA acts as a
modular transacting repressor of mRNA transcription
during heat shock [93]. More recently, it has been found
that thousands of TEs are transcribed in a tissue-specific
manner, are enriched and typically coincide with the
expression of nearby protein-coding genes and can comprise up to 30% of the total capped RNA present in a cell
[91].
Trends in Genetics
Vol.26 No.1
The central message here is that TEs contribute an
integral – and underestimated – fraction to the mammalian transcriptome and regulome. Owing to their enrichment near protein-coding genes, TEs can harness nearby
exons to generate a large number of protein-coding transcripts, or produce ncRNAs that overlap or regulate
protein-coding genes [94]. Moreover, the dramatic variation in TE composition among mammals might reflect
lineage-specific functional exaptation [95], as well as
examples of convergent evolution [96], calling into question
the relevance of conservation-based metrics of TE function
[97]. Because TEs contribute a substantial proportion of
human polymorphisms [98], they might also be responsible
for many phenotypic differences between individuals, a
particularly important perspective for genome resequencing projects given genome-wide association studies
indicate that the vast majority of variations affecting
complex traits and complex diseases lie within non-coding
regions of the genome [99].
Another related class of sequences that might have a
regulatory role is pseudogenes, which are presumed nonfunctional paralogs of functional protein-coding genes
generated by gene duplication or retrotransposition. Computational analyses of genome sequence data currently
indicate a cohort of 20 000 pseudogenes per mammalian
genome [100]. Numerous pseudogenes have been discovered in recent years that contain active promoters and
generate either sense or antisense RNAs distinct to their
ancestral paralogs [100–104], including key markers of
embryonic stem cell pluripotency (Oct4 and Nanog have
at least six and 10 pseudogenes, respectively) [105]. The
functional consequences of pseudogene transcription are
unclear, but it has been proposed that ncRNAs generated
by pseudogenes might silence paralogous mRNAs in trans
directly by forming RNA–RNA duplexes [102] or generating siRNAs from such duplexes [106,107]. Taken together
these data suggest that the term ‘pseudogene’ might ultimately prove to be a misnomer.
It should be noted that the repetitive nature of both TEs
and pseudogenes hinders, but does not prevent, the
reliable identification of their associated RNAs. Massively
parallel sequencing technologies can achieve a degree of
resolution impossible with hybridization-based strategies,
which can only reliably target the non-repetitive half of the
human genome [41]. If deep sequencing reads of sufficient
quality and length are produced, particularly with the use
of paired-end protocols and sophisticated bioinformatic
methods [108,109], it is possible to discriminate the expression of individual repetitive elements and investigate these
species with conventional laboratory techniques [91] including RNAi, despite challenges in the use of exogenous
siRNA molecules against repetitive regions.
An information continuum?
All these observations suggest that the extent of regulatory
information in mammalian genomes is far greater than
previously thought. They also call into question the longstanding assumption that most genes encode proteins
(with their cis-acting regulatory elements) and that
proteins transact most genetic information. We suggest
that a new paradigm needs to emerge that expands the
25
Review
definition of genetic information to include large numbers
of regulatory RNAs, and recognizes the possibility that
many ‘regulatory’ proteins, including those that might be
state-specific, are provided with an additional layer of
target specificity by guide RNAs. The most parsimonious
explanation for the G-value paradox is that we have constrained ourselves to a limited and incomplete definition of
the gene, with respect not only to the fact that many
different product isoforms can be produced by post-transcriptional and post-translational modifications, but also
the increasing likelihood that many functional genes do not
encode proteins.
Moreover, even the conception of a discrete ‘gene’ as a
basic organizational unit might be inappropriate to
describe the true nature of genetic information in higher
organisms. Not only is there the pervasive transcription of
the genome, but also a complex mix of overlapping, interleaved and bidirectional coding and non-coding, sense and
‘antisense’ transcripts expressed from most loci
[32,110,111]. Many transcripts contain distal 50 exons
(an average of 20 Kb in Drosophila and almost 200 Kb
in human) that are only used in particular developmental
contexts and traverse large genomic regions including
other genes [112,113], contain internal initiation sites
for alternative transcripts [46,114] and are processed by
alternative splicing and other pathways to produce a
variety of long and short RNAs [32,110,111]. Thus, the
boundaries of ‘genes’ are blurred and become indefinable.
In addition, transcriptomic studies have revealed interesting and unexpected species, such as chimeric transcripts,
that might indicate a higher order network organization
[115,116], consistent with the cell type-specific organization of chromosomes into territories and transcription
factories [117]. For this reason it might be difficult to parse
the genome in only one dimension or associate a genomic
locus with a single developmental process, physiological
function or role in disease progression. Rather, the genome
might be better viewed as a highly organized, information
dense and heavily transcribed structure wherein each
region variably responds to feed–forward regulatory signals and environmental stimuli through a myriad of RNA
and protein products [53,82]. This creates challenges for
resolving the complex genetic bases for common human
diseases, where gene networks rather than master genes
drive specific phenotypes [118], and modeling genetic networks based on the limited conventional descriptions of a
gene.
This conceptual upheaval requires a fresh look at how
the genomes of complex organisms, and perhaps all organisms, might be described and parsed to best reflect their
biological information content. It has been suggested that
genes might be redefined as ‘fuzzy transcription clusters
with multiple products’ [119] and more recently as a ‘union
of genomic sequences encoding a coherent set of potentially
overlapping functional products’ [120], which explicitly
includes ncRNAs, moves away from a protein-centric
model and recapitulates some of the earliest definitions
of genes as genetic loci. However, this does not deal with
transcripts containing distal exons that traverse apparently unrelated loci. A not mutually exclusive alternative
would be to invert the functional genomics paradigm of
26
Trends in Genetics Vol.26 No.1
annotating ‘genes’ as discrete entities by the product(s)
they produce. Instead, RNAs could be annotated by their
genomic origin, genomic environment and what is known of
their function, including open reading frame content and
interactions with other molecules. In effect, this would
circumvent the issue of gene definition (including its
boundaries) by consigning it to obsolescence.
Acknowledgements
We thank Alistair Forrest, Piero Carninci and the reviewers for their
constructive and helpful comments. JSM and RJT are supported by a
Federation Fellowship grant (FF0561986) and a Discovery Project grant
(DP0988851) from the Australian Research Council. GJF is supported by
an Overseas Based Biomedical Fellowship (CJ Martin Award) from the
Australian National Health and Medical Research Council (ID 575585)
and a UK BBSRC Institutional Strategic Programme Grant.
References
1 Hahn, M.W. and Wray, G.A. (2002) The G-value paradox. Evolution &
development 4, 73–75
2 Maas, S. et al. (2003) A-to-I RNA editing: recent news and residual
mysteries. J Biol. Chem. 278, 1391–1394
3 Navaratnam, N. and Sarwar, R. (2006) An overview of cytidine
deaminases. Int. J. Hematol. 83, 195–200
4 Taft, R.J. et al. (2007) The relationship between non-protein-coding
DNA and eukaryotic complexity. Bioessays 29, 288–299
5 Carroll, S.B. (2008) Evo-devo and an expanding evolutionary
synthesis: a genetic theory of morphological evolution. Cell 134, 25–36
6 Levine, M. and Tjian, R. (2003) Transcription regulation and animal
diversity. Nature 424, 147–151
7 Cairns, B.R. (2009) The logic of chromatin architecture and
remodeling at promoters. Nature 461, 193–198
8 Pei, D. (2009) Regulation of pluripotency and reprogramming by
transcription factors. J. Biol. Chem. 284, 3365–3369
9 Narita, Y. and Rijli, F.M. (2009) Hox genes in neural patterning and
circuit formation in the mouse hindbrain. Curr. Top Dev. Biol. 88, 139–
167
10 Tapscott, S.J. (2005) The circuitry of a master switch: Myod and the
regulation of skeletal muscle gene transcription. Development 132,
2685–2695
11 Giudicelli, F. et al. (2001) Krox-20 patterns the hindbrain through
both cell-autonomous and non cell-autonomous mechanisms. Genes
Dev. 15, 567–580
12 Brand-Saberi, B. (2005) Genetic and epigenetic control of skeletal
muscle development. Ann. Anat. 187, 199–207
13 Kwon, C.S. and Wagner, D. (2007) Unwinding chromatin for
development and growth: a few genes at a time. Trends Genet. 23,
403–412
14 Suzuki, H. et al. (2009) The transcriptional network that controls
growth arrest and differentiation in a human myeloid leukemia cell
line. Nat. Genet. 41, 553–562
15 Amit, I. et al. (2009) Unbiased reconstruction of a mammalian
transcriptional network mediating pathogen responses. Science,
Epub ahead of print 10.1126/science.1179050
16 Lin, S.L. et al. (2008) Mir-302 reprograms human skin cancer cells
into a pluripotent ES-cell-like state. RNA 14, 2115–2124
17 Lee, N.S. et al. (2008) miR-302b maintains ‘‘stemness’’ of human
embryonal carcinoma cells by post-transcriptional regulation of
cyclin D2 expression. Biochem. Biophys Res. Commun. 377, 434–440
18 Tay, Y. et al. (2008) MicroRNAs to Nanog, Oct4 and Sox2 coding
regions modulate embryonic stem cell differentiation. Nature 455,
1124–1128
19 Manning, B.D. and Cantley, L.C. (2007) AKT/PKB signaling:
navigating downstream. Cell 129, 1261–1274
20 Peters, L. and Meister, G. (2007) Argonaute proteins: mediators of
RNA silencing. Mol Cell 26, 611–623
21 Rigoutsos, I. (2009) New tricks for animal microRNAs: targeting of
amino acid coding regions at conserved and non-conserved sites.
Cancer Res 69, 3245–3248
22 Sandberg, R. et al. (2008) Proliferating cells express mRNAs with
shortened 3’ untranslated regions and fewer microRNA target sites.
Science 320, 1643–1647
Review
23 Bartel, D.P. (2009) MicroRNAs: target recognition and regulatory
functions. Cell 136, 215–233
24 Stefani, G. and Slack, F.J. (2008) Small non-coding RNAs in animal
development. Nat. Rev. Mol. Cell Biol. 9, 219–230
25 Medina, P.P. and Slack, F.J. (2008) microRNAs and cancer: an
overview. Cell Cycle 7, 2485–2492
26 Berezikov, E. et al. (2006) Many novel mammalian microRNA
candidates identified by extensive cloning and RAKE analysis.
Genome Res. 16, 1289–1298
27 Malone, C.D. and Hannon, G.J. (2009) Small RNAs as guardians of
the genome. Cell 136, 656–668
28 Ghildiyal, M. and Zamore, P.D. (2009) Small silencing RNAs: an
expanding universe. Nat. Rev. Genet. 10, 94–108
29 Morris, K.V. et al. (2008) Bidirectional transcription directs both
transcriptional gene activation and suppression in human cells.
PLoS Genet. 4, e1000258
30 Ender, C. et al. (2008) A human snoRNA with microRNA-like
functions. Mol. Cell 32, 519–528
31 Taft, R.J. et al. (2009) Small RNAs derived from snoRNAs. RNA 15,
1233–1240
32 Kapranov, P. et al. (2007) RNA maps reveal new RNA classes and a
possible function for pervasive transcription. Science 316, 1484–1488
33 Taft, R.J. et al. (2009) Tiny RNAs associated with transcription start
sites in animals. Nat. Genet. 41, 572–578
34 Taft, R.J. et al. (2009) Evolution, biogenesis and function of promoterassociated RNAs. Cell Cycle 8, 2332–2338
35 Schwartz, S. et al. (2009) Chromatin organization marks exon–intron
structure. Nat. Struct. Mol. Biol. 16, 990–995
36 Tilgner, H. et al. (2009) Nucleosome positioning as a determinant of
exon recognition. Nat. Struct. Mol. Biol. 16, 996–1001
37 Andersson, R. et al. (2009) Nucleosomes are well positioned in exons
and carry characteristic histone modifications. Genome Res. 19, 1732–
1741
38 Nahkuri, S. et al. (2009) Nucleosomes are preferentially positioned at
exons in somatic and sperm cells. Cell Cycle 8, 3420–3424
39 Kapranov, P. et al. (2002) Large-scale transcriptional activity in
chromosomes 21 and 22. Science 296, 916–919
40 Bertone, P. et al. (2004) Global identification of human transcribed
sequences with genome tiling arrays. Science 306, 2242–2246
41 Cheng, J. et al. (2005) Transcriptional maps of 10 human
chromosomes at 5-nucleotide resolution. Science 308, 1149–1154
42 Birney, E. et al. (2007) Identification and analysis of functional
elements in 1% of the human genome by the ENCODE pilot
project. Nature 447, 799–816
43 Okazaki, Y. et al. (2002) Analysis of the mouse transcriptome based on
functional annotation of 60,770 full-length cDNAs. Nature 420, 563–
573
44 Carninci, P. et al. (2005) The transcriptional landscape of the
mammalian genome. Science 309, 1559–1563
45 Katayama, S. et al. (2005) Antisense transcription in the mammalian
transcriptome. Science 309, 1564–1566
46 Carninci, P. et al. (2006) Genome-wide analysis of mammalian
promoter architecture and evolution. Nat. Genet. 38, 626–635
47 Mattick, J.S. (2009) The genetic signatures of non-coding RNAs. PLoS
Genet. 5, e1000459
48 Feng, J. et al. (2006) The Evf-2 non-coding RNA is transcribed from
the Dlx-5/6 ultraconserved region and functions as a Dlx-2
transcriptional coactivator. Genes Dev. 20, 1470–1484
49 Rinn, J.L. et al. (2007) Functional demarcation of active and silent
chromatin domains in human Hox loci by non-coding RNAs. Cell 129,
1311–1323
50 Yu, W. et al. (2008) Epigenetic silencing of tumor suppressor gene p15
by its antisense RNA. Nature 451, 202–206
51 Prasanth, K.V. and Spector, D.L. (2007) Eukaryotic regulatory RNAs:
an answer to the ‘genome complexity’ conundrum. Genes Dev. 21, 11–
42
52 Amaral, P.P. and Mattick, J.S. (2008) Non-coding RNA in
development. Mamm. Genome 19, 454–492
53 St Laurent, G., III and Wahlestedt, C. (2007) Non-coding RNAs:
couplers of analog and digital information in nervous system
function? Trends in Neurosciences 30, 612–621
54 Serganov, A. (2009) The long and the short of riboswitches. Curr.
Opin. Struct. Biol. 19, 251–259
Trends in Genetics
Vol.26 No.1
55 Mercer, T.R. et al. (2008) Specific expression of long non-coding RNAs
in the mouse brain. Proc. Natl. Acad. Sci. U. S. A. 105, 716–721
56 Sunwoo, H. et al. (2009) MEN e/b nuclear-retained non-coding RNAs
are upregulated upon muscle differentiation and are essential
components of paraspeckles. Genome Res. 19, 347–359
57 Chen, L.L. and Carmichael, G.G. (2009) Altered nuclear retention of
mRNAs containing inverted repeats in human embryonic stem
cells: functional role of a nuclear non-coding RNA. Mol. Cell 35,
467–478
58 Sone, M. et al. (2007) The mRNA-like non-coding RNA Gomafu
constitutes a novel nuclear domain in a subset of neurons. Journal
of Cell Science 120, 2498–2506
59 Ravasi, T. et al. (2006) Experimental validation of the regulated
expression of large numbers of non-coding RNAs from the mouse
genome. Genome Res. 16, 11–19
60 Furuno, M. et al. (2006) Clusters of internally primed transcripts
reveal novel long non-coding RNAs. PLoS Genet. 2, e37
61 Dieci, G. et al. (2007) The expanding RNA polymerase III
transcriptome. Trends Genet. 23, 614–622
62 Pang, K.C. et al. (2006) Rapid evolution of non-coding RNAs: lack of
conservation does not mean lack of function. Trends Genet. 22, 1–5
63 Pheasant, M. and Mattick, J.S. (2007) Raising the estimate of
functional human sequences. Genome Res. 17, 1245–1253
64 Nordstrom, K.J. et al. (2009) Crucial evaluation of the FANTOM3 noncoding RNA transcripts. Genomics 94, 169–176
65 Kouzarides, T. (2007) Chromatin modifications and their function.
Cell 128, 693–705
66 Bernstein, B.E. et al. (2007) The mammalian epigenome. Cell 128,
669–681
67 Simon, J.A. and Kingston, R.E. (2009) Mechanisms of Polycomb gene
silencing: knowns and unknowns. Nat. Rev. Mol. Cell Biol. 10, 697–
708
68 Mattick, J.S. et al. (2009) RNA regulation of epigenetic processes.
Bioessays 31, 51–59
69 Morris, K.V. (2009) Long antisense non-coding RNAs function to
direct epigenetic complexes that regulate transcription in human
cells. Epigenetics 4, 296–301
70 Mattick, J.S. and Gagen, M.J. (2001) The evolution of controlled
multitasked gene networks: the role of introns and other noncoding RNAs in the development of complex organisms. Mol. Biol.
Evol. 18, 1611–1630
71 Lempradl, A. and Ringrose, L. (2008) How does non-coding
transcription regulate Hox genes? Bioessays 30, 110–121
72 Khalil, A.M. et al. (2009) Many human large intergenic non-coding
RNAs associate with chromatin-modifying complexes and affect gene
expression. Proc. Natl. Acad. Sci. U. S. A. 106, 11667–11672
73 Sanchez-Elsner, T. et al. (2006) Non-coding RNAs of trithorax
response elements recruit Drosophila Ash1 to Ultrabithorax.
Science 311, 1118–1123
74 Dinger, M.E. et al. (2008) Long non-coding RNAs in mouse embryonic
stem cell pluripotency and differentiation. Genome Res. 18, 1433–
1445
75 Mattick, J.S. (2009) Has evolution learnt how to learn? EMBO Rep.
10, 665
76 Chandler, V.L. (2007) Paramutation: from maize to mice. Cell 128,
641–645
77 Cuzin, F. et al. (2008) Inherited variation at the epigenetic level:
paramutation from the plant to the mouse. Curr. Opin. Genet. Dev. 18,
193–196
78 Kleinjan, D.A. and van Heyningen, V. (2005) Long-range control of
gene expression: emerging mechanisms and disruption in disease.
Am. J. Hum. Genet. 76, 8–32
79 Calin, G.A. et al. (2007) Ultraconserved regions encoding ncRNAs
are altered in human leukemias and carcinomas. Cancer Cell 12, 215–
229
80 Ling, J. et al. (2005) The HS2 enhancer of the beta-globin locus control
region initiates synthesis of non-coding, polyadenylated RNAs
independent of a cis-linked globin promoter. J. Mol. Biol. 350, 883–896
81 Lunyak, V.V. et al. (2007) Developmentally regulated activation of a
SINE B2 repeat as a domain boundary in organogenesis. Science 317,
248–251
82 Mattick, J.S. (2007) A new paradigm for developmental biology. J.
Exp. Biol. 210, 1526–1547
27
Review
83 McClintock, B. (1956) Controlling elements and the gene. Cold Spring
Harb. Symp. Quant. Biol. 21, 197–216
84 Britten, R.J. and Davidson, E.H. (1971) Repetitive and non-repetitive
DNA sequences and a speculation on the origins of evolutionary
novelty. Q Rev. Biol. 66, 111–138
85 Britten, R. (2006) Transposable elements have contributed to
thousands of human proteins. Proc. Natl. Acad. Sci. U. S. A. 103,
1798–1803
86 Oliver, K.R. and Greene, W.K. (2009) Transposable elements:
powerful facilitators of evolution. Bioessays 31, 703–714
87 Brosius, J. (1999) RNAs from all categories generate retrosequences
that may be exapted as novel genes or regulatory elements. Gene 238,
115–134
88 Lowe, C.B. et al. (2007) Thousands of human mobile element
fragments undergo strong purifying selection near developmental
genes. Proc. Natl. Acad. Sci. U. S. A. 104, 8005–8010
89 Feschotte, C. (2008) Transposable elements and the evolution of
regulatory networks. Nat. Rev. Genet. 9, 397–405
90 Cordaux, R. and Batzer, M.A. (2009) The impact of retrotransposons
on human genome evolution. Nat. Rev. Genet. 10, 691–703
91 Faulkner, G.J. et al. (2009) The regulated retrotransposon
transcriptome of mammalian cells. Nat. Genet. 41, 563–571
92 Coufal, N.G. et al. (2009) L1 retrotransposition in human neural
progenitor cells. Nature 460, 1127–1131
93 Mariner, P.D. et al. (2008) Human Alu RNA is a modular transacting
repressor of mRNA transcription during heat shock. Mol. Cell 29,
499–509
94 Volff, J.N. and Brosius, J. (2007) Modern genomes with retro-look:
retrotransposed elements, retroposition and the origin of new genes.
Genome Dynamics 3, 175–190
95 Mattick, J.S. and Mehler, M.F. (2008) RNA editing, DNA recoding and
the evolution of human cognition. Trends in Neurosciences 31,
227–233
96 Romanish, M.T. et al. (2007) Repeated recruitment of LTR
retrotransposons as promoters by the antiapoptotic locus NAIP
during mammalian evolution. PLoS Genet. 3, e10
97 Faulkner, G.J. and Carninci, P. (2009) Altruistic functions for selfish
DNA. Cell Cycle 8, 2895–2900
98 Wang, J. et al. (2006) dbRIP: a highly integrated database of
retrotransposon insertion polymorphisms in humans. Hum. Mutat.
27, 323–329
99 Hindorff, L.A. et al. (2009) Potential etiologic and functional
implications of genome-wide association loci for human diseases
and traits. Proc. Natl. Acad. Sci. U. S. A. 106, 9362–9367
100 Svensson, O. et al. (2006) Genome-wide survey for biologically
functional pseudogenes. PLoS Comput. Biol. 2, e46
101 Zhou, B.S. et al. (1992) Identification of antisense RNA transcripts
from a human DNA topoisomerase I pseudogene. Cancer Res. 52,
4280–4285
28
Trends in Genetics Vol.26 No.1
102 Korneev, S.A. et al. (1999) Neuronal expression of neural nitric oxide
synthase (nNOS) protein is suppressed by an antisense RNA
transcribed from an NOS pseudogene. J. Neurosci. 19, 7711–7720
103 Frith, M.C. et al. (2006) Pseudo-messenger RNA: phantoms of the
transcriptome. PLoS Genet. 2, e23
104 Zheng, D. and Gerstein, M.B. (2007) The ambiguous boundary
between genes and pseudogenes: the dead rise up, or do they?
Trends Genet. 23, 219–224
105 Pain, D. et al. (2005) Multiple retropseudogenes from pluripotent cellspecific gene expression indicates a potential signature for novel gene
identification. J. Biol. Chem. 280, 6265–6268
106 Watanabe, T. et al. (2008) Endogenous siRNAs from naturally formed
dsRNAs regulate transcripts in mouse oocytes. Nature 453, 539–543
107 Tam, O.H. et al. (2008) Pseudogene-derived small interfering RNAs
regulate gene expression in mouse oocytes. Nature 453, 534–538
108 Faulkner, G.J. et al. (2008) A rescue strategy for multi-mapping short
sequence tags refines surveys of transcriptional activity by CAGE.
Genomics 91, 281–288
109 Hashimoto, T. et al. (2009) Probabilistic resolution of multi-mapping
reads in massively parallel sequencing data using MuMRescueLite.
Bioinformatics 25, 2613–2614
110 Mattick, J.S. and Makunin, I.V. (2006) Non-coding RNA. Hum Mol
Genet. 15, R17–29
111 Kapranov, P. et al. (2007) Genome-wide transcription and the
implications for genomic organization. Nat. Rev. Genet. 8, 413–423
112 Manak, J.R. et al. (2006) Biological function of unannotated
transcription during the early development of Drosophila
melanogaster. Nat. Genet. 38, 1151–1158
113 Denoeud, F. et al. (2007) Prominent use of distal 50 transcription start
sites and discovery of a large number of additional exons in ENCODE
regions. Genome Res. 17, 746–759
114 Valen, E. et al. (2009) Genome-wide detection and analysis of
hippocampus core promoters using DeepCAGE. Genome Res. 19,
255–265
115 Li, X. et al. (2009) Short homologous sequences are strongly associated
with the generation of chimeric RNAs in eukaryotes. J. Mol. Evol. 68,
56–65
116 Gingeras, T.R. (2009) Implications of chimaeric non-co-linear
transcripts. Nature 461, 206–211
117 Dekker, J. (2008) Gene regulation in the third dimension. Science 319,
1793–1794
118 Schadt, E.E. (2009) Molecular networks as sensors and drivers of
common human diseases. Nature 461, 218–223
119 Mattick, J.S. (2003) Challenging the dogma: the hidden layer of nonprotein-coding RNAs in complex organisms. Bioessays 25, 930–939
120 Gerstein, M.B. et al. (2007) What is a gene, post-ENCODE? History
and updated definition. Genome Res. 17, 669–681
121 Matera, A.G. et al. (2007) Non-coding RNAs: lessons from the small
nuclear and small nucleolar RNAs. Nat. Rev. Mol. Cell Biol. 8, 209–220
Download